Skip to content

liveness- & readiness probes for config-reloader container in prometheus stateful set fail when listenLocal: true #6682

Closed
@sdickhoven

Description

@sdickhoven

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Description

hi 👋

we use kube-prometheus-stack v55.5.0 as well as v60.1.0

the former comes with prometheus-operator v0.70.0, the latter with v0.74.0

in both versions prometheus-operator creates a broken stateful set if the Prometheus resource is configured with listenLocal: true.

specifically, the liveness- & readiness probes for the config-reloader container are configured with

httpGet:
  path: /healthz
  port: 8080
  scheme: HTTP

even though the container is also configured with

args:
- --listen-address=127.0.0.1:8080
- ...

so the config-reloader container keeps getting murdered by kubernetes. 💥

without listenLocal: true, the config-reloader container is configured with

args:
- --listen-address=:8080
- ...

...in which case these liveness- & readiness probes work fine.

Steps to Reproduce

create Prometheus resource with

spec:
  listenLocal: true

Expected Result

either when listenLocal is set to true or always, the liveness- & readiness probes for the config-reloader container should look something like this:

exec:
  command:
  - sh
  - -c
  - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:8080/healthz;
    elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:8080/healthz;
    else exit 1; fi

...which is exactly what the prometheus container in the stateful set already uses (presumably to deal with precisely this issue).

Actual Result

the liveness- & readiness probes for the config-reloader container are configured with

httpGet:
  path: /healthz
  port: 8080
  scheme: HTTP

Prometheus Operator Version

v0.70.0
v0.74.0

(presumably everything in between as well and likely also some previous versions 🤷)

Kubernetes Version

clientVersion:
  buildDate: "2024-05-14T10:42:02Z"
  compiler: gc
  gitCommit: 6911225c3f747e1cd9d109c305436d08b668f086
  gitTreeState: clean
  gitVersion: v1.30.1
  goVersion: go1.22.2
  major: "1"
  minor: "30"
  platform: darwin/amd64
kustomizeVersion: v5.0.4-0.20230601165947-6ce0bf390ce3
serverVersion:
  buildDate: "2024-04-30T23:53:58Z"
  compiler: gc
  gitCommit: 9c0e57823b31865d0ee095997d9e7e721ffdc77f
  gitTreeState: clean
  gitVersion: v1.29.4-eks-036c24b
  goVersion: go1.21.9
  major: "1"
  minor: 29+
  platform: linux/amd64

Kubernetes Cluster Type

EKS

How did you deploy Prometheus-Operator?

helm chart:prometheus-community/kube-prometheus-stack

Manifests

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: test
  namespace: default
spec:
  replicas: 1
  listenLocal: true
  version: v2.52.0
  podMonitorNamespaceSelector:
    matchLabels:
      foo: bar
  podMonitorSelector:
    matchLabels:
      foo: bar
  probeNamespaceSelector:
    matchLabels:
      foo: bar
  probeSelector:
    matchLabels:
      foo: bar
  ruleNamespaceSelector:
    matchLabels:
      foo: bar
  ruleSelector:
    matchLabels:
      foo: bar
  scrapeConfigNamespaceSelector:
    matchLabels:
      foo: bar
  scrapeConfigSelector:
    matchLabels:
      foo: bar
  serviceMonitorNamespaceSelector:
    matchLabels:
      foo: bar
  serviceMonitorSelector:
    matchLabels:
      foo: bar

# the selectors are irrelevant with respect to this bug.
# i just have a lot of `monitoring.coreos.com` resources that i don't want this test instance to pick up.

prometheus-operator log output

# kubectl logs -n kube-system deploy/kube-prometheus-stack-operator 
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:52608: EOF","ts":"2024-06-14T08:03:15.460833393Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:52642: read tcp 10.160.20.184:10250->10.160.142.27:52642: read: connection reset by peer","ts":"2024-06-14T08:03:15.554231481Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34500: EOF","ts":"2024-06-14T08:05:18.445885664Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34526: EOF","ts":"2024-06-14T08:05:18.456555184Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34562: EOF","ts":"2024-06-14T08:05:18.555825338Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:53830: EOF","ts":"2024-06-14T08:15:45.06724675Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:44112: EOF","ts":"2024-06-14T09:26:45.243586185Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:51488: EOF","ts":"2024-06-14T10:37:36.262672939Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:52032: read tcp 10.160.20.184:10250->10.160.144.214:52032: read: connection reset by peer","ts":"2024-06-14T13:04:03.464037773Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:36356: EOF","ts":"2024-06-14T14:15:32.356966667Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:59308: EOF","ts":"2024-06-14T15:27:55.24354065Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:37294: EOF","ts":"2024-06-14T17:49:23.352276157Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:37304: EOF","ts":"2024-06-14T17:49:23.35314571Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34628: EOF","ts":"2024-06-14T19:02:11.766423962Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34642: EOF","ts":"2024-06-14T19:02:11.855791336Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34660: EOF","ts":"2024-06-14T19:02:11.954423115Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:35206: EOF","ts":"2024-06-14T21:26:31.659543146Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:35244: EOF","ts":"2024-06-14T21:26:31.859287542Z"}

# kubectl get pod -n default --watch
NAME                         READY   STATUS     RESTARTS   AGE
prometheus-test-0            0/2     Init:0/1   0          7s
prometheus-test-0            0/2     PodInitializing   0          9s
prometheus-test-0            0/2     PodInitializing   0          13s
prometheus-test-0            0/2     Running           0          13s
prometheus-test-0            0/2     Running           0          17s
prometheus-test-0            1/2     Running           0          18s
prometheus-test-0            1/2     Running           1          43s
prometheus-test-0            1/2     Running           2 (2s ago)   72s

Anything else?

this issue is related to

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions