Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Description
hi 👋
we use kube-prometheus-stack v55.5.0 as well as v60.1.0
the former comes with prometheus-operator v0.70.0, the latter with v0.74.0
in both versions prometheus-operator creates a broken stateful set if the Prometheus
resource is configured with listenLocal: true
.
specifically, the liveness- & readiness probes for the config-reloader
container are configured with
httpGet:
path: /healthz
port: 8080
scheme: HTTP
even though the container is also configured with
args:
- --listen-address=127.0.0.1:8080
- ...
so the config-reloader
container keeps getting murdered by kubernetes. 💥
without listenLocal: true
, the config-reloader
container is configured with
args:
- --listen-address=:8080
- ...
...in which case these liveness- & readiness probes work fine.
Steps to Reproduce
create Prometheus
resource with
spec:
listenLocal: true
Expected Result
either when listenLocal
is set to true
or always, the liveness- & readiness probes for the config-reloader
container should look something like this:
exec:
command:
- sh
- -c
- if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:8080/healthz;
elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:8080/healthz;
else exit 1; fi
...which is exactly what the prometheus
container in the stateful set already uses (presumably to deal with precisely this issue).
Actual Result
the liveness- & readiness probes for the config-reloader
container are configured with
httpGet:
path: /healthz
port: 8080
scheme: HTTP
Prometheus Operator Version
v0.70.0
v0.74.0
(presumably everything in between as well and likely also some previous versions 🤷)
Kubernetes Version
clientVersion:
buildDate: "2024-05-14T10:42:02Z"
compiler: gc
gitCommit: 6911225c3f747e1cd9d109c305436d08b668f086
gitTreeState: clean
gitVersion: v1.30.1
goVersion: go1.22.2
major: "1"
minor: "30"
platform: darwin/amd64
kustomizeVersion: v5.0.4-0.20230601165947-6ce0bf390ce3
serverVersion:
buildDate: "2024-04-30T23:53:58Z"
compiler: gc
gitCommit: 9c0e57823b31865d0ee095997d9e7e721ffdc77f
gitTreeState: clean
gitVersion: v1.29.4-eks-036c24b
goVersion: go1.21.9
major: "1"
minor: 29+
platform: linux/amd64
Kubernetes Cluster Type
EKS
How did you deploy Prometheus-Operator?
helm chart:prometheus-community/kube-prometheus-stack
Manifests
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: test
namespace: default
spec:
replicas: 1
listenLocal: true
version: v2.52.0
podMonitorNamespaceSelector:
matchLabels:
foo: bar
podMonitorSelector:
matchLabels:
foo: bar
probeNamespaceSelector:
matchLabels:
foo: bar
probeSelector:
matchLabels:
foo: bar
ruleNamespaceSelector:
matchLabels:
foo: bar
ruleSelector:
matchLabels:
foo: bar
scrapeConfigNamespaceSelector:
matchLabels:
foo: bar
scrapeConfigSelector:
matchLabels:
foo: bar
serviceMonitorNamespaceSelector:
matchLabels:
foo: bar
serviceMonitorSelector:
matchLabels:
foo: bar
# the selectors are irrelevant with respect to this bug.
# i just have a lot of `monitoring.coreos.com` resources that i don't want this test instance to pick up.
prometheus-operator log output
# kubectl logs -n kube-system deploy/kube-prometheus-stack-operator
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:52608: EOF","ts":"2024-06-14T08:03:15.460833393Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:52642: read tcp 10.160.20.184:10250->10.160.142.27:52642: read: connection reset by peer","ts":"2024-06-14T08:03:15.554231481Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34500: EOF","ts":"2024-06-14T08:05:18.445885664Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34526: EOF","ts":"2024-06-14T08:05:18.456555184Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34562: EOF","ts":"2024-06-14T08:05:18.555825338Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:53830: EOF","ts":"2024-06-14T08:15:45.06724675Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:44112: EOF","ts":"2024-06-14T09:26:45.243586185Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:51488: EOF","ts":"2024-06-14T10:37:36.262672939Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:52032: read tcp 10.160.20.184:10250->10.160.144.214:52032: read: connection reset by peer","ts":"2024-06-14T13:04:03.464037773Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:36356: EOF","ts":"2024-06-14T14:15:32.356966667Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.144.214:59308: EOF","ts":"2024-06-14T15:27:55.24354065Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:37294: EOF","ts":"2024-06-14T17:49:23.352276157Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:37304: EOF","ts":"2024-06-14T17:49:23.35314571Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34628: EOF","ts":"2024-06-14T19:02:11.766423962Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34642: EOF","ts":"2024-06-14T19:02:11.855791336Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:34660: EOF","ts":"2024-06-14T19:02:11.954423115Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:35206: EOF","ts":"2024-06-14T21:26:31.659543146Z"}
{"caller":"server.go:3411","msg":"http: TLS handshake error from 10.160.142.27:35244: EOF","ts":"2024-06-14T21:26:31.859287542Z"}
# kubectl get pod -n default --watch
NAME READY STATUS RESTARTS AGE
prometheus-test-0 0/2 Init:0/1 0 7s
prometheus-test-0 0/2 PodInitializing 0 9s
prometheus-test-0 0/2 PodInitializing 0 13s
prometheus-test-0 0/2 Running 0 13s
prometheus-test-0 0/2 Running 0 17s
prometheus-test-0 1/2 Running 0 18s
prometheus-test-0 1/2 Running 1 43s
prometheus-test-0 1/2 Running 2 (2s ago) 72s
Anything else?
this issue is related to