cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use #449

jketcham · 2021-04-09T21:48:10Z

Checklist

I did not find a related open issue.
I did not find a solution in the troubleshooting guide: (https://cloud.google.com/config-connector/docs/troubleshooting)
If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.

Bug Description

A new revision for cnrm-resource-stats-recorder that tried to start yesterday is failing in a crash loop in one of my clusters, complaining that port 8888 is already in use (in the 'recorder' container).

It's trying to run recorder: gcr.io/gke-release/cnrm/recorder:d399cc9, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1.
The previous version is still running fine, on versions: recorder: gcr.io/gke-release/cnrm/recorder:2081072, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1.

Any insight into this issue would be appreciated!
Should I kill this revision and try re-applying the config connector manifests?

Additional Diagnostic Information

Kubernetes Cluster Version

1.17.17-gke.2800

Config Connector Version

1.39.0

Config Connector Mode

cluster

Log Output

The logs from the "recorder" container are (repeated with each crash):

{ "msg": "Recording the stats of Config Connector resources" }
{ "error": "listen tcp :8888: bind: address already in use", "msg": "error registering the Prometheus HTTP handler" }

Steps to Reproduce

Steps to reproduce the issue

I'm not sure exactly what triggered this issue, but it seems that it occurred when the recorder container was trying to upgrade to a new version while another revision was already running in the cluster.

The text was updated successfully, but these errors were encountered:

maqiuyujoyce · 2021-04-10T05:14:37Z

Hi @jketcham , sorry that you've run into this issue.

I'm not sure exactly what triggered this issue, but it seems that it occurred when the recorder container was trying to upgrade to a new version while another revision was already running in the cluster.

Would be helpful if you can provide some more context.

Did it happen automatically or manually?
Did other pods also get upgraded?
Did you find any version changes of Config Connector or your GKE cluster?
And did you install Config Connector via the GKE addon or manually?

Meanwhile, could you try to kill the recorder pod following the steps below and see if it makes any difference?

$ kubectl get pods -n cnrm-system
Find the pod name starting with cnrm-resource-stats-recorder.
$ kubectl delete pod [recorder_pod_name] -n cnrm-system

eyalzek · 2021-04-12T07:29:32Z

Just had it happen as well. The deployment uses the host network:

$ k get deployments.apps cnrm-resource-stats-recorder -oyaml |grep hostNetwork | tail -n1
      hostNetwork: true

and since the new pod was started on the same host as the old one, it failed to bind it:

$ k get po -owide |grep recorder
cnrm-resource-stats-recorder-7cf8996bbf-2m8gg   2/2     Running            0          39d    10.0.0.39     gke-terraform-202010271506369711-78b5ca99-ti3k   <none>           <none>
cnrm-resource-stats-recorder-7d4f588f6c-hhnsx   1/2     CrashLoopBackOff   33         148m   10.0.0.39     gke-terraform-202010271506369711-78b5ca99-ti3k   <none>           <none>

I think it would make sense to either change the UpdateStrategy of this deployment to Recreate or set an anti affinity so that it won't be scheduled on a node where it already running (which might break in cluster with a single node).

jketcham · 2021-04-12T18:18:51Z

@maqiuyujoyce thanks for your help, in answer to your questions:

Did it happen automatically or manually?

I believe this happened automatically, as I had not applied any config connector manifests prior to this happening, but I could be wrong if someone else who works on this cluster applied something.

Did other pods also get upgraded?

Yes, it looks like the other resources for config connector were updated along with the stats-recorder (they all at least have the same creation time as the stats-recorder pod that was crashing).

Did you find any version changes of Config Connector or your GKE cluster?

Not sure here, but I think this may have been triggered with an automatic minor version update to the cluster.

And did you install Config Connector via the GKE addon or manually?

I installed config connector via the GKE addon.

I also followed your advice and deleted the old stats-recorder pod which solved the port usage issue, and the new recorder was able to start no problem afterwards. I think @eyalzek hit the nail on the head with his findings that both are using the host network on the same host, which caused the problem.

I'd be fine to have this issue closed now, but it would be nice if Config Connector added something to prevent this from happening in the first place.

Thanks!

mathieu-benoit · 2021-04-13T12:16:16Z

I have the exact same issue:

k get po -n cnrm-system
NAME                                           READY   STATUS             RESTARTS   AGE
cnrm-controller-manager-0                      2/2     Running            0          9h
cnrm-deletiondefender-0                        1/1     Running            0          9h
cnrm-resource-stats-recorder-9f4c5ccfb-dznxz   1/2     CrashLoopBackOff   111        9h
cnrm-webhook-manager-5ccc747594-9clsv          1/1     Running            0          9h
cnrm-webhook-manager-5ccc747594-9lngc          1/1     Running            0          9h

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I'm still having that issue.

FYI:

GKE version 1.19.9-gke.100, Rapid channel
Config Connector version 1.45.0 - installation via its Operator

toumorokoshi · 2021-04-13T17:28:08Z

Thanks for all the info! as an update: we discussed internally and have decided to update the deployment strategy to Recreate, as well as clarify the exposed port in the Deployment spec to help the scheduler.

I'll update on a fix, goal is to get it in by next release.

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I'm still having that issue.

@mathieu-benoit this should work. Could you try deleting all the recorder pods, rather than just a single one?

mathieu-benoit · 2021-04-13T21:50:18Z

FYI @toumorokoshi, in my case I just have 1 cnrm-resource-stats-recorder pod which is getting that {"severity":"error","msg":"error registering the Prometheus HTTP handler","error":"listen tcp :8888: bind: address already in use"} error

eyalzek · 2021-04-13T22:08:13Z

@mathieu-benoit go into the node it's running on and check what's listening on port 8888... there must be something there. If you have mutiple nodes you can try to cordon the problematic one while restarting the pod to see if this happens on all nodes or just that one.

eyalzek · 2021-04-14T08:35:02Z

@toumorokoshi I just realized that we're also seeing the same issue that @mathieu-benoit is seeing, but only on our rapid GKE cluster running version v1.19.8-gke.2000. On other clusters in the regular channel with version v1.18.16-gke.502 this is not a problem.

On the 1.19 nodes there's software running which is bound to port 8888:

# netstat -tulpen |grep 8888
tcp        0      0 127.0.0.1:8888          0.0.0.0:*               LISTEN      1000       27836      2222/otelsvc 

# ps ax |grep 2222
   2222 ?        Ssl    0:11 /otelsvc --config=/conf/gke-metrics-agent-config.yaml --metrics-prefix=

in this case the cnrm-resource-stats-recorder is crash looping and can never recover.

On the 1.18 nodes otelsvc is still running, but it doesn't seem like it's binding any port. Digging deeper, this is the gke-metrics-agent daemonset, on 1.18 it's running version 0.3.5-gke.0 and on 1.19 0.3.8-gke.0. In both cases it's using the host network, but there is a difference in the gke-metrics-agent-conf configmap (in the kube-system NS) - there's actually quite a bit of difference, but the critical part is this:

$ diff /tmp/cm-1.19.yaml /tmp/cm-1.18.yaml |grep -B1 8888 
<         static_configs:
<         - targets: ["127.0.0.1:8888"]

I can't find any info regarding that in the GKE changelog, but this means that a GKE cluster running 1.19 cannot work with both the metrics addon and the config connector addon... I would suggest to solve this internally and possibly change the port on either of these workloads.

toumorokoshi · 2021-04-14T15:17:23Z

Got it, thanks for the digging and scoping to GKE-1.19!

I'll talk to a few folks internally to figure out the right next step here. I'll probably also ship a change to at least choose a more obscure host port to bind too.

toumorokoshi · 2021-04-14T18:30:35Z

Hello, as an update:

I'm working with the GKE folks now, but that resolution may take a while. For now we are targetting shipping a fix in the next version of config connector. That will appear in the add-on within a couple weeks (unfortunate delays in GKE add-on updates), and it is possible to get a fix sooner by using the manual installation.

mathieu-benoit · 2021-04-17T14:00:17Z

JFYI: I just tested the version 1.46.0 and the issue in my case is now fixed, thanks!

toumorokoshi · 2021-04-19T15:54:33Z

Thanks for the information! I'll close this issue for now since we shipped a fix in 1.46. When it does pop up in the add-ons I'll try to paste some versions in this thread.

To be clear, the fix is to use a more obscure port, along with using the "Replace" deployment strategy. So you may still see port conflicts if something else happens to use hostPort: 48797

hsuabina · 2021-05-25T15:47:45Z

I just bumped into this issue today, when I was playing around with Config Connector. Switching from installation with gke add-on to manually installing the opeartor did the trick for me as well (installed 1.51.0), and now the cnrm-resource-stats-recorder pod is created without crashing.

Is there any ETA yet on when this fix will be available through the add-on?

toumorokoshi · 2021-05-25T21:30:07Z

Is there any ETA yet on when this fix will be available through the add-on?

Unfortunately we don't have a lot of control around the add-on availability, it can be up to 8 weeks.

We are currently working on a project to try to reduce that time. At this point manual installation is your best choice to be on the edge.

hsmade · 2021-05-26T07:14:57Z

Bumped into the same issue, and wanted to add that in my case, the port clash is with the cnrm controller manager itself.
I have a 3 node test cluster, and both the cnrm-controller-manager and the cnrm-resource-stats-recorder have a prometheus endpoint on port 8888, both use hostPort, and both end up on the same node.

derekperkins · 2021-05-27T06:23:39Z

1.19.10-gke.1600 finally fixed the problem for me, putting the addon at 1.49.1

NeckBeardPrince · 2021-06-02T16:26:29Z

1.19.10-gke.1600 finally fixed the problem for me, putting the addon at 1.49.1

Humm, I don't see that release in https://cloud.google.com/kubernetes-engine/docs/release-notes.

jcanseco · 2021-06-02T17:45:21Z

Hi @NeckBeardPrince, it seems that it was released on May 19.

@hsmade, apologies that we just saw your comment. Are you still facing the issue? If you want to resolve it ASAP, we recommend switching to a manual installation instead just like what @hsuabina did in this comment.

hsmade · 2021-06-02T17:46:38Z

No worries. I'm running test clusters atm, so not impacted ;o)

jcanseco · 2021-06-02T17:58:35Z

Gotcha, good to hear :)

esn89 · 2021-07-28T22:00:47Z

I am running into this same issue:

kubectl describe po cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv -n cnrm-system
Name:         cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv
Namespace:    cnrm-system
Priority:     0
Node:         gke-main-us-dev-main-5e0a-83b549e3-8obo/10.250.0.47
Start Time:   Wed, 28 Jul 2021 21:26:58 +0000
Labels:       cnrm.cloud.google.com/component=cnrm-resource-stats-recorder
              cnrm.cloud.google.com/system=true
              pod-template-hash=9f4c5ccfb
Annotations:  cnrm.cloud.google.com/version: 1.45.0
Status:       Running
IP:           10.250.0.47
IPs:
  IP:           10.250.0.47
Controlled By:  ReplicaSet/cnrm-resource-stats-recorder-9f4c5ccfb
Containers:
  recorder:
    Container ID:  docker://b3f10adacee155bab4761e6497a16a3506e2d8ef4cb18f792c7099ea9cfaccfb
    Image:         gcr.io/gke-release/cnrm/recorder:6ec9227
    Image ID:      docker-pullable://gcr.io/gke-release/cnrm/recorder@sha256:4b77e2a75cca39a94e5a4b042694eb1843851837533fa059cd0fe174bc2644dc
    Port:          <none>
    Host Port:     <none>
    Command:
      /configconnector/recorder
    Args:
      --prometheus-scrape-endpoint=:8888
      --metric-interval=60
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 28 Jul 2021 21:53:06 +0000
      Finished:     Wed, 28 Jul 2021 21:53:06 +0000
    Ready:          False
    Restart Count:  10
    Limits:
      memory:  64Mi
    Requests:
      cpu:      20m
      memory:   64Mi
    Readiness:  exec [cat /tmp/ready] delay=3s timeout=1s period=3s #success=1 #failure=3
    Environment:
      CONFIG_CONNECTOR_VERSION:  1.45.0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cnrm-resource-stats-recorder-token-x69wh (ro)
  prom-to-sd:

My cluster version is: 1.19.9-gke.1900 and I installed it via the https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#addon-install

Is the add-on version still out of date?

NeckBeardPrince · 2021-07-28T22:08:00Z

I am running into this same issue:

kubectl describe po cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv -n cnrm-system
Name:         cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv
Namespace:    cnrm-system
Priority:     0
Node:         gke-main-us-dev-main-5e0a-83b549e3-8obo/10.250.0.47
Start Time:   Wed, 28 Jul 2021 21:26:58 +0000
Labels:       cnrm.cloud.google.com/component=cnrm-resource-stats-recorder
              cnrm.cloud.google.com/system=true
              pod-template-hash=9f4c5ccfb
Annotations:  cnrm.cloud.google.com/version: 1.45.0
Status:       Running
IP:           10.250.0.47
IPs:
  IP:           10.250.0.47
Controlled By:  ReplicaSet/cnrm-resource-stats-recorder-9f4c5ccfb
Containers:
  recorder:
    Container ID:  docker://b3f10adacee155bab4761e6497a16a3506e2d8ef4cb18f792c7099ea9cfaccfb
    Image:         gcr.io/gke-release/cnrm/recorder:6ec9227
    Image ID:      docker-pullable://gcr.io/gke-release/cnrm/recorder@sha256:4b77e2a75cca39a94e5a4b042694eb1843851837533fa059cd0fe174bc2644dc
    Port:          <none>
    Host Port:     <none>
    Command:
      /configconnector/recorder
    Args:
      --prometheus-scrape-endpoint=:8888
      --metric-interval=60
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 28 Jul 2021 21:53:06 +0000
      Finished:     Wed, 28 Jul 2021 21:53:06 +0000
    Ready:          False
    Restart Count:  10
    Limits:
      memory:  64Mi
    Requests:
      cpu:      20m
      memory:   64Mi
    Readiness:  exec [cat /tmp/ready] delay=3s timeout=1s period=3s #success=1 #failure=3
    Environment:
      CONFIG_CONNECTOR_VERSION:  1.45.0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cnrm-resource-stats-recorder-token-x69wh (ro)
  prom-to-sd:

My cluster version is: 1.19.9-gke.1900 and I installed it via the https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#addon-install

Is the add-on version still out of date?

Same here. So annoying this is taking so long for GCP to update the plugin. This has been going on for almost four months now.

toumorokoshi · 2021-08-02T14:48:07Z

Hello, sorry to hear you're still having issues, but I think upgrading your GKE master version would help (details below).

Is the add-on version still out of date?

Unfortunately yes. The version that the issue was fixed in is 1.46.0, and the version listed in your manifest in 1.45.0. We just released 1.58 last week.

My cluster version is: 1.19.9-gke.1900 and I installed it via the https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#addon-install

Would you happen to be using a static version of GKE? I don't see that version available in the static, release, or rapid channel.

For now the Config Connector addOn is tied to the GKE master version: you will need to update the master to pick up a new config connector.

Taking a look at the version available in us-central1, all versions in any channel (rapid, regular, stable) will contain at least version 1.51 of Config Connector.

If you are concerned regarding updating a Kubernetes minor version, version 1.19.12-gke.2100 will contain Config Connector 1.56.

Regardless of whether this resolves your issue, we know there is general pain around the slowness of addOn versions and are actively talking to the GKE team about options to speed things up.

jketcham added the bug Something isn't working label Apr 9, 2021

toumorokoshi closed this as completed Apr 19, 2021

yatzek mentioned this issue Jun 14, 2021

cnrm-resource-stats-recorder pod crash loop cloudfoundry/concourse-infra-for-fiwg#43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use #449

cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use #449

jketcham commented Apr 9, 2021 •

edited

maqiuyujoyce commented Apr 10, 2021

eyalzek commented Apr 12, 2021

jketcham commented Apr 12, 2021

mathieu-benoit commented Apr 13, 2021 •

edited

toumorokoshi commented Apr 13, 2021

mathieu-benoit commented Apr 13, 2021 •

edited

eyalzek commented Apr 13, 2021

eyalzek commented Apr 14, 2021

toumorokoshi commented Apr 14, 2021 •

edited

toumorokoshi commented Apr 14, 2021

mathieu-benoit commented Apr 17, 2021

toumorokoshi commented Apr 19, 2021

hsuabina commented May 25, 2021

toumorokoshi commented May 25, 2021

hsmade commented May 26, 2021

derekperkins commented May 27, 2021

NeckBeardPrince commented Jun 2, 2021

jcanseco commented Jun 2, 2021

hsmade commented Jun 2, 2021

jcanseco commented Jun 2, 2021

esn89 commented Jul 28, 2021

NeckBeardPrince commented Jul 28, 2021

toumorokoshi commented Aug 2, 2021

cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use #449

cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use #449

Comments

jketcham commented Apr 9, 2021 • edited

Checklist

Bug Description

Additional Diagnostic Information

Kubernetes Cluster Version

Config Connector Version

Config Connector Mode

Log Output

Steps to Reproduce

Steps to reproduce the issue

maqiuyujoyce commented Apr 10, 2021

eyalzek commented Apr 12, 2021

jketcham commented Apr 12, 2021

mathieu-benoit commented Apr 13, 2021 • edited

toumorokoshi commented Apr 13, 2021

mathieu-benoit commented Apr 13, 2021 • edited

eyalzek commented Apr 13, 2021

eyalzek commented Apr 14, 2021

toumorokoshi commented Apr 14, 2021 • edited

toumorokoshi commented Apr 14, 2021

mathieu-benoit commented Apr 17, 2021

toumorokoshi commented Apr 19, 2021

hsuabina commented May 25, 2021

toumorokoshi commented May 25, 2021

hsmade commented May 26, 2021

derekperkins commented May 27, 2021

NeckBeardPrince commented Jun 2, 2021

jcanseco commented Jun 2, 2021

hsmade commented Jun 2, 2021

jcanseco commented Jun 2, 2021

esn89 commented Jul 28, 2021

NeckBeardPrince commented Jul 28, 2021

toumorokoshi commented Aug 2, 2021

jketcham commented Apr 9, 2021 •

edited

mathieu-benoit commented Apr 13, 2021 •

edited

mathieu-benoit commented Apr 13, 2021 •

edited

toumorokoshi commented Apr 14, 2021 •

edited