Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use #449

Closed
3 tasks done
jketcham opened this issue Apr 9, 2021 · 23 comments
Labels
bug Something isn't working

Comments

@jketcham
Copy link

jketcham commented Apr 9, 2021

Checklist

Bug Description

A new revision for cnrm-resource-stats-recorder that tried to start yesterday is failing in a crash loop in one of my clusters, complaining that port 8888 is already in use (in the 'recorder' container).

It's trying to run recorder: gcr.io/gke-release/cnrm/recorder:d399cc9, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1.
The previous version is still running fine, on versions: recorder: gcr.io/gke-release/cnrm/recorder:2081072, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1.

Any insight into this issue would be appreciated!
Should I kill this revision and try re-applying the config connector manifests?

Additional Diagnostic Information

Kubernetes Cluster Version

1.17.17-gke.2800

Config Connector Version

1.39.0

Config Connector Mode

cluster

Log Output

The logs from the "recorder" container are (repeated with each crash):

{ "msg": "Recording the stats of Config Connector resources" }
{ "error": "listen tcp :8888: bind: address already in use", "msg": "error registering the Prometheus HTTP handler" }

Steps to Reproduce

Steps to reproduce the issue

I'm not sure exactly what triggered this issue, but it seems that it occurred when the recorder container was trying to upgrade to a new version while another revision was already running in the cluster.

@jketcham jketcham added the bug Something isn't working label Apr 9, 2021
@maqiuyujoyce
Copy link
Collaborator

Hi @jketcham , sorry that you've run into this issue.

I'm not sure exactly what triggered this issue, but it seems that it occurred when the recorder container was trying to upgrade to a new version while another revision was already running in the cluster.

Would be helpful if you can provide some more context.

  • Did it happen automatically or manually?
  • Did other pods also get upgraded?
  • Did you find any version changes of Config Connector or your GKE cluster?
  • And did you install Config Connector via the GKE addon or manually?

Meanwhile, could you try to kill the recorder pod following the steps below and see if it makes any difference?

  1. $ kubectl get pods -n cnrm-system
  2. Find the pod name starting with cnrm-resource-stats-recorder.
  3. $ kubectl delete pod [recorder_pod_name] -n cnrm-system

@eyalzek
Copy link

eyalzek commented Apr 12, 2021

Just had it happen as well. The deployment uses the host network:

$ k get deployments.apps cnrm-resource-stats-recorder -oyaml |grep hostNetwork | tail -n1
      hostNetwork: true

and since the new pod was started on the same host as the old one, it failed to bind it:

$ k get po -owide |grep recorder
cnrm-resource-stats-recorder-7cf8996bbf-2m8gg   2/2     Running            0          39d    10.0.0.39     gke-terraform-202010271506369711-78b5ca99-ti3k   <none>           <none>
cnrm-resource-stats-recorder-7d4f588f6c-hhnsx   1/2     CrashLoopBackOff   33         148m   10.0.0.39     gke-terraform-202010271506369711-78b5ca99-ti3k   <none>           <none>

I think it would make sense to either change the UpdateStrategy of this deployment to Recreate or set an anti affinity so that it won't be scheduled on a node where it already running (which might break in cluster with a single node).

@jketcham
Copy link
Author

@maqiuyujoyce thanks for your help, in answer to your questions:

Did it happen automatically or manually?

I believe this happened automatically, as I had not applied any config connector manifests prior to this happening, but I could be wrong if someone else who works on this cluster applied something.

Did other pods also get upgraded?

Yes, it looks like the other resources for config connector were updated along with the stats-recorder (they all at least have the same creation time as the stats-recorder pod that was crashing).

Did you find any version changes of Config Connector or your GKE cluster?

Not sure here, but I think this may have been triggered with an automatic minor version update to the cluster.

And did you install Config Connector via the GKE addon or manually?

I installed config connector via the GKE addon.


I also followed your advice and deleted the old stats-recorder pod which solved the port usage issue, and the new recorder was able to start no problem afterwards. I think @eyalzek hit the nail on the head with his findings that both are using the host network on the same host, which caused the problem.

I'd be fine to have this issue closed now, but it would be nice if Config Connector added something to prevent this from happening in the first place.

Thanks!

@mathieu-benoit
Copy link

mathieu-benoit commented Apr 13, 2021

I have the exact same issue:

k get po -n cnrm-system
NAME                                           READY   STATUS             RESTARTS   AGE
cnrm-controller-manager-0                      2/2     Running            0          9h
cnrm-deletiondefender-0                        1/1     Running            0          9h
cnrm-resource-stats-recorder-9f4c5ccfb-dznxz   1/2     CrashLoopBackOff   111        9h
cnrm-webhook-manager-5ccc747594-9clsv          1/1     Running            0          9h
cnrm-webhook-manager-5ccc747594-9lngc          1/1     Running            0          9h

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I'm still having that issue.

FYI:

  • GKE version 1.19.9-gke.100, Rapid channel
  • Config Connector version 1.45.0 - installation via its Operator

@toumorokoshi
Copy link
Contributor

Thanks for all the info! as an update: we discussed internally and have decided to update the deployment strategy to Recreate, as well as clarify the exposed port in the Deployment spec to help the scheduler.

I'll update on a fix, goal is to get it in by next release.

Unfortunately even after the kubectl delete pod [recorder_pod_name] -n cnrm-system I'm still having that issue.

@mathieu-benoit this should work. Could you try deleting all the recorder pods, rather than just a single one?

@mathieu-benoit
Copy link

mathieu-benoit commented Apr 13, 2021

FYI @toumorokoshi, in my case I just have 1 cnrm-resource-stats-recorder pod which is getting that {"severity":"error","msg":"error registering the Prometheus HTTP handler","error":"listen tcp :8888: bind: address already in use"} error

@eyalzek
Copy link

eyalzek commented Apr 13, 2021

@mathieu-benoit go into the node it's running on and check what's listening on port 8888... there must be something there. If you have mutiple nodes you can try to cordon the problematic one while restarting the pod to see if this happens on all nodes or just that one.

@eyalzek
Copy link

eyalzek commented Apr 14, 2021

@toumorokoshi I just realized that we're also seeing the same issue that @mathieu-benoit is seeing, but only on our rapid GKE cluster running version v1.19.8-gke.2000. On other clusters in the regular channel with version v1.18.16-gke.502 this is not a problem.

On the 1.19 nodes there's software running which is bound to port 8888:

# netstat -tulpen |grep 8888
tcp        0      0 127.0.0.1:8888          0.0.0.0:*               LISTEN      1000       27836      2222/otelsvc 

# ps ax |grep 2222
   2222 ?        Ssl    0:11 /otelsvc --config=/conf/gke-metrics-agent-config.yaml --metrics-prefix=

in this case the cnrm-resource-stats-recorder is crash looping and can never recover.

On the 1.18 nodes otelsvc is still running, but it doesn't seem like it's binding any port. Digging deeper, this is the gke-metrics-agent daemonset, on 1.18 it's running version 0.3.5-gke.0 and on 1.19 0.3.8-gke.0. In both cases it's using the host network, but there is a difference in the gke-metrics-agent-conf configmap (in the kube-system NS) - there's actually quite a bit of difference, but the critical part is this:

$ diff /tmp/cm-1.19.yaml /tmp/cm-1.18.yaml |grep -B1 8888 
<         static_configs:
<         - targets: ["127.0.0.1:8888"]

I can't find any info regarding that in the GKE changelog, but this means that a GKE cluster running 1.19 cannot work with both the metrics addon and the config connector addon... I would suggest to solve this internally and possibly change the port on either of these workloads.

@toumorokoshi
Copy link
Contributor

toumorokoshi commented Apr 14, 2021

Got it, thanks for the digging and scoping to GKE-1.19!

I'll talk to a few folks internally to figure out the right next step here. I'll probably also ship a change to at least choose a more obscure host port to bind too.

@toumorokoshi
Copy link
Contributor

Hello, as an update:

I'm working with the GKE folks now, but that resolution may take a while. For now we are targetting shipping a fix in the next version of config connector. That will appear in the add-on within a couple weeks (unfortunate delays in GKE add-on updates), and it is possible to get a fix sooner by using the manual installation.

@mathieu-benoit
Copy link

JFYI: I just tested the version 1.46.0 and the issue in my case is now fixed, thanks!

@toumorokoshi
Copy link
Contributor

Thanks for the information! I'll close this issue for now since we shipped a fix in 1.46. When it does pop up in the add-ons I'll try to paste some versions in this thread.

To be clear, the fix is to use a more obscure port, along with using the "Replace" deployment strategy. So you may still see port conflicts if something else happens to use hostPort: 48797

@hsuabina
Copy link

I just bumped into this issue today, when I was playing around with Config Connector. Switching from installation with gke add-on to manually installing the opeartor did the trick for me as well (installed 1.51.0), and now the cnrm-resource-stats-recorder pod is created without crashing.

Is there any ETA yet on when this fix will be available through the add-on?

@toumorokoshi
Copy link
Contributor

Is there any ETA yet on when this fix will be available through the add-on?

Unfortunately we don't have a lot of control around the add-on availability, it can be up to 8 weeks.

We are currently working on a project to try to reduce that time. At this point manual installation is your best choice to be on the edge.

@hsmade
Copy link

hsmade commented May 26, 2021

Bumped into the same issue, and wanted to add that in my case, the port clash is with the cnrm controller manager itself.
I have a 3 node test cluster, and both the cnrm-controller-manager and the cnrm-resource-stats-recorder have a prometheus endpoint on port 8888, both use hostPort, and both end up on the same node.

@derekperkins
Copy link

1.19.10-gke.1600 finally fixed the problem for me, putting the addon at 1.49.1

@NeckBeardPrince
Copy link

1.19.10-gke.1600 finally fixed the problem for me, putting the addon at 1.49.1

Humm, I don't see that release in https://cloud.google.com/kubernetes-engine/docs/release-notes.

@jcanseco
Copy link
Member

jcanseco commented Jun 2, 2021

Hi @NeckBeardPrince, it seems that it was released on May 19.

@hsmade, apologies that we just saw your comment. Are you still facing the issue? If you want to resolve it ASAP, we recommend switching to a manual installation instead just like what @hsuabina did in this comment.

@hsmade
Copy link

hsmade commented Jun 2, 2021

No worries. I'm running test clusters atm, so not impacted ;o)

@jcanseco
Copy link
Member

jcanseco commented Jun 2, 2021

Gotcha, good to hear :)

@esn89
Copy link

esn89 commented Jul 28, 2021

I am running into this same issue:

kubectl describe po cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv -n cnrm-system
Name:         cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv
Namespace:    cnrm-system
Priority:     0
Node:         gke-main-us-dev-main-5e0a-83b549e3-8obo/10.250.0.47
Start Time:   Wed, 28 Jul 2021 21:26:58 +0000
Labels:       cnrm.cloud.google.com/component=cnrm-resource-stats-recorder
              cnrm.cloud.google.com/system=true
              pod-template-hash=9f4c5ccfb
Annotations:  cnrm.cloud.google.com/version: 1.45.0
Status:       Running
IP:           10.250.0.47
IPs:
  IP:           10.250.0.47
Controlled By:  ReplicaSet/cnrm-resource-stats-recorder-9f4c5ccfb
Containers:
  recorder:
    Container ID:  docker://b3f10adacee155bab4761e6497a16a3506e2d8ef4cb18f792c7099ea9cfaccfb
    Image:         gcr.io/gke-release/cnrm/recorder:6ec9227
    Image ID:      docker-pullable://gcr.io/gke-release/cnrm/recorder@sha256:4b77e2a75cca39a94e5a4b042694eb1843851837533fa059cd0fe174bc2644dc
    Port:          <none>
    Host Port:     <none>
    Command:
      /configconnector/recorder
    Args:
      --prometheus-scrape-endpoint=:8888
      --metric-interval=60
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 28 Jul 2021 21:53:06 +0000
      Finished:     Wed, 28 Jul 2021 21:53:06 +0000
    Ready:          False
    Restart Count:  10
    Limits:
      memory:  64Mi
    Requests:
      cpu:      20m
      memory:   64Mi
    Readiness:  exec [cat /tmp/ready] delay=3s timeout=1s period=3s #success=1 #failure=3
    Environment:
      CONFIG_CONNECTOR_VERSION:  1.45.0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cnrm-resource-stats-recorder-token-x69wh (ro)
  prom-to-sd:

My cluster version is: 1.19.9-gke.1900 and I installed it via the https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#addon-install

Is the add-on version still out of date?

@NeckBeardPrince
Copy link

I am running into this same issue:

kubectl describe po cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv -n cnrm-system
Name:         cnrm-resource-stats-recorder-9f4c5ccfb-b4fxv
Namespace:    cnrm-system
Priority:     0
Node:         gke-main-us-dev-main-5e0a-83b549e3-8obo/10.250.0.47
Start Time:   Wed, 28 Jul 2021 21:26:58 +0000
Labels:       cnrm.cloud.google.com/component=cnrm-resource-stats-recorder
              cnrm.cloud.google.com/system=true
              pod-template-hash=9f4c5ccfb
Annotations:  cnrm.cloud.google.com/version: 1.45.0
Status:       Running
IP:           10.250.0.47
IPs:
  IP:           10.250.0.47
Controlled By:  ReplicaSet/cnrm-resource-stats-recorder-9f4c5ccfb
Containers:
  recorder:
    Container ID:  docker://b3f10adacee155bab4761e6497a16a3506e2d8ef4cb18f792c7099ea9cfaccfb
    Image:         gcr.io/gke-release/cnrm/recorder:6ec9227
    Image ID:      docker-pullable://gcr.io/gke-release/cnrm/recorder@sha256:4b77e2a75cca39a94e5a4b042694eb1843851837533fa059cd0fe174bc2644dc
    Port:          <none>
    Host Port:     <none>
    Command:
      /configconnector/recorder
    Args:
      --prometheus-scrape-endpoint=:8888
      --metric-interval=60
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 28 Jul 2021 21:53:06 +0000
      Finished:     Wed, 28 Jul 2021 21:53:06 +0000
    Ready:          False
    Restart Count:  10
    Limits:
      memory:  64Mi
    Requests:
      cpu:      20m
      memory:   64Mi
    Readiness:  exec [cat /tmp/ready] delay=3s timeout=1s period=3s #success=1 #failure=3
    Environment:
      CONFIG_CONNECTOR_VERSION:  1.45.0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cnrm-resource-stats-recorder-token-x69wh (ro)
  prom-to-sd:

My cluster version is: 1.19.9-gke.1900 and I installed it via the https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#addon-install

Is the add-on version still out of date?

Same here. So annoying this is taking so long for GCP to update the plugin. This has been going on for almost four months now.

@toumorokoshi
Copy link
Contributor

Hello, sorry to hear you're still having issues, but I think upgrading your GKE master version would help (details below).

Is the add-on version still out of date?

Unfortunately yes. The version that the issue was fixed in is 1.46.0, and the version listed in your manifest in 1.45.0. We just released 1.58 last week.

My cluster version is: 1.19.9-gke.1900 and I installed it via the https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#addon-install

Would you happen to be using a static version of GKE? I don't see that version available in the static, release, or rapid channel.

For now the Config Connector addOn is tied to the GKE master version: you will need to update the master to pick up a new config connector.

Taking a look at the version available in us-central1, all versions in any channel (rapid, regular, stable) will contain at least version 1.51 of Config Connector.

If you are concerned regarding updating a Kubernetes minor version, version 1.19.12-gke.2100 will contain Config Connector 1.56.

Regardless of whether this resolves your issue, we know there is general pain around the slowness of addOn versions and are actively talking to the GKE team about options to speed things up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests