-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cnrm-resource-stats-recorder crash loop triggered by recorder container; port 8888 already in use #449
Comments
Hi @jketcham , sorry that you've run into this issue.
Would be helpful if you can provide some more context.
Meanwhile, could you try to kill the recorder pod following the steps below and see if it makes any difference?
|
Just had it happen as well. The deployment uses the host network: $ k get deployments.apps cnrm-resource-stats-recorder -oyaml |grep hostNetwork | tail -n1
hostNetwork: true and since the new pod was started on the same host as the old one, it failed to bind it: $ k get po -owide |grep recorder
cnrm-resource-stats-recorder-7cf8996bbf-2m8gg 2/2 Running 0 39d 10.0.0.39 gke-terraform-202010271506369711-78b5ca99-ti3k <none> <none>
cnrm-resource-stats-recorder-7d4f588f6c-hhnsx 1/2 CrashLoopBackOff 33 148m 10.0.0.39 gke-terraform-202010271506369711-78b5ca99-ti3k <none> <none> I think it would make sense to either change the |
@maqiuyujoyce thanks for your help, in answer to your questions:
I believe this happened automatically, as I had not applied any config connector manifests prior to this happening, but I could be wrong if someone else who works on this cluster applied something.
Yes, it looks like the other resources for config connector were updated along with the stats-recorder (they all at least have the same creation time as the stats-recorder pod that was crashing).
Not sure here, but I think this may have been triggered with an automatic minor version update to the cluster.
I installed config connector via the GKE addon. I also followed your advice and deleted the old stats-recorder pod which solved the port usage issue, and the new recorder was able to start no problem afterwards. I think @eyalzek hit the nail on the head with his findings that both are using the host network on the same host, which caused the problem. I'd be fine to have this issue closed now, but it would be nice if Config Connector added something to prevent this from happening in the first place. Thanks! |
I have the exact same issue:
Unfortunately even after the FYI:
|
Thanks for all the info! as an update: we discussed internally and have decided to update the deployment strategy to I'll update on a fix, goal is to get it in by next release.
@mathieu-benoit this should work. Could you try deleting all the recorder pods, rather than just a single one? |
FYI @toumorokoshi, in my case I just have 1 |
@mathieu-benoit go into the node it's running on and check what's listening on port 8888... there must be something there. If you have mutiple nodes you can try to cordon the problematic one while restarting the pod to see if this happens on all nodes or just that one. |
@toumorokoshi I just realized that we're also seeing the same issue that @mathieu-benoit is seeing, but only on our rapid GKE cluster running version On the 1.19 nodes there's software running which is bound to port 8888: # netstat -tulpen |grep 8888
tcp 0 0 127.0.0.1:8888 0.0.0.0:* LISTEN 1000 27836 2222/otelsvc
# ps ax |grep 2222
2222 ? Ssl 0:11 /otelsvc --config=/conf/gke-metrics-agent-config.yaml --metrics-prefix= in this case the On the 1.18 nodes $ diff /tmp/cm-1.19.yaml /tmp/cm-1.18.yaml |grep -B1 8888
< static_configs:
< - targets: ["127.0.0.1:8888"] I can't find any info regarding that in the GKE changelog, but this means that a GKE cluster running 1.19 cannot work with both the metrics addon and the config connector addon... I would suggest to solve this internally and possibly change the port on either of these workloads. |
Got it, thanks for the digging and scoping to GKE-1.19! I'll talk to a few folks internally to figure out the right next step here. I'll probably also ship a change to at least choose a more obscure host port to bind too. |
Hello, as an update: I'm working with the GKE folks now, but that resolution may take a while. For now we are targetting shipping a fix in the next version of config connector. That will appear in the add-on within a couple weeks (unfortunate delays in GKE add-on updates), and it is possible to get a fix sooner by using the manual installation. |
JFYI: I just tested the version |
Thanks for the information! I'll close this issue for now since we shipped a fix in 1.46. When it does pop up in the add-ons I'll try to paste some versions in this thread. To be clear, the fix is to use a more obscure port, along with using the "Replace" deployment strategy. So you may still see port conflicts if something else happens to use |
I just bumped into this issue today, when I was playing around with Config Connector. Switching from installation with gke add-on to manually installing the opeartor did the trick for me as well (installed Is there any ETA yet on when this fix will be available through the add-on? |
Unfortunately we don't have a lot of control around the add-on availability, it can be up to 8 weeks. We are currently working on a project to try to reduce that time. At this point manual installation is your best choice to be on the edge. |
Bumped into the same issue, and wanted to add that in my case, the port clash is with the cnrm controller manager itself. |
|
Humm, I don't see that release in https://cloud.google.com/kubernetes-engine/docs/release-notes. |
Hi @NeckBeardPrince, it seems that it was released on May 19. @hsmade, apologies that we just saw your comment. Are you still facing the issue? If you want to resolve it ASAP, we recommend switching to a manual installation instead just like what @hsuabina did in this comment. |
No worries. I'm running test clusters atm, so not impacted ;o) |
Gotcha, good to hear :) |
I am running into this same issue:
My cluster version is: 1.19.9-gke.1900 and I installed it via the https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#addon-install Is the add-on version still out of date? |
Same here. So annoying this is taking so long for GCP to update the plugin. This has been going on for almost four months now. |
Hello, sorry to hear you're still having issues, but I think upgrading your GKE master version would help (details below).
Unfortunately yes. The version that the issue was fixed in is 1.46.0, and the version listed in your manifest in 1.45.0. We just released 1.58 last week.
Would you happen to be using a static version of GKE? I don't see that version available in the static, release, or rapid channel. For now the Config Connector addOn is tied to the GKE master version: you will need to update the master to pick up a new config connector. Taking a look at the version available in us-central1, all versions in any channel (rapid, regular, stable) will contain at least version 1.51 of Config Connector. If you are concerned regarding updating a Kubernetes minor version, version 1.19.12-gke.2100 will contain Config Connector 1.56. Regardless of whether this resolves your issue, we know there is general pain around the slowness of addOn versions and are actively talking to the GKE team about options to speed things up. |
Checklist
Bug Description
A new revision for
cnrm-resource-stats-recorder
that tried to start yesterday is failing in a crash loop in one of my clusters, complaining that port 8888 is already in use (in the 'recorder' container).It's trying to run
recorder: gcr.io/gke-release/cnrm/recorder:d399cc9, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1
.The previous version is still running fine, on versions:
recorder: gcr.io/gke-release/cnrm/recorder:2081072, prom-to-sd: k8s.gcr.io/prometheus-to-sd:v0.9.1
.Any insight into this issue would be appreciated!
Should I kill this revision and try re-applying the config connector manifests?
Additional Diagnostic Information
Kubernetes Cluster Version
1.17.17-gke.2800
Config Connector Version
1.39.0
Config Connector Mode
cluster
Log Output
The logs from the "recorder" container are (repeated with each crash):
Steps to Reproduce
Steps to reproduce the issue
I'm not sure exactly what triggered this issue, but it seems that it occurred when the
recorder
container was trying to upgrade to a new version while another revision was already running in the cluster.The text was updated successfully, but these errors were encountered: