-
Notifications
You must be signed in to change notification settings - Fork 40.9k
support to retry delete containers if failed to connect to containerd #130403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: ningmingxiao <[email protected]>
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @ningmingxiao. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ningmingxiao The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if err := cgc.manager.removeContainer(ctx, containers[i].id); err != nil { | ||
klog.ErrorS(err, "Failed to remove container", "containerID", containers[i].id) | ||
} | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can using goroutine solve this problem?
IIUC, if an error occurs when calling removeContainer
, why do we need to retry? What is the reason for retrying? Generally, retries should be used to wait for some actions or some resources to complete (for example, waiting for reconcile or waiting for binding) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find cgc.manager.removeContainer failed and don't continue to delete next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some logs find just try delete for the first time.
[containerd]$ cat k8s210.log |grep -i "failed to remove container"|wc -l
4906
[containerd]$ cat k8s210.log |grep -i "failed to remove container"|grep abbd21
E0210 21:37:42.651136 36915 kuberuntime_gc.go:151] "Failed to remove container" err="failed to get container status "abbd218805aa3e4aa64202463e719ca2a72e0ab56470bfb99bf0974f63e3d717": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"" containerID="abbd218805aa3e4aa64202463e719ca2a72e0ab56470bfb99bf0974f63e3d717"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From error message, it seems that there is a problem with the socket connected to containerd. 🤔 It may be due to network, containerd status or socket file. I don't think this should be a normal reason for retrying. In addition, if a large number of goroutines are started to perform retries, will it cause more pressure on kubelet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we limit the num of goroutine ? we create a 1000 pod for on a node, sometimes failed to connect to containerd because system is very busy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we limit the num of goroutine ?
I am not sure if this limitation is a good idea.
we create a 1000 pod for on a node, sometimes failed to connect to containerd because system is very busy.
In addition, I remember that the official recommendation for a single node is 110 pods. Of course, more than 110 may be necessary in some scenarios, but I don’t know if it is reasonable to exceed this much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a very hacky solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bear in mind that you are fixing this problem, but if the container runtime is failing as in the error you show you'll hit another place later ... and we are not going to implement retries of goroutines of all the operations ... so, we better define the retriable failures in all operations with the runtime holistically ... but we should not try to compensate externals errors in the kubelet based on individual and anecdotic evidence
/hold
I add some test cgc.manager.removeContainer can be retried to delete there must be some reason kubelet don't have chance to call cgc.manager.removeContainer. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
What type of PR is this?
Fixes # #130331
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: