support to retry delete containers if failed to connect to containerd #130403

ningmingxiao · 2025-02-25T02:29:13Z

What type of PR is this?

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Signed-off-by: ningmingxiao <[email protected]>

k8s-ci-robot · 2025-02-25T02:29:15Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-02-25T02:29:22Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-02-25T02:29:22Z

Hi @ningmingxiao. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-02-25T02:29:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ningmingxiao
Once this PR has been reviewed and has the lgtm label, please assign yujuhong for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

googs1025 · 2025-02-25T02:45:03Z

pkg/kubelet/kuberuntime/kuberuntime_gc.go

-		if err := cgc.manager.removeContainer(ctx, containers[i].id); err != nil {
-			klog.ErrorS(err, "Failed to remove container", "containerID", containers[i].id)
-		}
+		go func() {


Why can using goroutine solve this problem?
IIUC, if an error occurs when calling removeContainer, why do we need to retry? What is the reason for retrying? Generally, retries should be used to wait for some actions or some resources to complete (for example, waiting for reconcile or waiting for binding) 🤔

I find cgc.manager.removeContainer failed and don't continue to delete next time.

some logs find just try delete for the first time.
[containerd]$ cat k8s210.log |grep -i "failed to remove container"|wc -l
4906
[containerd]$ cat k8s210.log |grep -i "failed to remove container"|grep abbd21
E0210 21:37:42.651136 36915 kuberuntime_gc.go:151] "Failed to remove container" err="failed to get container status "abbd218805aa3e4aa64202463e719ca2a72e0ab56470bfb99bf0974f63e3d717": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"" containerID="abbd218805aa3e4aa64202463e719ca2a72e0ab56470bfb99bf0974f63e3d717"

From error message, it seems that there is a problem with the socket connected to containerd. 🤔 It may be due to network, containerd status or socket file. I don't think this should be a normal reason for retrying. In addition, if a large number of goroutines are started to perform retries, will it cause more pressure on kubelet?

can we limit the num of goroutine ? we create a 1000 pod for on a node, sometimes failed to connect to containerd because system is very busy.

can we limit the num of goroutine ?

I am not sure if this limitation is a good idea.

we create a 1000 pod for on a node, sometimes failed to connect to containerd because system is very busy.

In addition, I remember that the official recommendation for a single node is 110 pods. Of course, more than 110 may be necessary in some scenarios, but I don’t know if it is reasonable to exceed this much.

this looks like a very hacky solution

bear in mind that you are fixing this problem, but if the container runtime is failing as in the error you show you'll hit another place later ... and we are not going to implement retries of goroutines of all the operations ... so, we better define the retriable failures in all operations with the runtime holistically ... but we should not try to compensate externals errors in the kubelet based on individual and anecdotic evidence

/hold

ningmingxiao · 2025-02-25T09:30:26Z

I add some test cgc.manager.removeContainer can be retried to delete there must be some reason kubelet don't have chance to call cgc.manager.removeContainer.

k8s-triage-robot · 2025-05-26T14:10:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-06-25T14:15:49Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

support to retry delete containers if faild to connect to containerd

7af1f1b

Signed-off-by: ningmingxiao <[email protected]>

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 25, 2025

k8s-ci-robot requested review from mrunalp and yujuhong February 25, 2025 02:29

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 25, 2025

googs1025 reviewed Feb 25, 2025

View reviewed changes

ningmingxiao changed the title ~~support to retry delete containers if faild to connect to containerd~~ support to retry delete containers if failed to connect to containerd Feb 25, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2025

bart0sh moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Feb 27, 2025

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support to retry delete containers if failed to connect to containerd #130403

support to retry delete containers if failed to connect to containerd #130403

Uh oh!

ningmingxiao commented Feb 25, 2025

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

googs1025 Feb 25, 2025

Uh oh!

ningmingxiao Feb 25, 2025

Uh oh!

ningmingxiao Feb 25, 2025

Uh oh!

googs1025 Feb 25, 2025

Uh oh!

ningmingxiao Feb 25, 2025 •

edited

Loading

Uh oh!

googs1025 Feb 25, 2025

Uh oh!

aojea Feb 25, 2025

Uh oh!

aojea Feb 25, 2025

Uh oh!

ningmingxiao commented Feb 25, 2025 •

edited

Loading

Uh oh!

k8s-triage-robot commented May 26, 2025

Uh oh!

k8s-triage-robot commented Jun 25, 2025

Uh oh!

Uh oh!

support to retry delete containers if failed to connect to containerd #130403

Are you sure you want to change the base?

support to retry delete containers if failed to connect to containerd #130403

Uh oh!

Conversation

ningmingxiao commented Feb 25, 2025

What type of PR is this?

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

k8s-ci-robot commented Feb 25, 2025

Uh oh!

googs1025 Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

ningmingxiao Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

ningmingxiao Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

googs1025 Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

ningmingxiao Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

googs1025 Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

aojea Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

aojea Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

ningmingxiao commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-triage-robot commented May 26, 2025

Uh oh!

k8s-triage-robot commented Jun 25, 2025

Uh oh!

Uh oh!

ningmingxiao Feb 25, 2025 •

edited

Loading

ningmingxiao commented Feb 25, 2025 •

edited

Loading