Fix Multiple RemoveContainer Calls Due to Race Condition #131312

HirazawaUi · 2025-04-15T13:41:20Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

During the pod deletion process, due to concurrent conditions, removeContainer may be executed multiple times.

Which issue(s) this PR fixes:

Special notes for your reviewer:

In PLEG, we send the ContainerDied events for both the sandbox and regular containers to plegCh simultaneously. In syncLoopIteration, upon receiving a ContainerDied event, we execute kl.cleanUpContainersInPod(e.ID, containerID) to clean up the containers in the pod.

kubernetes/pkg/kubelet/pleg/generic.go

Lines 387 to 397 in 30469e1

    
           	for _, p := range pods { 
        
           		if p == nil { 
        
           			continue 
        
           		} 
        
           		fillCidSet(p.Containers) 
        
           		// Update sandboxes as containers 
        
           		// TODO: keep track of sandboxes explicitly. 
        
           		fillCidSet(p.Sandboxes) 
        
           	} 
        
           	return containers 
        
           }

kubernetes/pkg/kubelet/kubelet.go

Lines 2550 to 2566 in 30469e1

    
           case e := <-plegCh: 
        
           	if isSyncPodWorthy(e) { 
        
           		// PLEG event for a pod; sync it. 
        
           		if pod, ok := kl.podManager.GetPodByUID(e.ID); ok { 
        
           			klog.V(2).InfoS("SyncLoop (PLEG): event for pod", "pod", klog.KObj(pod), "event", e) 
        
           			handler.HandlePodSyncs([]*v1.Pod{pod}) 
        
           		} else { 
        
           			// If the pod no longer exists, ignore the event. 
        
           			klog.V(4).InfoS("SyncLoop (PLEG): pod does not exist, ignore irrelevant event", "event", e) 
        
           		} 
        
           	} 
        
           	if e.Type == pleg.ContainerDied { 
        
           		if containerID, ok := e.Data.(string); ok { 
        
           			kl.cleanUpContainersInPod(e.ID, containerID) 
        
           		} 
        
           	}

In cleanUpContainersInPod.deleteContainersInPod.getContainersToDeleteInPod, we explicitly select only the container ID from podStatus.ContainerStatuses that matches exitedContainerID. This is to avoid triggering unnecessary RemoveContainer calls due to the sandbox's ContainerDied event.

kubernetes/pkg/kubelet/pod_container_deletor.go

Lines 67 to 77 in 30469e1

    
           matchedContainer := func(filterContainerId string, podStatus *kubecontainer.PodStatus) *kubecontainer.Status { 
        
           	if filterContainerId == "" { 
        
           		return nil 
        
           	} 
        
           	for _, containerStatus := range podStatus.ContainerStatuses { 
        
           		if containerStatus.ID.ID == filterContainerId { 
        
           			return containerStatus 
        
           		} 
        
           	} 
        
           	return nil 
        
           }(filterContainerID, podStatus)

However, there may be a race condition: if SyncTerminatingPod completes before the ContainerDied event is received, the pod's status.terminatedAt is set. In this case, the removeAll variable in cleanUpContainersInPod function is set to true, causing both sandbox and regular container ContainerDied events to trigger equivalent RemoveContainer executions.

kubernetes/pkg/kubelet/kubelet.go

Lines 3181 to 3187 in 30469e1

    
           func (kl *Kubelet) cleanUpContainersInPod(podID types.UID, exitedContainerID string) { 
        
           	if podStatus, err := kl.podCache.Get(podID); err == nil { 
        
           		// When an evicted or deleted pod has already synced, all containers can be removed. 
        
           		removeAll := kl.podWorkers.ShouldPodContentBeRemoved(podID) 
        
           		kl.containerDeletor.deleteContainersInPod(exitedContainerID, podStatus, removeAll) 
        
           	} 
        
           }

However, since the removeAll parameter is set to true at this point, when removeAll is true, the deleteContainersInPod function iterates over all containers in the pod with an exited status and executes RemoveContainer for each of them. This results in the actual number of RemoveContainer executions being the square of the number of containers in the pod.

kubernetes/pkg/kubelet/pod_container_deletor.go

Lines 103 to 108 in b53b9fb

    
           func (p *podContainerDeletor) deleteContainersInPod(filterContainerID string, podStatus *kubecontainer.PodStatus, removeAll bool) { 
        
           	containersToKeep := p.containersToKeep 
        
           	if removeAll { 
        
           		containersToKeep = 0 
        
           		filterContainerID = "" 
        
           	}

kubernetes/pkg/kubelet/pod_container_deletor.go

Lines 85 to 93 in b53b9fb

    
           var candidates containerStatusbyCreatedList 
        
           for _, containerStatus := range podStatus.ContainerStatuses { 
        
           	if containerStatus.State != kubecontainer.ContainerStateExited { 
        
           		continue 
        
           	} 
        
           	if matchedContainer == nil || matchedContainer.Name == containerStatus.Name { 
        
           		candidates = append(candidates, containerStatus) 
        
           	} 
        
           }

In the fix for this PR, I modified deleteContainersInPod to no longer accept the removeAll parameter. When filterContainerID is passed as an empty string, it follows the previous logic to delete all containers by default (previously, when the removeAll parameter was passed, deleteContainersInPod explicitly set filterContainerID to empty to achieve this, making the removeAll parameter unnecessary).

Additionally, I added a keepMinimContainers parameter to deleteContainersInPod. After a pod completes SyncTerminatingPod, we will delete the containers in the pod one by one based on the ContainerDied events generated by PLEG, and we will no longer retain a minimum number of containers in the pod. When SyncTerminatingPod is not yet complete, it adheres to the previous behavior, deleting containers in the pod one by one while retaining a minimum number of containers.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-04-15T13:41:24Z

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Apr 15 13:35:16 UTC 2025.

k8s-ci-robot · 2025-04-15T13:41:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: HirazawaUi
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

HirazawaUi · 2025-04-15T15:15:45Z

/retest

HirazawaUi · 2025-04-15T23:42:45Z

/retest

bart0sh · 2025-04-16T05:55:45Z

/triage accepted

hshiina · 2025-04-16T08:39:22Z

This PR will fix a case where a pod has only one container, which seems the most common. I think this fix is worth it. However, if a pod has multiple containers, RemoveContainer() still can be called multiple times for one container due to removeAll.

HirazawaUi · 2025-04-18T04:19:03Z

/retest

HirazawaUi · 2025-04-18T04:23:16Z

This PR will fix a case where a pod has only one container, which seems the most common. I think this fix is worth it. However, if a pod has multiple containers, RemoveContainer() still can be called multiple times for one container due to removeAll.

Yes, I also noticed this issue, and the number of calls to RemoveContainer() is the square of the number of containers in the pod, which I think is unreasonable. I have already made fixes for this behavior in the PR.

HirazawaUi · 2025-04-18T05:33:43Z

/retest

hshiina · 2025-04-18T09:14:28Z

Removing removeAll may delay container deletion on the runtime though I'm not sure how harmful that would be.

We can observe this delay with this pod:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: testpod
  name: testpod
spec:
  containers:
  - image: busybox
    name: container1
    command:
      - sh
      - -c
      - trap "exit 0" SIGTERM; while true; do sleep 1; done
  - image: busybox
    name: container2
    command:
      - sh
      - -c
      - sleep 1d

At pod termination, while container1 exits soon, container2 remains till gracePeriod expires. Then, when a ContainerDied event for container1 is raised, RemoveContainer() is not called because SyncTerminationgPod() has not been completed yet. With this fix, when a ContainerDied event for container2 or the sandbox is raised, container1 is not removed and removed later by the garbage collection.

hshiina · 2025-04-18T09:18:26Z

By the way, when EventedPLEG is enabled, I guess ContainerDied events are usually delivered before SyncTerminatingPod() has completed. This gap with GenericPLEG might be another problem for EventedPLEG feature.

HirazawaUi · 2025-04-18T14:53:19Z

At pod termination, while container1 exits soon, container2 remains till gracePeriod expires. Then, when a ContainerDied event for container1 is raised, RemoveContainer() is not called because SyncTerminationgPod() has not been completed yet. With this fix, when a ContainerDied event for container2 or the sandbox is raised, container1 is not removed and removed later by the garbage collection.

Removing removeAll may delay container deletion on the runtime though I'm not sure how harmful that would be.

Thanks for the reminder. The reason for failing the e2e test also because this. I initially hoped to minimize code changes to fix this issue without adding extra caching or locks, but it seems unfeasible now. I’ll mark this PR as WIP and work on finding a better solution.

HirazawaUi · 2025-04-20T11:18:11Z

@hshiina Can you review my latest changes to see if I’ve missed any scenarios? :) Adding a cache to avoid repeated container cleanup was one of my initial ideas, but I was concerned that modifying too much code would complicate the review and PR process. So, I initially focused on fixing the issue with minimal changes. However, it now seems unavoidable to add this cache.

pkg/kubelet/kubelet.go

hshiina · 2025-04-20T18:58:53Z

It seems that RemoveContainer() still can cause NotFound in a case where two containers exit at the almost same time:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: testpod
  name: testpod
spec:
  containers:
  - image: busybox
    name: container1
    command:
      - sh
      - -c
      - trap "exit 0" SIGTERM; while true; do sleep 1; done
  - image: busybox
    name: container2
    command:
      - sh
      - -c
      - trap "exit 0" SIGTERM; while true; do sleep 1; done

When PLEG detects two containers exiting at the same time, it raises two ContainerDied events. By the first event, all containers are removed because all containers in Exited status on the pod cache. Then, single RemoveContainer() triggered by the second event fails with NotFound.

HirazawaUi · 2025-04-21T00:16:30Z

It seems that RemoveContainer() still can cause NotFound in a case where two containers exit at the almost same time:

Just modify the current code so that if a pod exists in the podCleanupTracker cache, the containers in that pod will not be cleaned up again. This way, we can avoid this situation, but I'm concerned that it might break the consistency of podWorker.

Compared to the previous exponentially numerous RemoveContainer calls based on the number of containers, I think the current fix might already be sufficient? Of course, I'm very open to any opinions.

k8s-ci-robot · 2025-04-21T12:55:03Z

@HirazawaUi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-unit-windows-master	`b13c099`	link	false	`/test pull-kubernetes-unit-windows-master`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sreeram-venkitesh · 2025-05-14T17:44:56Z

CC @haircommander

k8s-ci-robot · 2025-06-28T03:27:07Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 15, 2025

k8s-ci-robot requested review from bart0sh and yujuhong April 15, 2025 13:41

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 15, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs Apr 15, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Apr 15, 2025

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 15, 2025

HirazawaUi changed the title ~~Fix Race Condition in Pod Deletion to Prevent Multiple RemoveContainer Calls~~ Fix Multiple RemoveContainer Calls Due to Race Condition Apr 15, 2025

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 16, 2025

bart0sh moved this from Triage to Needs Reviewer in SIG Node: code and documentation PRs Apr 16, 2025

HirazawaUi added 2 commits April 18, 2025 10:07

Do not Call RemoveContainer for sandbox container

ce1973c

Do not call removecontainer multiple times

e30ad7e

HirazawaUi force-pushed the dont-execute-removecontainer-multiple-times branch from 4869685 to e30ad7e Compare April 18, 2025 02:09

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 18, 2025

HirazawaUi changed the title ~~Fix Multiple RemoveContainer Calls Due to Race Condition~~ WIP: Fix Multiple RemoveContainer Calls Due to Race Condition Apr 18, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 18, 2025

HirazawaUi changed the title ~~WIP: Fix Multiple RemoveContainer Calls Due to Race Condition~~ Fix Multiple RemoveContainer Calls Due to Race Condition Apr 20, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2025

bart0sh reviewed Apr 20, 2025

View reviewed changes

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

Add cache to avoid duplicate pod cleanup

b13c099

HirazawaUi force-pushed the dont-execute-removecontainer-multiple-times branch from 3441b85 to b13c099 Compare April 21, 2025 12:25

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 28, 2025

	for _, p := range pods {
	if p == nil {
	continue
	}
	fillCidSet(p.Containers)
	// Update sandboxes as containers
	// TODO: keep track of sandboxes explicitly.
	fillCidSet(p.Sandboxes)
	}
	return containers
	}

	case e := <-plegCh:
	if isSyncPodWorthy(e) {
	// PLEG event for a pod; sync it.
	if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
	klog.V(2).InfoS("SyncLoop (PLEG): event for pod", "pod", klog.KObj(pod), "event", e)
	handler.HandlePodSyncs([]*v1.Pod{pod})
	} else {
	// If the pod no longer exists, ignore the event.
	klog.V(4).InfoS("SyncLoop (PLEG): pod does not exist, ignore irrelevant event", "event", e)
	}
	}

	if e.Type == pleg.ContainerDied {
	if containerID, ok := e.Data.(string); ok {
	kl.cleanUpContainersInPod(e.ID, containerID)
	}
	}

	matchedContainer := func(filterContainerId string, podStatus kubecontainer.PodStatus) kubecontainer.Status {
	if filterContainerId == "" {
	return nil
	}
	for _, containerStatus := range podStatus.ContainerStatuses {
	if containerStatus.ID.ID == filterContainerId {
	return containerStatus
	}
	}
	return nil
	}(filterContainerID, podStatus)

	func (kl *Kubelet) cleanUpContainersInPod(podID types.UID, exitedContainerID string) {
	if podStatus, err := kl.podCache.Get(podID); err == nil {
	// When an evicted or deleted pod has already synced, all containers can be removed.
	removeAll := kl.podWorkers.ShouldPodContentBeRemoved(podID)
	kl.containerDeletor.deleteContainersInPod(exitedContainerID, podStatus, removeAll)
	}
	}

	func (p podContainerDeletor) deleteContainersInPod(filterContainerID string, podStatus kubecontainer.PodStatus, removeAll bool) {
	containersToKeep := p.containersToKeep
	if removeAll {
	containersToKeep = 0
	filterContainerID = ""
	}

	var candidates containerStatusbyCreatedList
	for _, containerStatus := range podStatus.ContainerStatuses {
	if containerStatus.State != kubecontainer.ContainerStateExited {
	continue
	}
	if matchedContainer == nil \|\| matchedContainer.Name == containerStatus.Name {
	candidates = append(candidates, containerStatus)
	}
	}

Fix Multiple RemoveContainer Calls Due to Race Condition #131312

Are you sure you want to change the base?

Fix Multiple RemoveContainer Calls Due to Race Condition #131312

Uh oh!

Conversation

HirazawaUi commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Apr 15, 2025

Uh oh!

k8s-ci-robot commented Apr 15, 2025

Uh oh!

HirazawaUi commented Apr 15, 2025

Uh oh!

HirazawaUi commented Apr 15, 2025

Uh oh!

bart0sh commented Apr 16, 2025

Uh oh!

hshiina commented Apr 16, 2025

Uh oh!

HirazawaUi commented Apr 18, 2025

Uh oh!

HirazawaUi commented Apr 18, 2025

Uh oh!

HirazawaUi commented Apr 18, 2025

Uh oh!

hshiina commented Apr 18, 2025

Uh oh!

hshiina commented Apr 18, 2025

Uh oh!

HirazawaUi commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HirazawaUi commented Apr 20, 2025

Uh oh!

Uh oh!

hshiina commented Apr 20, 2025

Uh oh!

HirazawaUi commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sreeram-venkitesh commented May 14, 2025

Uh oh!

k8s-ci-robot commented Jun 28, 2025

Uh oh!

Uh oh!

HirazawaUi commented Apr 15, 2025 •

edited

Loading

HirazawaUi commented Apr 18, 2025 •

edited

Loading

HirazawaUi commented Apr 21, 2025 •

edited

Loading

k8s-ci-robot commented Apr 21, 2025 •

edited

Loading