Skip to content

Fix Multiple RemoveContainer Calls Due to Race Condition #131312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Apr 15, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

During the pod deletion process, due to concurrent conditions, removeContainer may be executed multiple times.

Which issue(s) this PR fixes:

Fixes #131275

Special notes for your reviewer:

In PLEG, we send the ContainerDied events for both the sandbox and regular containers to plegCh simultaneously. In syncLoopIteration, upon receiving a ContainerDied event, we execute kl.cleanUpContainersInPod(e.ID, containerID) to clean up the containers in the pod.

for _, p := range pods {
if p == nil {
continue
}
fillCidSet(p.Containers)
// Update sandboxes as containers
// TODO: keep track of sandboxes explicitly.
fillCidSet(p.Sandboxes)
}
return containers
}

case e := <-plegCh:
if isSyncPodWorthy(e) {
// PLEG event for a pod; sync it.
if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
klog.V(2).InfoS("SyncLoop (PLEG): event for pod", "pod", klog.KObj(pod), "event", e)
handler.HandlePodSyncs([]*v1.Pod{pod})
} else {
// If the pod no longer exists, ignore the event.
klog.V(4).InfoS("SyncLoop (PLEG): pod does not exist, ignore irrelevant event", "event", e)
}
}
if e.Type == pleg.ContainerDied {
if containerID, ok := e.Data.(string); ok {
kl.cleanUpContainersInPod(e.ID, containerID)
}
}

In cleanUpContainersInPod.deleteContainersInPod.getContainersToDeleteInPod, we explicitly select only the container ID from podStatus.ContainerStatuses that matches exitedContainerID. This is to avoid triggering unnecessary RemoveContainer calls due to the sandbox's ContainerDied event.

matchedContainer := func(filterContainerId string, podStatus *kubecontainer.PodStatus) *kubecontainer.Status {
if filterContainerId == "" {
return nil
}
for _, containerStatus := range podStatus.ContainerStatuses {
if containerStatus.ID.ID == filterContainerId {
return containerStatus
}
}
return nil
}(filterContainerID, podStatus)

However, there may be a race condition: if SyncTerminatingPod completes before the ContainerDied event is received, the pod's status.terminatedAt is set. In this case, the removeAll variable in cleanUpContainersInPod function is set to true, causing both sandbox and regular container ContainerDied events to trigger equivalent RemoveContainer executions.

func (kl *Kubelet) cleanUpContainersInPod(podID types.UID, exitedContainerID string) {
if podStatus, err := kl.podCache.Get(podID); err == nil {
// When an evicted or deleted pod has already synced, all containers can be removed.
removeAll := kl.podWorkers.ShouldPodContentBeRemoved(podID)
kl.containerDeletor.deleteContainersInPod(exitedContainerID, podStatus, removeAll)
}
}

However, since the removeAll parameter is set to true at this point, when removeAll is true, the deleteContainersInPod function iterates over all containers in the pod with an exited status and executes RemoveContainer for each of them. This results in the actual number of RemoveContainer executions being the square of the number of containers in the pod.

func (p *podContainerDeletor) deleteContainersInPod(filterContainerID string, podStatus *kubecontainer.PodStatus, removeAll bool) {
containersToKeep := p.containersToKeep
if removeAll {
containersToKeep = 0
filterContainerID = ""
}

var candidates containerStatusbyCreatedList
for _, containerStatus := range podStatus.ContainerStatuses {
if containerStatus.State != kubecontainer.ContainerStateExited {
continue
}
if matchedContainer == nil || matchedContainer.Name == containerStatus.Name {
candidates = append(candidates, containerStatus)
}
}

In the fix for this PR, I modified deleteContainersInPod to no longer accept the removeAll parameter. When filterContainerID is passed as an empty string, it follows the previous logic to delete all containers by default (previously, when the removeAll parameter was passed, deleteContainersInPod explicitly set filterContainerID to empty to achieve this, making the removeAll parameter unnecessary).

Additionally, I added a keepMinimContainers parameter to deleteContainersInPod. After a pod completes SyncTerminatingPod, we will delete the containers in the pod one by one based on the ContainerDied events generated by PLEG, and we will no longer retain a minimum number of containers in the pod. When SyncTerminatingPod is not yet complete, it adheres to the previous behavior, deleting containers in the pod one by one while retaining a minimum number of containers.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 15, 2025
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Apr 15 13:35:16 UTC 2025.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 15, 2025
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 15, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: HirazawaUi
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@HirazawaUi
Copy link
Contributor Author

/retest

@HirazawaUi HirazawaUi changed the title Fix Race Condition in Pod Deletion to Prevent Multiple RemoveContainer Calls Fix Multiple RemoveContainer Calls Due to Race Condition Apr 15, 2025
@HirazawaUi
Copy link
Contributor Author

/retest

@bart0sh
Copy link
Contributor

bart0sh commented Apr 16, 2025

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 16, 2025
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node: code and documentation PRs Apr 16, 2025
@hshiina
Copy link
Contributor

hshiina commented Apr 16, 2025

This PR will fix a case where a pod has only one container, which seems the most common. I think this fix is worth it. However, if a pod has multiple containers, RemoveContainer() still can be called multiple times for one container due to removeAll.

@HirazawaUi HirazawaUi force-pushed the dont-execute-removecontainer-multiple-times branch from 4869685 to e30ad7e Compare April 18, 2025 02:09
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 18, 2025
@HirazawaUi
Copy link
Contributor Author

/retest

@HirazawaUi
Copy link
Contributor Author

This PR will fix a case where a pod has only one container, which seems the most common. I think this fix is worth it. However, if a pod has multiple containers, RemoveContainer() still can be called multiple times for one container due to removeAll.

Yes, I also noticed this issue, and the number of calls to RemoveContainer() is the square of the number of containers in the pod, which I think is unreasonable. I have already made fixes for this behavior in the PR.

@HirazawaUi
Copy link
Contributor Author

/retest

@hshiina
Copy link
Contributor

hshiina commented Apr 18, 2025

Removing removeAll may delay container deletion on the runtime though I'm not sure how harmful that would be.

We can observe this delay with this pod:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: testpod
  name: testpod
spec:
  containers:
  - image: busybox
    name: container1
    command:
      - sh
      - -c
      - trap "exit 0" SIGTERM; while true; do sleep 1; done
  - image: busybox
    name: container2
    command:
      - sh
      - -c
      - sleep 1d

At pod termination, while container1 exits soon, container2 remains till gracePeriod expires. Then, when a ContainerDied event for container1 is raised, RemoveContainer() is not called because SyncTerminationgPod() has not been completed yet. With this fix, when a ContainerDied event for container2 or the sandbox is raised, container1 is not removed and removed later by the garbage collection.

@hshiina
Copy link
Contributor

hshiina commented Apr 18, 2025

By the way, when EventedPLEG is enabled, I guess ContainerDied events are usually delivered before SyncTerminatingPod() has completed. This gap with GenericPLEG might be another problem for EventedPLEG feature.

@HirazawaUi
Copy link
Contributor Author

HirazawaUi commented Apr 18, 2025

At pod termination, while container1 exits soon, container2 remains till gracePeriod expires. Then, when a ContainerDied event for container1 is raised, RemoveContainer() is not called because SyncTerminationgPod() has not been completed yet. With this fix, when a ContainerDied event for container2 or the sandbox is raised, container1 is not removed and removed later by the garbage collection.

Removing removeAll may delay container deletion on the runtime though I'm not sure how harmful that would be.

Thanks for the reminder. The reason for failing the e2e test also because this. I initially hoped to minimize code changes to fix this issue without adding extra caching or locks, but it seems unfeasible now. I’ll mark this PR as WIP and work on finding a better solution.

@HirazawaUi HirazawaUi changed the title Fix Multiple RemoveContainer Calls Due to Race Condition WIP: Fix Multiple RemoveContainer Calls Due to Race Condition Apr 18, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 18, 2025
@HirazawaUi HirazawaUi changed the title WIP: Fix Multiple RemoveContainer Calls Due to Race Condition Fix Multiple RemoveContainer Calls Due to Race Condition Apr 20, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2025
@HirazawaUi
Copy link
Contributor Author

@hshiina Can you review my latest changes to see if I’ve missed any scenarios? :) Adding a cache to avoid repeated container cleanup was one of my initial ideas, but I was concerned that modifying too much code would complicate the review and PR process. So, I initially focused on fixing the issue with minimal changes. However, it now seems unavoidable to add this cache.

@hshiina
Copy link
Contributor

hshiina commented Apr 20, 2025

It seems that RemoveContainer() still can cause NotFound in a case where two containers exit at the almost same time:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: testpod
  name: testpod
spec:
  containers:
  - image: busybox
    name: container1
    command:
      - sh
      - -c
      - trap "exit 0" SIGTERM; while true; do sleep 1; done
  - image: busybox
    name: container2
    command:
      - sh
      - -c
      - trap "exit 0" SIGTERM; while true; do sleep 1; done

When PLEG detects two containers exiting at the same time, it raises two ContainerDied events. By the first event, all containers are removed because all containers in Exited status on the pod cache. Then, single RemoveContainer() triggered by the second event fails with NotFound.

@HirazawaUi
Copy link
Contributor Author

HirazawaUi commented Apr 21, 2025

It seems that RemoveContainer() still can cause NotFound in a case where two containers exit at the almost same time:

Just modify the current code so that if a pod exists in the podCleanupTracker cache, the containers in that pod will not be cleaned up again. This way, we can avoid this situation, but I'm concerned that it might break the consistency of podWorker.

Compared to the previous exponentially numerous RemoveContainer calls based on the number of containers, I think the current fix might already be sufficient? Of course, I'm very open to any opinions.

@HirazawaUi HirazawaUi force-pushed the dont-execute-removecontainer-multiple-times branch from 3441b85 to b13c099 Compare April 21, 2025 12:25
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 21, 2025

@HirazawaUi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit-windows-master b13c099 link false /test pull-kubernetes-unit-windows-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sreeram-venkitesh
Copy link
Member

CC @haircommander

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

kubelet tries to remove pod multiple times(reopen)
5 participants