-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Fix Multiple RemoveContainer Calls Due to Race Condition #131312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix Multiple RemoveContainer Calls Due to Race Condition #131312
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Apr 15 13:35:16 UTC 2025. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: HirazawaUi The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest |
/triage accepted |
This PR will fix a case where a pod has only one container, which seems the most common. I think this fix is worth it. However, if a pod has multiple containers, |
4869685
to
e30ad7e
Compare
/retest |
Yes, I also noticed this issue, and the number of calls to |
/retest |
Removing We can observe this delay with this pod:
At pod termination, while |
By the way, when |
Thanks for the reminder. The reason for failing the e2e test also because this. I initially hoped to minimize code changes to fix this issue without adding extra caching or locks, but it seems unfeasible now. I’ll mark this PR as WIP and work on finding a better solution. |
@hshiina Can you review my latest changes to see if I’ve missed any scenarios? :) Adding a cache to avoid repeated container cleanup was one of my initial ideas, but I was concerned that modifying too much code would complicate the review and PR process. So, I initially focused on fixing the issue with minimal changes. However, it now seems unavoidable to add this cache. |
It seems that
When PLEG detects two containers exiting at the same time, it raises two |
Just modify the current code so that if a pod exists in the podCleanupTracker cache, the containers in that pod will not be cleaned up again. This way, we can avoid this situation, but I'm concerned that it might break the consistency of podWorker. Compared to the previous exponentially numerous RemoveContainer calls based on the number of containers, I think the current fix might already be sufficient? Of course, I'm very open to any opinions. |
3441b85
to
b13c099
Compare
@HirazawaUi: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
During the pod deletion process, due to concurrent conditions, removeContainer may be executed multiple times.
Which issue(s) this PR fixes:
Fixes #131275
Special notes for your reviewer:
In PLEG, we send the
ContainerDied
events for both the sandbox and regular containers toplegCh
simultaneously. In syncLoopIteration, upon receiving aContainerDied
event, we executekl.cleanUpContainersInPod(e.ID, containerID)
to clean up the containers in the pod.kubernetes/pkg/kubelet/pleg/generic.go
Lines 387 to 397 in 30469e1
kubernetes/pkg/kubelet/kubelet.go
Lines 2550 to 2566 in 30469e1
In
cleanUpContainersInPod.deleteContainersInPod.getContainersToDeleteInPod
, we explicitly select only the container ID frompodStatus.ContainerStatuses
that matchesexitedContainerID
. This is to avoid triggering unnecessaryRemoveContainer
calls due to the sandbox'sContainerDied
event.kubernetes/pkg/kubelet/pod_container_deletor.go
Lines 67 to 77 in 30469e1
However, there may be a race condition: if
SyncTerminatingPod
completes before theContainerDied
event is received, the pod'sstatus.terminatedAt
is set. In this case, theremoveAll
variable incleanUpContainersInPod
function is set to true, causing both sandbox and regular containerContainerDied
events to trigger equivalentRemoveContainer
executions.kubernetes/pkg/kubelet/kubelet.go
Lines 3181 to 3187 in 30469e1
However, since the removeAll parameter is set to true at this point, when removeAll is true, the deleteContainersInPod function iterates over all containers in the pod with an exited status and executes RemoveContainer for each of them. This results in the actual number of RemoveContainer executions being the square of the number of containers in the pod.
kubernetes/pkg/kubelet/pod_container_deletor.go
Lines 103 to 108 in b53b9fb
kubernetes/pkg/kubelet/pod_container_deletor.go
Lines 85 to 93 in b53b9fb
In the fix for this PR, I modified
deleteContainersInPod
to no longer accept theremoveAll
parameter. WhenfilterContainerID
is passed as an empty string, it follows the previous logic to delete all containers by default (previously, when theremoveAll
parameter was passed,deleteContainersInPod
explicitly setfilterContainerID
to empty to achieve this, making theremoveAll
parameter unnecessary).Additionally, I added a
keepMinimContainers
parameter todeleteContainersInPod
. After a pod completesSyncTerminatingPod
, we will delete the containers in the pod one by one based on theContainerDied
events generated by PLEG, and we will no longer retain a minimum number of containers in the pod. WhenSyncTerminatingPod
is not yet complete, it adheres to the previous behavior, deleting containers in the pod one by one while retaining a minimum number of containers.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: