-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: HandlePodCleanups takes an extra sync to restart pods #116690
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
HandlePodCleanups is responsible for restarting pods that are no longer running (usually due to delete and recreation with the same UID in quick succession). We have to filter the list of pods to restart from podManager to get the list of admitted pods, which uses filterOutInactivePods on the kubelet. That method excludes pods the pod worker has already terminated. Since a restarted pod will be in the terminated state before HandlePodCleanups calls SyncKnownPods, we have to call filterOutInactivePods after SyncKnownPods, otherwise the to-be-restarted pod is ignored and we have to wait for the next houskeeping cycle to restart it. Since static pods are often critical system components, this extra 2s wait is undesirable and we should restart as soon as we can. Add a failing test that passes after we move the filter call after SyncKnownPods.
366d51e
to
d25572c
Compare
Thanks for the fix and detailed explanation. /lgtm |
LGTM label has been added. Git tree hash: 3de1782d48757c3c0d69e3fd45b7f13909bc2572
|
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/triage accepted |
HandlePodCleanups is responsible for restarting pods that are no
longer running (usually due to delete and recreation with the same
UID in quick succession). We have to filter the list of pods to
restart from podManager to get the list of admitted pods, which
uses filterOutInactivePods on the kubelet. That method excludes
pods the pod worker has already terminated. Since a restarted
pod will be in the terminated state before HandlePodCleanups
calls SyncKnownPods, we have to call filterOutInactivePods after
SyncKnownPods, otherwise the to-be-restarted pod is ignored and
we have to wait for the next houskeeping cycle to restart it.
Since static pods are often critical system components, this
extra 2s wait is undesirable and we should restart as soon as
we can. Add a failing test that passes after we move the filter
call after SyncKnownPods.
/kind bug
/sig node
/assign @bobbypage
Not for 1.28, this is a latency issue only not a correctness issue.