kubelet: HandlePodCleanups takes an extra sync to restart pods #116690

smarterclayton · 2023-03-16T20:56:08Z

HandlePodCleanups is responsible for restarting pods that are no
longer running (usually due to delete and recreation with the same
UID in quick succession). We have to filter the list of pods to
restart from podManager to get the list of admitted pods, which
uses filterOutInactivePods on the kubelet. That method excludes
pods the pod worker has already terminated. Since a restarted
pod will be in the terminated state before HandlePodCleanups
calls SyncKnownPods, we have to call filterOutInactivePods after
SyncKnownPods, otherwise the to-be-restarted pod is ignored and
we have to wait for the next houskeeping cycle to restart it.

Since static pods are often critical system components, this
extra 2s wait is undesirable and we should restart as soon as
we can. Add a failing test that passes after we move the filter
call after SyncKnownPods.

/kind bug
/sig node
/assign @bobbypage

Not for 1.28, this is a latency issue only not a correctness issue.

Updated static pods are restarted 2s faster by correcting a safe but non-optimal ordering bug.

k8s-ci-robot · 2023-03-16T20:56:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

HandlePodCleanups is responsible for restarting pods that are no longer running (usually due to delete and recreation with the same UID in quick succession). We have to filter the list of pods to restart from podManager to get the list of admitted pods, which uses filterOutInactivePods on the kubelet. That method excludes pods the pod worker has already terminated. Since a restarted pod will be in the terminated state before HandlePodCleanups calls SyncKnownPods, we have to call filterOutInactivePods after SyncKnownPods, otherwise the to-be-restarted pod is ignored and we have to wait for the next houskeeping cycle to restart it. Since static pods are often critical system components, this extra 2s wait is undesirable and we should restart as soon as we can. Add a failing test that passes after we move the filter call after SyncKnownPods.

bobbypage · 2023-03-17T19:31:03Z

Thanks for the fix and detailed explanation.

/lgtm

k8s-ci-robot · 2023-03-17T19:31:09Z

LGTM label has been added.

Git tree hash: 3de1782d48757c3c0d69e3fd45b7f13909bc2572

bobbypage · 2023-03-17T19:32:47Z

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial

k8s-triage-robot · 2023-03-17T23:30:50Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

bart0sh · 2023-03-30T18:00:21Z

/triage accepted
/priority important-soon

k8s-ci-robot assigned bobbypage Mar 16, 2023

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet labels Mar 16, 2023

k8s-ci-robot requested review from SergeyKanzhelev and yujuhong March 16, 2023 20:57

smarterclayton force-pushed the handle_twice branch from 366d51e to d25572c Compare March 16, 2023 21:18

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 16, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2023

SergeyKanzhelev added this to Triage in SIG Node PR Triage Mar 22, 2023

bart0sh moved this from Triage to Needs Approver in SIG Node PR Triage Mar 30, 2023

k8s-ci-robot merged commit d48c883 into kubernetes:master Apr 12, 2023

SIG Node PR Triage automation moved this from Needs Approver to Done Apr 12, 2023

k8s-ci-robot added this to the v1.28 milestone Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet: HandlePodCleanups takes an extra sync to restart pods #116690

kubelet: HandlePodCleanups takes an extra sync to restart pods #116690

smarterclayton commented Mar 16, 2023

k8s-ci-robot commented Mar 16, 2023

bobbypage commented Mar 17, 2023

k8s-ci-robot commented Mar 17, 2023

bobbypage commented Mar 17, 2023

k8s-triage-robot commented Mar 17, 2023

bart0sh commented Mar 30, 2023

kubelet: HandlePodCleanups takes an extra sync to restart pods #116690

kubelet: HandlePodCleanups takes an extra sync to restart pods #116690

Conversation

smarterclayton commented Mar 16, 2023

k8s-ci-robot commented Mar 16, 2023

bobbypage commented Mar 17, 2023

k8s-ci-robot commented Mar 17, 2023

bobbypage commented Mar 17, 2023

k8s-triage-robot commented Mar 17, 2023

bart0sh commented Mar 30, 2023