Fix unexpected change of container ready state when kubelet restart #123982

LastNight1997 · 2024-03-19T04:00:21Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

pod config readinessprobe:

if ready state is true and kubelet restart, pod temporarily report containerNotReady，this may lead to service not available
if ready state is false and kubelet restart, pod temporarily report containerReady，this may also lead to service not available

thus, this PR keep the container ready state when kubelet restart.

Which issue(s) this PR fixes:

Fixes #100277

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-03-19T04:00:30Z

Hi @LastNight1997. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-03-19T04:01:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LastNight1997
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bart0sh · 2024-03-24T20:29:11Z

pkg/kubelet/prober/prober_manager.go

+		if container.ReadinessProbe != nil {
+			containersWithReadinessProbe.Insert(container.Name)
+		}
+	}


This code is duplicated for the initContainers below. Would it make sense to wrap it into function?

bart0sh · 2024-03-24T20:30:13Z

pkg/kubelet/prober/prober_manager.go

-				case w.manualTriggerCh <- struct{}{}:
-				default: // Non-blocking.
-					klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String())
-				}


and call that function only where its result is used, e.g. here

we can wrap it into function, but call it here is not a good idea, because it's in the for loop here.

this makes sense to me. It should be called before the loop.

bart0sh · 2024-03-24T20:30:26Z

pkg/kubelet/prober/prober_manager.go

-				case w.manualTriggerCh <- struct{}{}:
-				default: // Non-blocking.
-					klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String())
-				}


HirazawaUi · 2024-04-03T14:48:58Z

/sig network
because sig-network wants to move it forward.

bart0sh · 2024-04-03T16:11:51Z

/triage accepted
/priority important-longterm

bart0sh · 2024-04-03T16:12:37Z

/ok-to-test

k8s-ci-robot · 2024-04-03T16:43:59Z

@LastNight1997: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-kind-ipv6	`10a35cb`	link	true	`/test pull-kubernetes-e2e-kind-ipv6`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

HirazawaUi · 2024-04-04T08:56:50Z

Write a small KEP detailing the change and the tests needed to prove it (I think the bug has a lot of text you can steal)

A. I think we can use a "deprecated" gate instead. So, concretely - a gate named "PodUnreadyOnKubeletRestart" which is {Default: false, PreRelease: featuregate.Deprecated}. The code in question would look like:

@thockin If we just use a "deprecated" gate, maybe we don't need a KEP, could we use the same precedent we used: #119789

aojea · 2024-04-04T09:01:18Z

@thockin If we just use a "deprecated" gate, maybe we don't need a KEP, could we use the same precedent we used: #119789

It will be good to have the KEP to describe the scenarios and behaviors and tests, so we keep this documented, the referenced issue is more scoped and this have a larger blast radius , last time we changed something in Pod readiness we broke Service and it took more than 3 releases to notice it

thockin · 2024-04-04T16:16:11Z

#119789 was a really inconsequential change - this is really not :)

LastNight1997 · 2024-04-09T06:33:37Z

You can assign the KEP to me (though it needs a SIG Node reviewer/approver, too

@thockin Thank you. I have no experience writing KEP. I really appreciate it if you could do it.

thockin · 2024-04-17T20:48:52Z

@LastNight1997 LOL, what I meant was: you write the KEP and I'll review/shepherd it.

Writing a KEP basically means:

open an issue here: https://github.com/kubernetes/enhancements/issues
clone this dir: https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template into a new directory named after the issue number from above
fill in the files in this new directory
open a PR and assign it to me

You should be able to breeze thru most of it since you already know the code change. The most effort will go into thinking about test and how this could fail. Try to imagine the most bizarre thing a person could do to break this change - how do we defend?

chenk008 · 2024-07-11T01:42:42Z

Will this PR continue to move forward?

I have a similar change... https://github.com/chenk008/kubernetes/pull/3/files#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0d

thockin · 2024-07-14T22:59:25Z

@chenk008 This still needs a KEP. #123982 (comment)

olyazavr · 2024-09-06T14:20:05Z

+1 to this PR, this bug has been plaguing us and has caused noticeable pain. I was in the process of writing a similar patch for our setup when I came across this PR

k8s-triage-robot · 2025-02-03T20:59:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

HirazawaUi · 2025-02-04T02:31:10Z

/remove-lifecycle stale

tallclair

Sorry, I realized I had reviewed this a while ago but never submitted the comments.

That said, looking through this again, I'm not sure this is the right approach. There's a lot of complexity here around knowing when to carry forward the old status.

What if instead we prepopulate the status cache & readiness results cache when a pod is added back? In other words, in HandlePodAdditions (kubelet.go), after the pod is admitted: if the pod status says it's running then add its status to the status manager, and add any ready containers to the readiness manager.

We need to carefully think through the implications of pre-populating the status cache though.
/cc @smarterclayton @bobbypage

tallclair · 2024-11-05T20:52:56Z

pkg/kubelet/prober/prober_manager.go

+			ready = result == results.Success
+			if !ready {
+				m.tryTriggerWorker(pod.UID, c.Name, readiness)
+			}


This shouldn't manually retrigger in the case of a failure. There was already a result, the probe failed - it should retry on the regular probing period.

tallclair · 2024-11-05T20:54:59Z

pkg/kubelet/prober/prober_manager.go

 		} else {
 			// The check whether there is a probe which hasn't run yet.
-			w, exists := m.getWorker(pod.UID, c.Name, readiness)
-			ready = !exists // no readinessProbe -> always ready
-			if exists {
-				// Trigger an immediate run of the readinessProbe to update ready state
-				select {
-				case w.manualTriggerCh <- struct{}{}:
-				default: // Non-blocking.
-					klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String())
-				}
+			if containersWithReadinessProbe.Has(c.Name) {


nit: else if

} else if containersWithReadinessProbe.Has(c.Name) { // The check whether there is a probe which hasn't run yet.

tallclair · 2024-11-05T20:57:24Z

pkg/kubelet/prober/prober_manager.go

-					klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String())
-				}
+			if containersWithReadinessProbe.Has(c.Name) {
+				m.tryTriggerWorker(pod.UID, c.Name, readiness)


I feel like we should just remove the manual triggering here too. What purpose does it serve?

Looks like it was added in #98376, but I don't have the full context there. This looks like it violates the probe's InitialDelaySeconds to me.

/cc @SergeyKanzhelev

tallclair · 2025-02-19T19:58:21Z

pkg/kubelet/prober/worker.go

@@ -230,7 +230,9 @@ func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {
 			w.resultsManager.Remove(w.containerID)
 		}
 		w.containerID = kubecontainer.ParseContainerID(c.ContainerID)
-		w.resultsManager.Set(w.containerID, w.initialValue, w.pod)
+		if w.probeType != readiness {


What is going to reset the readiness to false if the container restarts?

k8s-triage-robot · 2025-05-20T20:31:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-06-24T19:07:50Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2025-06-25T11:14:48Z

The lifecycle/frozen label can not be applied to PRs.

This bot removes lifecycle/frozen from PRs because:

Commenting /lifecycle frozen on a PR has not worked since March 2021
PRs that remain open for >150 days are unlikely to be easily rebased

You can:

Rebase this PR and attempt to get it merged
Close this PR with /close

Please send feedback to sig-contributor-experience at kubernetes/community.

/remove-lifecycle frozen

Fix unexpected change of container ready state when kubelet restart

10a35cb

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 19, 2024

k8s-ci-robot requested review from endocrimes and tallclair March 19, 2024 04:00

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 19, 2024

bart0sh reviewed Mar 24, 2024

View reviewed changes

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Apr 3, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2024

thockin assigned tallclair, mrunalp and SergeyKanzhelev Apr 3, 2024

pololowww mentioned this pull request Aug 7, 2024

Fix inconsistent container ready state after kubelet restart kubernetes/enhancements#4781

Open

4 tasks

pacoxu mentioned this pull request Aug 9, 2024

KEP-4781: Fix inconsistent container start and ready state after kubelet restart kubernetes/enhancements#4784

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2025

tallclair reviewed Feb 19, 2025

View reviewed changes

k8s-ci-robot requested review from bobbypage, smarterclayton and SergeyKanzhelev February 19, 2025 20:02

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 24, 2025

thockin added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 25, 2025

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 25, 2025

Fix unexpected change of container ready state when kubelet restart #123982

Are you sure you want to change the base?

Fix unexpected change of container ready state when kubelet restart #123982

Uh oh!

Conversation

LastNight1997 commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 19, 2024

Uh oh!

k8s-ci-robot commented Mar 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HirazawaUi commented Apr 3, 2024

Uh oh!

bart0sh commented Apr 3, 2024

Uh oh!

bart0sh commented Apr 3, 2024

Uh oh!

k8s-ci-robot commented Apr 3, 2024

Uh oh!

HirazawaUi commented Apr 4, 2024

Uh oh!

aojea commented Apr 4, 2024

Uh oh!

thockin commented Apr 4, 2024

Uh oh!

LastNight1997 commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thockin commented Apr 17, 2024

Uh oh!

chenk008 commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thockin commented Jul 14, 2024

Uh oh!

olyazavr commented Sep 6, 2024

Uh oh!

k8s-triage-robot commented Feb 3, 2025

Uh oh!

HirazawaUi commented Feb 4, 2025

Uh oh!

tallclair left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented May 20, 2025

Uh oh!

k8s-triage-robot commented Jun 24, 2025

Uh oh!

k8s-triage-robot commented Jun 25, 2025

Uh oh!

Uh oh!

LastNight1997 commented Mar 19, 2024 •

edited

Loading

LastNight1997 commented Apr 9, 2024 •

edited

Loading

chenk008 commented Jul 11, 2024 •

edited

Loading