-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Fix unexpected change of container ready state when kubelet restart #123982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @LastNight1997. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: LastNight1997 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if container.ReadinessProbe != nil { | ||
containersWithReadinessProbe.Insert(container.Name) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is duplicated for the initContainers below. Would it make sense to wrap it into function?
case w.manualTriggerCh <- struct{}{}: | ||
default: // Non-blocking. | ||
klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and call that function only where its result is used, e.g. here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can wrap it into function, but call it here is not a good idea, because it's in the for loop here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me. It should be called before the loop.
case w.manualTriggerCh <- struct{}{}: | ||
default: // Non-blocking. | ||
klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
/sig network |
/triage accepted |
/ok-to-test |
@LastNight1997: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@thockin If we just use a "deprecated" gate, maybe we don't need a KEP, could we use the same precedent we used: #119789 |
It will be good to have the KEP to describe the scenarios and behaviors and tests, so we keep this documented, the referenced issue is more scoped and this have a larger blast radius , last time we changed something in Pod readiness we broke Service and it took more than 3 releases to notice it |
#119789 was a really inconsequential change - this is really not :) |
@thockin Thank you. I have no experience writing KEP. I really appreciate it if you could do it. |
@LastNight1997 LOL, what I meant was: you write the KEP and I'll review/shepherd it. Writing a KEP basically means:
You should be able to breeze thru most of it since you already know the code change. The most effort will go into thinking about test and how this could fail. Try to imagine the most bizarre thing a person could do to break this change - how do we defend? |
Will this PR continue to move forward? I have a similar change... https://github.com/chenk008/kubernetes/pull/3/files#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0d |
@chenk008 This still needs a KEP. #123982 (comment) |
+1 to this PR, this bug has been plaguing us and has caused noticeable pain. I was in the process of writing a similar patch for our setup when I came across this PR |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I realized I had reviewed this a while ago but never submitted the comments.
That said, looking through this again, I'm not sure this is the right approach. There's a lot of complexity here around knowing when to carry forward the old status.
What if instead we prepopulate the status cache & readiness results cache when a pod is added back? In other words, in HandlePodAdditions
(kubelet.go), after the pod is admitted: if the pod status says it's running then add its status to the status manager, and add any ready containers to the readiness manager.
We need to carefully think through the implications of pre-populating the status cache though.
/cc @smarterclayton @bobbypage
ready = result == results.Success | ||
if !ready { | ||
m.tryTriggerWorker(pod.UID, c.Name, readiness) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't manually retrigger in the case of a failure. There was already a result, the probe failed - it should retry on the regular probing period.
} else { | ||
// The check whether there is a probe which hasn't run yet. | ||
w, exists := m.getWorker(pod.UID, c.Name, readiness) | ||
ready = !exists // no readinessProbe -> always ready | ||
if exists { | ||
// Trigger an immediate run of the readinessProbe to update ready state | ||
select { | ||
case w.manualTriggerCh <- struct{}{}: | ||
default: // Non-blocking. | ||
klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String()) | ||
} | ||
if containersWithReadinessProbe.Has(c.Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: else if
} else if containersWithReadinessProbe.Has(c.Name) {
// The check whether there is a probe which hasn't run yet.
klog.InfoS("Failed to trigger a manual run", "probe", w.probeType.String()) | ||
} | ||
if containersWithReadinessProbe.Has(c.Name) { | ||
m.tryTriggerWorker(pod.UID, c.Name, readiness) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should just remove the manual triggering here too. What purpose does it serve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was added in #98376, but I don't have the full context there. This looks like it violates the probe's InitialDelaySeconds to me.
/cc @SergeyKanzhelev
@@ -230,7 +230,9 @@ func (w *worker) doProbe(ctx context.Context) (keepGoing bool) { | |||
w.resultsManager.Remove(w.containerID) | |||
} | |||
w.containerID = kubecontainer.ParseContainerID(c.ContainerID) | |||
w.resultsManager.Set(w.containerID, w.initialValue, w.pod) | |||
if w.probeType != readiness { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is going to reset the readiness to false if the container restarts?
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The This bot removes
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /remove-lifecycle frozen |
What type of PR is this?
/kind bug
What this PR does / why we need it:
pod config readinessprobe:
thus, this PR keep the container ready state when kubelet restart.
Which issue(s) this PR fixes:
Fixes #100277
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: