-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Fix pod start duration observation #132278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @ali-a-a. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ali-a-a The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Great to see you fixing this issue! However, @natasha41575 wasn't able to reproduce it. Would you be willing to attempt reproduction and detail your steps in #132268? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HirazawaUi I actually was able to reproduce the issue after trying a few more times - left a comment on #132268.
pkg/kubelet/kubelet.go
Outdated
@@ -1928,7 +1928,7 @@ func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType | |||
// Record the time it takes for the pod to become running | |||
// since kubelet first saw the pod if firstSeenTime is set. | |||
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID) | |||
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning && | |||
if ok && existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the original intention of the code was:
if ok && existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning && | |
if (!ok || existingStatus.Phase == v1.PodPending) && apiPodStatus.Phase == v1.PodRunning && |
That would mean if the status is not found in the status manager AND the pod is running, record the metric, which makes sense. But it looks to me like there was perhaps a mistake made in thinking through the logic / precedence of the operators.
I think this the preferable way to fix the issue. With the change this PR has now, it would not record the metric until the second sync loop iteration, which is less ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this the preferable way to fix the issue. With the change this PR has now, it would not record the metric until the second sync loop iteration, which is less ideal.
Hmm. After looking at the statusManager again, I think your suggested code is the intended logic. There can be a case that the pod isn't present in the manager, but it is running. My fix wouldn't handle this case.
I'll push the corrected version soon.
/ok-to-test |
Thank you @natasha41575 for checking it again. |
e2d1189
to
a1c4da0
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
Thanks for your quick action, but this is a bug fix. I think we should add an e2e test to ensure that the issue is indeed resolved.
agreed that test coverage is helpful - however a full e2e test is probably not needed, a unit test in kubelet_test.go should be sufficient |
@HirazawaUi I also agree that we should test the bug fix, but an e2e test for a single metric seems a bit too much. If you are all ok with it, I can add a unit test for the |
For this issue, I don't have a strong preference here, but typically an e2e test provides more visual verification of whether a bug is actually fixed. If you believe a unit test is sufficient, I'd suggest you:
|
bugfix require a regression tests, is on the reviewers/approvers to judge what will be the best or better ways to test it. |
New changes are detected. LGTM label has been removed. |
I added a unit test that fails without the fix. I also added a short document in my PR's description on how I tested the cluster before and after the fix. About e2e tests, I couldn't find anything related to Kubelet's metrics (other than the instrumentation package which only grabs all the metrics). Still, let me know if you think the unit test is not sufficient and it is better to add an e2e test as well. |
pkg/kubelet/kubelet_test.go
Outdated
{Name: "containerA"}, | ||
}, | ||
}, | ||
Status: v1.PodStatus{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like you are adding this status to the kubelet
's statusManager, so this test covers the !ok && apiPodStatus.Phase == v1.PodRunning
case that was previously failing, which is great as regression coverage.
For completion, I think we should also have tests covering the cases of existingStatus == v1.PodPending
and existingStatus == v1.PodRunning
. This would be cleaner to do if you can convert this to table-driven tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned, we already cover these cases:
ok
is false andapiPodStatus.Phase == v1.PodPending
ok
is true,existingStatus.Phase == v1.PodPending
, andapiPodStatus.Phase == v1.PodRunning
I think the only case that remains is when existingStatus.Phase == v1.PodRunning
, which means that we should have already observed the pod startup, and duplication shouldn't happen in the metric count. We can add this case by simply adding another SyncPod
to ensure that it won't observe the pod again. The added SyncPod
call would be something like this:
// Syncing the running pod for the second time.
_, err = kubelet.SyncPod(context.Background(), kubetypes.SyncPodUpdate, pod, nil, podStatus) // podStatus is as the previous call with one running container
if err != nil {
t.Errorf("Unexpected err: %v", err)
}
// The metric should not consider the running pod again.
testutil.AssertHistogramTotalCount(t, "kubelet_pod_start_duration_seconds", map[string]string{}, 1)
I'm not sure whether table-driven testing is needed here. What I understand is that we want to test the behavior of SyncPod
over changes in the pod's status and statusManager
keys. If we used table-driven testing, cases would be dependent on each other, which doesn't seem really clean to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine with me if you prefer to do it not table-driven as long as we have the coverage. Please make sure we also cover ok
is false and apiPodStatus.Phase == v1.PodRunning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I changed the unit test to cover both !ok && apiPodStatus.Phase == v1.PodRunning
and existingStatus.Phase == v1.PodRunning
cases.
I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.
We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.
Something like:
tests := []struct {
name string
pod *v1.Pod
expectedHistogramBump bool
}{
...
b0f1b29
to
f18685c
Compare
pkg/kubelet/kubelet_test.go
Outdated
{Name: "containerA"}, | ||
}, | ||
}, | ||
Status: v1.PodStatus{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.
We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.
Something like:
tests := []struct {
name string
pod *v1.Pod
expectedHistogramBump bool
}{
...
pkg/kubelet/kubelet_test.go
Outdated
{ | ||
Name: "container1", | ||
State: kubecontainer.ContainerStateCreated, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the container status here need to be populated? Can we instead just pass in &kubecontainer.PodStatus{ID: pod.UID}
directly into the SyncPod
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can pass nothing to state
if the container is not running. I just wanted to be explicit here.
pkg/kubelet/kubelet_test.go
Outdated
} | ||
|
||
// In the first sync process, pod1 is pending. | ||
_, err := kubelet.SyncPod(context.Background(), kubetypes.SyncPodSync, pod1, nil, pod1Status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like you are relying on this SyncPod
call to populate the status manager so that we cover the ok && existingStatus.Phase == v1.PodPending
case later. I think this is more confusing than it needs to be. Instead, each case in the table driven test should have a different pod, and whenever we need to test the case where ok
is true, we populate the status manager directly using kl.statusManager.SetPodStatus
. This will also reduce how much each test case depends on the previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It is not just populating the status manager. It is also testing the case that ok is false. This call is exactly the call that would fail without the fix. But as I said in my other comment, I tried to follow your style in my latest commit.
I think the way I wanted to test it was different than your way. I wanted to write some sort of behavioral test that works based on changes to pod status. Instead of focusing on the state of the status manager, I was focusing on transitions in the pod status. But I understand your point as well and agree with you. So, I changed the test to a table-driven test in my last commit. Please let me know if you have anything else in your mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of small nits
I think the way I wanted to test it was different than your way. I wanted to write some sort of behavioral test that works based on changes to pod status. Instead of focusing on the state of the status manager, I was focusing on transitions in the pod status. But I understand your point as well and agree with you. So, I changed the test to a table-driven test in my last commit. Please let me know if you have anything else in your mind.
Thanks for working with me on this :)
/hold cancel
@HirazawaUi FYI that I'm removing the hold since I think the unit tests look pretty good now and from your previous comment sounded like you were OK if we don't have a specific e2e test for this
pkg/kubelet/kubelet_test.go
Outdated
expectedHistogramBump bool | ||
}{ | ||
{ | ||
name: "ok is false and apiPodStatus is pending", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: someone just reading this test won't necessarily know what ok
means without context, let's be more explicit:
name: "ok is false and apiPodStatus is pending", | |
name: "statusManager not populated and apiPodStatus is pending", |
This also applies to your other test names that refer to ok
}, | ||
} | ||
|
||
histogramCount := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: is it possible to set histogramCount
to the current total histogram count here? (you can disregard this comment if it would be challenging or complex to do; just thinking if it would be a simple change then we might as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at it again. The only way I can think of is to do it like AssertHistogramTotalCount
. We can probably add another function in testutil
and remove duplicated code between the new function and AssertHistogramTotalCount
, but I'm not sure whether it is needed. I haven't implemented it, but if you think it seems fine, we can try it out.
5bbc6bd
to
36ca173
Compare
@@ -1928,7 +1928,7 @@ func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType | |||
// Record the time it takes for the pod to become running | |||
// since kubelet first saw the pod if firstSeenTime is set. | |||
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID) | |||
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning && | |||
if (!ok || existingStatus.Phase == v1.PodPending) && apiPodStatus.Phase == v1.PodRunning && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I've been caught up with other things and haven't had time to properly review this PR until now.
I understand the changes made here: we record the pod_start_duration_seconds metric
when either the podStatus
doesn't exist in statusManager
, or when the status transitions from pending to running.
This works well in most scenarios, and checking for non-existent podStatus
in statusManager
is good defensive programming (since it's nearly impossible to have apiPodStatus
as running while podStatus
is missing in statusManager
).
But what happens during a kubelet restart?
If there are already running pods on the node, the statusManager
won't have their podStatus
. Since their apiPodStatus
is already running, we'd record pod_start_duration_seconds
for these pods.
However, because kubelet restarted, the firstSeenTime
would be approximately equal to kubelet's startup time. This would make all recorded existing pod startup durations appear extremely short.
Does this recorded metric still hold meaning for these existing pods? It no longer represents the actual pod startup interval and would undoubtedly create confusion.
I'm uncertain about the exact behavior here. In my opinion, it might be better to either:
- Apply the following code changes:
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if ok {
if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
!firstSeenTime.IsZero() {
fmt.Printf("pod %s firstSeenTime: %v", pod.Name, firstSeenTime)
metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
}
}
- Introduce a separate metric specifically for pods that already existed on the node when kubelet restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @dashpole I'd appreciate broader and more accurate input from sig-Instrumentation on this matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this confusion stems from the changes introduced in PR: #105829
It's my understanding that the pod_start_duration_seconds metric was originally intended to measure the duration from pending → running state. However, we messed up in the implementation of the pod_start_duration_seconds
metric functionality, it doesn't work as intended.
While we updated the metric's description to match its actual (incorrect) behavior in PR: #105829, making the documentation temporarily consistent with the implementation, this created a fundamental disconnect between the metric's name and its documented purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply the following code changes:
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if ok {
if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
!firstSeenTime.IsZero() {
fmt.Printf("pod %s firstSeenTime: %v", pod.Name, firstSeenTime)
metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
}
}
This implementation is what I wrote first, but as per talk with @natasha41575, it misses the case the pod is running right away in the first call to SyncPod
. In that case, we would calculate the wrong duration :?
However, because kubelet restarted, the firstSeenTime would be approximately equal to kubelet's startup time. This would make all recorded pod startup durations appear extremely short.
Based on the current metric's description, isn't it what we want? The current description (whether right or wrong) says:
Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run
If pods keep running and kubelet restarts, the difference between the time kubelet seeing a pod and the time the pod starts running should be almost zero as the pod is already running.
I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the current metric's description, isn't it what we want? The current description (whether right or wrong) says:
Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run
This is exactly what I mentioned in #132278.
Our original expectation was: the pod_start_duration_seconds metric
should record the time taken for a pod to transition from the Pending
to the Running
state.
Both the name pod_start_duration_seconds
and its previous description confirm that this is what we intended.
However, we messed up in the implementation of the pod_start_duration_seconds
metric functionality, it doesn't work as intended.
Subsequently, in #105829, we revised the description of the pod_start_duration_seconds
metric based on this incorrect logic. This does not justify our change to the description, especially since the description now severely contradicts the metric's name.
I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.
Perhaps we need to clarify the responsibility of the pod_start_duration_seconds
metric, revise its description again to align with the metric name, and make it function correctly.
Let's cc the sig-node maintainers, @SergeyKanzhelev @tallclair @haircommander to get broader perspectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks @HirazawaUi for the context, I misunderstood the expectation of this metric. Based on the metric description I had assumed that re-recording the metric upon kubelet restart was intentional, but from your description I see that this was not the original desired behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.
I'm inclined to agree with this comment that this PR is a net fix for the duplicate recording issue currently described by #132268, and is an incremental improvement despite not fixing the fundamental issue of the metric description and implementation not matching the original intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to agree with this comment that this PR is a net fix for the duplicate recording issue currently described by #132268, and is an incremental improvement despite not fixing the fundamental issue of the metric description and implementation not matching the original intention.
I agree to merge this PR to temporarily resolve the duplicate metrics reporting issue.
However, considering we may have handled things incorrectly before, we need to exercise extra caution when fixing this problem. We likely need further discussion to determine what we should do now :)
A few comments from a sig-instrumentation perspective (cc @kubernetes/sig-instrumentation-approvers to keep me honest):
|
We can achieve this simply by applying the following change:
We may also need to revert the comment changes made to @ali-a-a, would you mind making this change? Or do you have other opinions? |
@HirazawaUi, just to double-check. If the pod is already running in the first call to |
If we want to record only the start duration of pods launched after kubelet starts, this is a good solution. What other issues might it have? I haven't thought of any yet. Aside from kubelet restarts, I don’t see any scenario in the code path where a pod is already in the running state during its first Or perhaps you're being cautious and feel this requires further discussion and more precise suggestions. You could hold off on making this change for now. |
I hadn't looked into the process of syncing pods in depth and brought that question up to ensure we are aware of that. After looking at the code, I also couldn't find a case where that scenario would happen. I think we can go with this solution. @natasha41575 as you had this comment (#132278 (comment)) on this solution before, does this change sound good to you as well? |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Try to fix issue #132268. It seems that the logic for observing pod startup duration is incorrect. When the pod status is not present in the status manager, kubelet records the duration (which doesn't seem right). Later, when the pod is actually running and it is in the status manager, it observes the pod startup again. I suppose it is the reason kubelet counts pod startups twice.
I added the log below for debugging, and it is the output before the fix.
The pod in the first log is not running yet, but we observe its startup incorrectly. After the fix, the log happens only once - when the pod is actually running.
Which issue(s) this PR is related to:
Fixes #132268
Special notes for your reviewer:
In addition to the above test, I verified that the
kubelet_pod_start_duration_seconds
metric increments its count by one with each pod creation after applying the fix.My Kubelet is compiled using the master branch. Also, my Containerd's version is as follows:
Before the fix:
After the fix:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: