Skip to content

Fix pod start duration observation #132278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ali-a-a
Copy link
Contributor

@ali-a-a ali-a-a commented Jun 13, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Try to fix issue #132268. It seems that the logic for observing pod startup duration is incorrect. When the pod status is not present in the status manager, kubelet records the duration (which doesn't seem right). Later, when the pod is actually running and it is in the status manager, it observes the pod startup again. I suppose it is the reason kubelet counts pod startups twice.

I added the log below for debugging, and it is the output before the fix.

existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
	!firstSeenTime.IsZero() {
        // Debug log
	klog.V(0).Infof("Observe pod start duration: pod.UID: %v, ok: %v, existingStatus.Phase: %v, apiPodStatus.Phase: %v, firstSeen: %v", pod.UID, ok, existingStatus.Phase, apiPodStatus.Phase, firstSeenTime)
	metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
}
Observe pod start duration: pod.UID: 009e1a2b-4f75-4b4d-b30e-fd03ad52caad, ok: false, existingStatus.Phase: , apiPodStatus.Phase: Pending, firstSeen: 2025-06-13 03:57:50.571995801 +0000 UTC
Observe pod start duration: pod.UID: 009e1a2b-4f75-4b4d-b30e-fd03ad52caad, ok: true, existingStatus.Phase: Pending, apiPodStatus.Phase: Running, firstSeen: 2025-06-13 03:57:50.571995801 +0000 UTC

The pod in the first log is not running yet, but we observe its startup incorrectly. After the fix, the log happens only once - when the pod is actually running.

Which issue(s) this PR is related to:

Fixes #132268

Special notes for your reviewer:

In addition to the above test, I verified that the kubelet_pod_start_duration_seconds metric increments its count by one with each pod creation after applying the fix.

My Kubelet is compiled using the master branch. Also, my Containerd's version is as follows:

containerd github.com/containerd/containerd/v2 v2.1.1 cb1076646aa3740577fafbf3d914198b7fe8e3f7

Before the fix:

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 15

kubectl run nginx --image=nginx 

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 17

After the fix:

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 3

kubectl run nginx --image=nginx

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 4

Does this PR introduce a user-facing change?

Fix the issue with the `kubelet_pod_start_duration_seconds` metric being recorded twice per pod.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 13, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ali-a-a. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot requested review from ffromani and odinuge June 13, 2025 04:42
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ali-a-a
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@HirazawaUi
Copy link
Contributor

HirazawaUi commented Jun 13, 2025

Great to see you fixing this issue! However, @natasha41575 wasn't able to reproduce it. Would you be willing to attempt reproduction and detail your steps in #132268?

Copy link
Contributor

@natasha41575 natasha41575 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HirazawaUi I actually was able to reproduce the issue after trying a few more times - left a comment on #132268.

@@ -1928,7 +1928,7 @@ func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType
// Record the time it takes for the pod to become running
// since kubelet first saw the pod if firstSeenTime is set.
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
if ok && existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
Copy link
Contributor

@natasha41575 natasha41575 Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the original intention of the code was:

Suggested change
if ok && existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
if (!ok || existingStatus.Phase == v1.PodPending) && apiPodStatus.Phase == v1.PodRunning &&

That would mean if the status is not found in the status manager AND the pod is running, record the metric, which makes sense. But it looks to me like there was perhaps a mistake made in thinking through the logic / precedence of the operators.

I think this the preferable way to fix the issue. With the change this PR has now, it would not record the metric until the second sync loop iteration, which is less ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this the preferable way to fix the issue. With the change this PR has now, it would not record the metric until the second sync loop iteration, which is less ideal.

Hmm. After looking at the statusManager again, I think your suggested code is the intended logic. There can be a case that the pod isn't present in the manager, but it is running. My fix wouldn't handle this case.

I'll push the corrected version soon.

@natasha41575
Copy link
Contributor

/ok-to-test
/triage accepted

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2025
@natasha41575 natasha41575 moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jun 13, 2025
@ali-a-a
Copy link
Contributor Author

ali-a-a commented Jun 13, 2025

@HirazawaUi I actually was able to reproduce the issue after trying a few more times - left a comment on #132268.

Thank you @natasha41575 for checking it again.

@ali-a-a ali-a-a force-pushed the fix-pod-start-duration-metric branch from e2d1189 to a1c4da0 Compare June 13, 2025 17:32
@natasha41575
Copy link
Contributor

/lgtm
/priority important-soon

@k8s-ci-robot k8s-ci-robot removed the release-note-none Denotes a PR that doesn't merit a release note. label Jun 13, 2025
Copy link
Contributor

@HirazawaUi HirazawaUi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Thanks for your quick action, but this is a bug fix. I think we should add an e2e test to ensure that the issue is indeed resolved.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2025
@natasha41575
Copy link
Contributor

I think we should add an e2e test to ensure that the issue is indeed resolved.

agreed that test coverage is helpful - however a full e2e test is probably not needed, a unit test in kubelet_test.go should be sufficient

@ali-a-a
Copy link
Contributor Author

ali-a-a commented Jun 14, 2025

I think we should add an e2e test to ensure that the issue is indeed resolved.

@HirazawaUi I also agree that we should test the bug fix, but an e2e test for a single metric seems a bit too much. If you are all ok with it, I can add a unit test for the kubelet_pod_start_duration_seconds metric.

@HirazawaUi
Copy link
Contributor

HirazawaUi commented Jun 14, 2025

I also agree that we should test the bug fix, but an e2e test for a single metric seems a bit too much. If you are all ok with it, I can add a unit test for the kubelet_pod_start_duration_seconds metric.

For this issue, I don't have a strong preference here, but typically an e2e test provides more visual verification of whether a bug is actually fixed. If you believe a unit test is sufficient, I'd suggest you:

  • Verify the fix through local testing
  • Document your test results in the PR description (While I've confirmed locally that this fix works well), this would significantly help approvers during review.

@aojea
Copy link
Member

aojea commented Jun 14, 2025

bugfix require a regression tests, is on the reviewers/approvers to judge what will be the best or better ways to test it.
An unit test here seems required ... you should check if an e2e test already exist for this or something similar, then adding more coverage just by extending one of the existing ones will be highly benefitial

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 14, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@ali-a-a
Copy link
Contributor Author

ali-a-a commented Jun 15, 2025

@aojea @HirazawaUi

I added a unit test that fails without the fix. I also added a short document in my PR's description on how I tested the cluster before and after the fix.

About e2e tests, I couldn't find anything related to Kubelet's metrics (other than the instrumentation package which only grabs all the metrics). Still, let me know if you think the unit test is not sufficient and it is better to add an e2e test as well.

{Name: "containerA"},
},
},
Status: v1.PodStatus{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you are adding this status to the kubelet's statusManager, so this test covers the !ok && apiPodStatus.Phase == v1.PodRunning case that was previously failing, which is great as regression coverage.

For completion, I think we should also have tests covering the cases of existingStatus == v1.PodPending and existingStatus == v1.PodRunning. This would be cleaner to do if you can convert this to table-driven tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, we already cover these cases:

  • ok is false and apiPodStatus.Phase == v1.PodPending
  • ok is true, existingStatus.Phase == v1.PodPending, and apiPodStatus.Phase == v1.PodRunning

I think the only case that remains is when existingStatus.Phase == v1.PodRunning, which means that we should have already observed the pod startup, and duplication shouldn't happen in the metric count. We can add this case by simply adding another SyncPod to ensure that it won't observe the pod again. The added SyncPod call would be something like this:

// Syncing the running pod for the second time.
_, err = kubelet.SyncPod(context.Background(), kubetypes.SyncPodUpdate, pod, nil, podStatus) // podStatus is as the previous call with one running container
if err != nil {
	t.Errorf("Unexpected err: %v", err)
}

// The metric should not consider the running pod again.
testutil.AssertHistogramTotalCount(t, "kubelet_pod_start_duration_seconds", map[string]string{}, 1)

I'm not sure whether table-driven testing is needed here. What I understand is that we want to test the behavior of SyncPod over changes in the pod's status and statusManager keys. If we used table-driven testing, cases would be dependent on each other, which doesn't seem really clean to me.

Copy link
Contributor

@natasha41575 natasha41575 Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine with me if you prefer to do it not table-driven as long as we have the coverage. Please make sure we also cover ok is false and apiPodStatus.Phase == v1.PodRunning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I changed the unit test to cover both !ok && apiPodStatus.Phase == v1.PodRunning and existingStatus.Phase == v1.PodRunning cases.

I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.

We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.

Something like:

	tests := []struct {
		name                  string
		pod                   *v1.Pod
                expectedHistogramBump bool
	}{
        ...

@ali-a-a ali-a-a force-pushed the fix-pod-start-duration-metric branch from b0f1b29 to f18685c Compare June 17, 2025 00:42
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2025
{Name: "containerA"},
},
},
Status: v1.PodStatus{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.

We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.

Something like:

	tests := []struct {
		name                  string
		pod                   *v1.Pod
                expectedHistogramBump bool
	}{
        ...

{
Name: "container1",
State: kubecontainer.ContainerStateCreated,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the container status here need to be populated? Can we instead just pass in &kubecontainer.PodStatus{ID: pod.UID} directly into the SyncPod call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can pass nothing to state if the container is not running. I just wanted to be explicit here.

}

// In the first sync process, pod1 is pending.
_, err := kubelet.SyncPod(context.Background(), kubetypes.SyncPodSync, pod1, nil, pod1Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like you are relying on this SyncPod call to populate the status manager so that we cover the ok && existingStatus.Phase == v1.PodPending case later. I think this is more confusing than it needs to be. Instead, each case in the table driven test should have a different pod, and whenever we need to test the case where ok is true, we populate the status manager directly using kl.statusManager.SetPodStatus. This will also reduce how much each test case depends on the previous one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It is not just populating the status manager. It is also testing the case that ok is false. This call is exactly the call that would fail without the fix. But as I said in my other comment, I tried to follow your style in my latest commit.

@ali-a-a
Copy link
Contributor Author

ali-a-a commented Jun 18, 2025

We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.

I think the way I wanted to test it was different than your way. I wanted to write some sort of behavioral test that works based on changes to pod status. Instead of focusing on the state of the status manager, I was focusing on transitions in the pod status. But I understand your point as well and agree with you. So, I changed the test to a table-driven test in my last commit. Please let me know if you have anything else in your mind.

Copy link
Contributor

@natasha41575 natasha41575 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of small nits

I think the way I wanted to test it was different than your way. I wanted to write some sort of behavioral test that works based on changes to pod status. Instead of focusing on the state of the status manager, I was focusing on transitions in the pod status. But I understand your point as well and agree with you. So, I changed the test to a table-driven test in my last commit. Please let me know if you have anything else in your mind.

Thanks for working with me on this :)

/hold cancel

@HirazawaUi FYI that I'm removing the hold since I think the unit tests look pretty good now and from your previous comment sounded like you were OK if we don't have a specific e2e test for this

expectedHistogramBump bool
}{
{
name: "ok is false and apiPodStatus is pending",
Copy link
Contributor

@natasha41575 natasha41575 Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: someone just reading this test won't necessarily know what ok means without context, let's be more explicit:

Suggested change
name: "ok is false and apiPodStatus is pending",
name: "statusManager not populated and apiPodStatus is pending",

This also applies to your other test names that refer to ok

},
}

histogramCount := 0
Copy link
Contributor

@natasha41575 natasha41575 Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: is it possible to set histogramCount to the current total histogram count here? (you can disregard this comment if it would be challenging or complex to do; just thinking if it would be a simple change then we might as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it again. The only way I can think of is to do it like AssertHistogramTotalCount. We can probably add another function in testutil and remove duplicated code between the new function and AssertHistogramTotalCount, but I'm not sure whether it is needed. I haven't implemented it, but if you think it seems fine, we can try it out.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2025
@ali-a-a ali-a-a force-pushed the fix-pod-start-duration-metric branch from 5bbc6bd to 36ca173 Compare June 18, 2025 23:26
@@ -1928,7 +1928,7 @@ func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType
// Record the time it takes for the pod to become running
// since kubelet first saw the pod if firstSeenTime is set.
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
if (!ok || existingStatus.Phase == v1.PodPending) && apiPodStatus.Phase == v1.PodRunning &&
Copy link
Contributor

@HirazawaUi HirazawaUi Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I've been caught up with other things and haven't had time to properly review this PR until now.

I understand the changes made here: we record the pod_start_duration_seconds metric when either the podStatus doesn't exist in statusManager, or when the status transitions from pending to running.

This works well in most scenarios, and checking for non-existent podStatus in statusManager is good defensive programming (since it's nearly impossible to have apiPodStatus as running while podStatus is missing in statusManager).

But what happens during a kubelet restart?

If there are already running pods on the node, the statusManager won't have their podStatus. Since their apiPodStatus is already running, we'd record pod_start_duration_seconds for these pods.

However, because kubelet restarted, the firstSeenTime would be approximately equal to kubelet's startup time. This would make all recorded existing pod startup durations appear extremely short.

Does this recorded metric still hold meaning for these existing pods? It no longer represents the actual pod startup interval and would undoubtedly create confusion.

I'm uncertain about the exact behavior here. In my opinion, it might be better to either:

  • Apply the following code changes:
	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
	if ok {
		if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
			!firstSeenTime.IsZero() {
			fmt.Printf("pod %s firstSeenTime: %v", pod.Name, firstSeenTime)
			metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
		}
	}
  • Introduce a separate metric specifically for pods that already existed on the node when kubelet restarted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @dashpole I'd appreciate broader and more accurate input from sig-Instrumentation on this matter.

Copy link
Contributor

@HirazawaUi HirazawaUi Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this confusion stems from the changes introduced in PR: #105829

It's my understanding that the pod_start_duration_seconds metric was originally intended to measure the duration from pending → running state. However, we messed up in the implementation of the pod_start_duration_seconds metric functionality, it doesn't work as intended.

While we updated the metric's description to match its actual (incorrect) behavior in PR: #105829, making the documentation temporarily consistent with the implementation, this created a fundamental disconnect between the metric's name and its documented purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply the following code changes:

existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if ok {
	if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
		!firstSeenTime.IsZero() {
		fmt.Printf("pod %s firstSeenTime: %v", pod.Name, firstSeenTime)
		metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
	}
}

This implementation is what I wrote first, but as per talk with @natasha41575, it misses the case the pod is running right away in the first call to SyncPod. In that case, we would calculate the wrong duration :?

However, because kubelet restarted, the firstSeenTime would be approximately equal to kubelet's startup time. This would make all recorded pod startup durations appear extremely short.

Based on the current metric's description, isn't it what we want? The current description (whether right or wrong) says:

Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run

If pods keep running and kubelet restarts, the difference between the time kubelet seeing a pod and the time the pod starts running should be almost zero as the pod is already running.

I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.

Copy link
Contributor

@HirazawaUi HirazawaUi Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current metric's description, isn't it what we want? The current description (whether right or wrong) says:

Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run

This is exactly what I mentioned in #132278.

Our original expectation was: the pod_start_duration_seconds metric should record the time taken for a pod to transition from the Pending to the Running state.

Both the name pod_start_duration_seconds and its previous description confirm that this is what we intended.

However, we messed up in the implementation of the pod_start_duration_seconds metric functionality, it doesn't work as intended.

Subsequently, in #105829, we revised the description of the pod_start_duration_seconds metric based on this incorrect logic. This does not justify our change to the description, especially since the description now severely contradicts the metric's name.

I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.

Perhaps we need to clarify the responsibility of the pod_start_duration_seconds metric, revise its description again to align with the metric name, and make it function correctly.

Let's cc the sig-node maintainers, @SergeyKanzhelev @tallclair @haircommander to get broader perspectives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks @HirazawaUi for the context, I misunderstood the expectation of this metric. Based on the metric description I had assumed that re-recording the metric upon kubelet restart was intentional, but from your description I see that this was not the original desired behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.

I'm inclined to agree with this comment that this PR is a net fix for the duplicate recording issue currently described by #132268, and is an incremental improvement despite not fixing the fundamental issue of the metric description and implementation not matching the original intention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree with this comment that this PR is a net fix for the duplicate recording issue currently described by #132268, and is an incremental improvement despite not fixing the fundamental issue of the metric description and implementation not matching the original intention.

I agree to merge this PR to temporarily resolve the duplicate metrics reporting issue.

However, considering we may have handled things incorrectly before, we need to exercise extra caution when fixing this problem. We likely need further discussion to determine what we should do now :)

@dashpole
Copy link
Contributor

A few comments from a sig-instrumentation perspective (cc @kubernetes/sig-instrumentation-approvers to keep me honest):

  • The pod_start_duration_seconds is alpha. This means you can technically make modifications and fixes to it. But if you think a lot of users are relying on the broken behavior, you should still avoid breaking them if possible.
  • Without looking into this too deeply, the metric does seem misleading to me. If there is a way to fix it to only record the start duration for pods that pods that have started after the kubelet started, that seems like an appropriate fix to implement.

@HirazawaUi
Copy link
Contributor

HirazawaUi commented Jun 23, 2025

  • Without looking into this too deeply, the metric does seem misleading to me. If there is a way to fix it to only record the start duration for pods that pods that have started after the kubelet started, that seems like an appropriate fix to implement.

We can achieve this simply by applying the following change:

	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
	if ok {
		if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
			!firstSeenTime.IsZero() {
			metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
		}
	}

We may also need to revert the comment changes made to pod_start_duration_seconds in PR #105829.

@ali-a-a, would you mind making this change? Or do you have other opinions?

@ali-a-a
Copy link
Contributor Author

ali-a-a commented Jun 23, 2025

We can achieve this simply by applying the following change:

	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
	if ok {
		if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
			!firstSeenTime.IsZero() {
			metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
		}
	}

@HirazawaUi, just to double-check. If the pod is already running in the first call to SyncPod, we miss its startup metric in the suggested implementation. Did you also consider this case?

@HirazawaUi
Copy link
Contributor

just to double-check. If the pod is already running in the first call to SyncPod, we miss its startup metric in the suggested implementation. Did you also consider this case?

If we want to record only the start duration of pods launched after kubelet starts, this is a good solution. What other issues might it have? I haven't thought of any yet.

Aside from kubelet restarts, I don’t see any scenario in the code path where a pod is already in the running state during its first SyncPod execution. SyncPod drives the transition of pod states, so it observes the entire lifecycle of a pod. Please correct me if I'm wrong.

Or perhaps you're being cautious and feel this requires further discussion and more precise suggestions. You could hold off on making this change for now.

@ali-a-a
Copy link
Contributor Author

ali-a-a commented Jun 24, 2025

Aside from kubelet restarts, I don’t see any scenario in the code path where a pod is already in the running state during its first SyncPod execution. SyncPod drives the transition of pod states, so it observes the entire lifecycle of a pod. Please correct me if I'm wrong.

I hadn't looked into the process of syncing pods in depth and brought that question up to ensure we are aware of that. After looking at the code, I also couldn't find a case where that scenario would happen. I think we can go with this solution.

@natasha41575 as you had this comment (#132278 (comment)) on this solution before, does this change sound good to you as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

kubelet_pod_start_duration_seconds Metric Incorrectly Reports Double Pod Count
6 participants