Fix pod start duration observation #132278

ali-a-a · 2025-06-13T04:41:43Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Try to fix issue #132268. It seems that the logic for observing pod startup duration is incorrect. When the pod status is not present in the status manager, kubelet records the duration (which doesn't seem right). Later, when the pod is actually running and it is in the status manager, it observes the pod startup again. I suppose it is the reason kubelet counts pod startups twice.

I added the log below for debugging, and it is the output before the fix.

existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
	!firstSeenTime.IsZero() {
        // Debug log
	klog.V(0).Infof("Observe pod start duration: pod.UID: %v, ok: %v, existingStatus.Phase: %v, apiPodStatus.Phase: %v, firstSeen: %v", pod.UID, ok, existingStatus.Phase, apiPodStatus.Phase, firstSeenTime)
	metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
}

Observe pod start duration: pod.UID: 009e1a2b-4f75-4b4d-b30e-fd03ad52caad, ok: false, existingStatus.Phase: , apiPodStatus.Phase: Pending, firstSeen: 2025-06-13 03:57:50.571995801 +0000 UTC
Observe pod start duration: pod.UID: 009e1a2b-4f75-4b4d-b30e-fd03ad52caad, ok: true, existingStatus.Phase: Pending, apiPodStatus.Phase: Running, firstSeen: 2025-06-13 03:57:50.571995801 +0000 UTC

The pod in the first log is not running yet, but we observe its startup incorrectly. After the fix, the log happens only once - when the pod is actually running.

Which issue(s) this PR is related to:

Fixes #132268

Special notes for your reviewer:

In addition to the above test, I verified that the kubelet_pod_start_duration_seconds metric increments its count by one with each pod creation after applying the fix.

My Kubelet is compiled using the master branch. Also, my Containerd's version is as follows:

containerd github.com/containerd/containerd/v2 v2.1.1 cb1076646aa3740577fafbf3d914198b7fe8e3f7

Before the fix:

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 15

kubectl run nginx --image=nginx 

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 17

After the fix:

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 3

kubectl run nginx --image=nginx

curl -k -H "Authorization: Bearer $METRICS_READER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_duration_seconds_count
kubelet_pod_start_duration_seconds_count 4

Does this PR introduce a user-facing change?

Fix the issue with the `kubelet_pod_start_duration_seconds` metric being recorded twice per pod.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-06-13T04:41:53Z

Hi @ali-a-a. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-06-13T04:42:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ali-a-a
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

HirazawaUi · 2025-06-13T14:51:26Z

Great to see you fixing this issue! However, @natasha41575 wasn't able to reproduce it. Would you be willing to attempt reproduction and detail your steps in #132268?

natasha41575

@HirazawaUi I actually was able to reproduce the issue after trying a few more times - left a comment on #132268.

natasha41575 · 2025-06-13T15:47:35Z

pkg/kubelet/kubelet.go

@@ -1928,7 +1928,7 @@ func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType
 	// Record the time it takes for the pod to become running
 	// since kubelet first saw the pod if firstSeenTime is set.
 	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
-	if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
+	if ok && existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&


I'm wondering if the original intention of the code was:

Suggested change

if ok && existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&

if (!ok || existingStatus.Phase == v1.PodPending) && apiPodStatus.Phase == v1.PodRunning &&

That would mean if the status is not found in the status manager AND the pod is running, record the metric, which makes sense. But it looks to me like there was perhaps a mistake made in thinking through the logic / precedence of the operators.

I think this the preferable way to fix the issue. With the change this PR has now, it would not record the metric until the second sync loop iteration, which is less ideal.

I think this the preferable way to fix the issue. With the change this PR has now, it would not record the metric until the second sync loop iteration, which is less ideal.

Hmm. After looking at the statusManager again, I think your suggested code is the intended logic. There can be a case that the pod isn't present in the manager, but it is running. My fix wouldn't handle this case.

I'll push the corrected version soon.

natasha41575 · 2025-06-13T15:57:19Z

/ok-to-test
/triage accepted

ali-a-a · 2025-06-13T17:24:58Z

@HirazawaUi I actually was able to reproduce the issue after trying a few more times - left a comment on #132268.

Thank you @natasha41575 for checking it again.

natasha41575 · 2025-06-13T17:53:22Z

/lgtm
/priority important-soon

HirazawaUi

/hold

Thanks for your quick action, but this is a bug fix. I think we should add an e2e test to ensure that the issue is indeed resolved.

natasha41575 · 2025-06-13T23:52:29Z

I think we should add an e2e test to ensure that the issue is indeed resolved.

agreed that test coverage is helpful - however a full e2e test is probably not needed, a unit test in kubelet_test.go should be sufficient

ali-a-a · 2025-06-14T00:27:26Z

I think we should add an e2e test to ensure that the issue is indeed resolved.

@HirazawaUi I also agree that we should test the bug fix, but an e2e test for a single metric seems a bit too much. If you are all ok with it, I can add a unit test for the kubelet_pod_start_duration_seconds metric.

HirazawaUi · 2025-06-14T15:47:47Z

I also agree that we should test the bug fix, but an e2e test for a single metric seems a bit too much. If you are all ok with it, I can add a unit test for the kubelet_pod_start_duration_seconds metric.

For this issue, I don't have a strong preference here, but typically an e2e test provides more visual verification of whether a bug is actually fixed. If you believe a unit test is sufficient, I'd suggest you:

Verify the fix through local testing
Document your test results in the PR description (While I've confirmed locally that this fix works well), this would significantly help approvers during review.

aojea · 2025-06-14T15:53:32Z

bugfix require a regression tests, is on the reviewers/approvers to judge what will be the best or better ways to test it.
An unit test here seems required ... you should check if an e2e test already exist for this or something similar, then adding more coverage just by extending one of the existing ones will be highly benefitial

k8s-ci-robot · 2025-06-14T23:28:09Z

New changes are detected. LGTM label has been removed.

ali-a-a · 2025-06-15T01:02:22Z

@aojea @HirazawaUi

I added a unit test that fails without the fix. I also added a short document in my PR's description on how I tested the cluster before and after the fix.

About e2e tests, I couldn't find anything related to Kubelet's metrics (other than the instrumentation package which only grabs all the metrics). Still, let me know if you think the unit test is not sufficient and it is better to add an e2e test as well.

natasha41575 · 2025-06-16T15:01:37Z

pkg/kubelet/kubelet_test.go

+				{Name: "containerA"},
+			},
+		},
+		Status: v1.PodStatus{


It doesn't look like you are adding this status to the kubelet's statusManager, so this test covers the !ok && apiPodStatus.Phase == v1.PodRunning case that was previously failing, which is great as regression coverage.

For completion, I think we should also have tests covering the cases of existingStatus == v1.PodPending and existingStatus == v1.PodRunning. This would be cleaner to do if you can convert this to table-driven tests.

As you mentioned, we already cover these cases:

ok is false and apiPodStatus.Phase == v1.PodPending

ok is true, existingStatus.Phase == v1.PodPending, and apiPodStatus.Phase == v1.PodRunning

I think the only case that remains is when existingStatus.Phase == v1.PodRunning, which means that we should have already observed the pod startup, and duplication shouldn't happen in the metric count. We can add this case by simply adding another SyncPod to ensure that it won't observe the pod again. The added SyncPod call would be something like this:

// Syncing the running pod for the second time. _, err = kubelet.SyncPod(context.Background(), kubetypes.SyncPodUpdate, pod, nil, podStatus) // podStatus is as the previous call with one running container if err != nil { t.Errorf("Unexpected err: %v", err) } // The metric should not consider the running pod again. testutil.AssertHistogramTotalCount(t, "kubelet_pod_start_duration_seconds", map[string]string{}, 1)

I'm not sure whether table-driven testing is needed here. What I understand is that we want to test the behavior of SyncPod over changes in the pod's status and statusManager keys. If we used table-driven testing, cases would be dependent on each other, which doesn't seem really clean to me.

It's fine with me if you prefer to do it not table-driven as long as we have the coverage. Please make sure we also cover ok is false and apiPodStatus.Phase == v1.PodRunning

Thanks for your comment. I changed the unit test to cover both !ok && apiPodStatus.Phase == v1.PodRunning and existingStatus.Phase == v1.PodRunning cases.

I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.

I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.

We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.

Something like:

tests := []struct { name string pod *v1.Pod expectedHistogramBump bool }{ ...

natasha41575 · 2025-06-18T17:44:31Z

pkg/kubelet/kubelet_test.go

+				{Name: "containerA"},
+			},
+		},
+		Status: v1.PodStatus{


I also tried table-driven testing, but since the counter of the histogram cannot be reset easily, the results of some cases would be dependent on their previous cases. So, I didn't use it.

We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.

Something like:

tests := []struct { name string pod *v1.Pod expectedHistogramBump bool }{ ...

natasha41575 · 2025-06-18T17:46:20Z

pkg/kubelet/kubelet_test.go

+			{
+				Name:  "container1",
+				State: kubecontainer.ContainerStateCreated,
+			},


why does the container status here need to be populated? Can we instead just pass in &kubecontainer.PodStatus{ID: pod.UID} directly into the SyncPod call?

Yes, we can pass nothing to state if the container is not running. I just wanted to be explicit here.

natasha41575 · 2025-06-18T17:49:28Z

pkg/kubelet/kubelet_test.go

+	}
+
+	// In the first sync process, pod1 is pending.
+	_, err := kubelet.SyncPod(context.Background(), kubetypes.SyncPodSync, pod1, nil, pod1Status)


It looks to me like you are relying on this SyncPod call to populate the status manager so that we cover the ok && existingStatus.Phase == v1.PodPending case later. I think this is more confusing than it needs to be. Instead, each case in the table driven test should have a different pod, and whenever we need to test the case where ok is true, we populate the status manager directly using kl.statusManager.SetPodStatus. This will also reduce how much each test case depends on the previous one.

Hmm. It is not just populating the status manager. It is also testing the case that ok is false. This call is exactly the call that would fail without the fix. But as I said in my other comment, I tried to follow your style in my latest commit.

ali-a-a · 2025-06-18T20:28:59Z

We have other table driven tests that have this behavior where some cases are affected by previous cases. If that's the only reason you decided not to use it, we should still go with a table-driven structure. It will be easier to read and see that we are covering all the cases.

I think the way I wanted to test it was different than your way. I wanted to write some sort of behavioral test that works based on changes to pod status. Instead of focusing on the state of the status manager, I was focusing on transitions in the pod status. But I understand your point as well and agree with you. So, I changed the test to a table-driven test in my last commit. Please let me know if you have anything else in your mind.

natasha41575

couple of small nits

I think the way I wanted to test it was different than your way. I wanted to write some sort of behavioral test that works based on changes to pod status. Instead of focusing on the state of the status manager, I was focusing on transitions in the pod status. But I understand your point as well and agree with you. So, I changed the test to a table-driven test in my last commit. Please let me know if you have anything else in your mind.

Thanks for working with me on this :)

/hold cancel

@HirazawaUi FYI that I'm removing the hold since I think the unit tests look pretty good now and from your previous comment sounded like you were OK if we don't have a specific e2e test for this

natasha41575 · 2025-06-18T20:37:02Z

pkg/kubelet/kubelet_test.go

+		expectedHistogramBump bool
+	}{
+		{
+			name:                  "ok is false and apiPodStatus is pending",


super nit: someone just reading this test won't necessarily know what ok means without context, let's be more explicit:

Suggested change

name: "ok is false and apiPodStatus is pending",

name: "statusManager not populated and apiPodStatus is pending",

This also applies to your other test names that refer to ok

natasha41575 · 2025-06-18T20:41:39Z

pkg/kubelet/kubelet_test.go

+		},
+	}
+
+	histogramCount := 0


optional nit: is it possible to set histogramCount to the current total histogram count here? (you can disregard this comment if it would be challenging or complex to do; just thinking if it would be a simple change then we might as well)

I looked at it again. The only way I can think of is to do it like AssertHistogramTotalCount. We can probably add another function in testutil and remove duplicated code between the new function and AssertHistogramTotalCount, but I'm not sure whether it is needed. I haven't implemented it, but if you think it seems fine, we can try it out.

HirazawaUi · 2025-06-19T13:18:20Z

pkg/kubelet/kubelet.go

@@ -1928,7 +1928,7 @@ func (kl *Kubelet) SyncPod(ctx context.Context, updateType kubetypes.SyncPodType
 	// Record the time it takes for the pod to become running
 	// since kubelet first saw the pod if firstSeenTime is set.
 	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
-	if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
+	if (!ok || existingStatus.Phase == v1.PodPending) && apiPodStatus.Phase == v1.PodRunning &&


Apologies, I've been caught up with other things and haven't had time to properly review this PR until now.

I understand the changes made here: we record the pod_start_duration_seconds metric when either the podStatus doesn't exist in statusManager, or when the status transitions from pending to running.

This works well in most scenarios, and checking for non-existent podStatus in statusManager is good defensive programming (since it's nearly impossible to have apiPodStatus as running while podStatus is missing in statusManager).

But what happens during a kubelet restart?

If there are already running pods on the node, the statusManager won't have their podStatus. Since their apiPodStatus is already running, we'd record pod_start_duration_seconds for these pods.

However, because kubelet restarted, the firstSeenTime would be approximately equal to kubelet's startup time. This would make all recorded existing pod startup durations appear extremely short.

Does this recorded metric still hold meaning for these existing pods? It no longer represents the actual pod startup interval and would undoubtedly create confusion.

I'm uncertain about the exact behavior here. In my opinion, it might be better to either:

Apply the following code changes:

existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID) if ok { if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning && !firstSeenTime.IsZero() { fmt.Printf("pod %s firstSeenTime: %v", pod.Name, firstSeenTime) metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime)) } }

Introduce a separate metric specifically for pods that already existed on the node when kubelet restarted.

/cc @dashpole I'd appreciate broader and more accurate input from sig-Instrumentation on this matter.

I believe this confusion stems from the changes introduced in PR: #105829

It's my understanding that the pod_start_duration_seconds metric was originally intended to measure the duration from pending → running state. However, we messed up in the implementation of the pod_start_duration_seconds metric functionality, it doesn't work as intended.

While we updated the metric's description to match its actual (incorrect) behavior in PR: #105829, making the documentation temporarily consistent with the implementation, this created a fundamental disconnect between the metric's name and its documented purpose.

Apply the following code changes: existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID) if ok { if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning && !firstSeenTime.IsZero() { fmt.Printf("pod %s firstSeenTime: %v", pod.Name, firstSeenTime) metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime)) } }

This implementation is what I wrote first, but as per talk with @natasha41575, it misses the case the pod is running right away in the first call to SyncPod. In that case, we would calculate the wrong duration :?

However, because kubelet restarted, the firstSeenTime would be approximately equal to kubelet's startup time. This would make all recorded pod startup durations appear extremely short.

Based on the current metric's description, isn't it what we want? The current description (whether right or wrong) says:

Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run

If pods keep running and kubelet restarts, the difference between the time kubelet seeing a pod and the time the pod starts running should be almost zero as the pod is already running.

I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.

Based on the current metric's description, isn't it what we want? The current description (whether right or wrong) says:

Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run

This is exactly what I mentioned in #132278.

Our original expectation was: the pod_start_duration_seconds metric should record the time taken for a pod to transition from the Pending to the Running state.

Both the name pod_start_duration_seconds and its previous description confirm that this is what we intended.

However, we messed up in the implementation of the pod_start_duration_seconds metric functionality, it doesn't work as intended.

Subsequently, in #105829, we revised the description of the pod_start_duration_seconds metric based on this incorrect logic. This does not justify our change to the description, especially since the description now severely contradicts the metric's name.

I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.

Perhaps we need to clarify the responsibility of the pod_start_duration_seconds metric, revise its description again to align with the metric name, and make it function correctly.

Let's cc the sig-node maintainers, @SergeyKanzhelev @tallclair @haircommander to get broader perspectives.

Ah thanks @HirazawaUi for the context, I misunderstood the expectation of this metric. Based on the metric description I had assumed that re-recording the metric upon kubelet restart was intentional, but from your description I see that this was not the original desired behavior.

I agree that the name of this metric is really misleading, but this implementation aligns with its description. I think the fix to the name or introducing a new metric can be done in another dedicated PR.

I'm inclined to agree with this comment that this PR is a net fix for the duplicate recording issue currently described by #132268, and is an incremental improvement despite not fixing the fundamental issue of the metric description and implementation not matching the original intention.

I'm inclined to agree with this comment that this PR is a net fix for the duplicate recording issue currently described by #132268, and is an incremental improvement despite not fixing the fundamental issue of the metric description and implementation not matching the original intention.

I agree to merge this PR to temporarily resolve the duplicate metrics reporting issue.

However, considering we may have handled things incorrectly before, we need to exercise extra caution when fixing this problem. We likely need further discussion to determine what we should do now :)

dashpole · 2025-06-23T13:55:17Z

A few comments from a sig-instrumentation perspective (cc @kubernetes/sig-instrumentation-approvers to keep me honest):

The pod_start_duration_seconds is alpha. This means you can technically make modifications and fixes to it. But if you think a lot of users are relying on the broken behavior, you should still avoid breaking them if possible.
Without looking into this too deeply, the metric does seem misleading to me. If there is a way to fix it to only record the start duration for pods that pods that have started after the kubelet started, that seems like an appropriate fix to implement.

HirazawaUi · 2025-06-23T14:02:29Z

Without looking into this too deeply, the metric does seem misleading to me. If there is a way to fix it to only record the start duration for pods that pods that have started after the kubelet started, that seems like an appropriate fix to implement.

We can achieve this simply by applying the following change:

	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
	if ok {
		if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
			!firstSeenTime.IsZero() {
			metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
		}
	}

We may also need to revert the comment changes made to pod_start_duration_seconds in PR #105829.

@ali-a-a, would you mind making this change? Or do you have other opinions?

ali-a-a · 2025-06-23T15:39:51Z

We can achieve this simply by applying the following change:

	existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
	if ok {
		if existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
			!firstSeenTime.IsZero() {
			metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
		}
	}

@HirazawaUi, just to double-check. If the pod is already running in the first call to SyncPod, we miss its startup metric in the suggested implementation. Did you also consider this case?

HirazawaUi · 2025-06-24T14:57:18Z

just to double-check. If the pod is already running in the first call to SyncPod, we miss its startup metric in the suggested implementation. Did you also consider this case?

If we want to record only the start duration of pods launched after kubelet starts, this is a good solution. What other issues might it have? I haven't thought of any yet.

Aside from kubelet restarts, I don’t see any scenario in the code path where a pod is already in the running state during its first SyncPod execution. SyncPod drives the transition of pod states, so it observes the entire lifecycle of a pod. Please correct me if I'm wrong.

Or perhaps you're being cautious and feel this requires further discussion and more precise suggestions. You could hold off on making this change for now.

ali-a-a · 2025-06-24T23:27:58Z

Aside from kubelet restarts, I don’t see any scenario in the code path where a pod is already in the running state during its first SyncPod execution. SyncPod drives the transition of pod states, so it observes the entire lifecycle of a pod. Please correct me if I'm wrong.

I hadn't looked into the process of syncing pods in depth and brought that question up to ensure we are aware of that. After looking at the code, I also couldn't find a case where that scenario would happen. I think we can go with this solution.

@natasha41575 as you had this comment (#132278 (comment)) on this solution before, does this change sound good to you as well?

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 13, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs Jun 13, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 13, 2025

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 13, 2025

k8s-ci-robot requested review from ffromani and odinuge June 13, 2025 04:42

natasha41575 reviewed Jun 13, 2025

View reviewed changes

natasha41575 mentioned this pull request Jun 13, 2025

kubelet_pod_start_duration_seconds Metric Incorrectly Reports Double Pod Count #132268

Open

natasha41575 moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jun 13, 2025

Fix pod start duration observation

a1c4da0

ali-a-a force-pushed the fix-pod-start-duration-metric branch from e2d1189 to a1c4da0 Compare June 13, 2025 17:32

k8s-ci-robot removed the release-note-none Denotes a PR that doesn't merit a release note. label Jun 13, 2025

HirazawaUi reviewed Jun 13, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2025

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 14, 2025

k8s-ci-robot requested a review from natasha41575 June 14, 2025 23:28

natasha41575 reviewed Jun 16, 2025

View reviewed changes

Add TestPodStartDurationCountMetric

f18685c

ali-a-a force-pushed the fix-pod-start-duration-metric branch from b0f1b29 to f18685c Compare June 17, 2025 00:42

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2025

natasha41575 reviewed Jun 18, 2025

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2025

Use table-driven testing for TestPodStartDurationCountMetric

36ca173

ali-a-a force-pushed the fix-pod-start-duration-metric branch from 5bbc6bd to 36ca173 Compare June 18, 2025 23:26

HirazawaUi reviewed Jun 19, 2025

View reviewed changes

	if ok && existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
	if (!ok \|\| existingStatus.Phase == v1.PodPending) && apiPodStatus.Phase == v1.PodRunning &&

	name: "ok is false and apiPodStatus is pending",
	name: "statusManager not populated and apiPodStatus is pending",

Fix pod start duration observation #132278

Are you sure you want to change the base?

Fix pod start duration observation #132278

Uh oh!

Conversation

ali-a-a commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

HirazawaUi commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natasha41575 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 commented Jun 13, 2025

Uh oh!

ali-a-a commented Jun 13, 2025

Uh oh!

natasha41575 commented Jun 13, 2025

Uh oh!

HirazawaUi left a comment

Choose a reason for hiding this comment

Uh oh!

natasha41575 commented Jun 13, 2025

Uh oh!

ali-a-a commented Jun 14, 2025

Uh oh!

HirazawaUi commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 14, 2025

Uh oh!

ali-a-a commented Jun 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ali-a-a commented Jun 18, 2025

Uh oh!

natasha41575 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ali-a-a commented Jun 13, 2025 •

edited

Loading

HirazawaUi commented Jun 13, 2025 •

edited

Loading

natasha41575 left a comment •

edited

Loading

natasha41575 Jun 13, 2025 •

edited

Loading

HirazawaUi commented Jun 14, 2025 •

edited

Loading

aojea commented Jun 14, 2025 •

edited

Loading

natasha41575 Jun 16, 2025 •

edited

Loading

natasha41575 left a comment •

edited

Loading

natasha41575 Jun 18, 2025 •

edited

Loading

natasha41575 Jun 18, 2025 •

edited

Loading

HirazawaUi Jun 19, 2025 •

edited

Loading

HirazawaUi Jun 19, 2025 •

edited

Loading

HirazawaUi Jun 19, 2025 •

edited

Loading

HirazawaUi commented Jun 23, 2025 •

edited

Loading