Add metric for e2e pod startup latency including image pull #121041

ruiwen-zhao · 2023-10-06T22:07:23Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Currently Pod startup latency tracker only emits a latency that excludes image pull time. This latency might not reflect what users actually see because image pull is a major part of the overall latency. Therefore, I am adding a metric for e2e latency that includes the image pull time, so that we can:

monitor the latency that users actually see,
monitor the improvements/regression on image pull latency

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Kubelet emits a metric for end-to-end pod startup latency including image pull.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

ruiwen-zhao · 2023-10-06T22:07:59Z

/cc @qiutongs @elfinhe

k8s-ci-robot · 2023-10-06T22:08:01Z

@ruiwen-zhao: GitHub didn't allow me to request PR reviews from the following users: elfinhe.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @qiutongs @elfinhe

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aojea · 2023-10-06T22:12:10Z

pkg/kubelet/metrics/metrics.go

@@ -162,7 +163,7 @@ var (
 Subsystem: KubeletSubsystem,
 Name: PodStartDurationKey,
 Help: "Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run",
- Buckets: metrics.DefBuckets,
+ Buckets: []float64{0.5, 1, 2, 3, 4, 5, 6, 8, 10, 20, 30, 45, 60, 120, 180, 240, 300, 360, 480, 600, 900, 1200, 1800, 2700, 3600},


3600 seconds?

yeah, the buckets here are the same as pod_start_sli_duration_seconds. We need the super long buckets because there might be node creation (in case of GKE's autoscaling, for example), stockout, retries, between a pod being created and a pod starting. All of those could take a long time.

bart0sh · 2023-10-09T16:37:59Z

/triage accepted
/priority important-longterm

ruiwen-zhao · 2023-10-09T18:18:42Z

/retest

pkg/kubelet/metrics/metrics.go

k8s-ci-robot · 2023-10-24T21:28:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, ruiwen-zhao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

qiutongs · 2023-10-25T16:26:03Z

pkg/kubelet/metrics/metrics.go

@@ -41,6 +41,7 @@ const (
 PodWorkerDurationKey = "pod_worker_duration_seconds"
 PodStartDurationKey = "pod_start_duration_seconds"
 PodStartSLIDurationKey = "pod_start_sli_duration_seconds"
+ PodStartE2EDurationKey = "pod_start_e2e_duration_seconds"


nit for naming: how about pod_start_total_duration_seconds?

I perfer e2e mostly because it sounds more "e2e", and it aligns with the name of the slo-monitor metric: https://github.com/kubernetes/perf-tests/blob/358844d5ba650dd5a57bae0072691d93cd82e1bd/slo-monitor/src/monitors/pod_monitor.go#L43

e2e in kubernetes has a hard bias towards e2e tests, I suggest other name less confusing

makes sense. I changed it to pod_start_total_duration_seconds as Qiutong suggests.

qiutongs · 2023-10-25T16:26:19Z

/lgtm

k8s-ci-robot · 2023-10-25T16:26:28Z

LGTM label has been added.

Git tree hash: a465d8c3e39844efb2c251eab9524b9da6c9636c

Signed-off-by: ruiwen-zhao <[email protected]>

aojea · 2023-10-25T20:47:38Z

/lgtm

Awesome work, we have eyes now :)

k8s-ci-robot · 2023-10-25T20:47:45Z

LGTM label has been added.

Git tree hash: fb9e39b996f9985771727b6b4cd87f5a7f63b0ad

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 6, 2023

k8s-ci-robot requested a review from qiutongs October 6, 2023 22:08

k8s-ci-robot requested review from tzneal and yujuhong October 6, 2023 22:08

aojea reviewed Oct 6, 2023

View reviewed changes

bart0sh added this to Triage in SIG Node PR Triage Oct 9, 2023

bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Oct 9, 2023

mrunalp reviewed Oct 24, 2023

View reviewed changes

pkg/kubelet/metrics/metrics.go Outdated Show resolved Hide resolved

ruiwen-zhao force-pushed the sli-add-pull branch from 724789d to 832c1b2 Compare October 24, 2023 21:26

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 24, 2023

mrunalp approved these changes Oct 24, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2023

ruiwen-zhao force-pushed the sli-add-pull branch 2 times, most recently from a44db93 to 7bed864 Compare October 24, 2023 22:23

qiutongs reviewed Oct 25, 2023

View reviewed changes

k8s-ci-robot assigned qiutongs Oct 25, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2023

Add metric for e2e pod startup latency including image pull

1165609

Signed-off-by: ruiwen-zhao <[email protected]>

ruiwen-zhao force-pushed the sli-add-pull branch from 7bed864 to 1165609 Compare October 25, 2023 20:34

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2023

k8s-ci-robot requested a review from qiutongs October 25, 2023 20:34

k8s-ci-robot assigned aojea Oct 25, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2023

k8s-ci-robot merged commit de70890 into kubernetes:master Oct 25, 2023
14 checks passed

SIG Node PR Triage automation moved this from Needs Reviewer to Done Oct 25, 2023

k8s-ci-robot added this to the v1.29 milestone Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric for e2e pod startup latency including image pull #121041

Add metric for e2e pod startup latency including image pull #121041

ruiwen-zhao commented Oct 6, 2023

ruiwen-zhao commented Oct 6, 2023

k8s-ci-robot commented Oct 6, 2023

aojea Oct 6, 2023

ruiwen-zhao Oct 9, 2023

bart0sh commented Oct 9, 2023

ruiwen-zhao commented Oct 9, 2023

k8s-ci-robot commented Oct 24, 2023

qiutongs Oct 25, 2023

ruiwen-zhao Oct 25, 2023

aojea Oct 25, 2023

ruiwen-zhao Oct 25, 2023

qiutongs commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023

aojea commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023

Add metric for e2e pod startup latency including image pull #121041

Add metric for e2e pod startup latency including image pull #121041

Conversation

ruiwen-zhao commented Oct 6, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

ruiwen-zhao commented Oct 6, 2023

k8s-ci-robot commented Oct 6, 2023

aojea Oct 6, 2023

Choose a reason for hiding this comment

ruiwen-zhao Oct 9, 2023

Choose a reason for hiding this comment

bart0sh commented Oct 9, 2023

ruiwen-zhao commented Oct 9, 2023

k8s-ci-robot commented Oct 24, 2023

qiutongs Oct 25, 2023

Choose a reason for hiding this comment

ruiwen-zhao Oct 25, 2023

Choose a reason for hiding this comment

aojea Oct 25, 2023

Choose a reason for hiding this comment

ruiwen-zhao Oct 25, 2023

Choose a reason for hiding this comment

qiutongs commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023

aojea commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023