Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for e2e pod startup latency including image pull #121041

Merged
merged 1 commit into from
Oct 25, 2023

Conversation

ruiwen-zhao
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Currently Pod startup latency tracker only emits a latency that excludes image pull time. This latency might not reflect what users actually see because image pull is a major part of the overall latency. Therefore, I am adding a metric for e2e latency that includes the image pull time, so that we can:

  1. monitor the latency that users actually see,
  2. monitor the improvements/regression on image pull latency

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Kubelet emits a metric for end-to-end pod startup latency including image pull.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 6, 2023
@ruiwen-zhao
Copy link
Contributor Author

/cc @qiutongs @elfinhe

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 6, 2023
@k8s-ci-robot
Copy link
Contributor

@ruiwen-zhao: GitHub didn't allow me to request PR reviews from the following users: elfinhe.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @qiutongs @elfinhe

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -162,7 +163,7 @@ var (
Subsystem: KubeletSubsystem,
Name: PodStartDurationKey,
Help: "Duration in seconds from kubelet seeing a pod for the first time to the pod starting to run",
Buckets: metrics.DefBuckets,
Buckets: []float64{0.5, 1, 2, 3, 4, 5, 6, 8, 10, 20, 30, 45, 60, 120, 180, 240, 300, 360, 480, 600, 900, 1200, 1800, 2700, 3600},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3600 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the buckets here are the same as pod_start_sli_duration_seconds. We need the super long buckets because there might be node creation (in case of GKE's autoscaling, for example), stockout, retries, between a pod being created and a pod starting. All of those could take a long time.

@bart0sh bart0sh added this to Triage in SIG Node PR Triage Oct 9, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Oct 9, 2023

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 9, 2023
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Oct 9, 2023
@ruiwen-zhao
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 24, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, ruiwen-zhao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2023
@ruiwen-zhao ruiwen-zhao force-pushed the sli-add-pull branch 2 times, most recently from a44db93 to 7bed864 Compare October 24, 2023 22:23
@@ -41,6 +41,7 @@ const (
PodWorkerDurationKey = "pod_worker_duration_seconds"
PodStartDurationKey = "pod_start_duration_seconds"
PodStartSLIDurationKey = "pod_start_sli_duration_seconds"
PodStartE2EDurationKey = "pod_start_e2e_duration_seconds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for naming: how about pod_start_total_duration_seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perfer e2e mostly because it sounds more "e2e", and it aligns with the name of the slo-monitor metric: https://github.com/kubernetes/perf-tests/blob/358844d5ba650dd5a57bae0072691d93cd82e1bd/slo-monitor/src/monitors/pod_monitor.go#L43

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e in kubernetes has a hard bias towards e2e tests, I suggest other name less confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I changed it to pod_start_total_duration_seconds as Qiutong suggests.

@qiutongs
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a465d8c3e39844efb2c251eab9524b9da6c9636c

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2023
@aojea
Copy link
Member

aojea commented Oct 25, 2023

/lgtm

Awesome work, we have eyes now :)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: fb9e39b996f9985771727b6b4cd87f5a7f63b0ad

@k8s-ci-robot k8s-ci-robot merged commit de70890 into kubernetes:master Oct 25, 2023
14 checks passed
SIG Node PR Triage automation moved this from Needs Reviewer to Done Oct 25, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants