Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a node startup latency tracker #118568

Merged
merged 1 commit into from
Sep 15, 2023

Conversation

qiutongs
Copy link
Contributor

@qiutongs qiutongs commented Jun 8, 2023

What type of PR is this?

kind/feature

What this PR does / why we need it:

Export metrics for node startup latency breakdown. Then we can understand how the node startup performs.

Initially, I am focusing on four Kubelet starts running for the first time timepoints.

  1. Kubelet starts running for the first time
  2. Kubelet attempts to register the node object for the first time
  3. Node object is created/registered for the first time
  4. Node’s ready condition is true for the first time

Therefore, I create 3 new latency metrics.

Which issue(s) this PR fixes:

Fixes # N/A

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Kubelet exposes latency metrics of different stages of the node startup.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 8, 2023
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 8, 2023
@pacoxu
Copy link
Member

pacoxu commented Jun 8, 2023

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jun 8, 2023
@bart0sh bart0sh added this to WIP in SIG Node PR Triage Jun 8, 2023
@qiutongs qiutongs force-pushed the node-startup-latency branch 2 times, most recently from 088619a to 7d0773c Compare June 11, 2023 01:57
@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jun 11, 2023
pkg/kubelet/util/node_startup_latency_tracker.go Outdated Show resolved Hide resolved
pkg/kubelet/metrics/metrics.go Outdated Show resolved Hide resolved
@qiutongs
Copy link
Contributor Author

qiutongs commented Aug 1, 2023

CC @linxiulei

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 2, 2023
@qiutongs qiutongs changed the title [WIP] Create a node startup latency tracker Create a node startup latency tracker Aug 2, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 2, 2023
@qiutongs
Copy link
Contributor Author

qiutongs commented Aug 2, 2023

/cc @azylinski @linxiulei

@aojea
Copy link
Member

aojea commented Sep 11, 2023

/lgtm

New metrics now look even better, @qiutongs can add a release note mentioning that the kubelet now has metrics to expose the duration of the different stages of the node startup?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a4b98e71b840b7a9d9f96303072f15240a87806c

@qiutongs qiutongs closed this Sep 11, 2023
SIG Node PR Triage automation moved this from Needs Reviewer to Done Sep 11, 2023
@qiutongs qiutongs reopened this Sep 11, 2023
SIG Node PR Triage automation moved this from Done to Triage Sep 11, 2023
@aojea
Copy link
Member

aojea commented Sep 11, 2023

/test pull-kubernetes-e2e-gce

unrelated failure

Kubernetes e2e suite: [It] [sig-cli] Kubectl client Simple pod should contain last line of the log expand_less	26s
{ failed [FAILED] Expected
    <string>: failed to create fsnotify watcher: too many open files
to contain substring

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 11, 2023
@qiutongs
Copy link
Contributor Author

/lgtm

New metrics now look even better, @qiutongs can add a release note mentioning that the kubelet now has metrics to expose the duration of the different stages of the node startup?

Updated it in #118568 (comment)

@aojea
Copy link
Member

aojea commented Sep 11, 2023

I think that we are good enough for merging, this metrics are important for working on improving the startup process of the kubelet, are you ok with this @dgrisonnet ?

For approvals
/assign @SergeyKanzhelev
for kubelet
/assign @wojtek-t
for kubemark

@dgrisonnet
Copy link
Member

I would still love to see #118568 (comment) addressed, but am I fine with adding the additional documentation in a follow-up PR.

@aojea
Copy link
Member

aojea commented Sep 12, 2023

I doesn't have to be super precise, but I think it would be valuable to give hints about what could be causing increase in latency at these layers.

I don't feel we can provide a good information right now about it, but we must not forget to update that metrics once we do the due diligence

@qiutongs
Copy link
Contributor Author

I doesn't have to be super precise, but I think it would be valuable to give hints about what could be causing increase in latency at these layers.

I don't feel we can provide a good information right now about it, but we must not forget to update that metrics once we do the due diligence

I will create a follow-up PR once I can summarize the steps. It will look like:

Duration in seconds of node startup before registration, including X, Y and Z.
(X, Y, Z are some critical steps before node registration)

@dgrisonnet
Copy link
Member

I don't feel we can provide a good information right now about it, but we must not forget to update that metrics once we do the due diligence

Sounds fair 👍

I will create a follow-up PR once I can summarize the steps. It will look like:
Duration in seconds of node startup before registration, including X, Y and Z.
(X, Y, Z are some critical steps before node registration)

Perfect, that's exactly what I was looking for :)

@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Sep 15, 2023
@dchen1107
Copy link
Member

+1 on addressing #118568 (comment) but ok for the following PR. Looking forward to seeing the reports generated from these metrics. Thanks!

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, qiutongs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2023
@dchen1107 dchen1107 moved this from Needs Reviewer to Triage in SIG Node PR Triage Sep 15, 2023
@qiutongs
Copy link
Contributor Author

qiutongs commented Sep 15, 2023

I was wondering if this can be backported to 1.28 or earlier. After talking to a few folks, it seems such change is not qualified as a bug fix so we generally don't backport.

If someone has different thoughts, please let me know!

@qiutongs
Copy link
Contributor Author

/retest-required

@k8s-ci-robot k8s-ci-robot merged commit 4fd8bd9 into kubernetes:master Sep 15, 2023
14 checks passed
SIG Node PR Triage automation moved this from Triage to Done Sep 15, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Sep 15, 2023
@qiutongs qiutongs deleted the node-startup-latency branch September 15, 2023 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet