Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf optimization: GetPodQOS() returns persisted value of PodStatus.QOSClass, if set. #119665

Merged
merged 3 commits into from
Oct 12, 2023

Conversation

vinaykul
Copy link
Contributor

@vinaykul vinaykul commented Jul 29, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Invoking GetPodQOS is much more heavyweight op compared to reading PodStatus.QOSClass, a value that does not change after initial assignment because QoS class of pods are immutable. Running in-place pod resize e2e tests currently has GetPodQOS showing up in only 0.1% in the samples of kubelet flamegraph with 100 other pods on the node, each pod with 3 Burstable QoS class containers (real world cost may be higher or lower depending on nature of workloads) The intent is to eliminate that 0.1% and keep it that way by (re)setting this precedent in light of KEP 2527 allowing to use status as reliable, persistent source of truth.

GetPodQOS

Which issue(s) this PR fixes:

Fixes #
partially fixes 109547

Special notes for your reviewer:

In the flamegraph attached below, please see the sampling percentage of SetPodResizeStatus for this e2e pod resize test scenario. If this small optimization trial balloon PR stands as a valid use of KEP 2527, we may also have the case for relying on PodStatus.Resize and PodStatus...AllocatedResources as the reliable and persistent source of truth instead of checkpoint file - a good optimization imho (with special handling for stand-alone kubelet in case we want to support in-place resize for that case)

Does this PR introduce a user-facing change?

GetPodQOS(pod *core.Pod) function now returns the stored value from PodStatus.QOSClass, if set. To compute/evaluate the value of QOSClass from scratch, ComputePodQOS(pod *core.Pod) must be used.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources

Kubelet flamegraph data:
kubelet-getpodqos

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 29, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.28 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.28.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sat Jul 29 10:13:20 UTC 2023.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubectl area/kubelet labels Jul 29, 2023
@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. area/test needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 29, 2023
@vinaykul
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-cos-alpha-features
/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2

@vinaykul
Copy link
Contributor Author

@vinaykul vinaykul changed the title Perf optimization: Move away from GetPodQOS, using PodStatus.QOSClass instead Perf optimization: Move away from GetPodQOS, use PodStatus.QOSClass instead Jul 30, 2023
@bart0sh bart0sh added this to Triage in SIG Node PR Triage Jul 31, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Jul 31, 2023

/triage accepted
/priority important-longterm

@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Jul 31, 2023
if pod.Status.QOSClass != "" {
return pod.Status.QOSClass
}
return GetPodQOS(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that this is much more expensive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even rename to GetPodQOSExpensive() or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or name this GetPodQOS() and move the old, expensive one to getPodQOSExpensive()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion, fewer changes! I'll rename the current expensive one as ComputePodQOS() or DeterminePodQOS(), I'll update & retest over the weekend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin My last change has addressed this. PTAL, ty

pkg/apis/core/v1/helper/qos/qos.go Outdated Show resolved Hide resolved
@ndixita ndixita moved this from Triage to Archive-it in SIG Node CI/Test Board Aug 2, 2023
@vinaykul vinaykul changed the title Perf optimization: Move away from GetPodQOS, use PodStatus.QOSClass instead Perf optimization: GetPodQOS() returns persisted value of PodStatus.QOSClass, if set. Aug 7, 2023
Copy link
Contributor

@iholder101 iholder101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @vinaykul!
see one nit below

pkg/apis/core/helper/qos/qos.go Outdated Show resolved Hide resolved
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 8, 2023

@vinaykul: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gce-cos-alpha-features 54c11813c710283da9c84dd14fdf64537a5f20eb link unknown /test pull-kubernetes-e2e-gce-cos-alpha-features

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@iholder101
Copy link
Contributor

/lgtm
Thank you @vinaykul!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e13504b92a8276f4ac5ed20efde13c77f414a14f

@bart0sh bart0sh moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Aug 24, 2023
@vinaykul
Copy link
Contributor Author

vinaykul commented Oct 8, 2023

@thockin PTAL

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: thockin, vinaykul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 11, 2023
@k8s-ci-robot k8s-ci-robot merged commit a2cc9db into kubernetes:master Oct 12, 2023
12 checks passed
SIG Node PR Triage automation moved this from Needs Approver to Done Oct 12, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubectl area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Archived in project
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

9 participants