Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dra: pre-scheduled pods #118209

Merged
merged 5 commits into from
Jul 13, 2023
Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 23, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.

Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.

Which issue(s) this PR fixes:

Fixes #114005

Special notes for your reviewer:

The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.

What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.

Does this PR introduce a user-facing change?

kube-controller-manager: the dynamic resource controller steps in when a pod got created such that the scheduler ignores it (i.e. spec.nodeName is set) and then takes care of triggering delayed resource claim allocation and/or reserving a claim for the pod.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

-KEP: https://github.com/kubernetes/enhancements/issues/3063

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 23, 2023
@pohly
Copy link
Contributor Author

pohly commented May 23, 2023

/cc @moshe010

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2023
@cici37
Copy link
Contributor

cici37 commented May 23, 2023

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 23, 2023
@bart0sh
Copy link
Contributor

bart0sh commented May 24, 2023

/triage accepted
/priority important-longterm
/cc @klueska

@k8s-ci-robot k8s-ci-robot requested a review from klueska May 24, 2023 07:18
@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 24, 2023
@bart0sh bart0sh moved this from Triage to PRs - Needs Reviewer in SIG Node CI/Test Board May 24, 2023
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage May 24, 2023
SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to PRs - Needs Approver Jul 13, 2023
@klueska
Copy link
Contributor

klueska commented Jul 13, 2023

I didn't do a thorough review of all of the code myself, but in general the flow looks good and I trust @elezar's LGTM, knowing he did do a thorough review.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 72afbcaa4b48617558d2e2cf531f6f39014127af

@pohly
Copy link
Contributor Author

pohly commented Jul 13, 2023

/assign @liggitt

For plugin/pkg/auth/authorizer approval.

},
UpdateFunc: func(old, updated interface{}) {
ec.onResourceClaimAddOrUpdate(logger, updated)
logger.V(6).Info("updated claim", "claimDiff", cmp.Diff(old, updated))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this diff will always run, regardless of whether the result is logged... isn't that expensive? also, cmp.Diff isn't guaranteed to not panic:

From https://pkg.go.dev/github.com/google/go-cmp/cmp:

Its propensity towards panicking means that its unsuitable for production environments where a spurious panic may be fatal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, let's better guard the call with an if check. I think at -v6 it's okay to take the risk of a panic. I've seen other log calls which dump diffs at higher verbosity levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at -v6 it's okay to take the risk of a panic

Do we know this is a safe type to pass to cmp.Diff? Should we test that to give us confidence V(6) won't panic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested that manually while developing the PR. For SIG Instrumentation, I am running a periodic job with logging at a very high log level which happens to include DRA. But automated testing in release informing jobs is missing.

I could add a dedicated unit test, but that's not worth it, therefore I have removed the usage of cmp.Diff: https://github.com/kubernetes/kubernetes/compare/5cb4f18791a44d23c2d1d83ba7323ec903e1597b..80ab8f0542f9ddcf4935e24f742ef2a94b204471

@liggitt Okay to merge now?

Copy link
Member

@liggitt liggitt Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Okay to merge now?

yup

@liggitt
Copy link
Member

liggitt commented Jul 13, 2023

/approve
for authorizer change

/hold
for resolution of unconditional cmp.Diff call. @pohly can unhold and lgtm once that is resolved

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 13, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, pohly, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 13, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2023
@pohly
Copy link
Contributor Author

pohly commented Jul 13, 2023

Update pushed. Let's wait for pull-kubernetes-kind-dra before lifting the hold, just in case.

Enabling logging is useful to track what the code is doing.

There are some functional changes:
- The pod handler checks for existence of claims. This
  avoids adding pods to the work queue in more cases
  when nothing needs to be done, at the cost of
  making the event handlers a bit slower. This will become
  more important when adding more work to the controller
- The handler for deleted ResourceClaim did not check for
  cache.DeletedFinalStateUnknown.
The allocation mode is relevant when clearing the reservedFor: for delayed
allocation, deallocation gets requested, for immediate allocation not. Both
should get tested.

All pre-defined claims now use delayed allocation, just as they would if
created normally.
When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.

Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.

The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.

What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.
@liggitt
Copy link
Member

liggitt commented Jul 13, 2023

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jul 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d67bdd0250fda6b848636b2ad40b1e3edd23ca07

@k8s-ci-robot k8s-ci-robot merged commit bea27f8 into kubernetes:master Jul 13, 2023
13 of 15 checks passed
SIG Node CI/Test Board automation moved this from PRs - Needs Approver to Done Jul 13, 2023
SIG Node PR Triage automation moved this from Needs Reviewer to Done Jul 13, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails
9 participants