dra: pre-scheduled pods #118209

pohly · 2023-05-23T18:46:23Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.

Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.

Which issue(s) this PR fixes:

Fixes #114005

Special notes for your reviewer:

The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.

What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.

Does this PR introduce a user-facing change?

kube-controller-manager: the dynamic resource controller steps in when a pod got created such that the scheduler ignores it (i.e. spec.nodeName is set) and then takes care of triggering delayed resource claim allocation and/or reserving a claim for the pod.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

-KEP: https://github.com/kubernetes/enhancements/issues/3063

pohly · 2023-05-23T18:47:03Z

/cc @moshe010

cici37 · 2023-05-23T20:15:27Z

/remove-sig api-machinery

bart0sh · 2023-05-24T07:18:28Z

/triage accepted
/priority important-longterm
/cc @klueska

klueska · 2023-07-13T10:50:42Z

I didn't do a thorough review of all of the code myself, but in general the flow looks good and I trust @elezar's LGTM, knowing he did do a thorough review.

/lgtm

k8s-ci-robot · 2023-07-13T10:50:49Z

LGTM label has been added.

Git tree hash: 72afbcaa4b48617558d2e2cf531f6f39014127af

pohly · 2023-07-13T11:18:24Z

/assign @liggitt

For plugin/pkg/auth/authorizer approval.

plugin/pkg/auth/authorizer/rbac/bootstrappolicy/controller_policy.go

liggitt · 2023-07-13T15:54:34Z

pkg/controller/resourceclaim/controller.go

 },
 UpdateFunc: func(old, updated interface{}) {
- ec.onResourceClaimAddOrUpdate(logger, updated)
+ logger.V(6).Info("updated claim", "claimDiff", cmp.Diff(old, updated))


this diff will always run, regardless of whether the result is logged... isn't that expensive? also, cmp.Diff isn't guaranteed to not panic:

From https://pkg.go.dev/github.com/google/go-cmp/cmp:

Its propensity towards panicking means that its unsuitable for production environments where a spurious panic may be fatal.

You are right, let's better guard the call with an if check. I think at -v6 it's okay to take the risk of a panic. I've seen other log calls which dump diffs at higher verbosity levels.

I think at -v6 it's okay to take the risk of a panic

Do we know this is a safe type to pass to cmp.Diff? Should we test that to give us confidence V(6) won't panic?

I have tested that manually while developing the PR. For SIG Instrumentation, I am running a periodic job with logging at a very high log level which happens to include DRA. But automated testing in release informing jobs is missing.

I could add a dedicated unit test, but that's not worth it, therefore I have removed the usage of cmp.Diff: https://github.com/kubernetes/kubernetes/compare/5cb4f18791a44d23c2d1d83ba7323ec903e1597b..80ab8f0542f9ddcf4935e24f742ef2a94b204471

@liggitt Okay to merge now?

@liggitt Okay to merge now?

yup

liggitt · 2023-07-13T16:17:42Z

/approve
for authorizer change

/hold
for resolution of unconditional cmp.Diff call. @pohly can unhold and lgtm once that is resolved

k8s-ci-robot · 2023-07-13T16:18:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, pohly, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-controller-manager/OWNERS~~ [liggitt,soltysh]
~~pkg/controller/resourceclaim/OWNERS~~ [liggitt,pohly,soltysh]
~~plugin/pkg/auth/authorizer/OWNERS~~ [liggitt]
~~staging/src/k8s.io/dynamic-resource-allocation/OWNERS~~ [liggitt,pohly]
~~test/OWNERS~~ [liggitt,pohly,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2023-07-13T16:21:21Z

Update pushed. Let's wait for pull-kubernetes-kind-dra before lifting the hold, just in case.

Enabling logging is useful to track what the code is doing. There are some functional changes: - The pod handler checks for existence of claims. This avoids adding pods to the work queue in more cases when nothing needs to be done, at the cost of making the event handlers a bit slower. This will become more important when adding more work to the controller - The handler for deleted ResourceClaim did not check for cache.DeletedFinalStateUnknown.

The allocation mode is relevant when clearing the reservedFor: for delayed allocation, deallocation gets requested, for immediate allocation not. Both should get tested. All pre-defined claims now use delayed allocation, just as they would if created normally.

When someone decides that a Pod should definitely run on a specific node, they can create the Pod with spec.nodeName already set. Some custom scheduler might do that. Then kubelet starts to check the pod and (if DRA is enabled) will refuse to run it, either because the claims are still waiting for the first consumer or the pod wasn't added to reservedFor. Both are things the scheduler normally does. Also, if a pod got scheduled while the DRA feature was off in the kube-scheduler, a pod can reach the same state. The resource claim controller can handle these two cases by taking over for the kube-scheduler when nodeName is set. Triggering an allocation is simpler than in the scheduler because all it takes is creating the right PodSchedulingContext with spec.selectedNode set. There's no need to list nodes because that choice was already made, permanently. Adding the pod to reservedFor also isn't hard. What's currently missing is triggering de-allocation of claims to re-allocate them for the desired node. This is not important for claims that get created for the pod from a template and then only get used once, but it might be worthwhile to add de-allocation in the future.

liggitt · 2023-07-13T19:53:45Z

/lgtm
/hold cancel

k8s-ci-robot · 2023-07-13T19:53:51Z

LGTM label has been added.

Git tree hash: d67bdd0250fda6b848636b2ad40b1e3edd23ca07

k8s-ci-robot requested a review from moshe010 May 23, 2023 18:47

k8s-ci-robot added the area/test label May 23, 2023

k8s-ci-robot requested review from aojea and bart0sh May 23, 2023 18:51

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 23, 2023

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board May 24, 2023

SergeyKanzhelev added this to Triage in SIG Node PR Triage May 24, 2023

k8s-ci-robot requested a review from klueska May 24, 2023 07:18

bart0sh moved this from Triage to PRs - Needs Reviewer in SIG Node CI/Test Board May 24, 2023

bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage May 24, 2023

SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to PRs - Needs Approver Jul 13, 2023

k8s-ci-robot assigned klueska Jul 13, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2023

k8s-ci-robot assigned liggitt Jul 13, 2023

liggitt reviewed Jul 13, 2023

View reviewed changes

plugin/pkg/auth/authorizer/rbac/bootstrappolicy/controller_policy.go Show resolved Hide resolved

liggitt reviewed Jul 13, 2023

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 13, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 13, 2023

pohly force-pushed the dra-pre-scheduled-pods branch from c9a57cc to 5cb4f18 Compare July 13, 2023 16:19

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2023

k8s-ci-robot requested a review from liggitt July 13, 2023 16:19

pohly added 3 commits July 13, 2023 21:27

pohly force-pushed the dra-pre-scheduled-pods branch from 5cb4f18 to 80ab8f0 Compare July 13, 2023 19:27

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jul 13, 2023

k8s-ci-robot merged commit bea27f8 into kubernetes:master Jul 13, 2023
13 of 15 checks passed

SIG Node CI/Test Board automation moved this from PRs - Needs Approver to Done Jul 13, 2023

SIG Node PR Triage automation moved this from Needs Reviewer to Done Jul 13, 2023

k8s-ci-robot added this to the v1.28 milestone Jul 13, 2023

klueska mentioned this pull request Jul 19, 2023

dynamic resource allocation kubernetes/enhancements#3063

Open

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dra: pre-scheduled pods #118209

dra: pre-scheduled pods #118209

pohly commented May 23, 2023

pohly commented May 23, 2023

cici37 commented May 23, 2023

bart0sh commented May 24, 2023

klueska commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

pohly commented Jul 13, 2023

liggitt Jul 13, 2023

pohly Jul 13, 2023

liggitt Jul 13, 2023

pohly Jul 13, 2023

liggitt Jul 13, 2023 •

edited

liggitt commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

pohly commented Jul 13, 2023

liggitt commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

dra: pre-scheduled pods #118209

dra: pre-scheduled pods #118209

Conversation

pohly commented May 23, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

pohly commented May 23, 2023

cici37 commented May 23, 2023

bart0sh commented May 24, 2023

klueska commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

pohly commented Jul 13, 2023

liggitt Jul 13, 2023

Choose a reason for hiding this comment

pohly Jul 13, 2023

Choose a reason for hiding this comment

liggitt Jul 13, 2023

Choose a reason for hiding this comment

pohly Jul 13, 2023

Choose a reason for hiding this comment

liggitt Jul 13, 2023 • edited

Choose a reason for hiding this comment

liggitt commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

pohly commented Jul 13, 2023

liggitt commented Jul 13, 2023

k8s-ci-robot commented Jul 13, 2023

liggitt Jul 13, 2023 •

edited