-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dra: pre-scheduled pods #118209
dra: pre-scheduled pods #118209
Conversation
/cc @moshe010 |
/remove-sig api-machinery |
/triage accepted |
I didn't do a thorough review of all of the code myself, but in general the flow looks good and I trust @elezar's LGTM, knowing he did do a thorough review. /lgtm |
LGTM label has been added. Git tree hash: 72afbcaa4b48617558d2e2cf531f6f39014127af
|
/assign @liggitt For |
}, | ||
UpdateFunc: func(old, updated interface{}) { | ||
ec.onResourceClaimAddOrUpdate(logger, updated) | ||
logger.V(6).Info("updated claim", "claimDiff", cmp.Diff(old, updated)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this diff will always run, regardless of whether the result is logged... isn't that expensive? also, cmp.Diff isn't guaranteed to not panic:
From https://pkg.go.dev/github.com/google/go-cmp/cmp:
Its propensity towards panicking means that its unsuitable for production environments where a spurious panic may be fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, let's better guard the call with an if check. I think at -v6 it's okay to take the risk of a panic. I've seen other log calls which dump diffs at higher verbosity levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at -v6 it's okay to take the risk of a panic
Do we know this is a safe type to pass to cmp.Diff? Should we test that to give us confidence V(6) won't panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested that manually while developing the PR. For SIG Instrumentation, I am running a periodic job with logging at a very high log level which happens to include DRA. But automated testing in release informing jobs is missing.
I could add a dedicated unit test, but that's not worth it, therefore I have removed the usage of cmp.Diff
: https://github.com/kubernetes/kubernetes/compare/5cb4f18791a44d23c2d1d83ba7323ec903e1597b..80ab8f0542f9ddcf4935e24f742ef2a94b204471
@liggitt Okay to merge now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Okay to merge now?
yup
/approve /hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, pohly, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c9a57cc
to
5cb4f18
Compare
Update pushed. Let's wait for pull-kubernetes-kind-dra before lifting the hold, just in case. |
Enabling logging is useful to track what the code is doing. There are some functional changes: - The pod handler checks for existence of claims. This avoids adding pods to the work queue in more cases when nothing needs to be done, at the cost of making the event handlers a bit slower. This will become more important when adding more work to the controller - The handler for deleted ResourceClaim did not check for cache.DeletedFinalStateUnknown.
The allocation mode is relevant when clearing the reservedFor: for delayed allocation, deallocation gets requested, for immediate allocation not. Both should get tested. All pre-defined claims now use delayed allocation, just as they would if created normally.
When someone decides that a Pod should definitely run on a specific node, they can create the Pod with spec.nodeName already set. Some custom scheduler might do that. Then kubelet starts to check the pod and (if DRA is enabled) will refuse to run it, either because the claims are still waiting for the first consumer or the pod wasn't added to reservedFor. Both are things the scheduler normally does. Also, if a pod got scheduled while the DRA feature was off in the kube-scheduler, a pod can reach the same state. The resource claim controller can handle these two cases by taking over for the kube-scheduler when nodeName is set. Triggering an allocation is simpler than in the scheduler because all it takes is creating the right PodSchedulingContext with spec.selectedNode set. There's no need to list nodes because that choice was already made, permanently. Adding the pod to reservedFor also isn't hard. What's currently missing is triggering de-allocation of claims to re-allocate them for the desired node. This is not important for claims that get created for the pod from a template and then only get used once, but it might be worthwhile to add de-allocation in the future.
5cb4f18
to
80ab8f0
Compare
/lgtm |
LGTM label has been added. Git tree hash: d67bdd0250fda6b848636b2ad40b1e3edd23ca07
|
What type of PR is this?
/kind feature
What this PR does / why we need it:
When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.
Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.
Which issue(s) this PR fixes:
Fixes #114005
Special notes for your reviewer:
The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.
What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: