-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API #119012
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
A few test failures because tests need to be updated, but the implementation is ready for review and probably is okay. |
b6ef657
to
3b74a71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
// The ResourceClaims for which preparation was done | ||
// or attempted, with claim_uid as key. | ||
// | ||
// It is valid to return less entries. NodePrepareResources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a default number of tries before backing off if no response is returned for claim?
Is no response meant to indicate transient error, while explicit error should indicate definite error that cannot be overcome by a subsequent repetition of prepare call for that claim ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no such semantic difference. kubelet treats a missing response as if the prepare call hadn't been invoked and will retry like it normally does - at least that was the goal. I need to double-check whether the implementation really does it.
The expectation was that drivers only return less entries when there was some kind of error that they reported. I suspect that returning no error and less entries is not currently caught in the post-processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording was indeed unfortunate. The intend was to treat that situation like an error, so the description should also state it like one. Changed.
The necessary check in the code was indeed missing - added.
While doing so, I found some cut-and-paste errors in error messages - fixed.
Here's the full diff: https://github.com/kubernetes/kubernetes/compare/3b74a711a02abc495b351499033593f9831959c3..fd74b7b8ddf7313af813f5c5e8cbb4923f69c534
LGTM label has been added. Git tree hash: e12d882e980acf0f0ce055f8440485b7c83ab6cb
|
3b74a71
to
fd74b7b
Compare
fd74b7b
to
acf558e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
pkg/kubelet/cm/dra/manager.go
Outdated
batches[pluginName] = append(batches[pluginName], | ||
&drapb.Claim{ | ||
Namespace: resourceClaim.Namespace, | ||
Uid: string(resourceClaim.UID), | ||
Name: resourceClaim.Name, | ||
ResourceHandle: resourceHandle.Data, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: but I'd prefer to split this apart, i.e.:
pbclaim := &drapb.Claim{
Namespace: resourceClaim.Namespace,
Uid: string(resourceClaim.UID),
Name: resourceClaim.Name,
ResourceHandle: resourceHandle.Data,
}
batches[pluginName] = append(batches[pluginName], pbclaim)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this change get reverted as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still doesn't seem to be changed...
pkg/kubelet/cm/dra/manager.go
Outdated
if result.Error != "" { | ||
return fmt.Errorf("NodePrepareResources failed for claim %s/%s: %s", reqClaim.Namespace, reqClaim.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There' something still strange to me about this Error
field. It is solely for reporting here? Or is it used / interpreted elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only checked here. The content of the Error
field then only gets passed on as part of the error created here... except that err
should have been result.Error
! Fixed.
pkg/kubelet/cm/dra/manager.go
Outdated
batches[pluginName] = append(batches[pluginName], | ||
&drapb.Claim{ | ||
Namespace: resourceClaim.Namespace, | ||
Uid: string(resourceClaim.UID), | ||
Name: resourceClaim.Name, | ||
ResourceHandle: resourceHandle.Data, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above for splitting this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
pkg/kubelet/cm/dra/manager.go
Outdated
// No errors, time to remove the pod reference from all claims that were involved. | ||
for _, claimInfo := range claimInfos { | ||
// Delete last pod UID only if unprepare succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slight change in semantics from before, since previously we would remove pod references as NodeUnprepareResources()
was called successfully on each individual claim. I'm not saying it's not OK to do it this new way, but we should acknowledge that it's a slight change in semantics from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original semantic. I moved this code into the for loop above, directly after checking the per-claim result.
pkg/kubelet/cm/dra/manager.go
Outdated
batches[pluginName] = append(batches[pluginName], | ||
&drapb.Claim{ | ||
Namespace: resourceClaim.Namespace, | ||
Uid: string(resourceClaim.UID), | ||
Name: resourceClaim.Name, | ||
ResourceHandle: resourceHandle.Data, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the code above for:
// Is the resource already prepared? Then add the pod UID to it.
if claimInfo := m.cache.get(claimName, pod.Namespace); claimInfo != nil {
// We delay checkpointing of this change until this call
// returns successfully. It is OK to do this because we
// will only return successfully from this call if the
// checkpoint has succeeded. That means if the kubelet is
// ever restarted before this checkpoint succeeds, the pod
// whose resources are being prepared would never have
// started, so it's OK (actually correct) to not include it
// in the cache.
claimInfo.addPodReference(pod.UID)
continue
}
It means we include claims in these batches if they haven't already been prepared (presumably by some other pod or a different round of preparing this pod where some claim allocations failed).
I'm not sure how I feel about that. Given that the underlying NodePrepareResources()
call needs to be idempotent, I'd rather send it the full list of claims linked to the pod anytime NodePrepareResources()
is called, and let it explicitly ignore the ones that have already been prepared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference either way. As you prefer sending the full list, let's do that.
It leads to one additional corner case: all claims for a driver might have been prepared already. In that case we shouldn't call NodePrepareResources
, should we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't hurt if we do, the driver should just not actually do any preparation.
One thing this makes me realize though is that that we originally discussed with @byako to both (1) batch claim requests and (2) pass pod info to the Prepare/Unprepare calls so that the driver would be aware of which pod's claims it was currently processing. If the pod info was also being passed, then it would definitely make sense to pass the full list of claims to Prepare/Unprepare even if there was nothing for the driver to do (so that it could keep track of this mapping if it wanted).
I presume @byako had a use case for passing the pod info (and I remember having a use-case for it as well), but its been a while since I looked at it and I can't remember of the top of my head why it was useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When passing pod info, what information should that be? Name/namespace/uid?
The problem then becomes the NodeUnprepareResources
: it doesn't get called for a pod when some other pod is still using the claim, so the driver will not be able to determine when a pod stops using a claim. Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have decided to avoid passing pod info in this iteration as well as keep the original semantics of only including claims in the batch if they are currently unprepared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify the use case - the Pod info was used to map in CRD for what Pod the claim was allocated, so when the first time the plugin is called to prepare one out of N claims for the Pod (without batch of claims) - plugin has a chance of finding on its own from CRD all the claims for the same Pod and prepare them all at once because HW only allows one configuration at a time. Example is SR-IOV Virtual Functions - after first N Virtual Functions were created with one config call, subsequent ones (coming from other claims for the same HW) can only be created after first is removed.
The Pod info is not needed in case the plugin is given all claims' info for the Pod because there is no need to traverse CRD to find all claims allocated for the Pod.
Passing only unprepared claims should be fine, I've no opinion if it's good or bad. The only case when it might be important, that I can think of, is when on previous prepare call some claims succeeded and some did not, and IMHO this should only happen when the claims were for different HW. There should not be a case when some claims for HW A succeeded, and some other claims for same HW B in the same prepare call failed. If this was the case, then the driver should unprepare automatically first claims that succeeded for HW A on the same call and report them as failed as well (because subsequent claims might not be accepted by HW A) so the next prepare call would have all claims for HW A again.
pkg/kubelet/cm/dra/manager.go
Outdated
// Checkpoint to reduce redundant calls to NodeUnPrepareResource() after a kubelet restart. | ||
err = m.cache.syncToCheckpoint() | ||
// Checkpoint to reduce redundant calls to NodeUnPrepareResources() after a kubelet restart. | ||
err := m.cache.syncToCheckpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we batch the pod reference deletions at the end like this, then I think the call to syncToCheckpoint ()
within this loop can simply be removed. We call it again just after the loop and without a bunch of gRPC calls between the checkpoints as before (which is why we wanted to checkpoint early in case one of the NodeUnprepareResource()
calls failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the syncToCheckpoint
at the end of each function also was redundant. Doing it once after each gRPC call should be enough. Changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... except when the entire call was skipped. Then the final sync is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant the other way around -- the final one is necessary -- the intermediate ones are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So prune them even further, i.e. do it potentially after multiple NodePrepareResources
calls? That can happen when there are multiple plugins involved.
nodeClient drapb.NodeClient, | ||
nodeClientOld drapbv1alpha2.NodeClient, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering how we were going to handle the old API. This seems like a reasonable solution -- i.e. always make calls using the new API in manager.go
but convert them under the hood to the old API (if necessary) from within the client.
resp, err = nodeClient.NodePrepareResources(ctx, req) | ||
if err != nil { | ||
status, _ := grpcstatus.FromError(err) | ||
if status.Code() == grpccodes.Unimplemented { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to just check which API a plugin registered with, rather than attempt a call on the new API can fall back to the old one if it fails? If not, is there some way to make this more generic instead of inline in this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need some persistent state for the plugin. Perhaps this can be handled as part of #118619?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's leave this as is and I'll work with @TommyStarK to clean this up in his PR.
message NodePrepareResourcesResponse { | ||
// The ResourceClaims for which preparation was done | ||
// or attempted, with claim_uid as key. | ||
// | ||
// It is an error if some claim listed in NodePrepareResourcesRequest | ||
// does not get prepared. NodePrepareResources | ||
// will be called again for those that are missing. | ||
map<string, NodePrepareResourceResponse> claims = 1; | ||
} | ||
|
||
message NodePrepareResourceResponse { | ||
// These are the additional devices that kubelet must | ||
// make available via the container runtime. A resource | ||
// may have zero or more devices. | ||
repeated string cdi_devices = 1; | ||
// If non-empty, preparing the ResourceClaim failed. | ||
// cdi_devices is ignored in that case. | ||
string error = 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK;, I'm fine with it. I just wanted to make sure we were conscious of why we would do it this way vs. the other.
repeated string cdi_devices = 1; | ||
// If non-empty, preparing the ResourceClaim failed. | ||
// cdi_devices is ignored in that case. | ||
string error = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add it later without bumping the API as you mention. I think it's fine as is for now. (Though a little weird to me to just pass back an arbitrary string just for logging purposes).
@klueska: I've pushed new commits. I kept the "include all claims in NodePrepareResources" change separate in case that we want to remove that after all. It's now not consistent with |
/test pull-kubernetes-integration-eks Just an experiment, can be ignored. |
1cdcdcf
to
b744050
Compare
Because of that inconsistency @klueska and I agreed on chat to go back to the original behavior of only including claims in |
pkg/kubelet/cm/dra/manager.go
Outdated
batches[pluginName] = append(batches[pluginName], | ||
&drapb.Claim{ | ||
Namespace: resourceClaim.Namespace, | ||
Uid: string(resourceClaim.UID), | ||
Name: resourceClaim.Name, | ||
ResourceHandle: resourceHandle.Data, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have decided to avoid passing pod info in this iteration as well as keep the original semantics of only including claims in the batch if they are currently unprepared.
} | ||
claimInfos[resourceClaim.UID] = claimInfo | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment that begins this loop is now incorrect as we don't call NodeUnprepareResource()
from within this loop anymore. That comment should be moved below to the new loop with:
for pluginName, claims := range batches {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check with the PrepareResources()
call and try and make the comments consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "only for the last pod that references the claim" seemed redundant to me. I dropped it in the updated comments.
pkg/kubelet/cm/dra/manager.go
Outdated
// Check for errors. If there is any error, processing gets aborted. | ||
// We could try to continue, but that would make the code more complex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Check for errors. If there is any error, processing gets aborted. | |
// We could try to continue, but that would make the code more complex. | |
// Call NodeUnprepareResources for all claims in the batch. | |
// If there is any error, processing gets aborted. | |
// We could try to continue, but that would make the code more complex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, with "for all claims in each batch" - there's one for each plugin.
pkg/kubelet/cm/dra/manager.go
Outdated
return fmt.Errorf("NodeUnprepareResources failed for claim %s/%s: %s", reqClaim.Namespace, reqClaim.Name, err) | ||
} | ||
|
||
// Delete last pod UID only if unprepare succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Delete last pod UID only if unprepare succeed. | |
// Delete last pod UID only if unprepare succeeds. |
resp, err = nodeClient.NodePrepareResources(ctx, req) | ||
if err != nil { | ||
status, _ := grpcstatus.FromError(err) | ||
if status.Code() == grpccodes.Unimplemented { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's leave this as is and I'll work with @TommyStarK to clean this up in his PR.
// These are the additional devices that kubelet must | ||
// make available via the container runtime. A resource | ||
// may have zero or more devices. | ||
repeated string cdi_devices = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this:
cdi_devices = 1 [(gogoproto.customname) = "CDIDevices"];
So that it is referenced in go as CDIDevices
instead of CdiDevices
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
b744050
to
22ec128
Compare
Update pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, then I think this is ready to go.
pkg/kubelet/cm/dra/manager.go
Outdated
batches[pluginName] = append(batches[pluginName], | ||
&drapb.Claim{ | ||
Namespace: resourceClaim.Namespace, | ||
Uid: string(resourceClaim.UID), | ||
Name: resourceClaim.Name, | ||
ResourceHandle: resourceHandle.Data, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still doesn't seem to be changed...
Combining all prepare/unprepare operations for a pod enables plugins to optimize the execution. Plugins can continue to use the v1beta2 API for now, but should switch. The new API is designed so that plugins which want to work on each claim one-by-one can do so and then report errors for each claim separately, i.e. partial success is supported.
22ec128
to
d743c50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Patrick for being so responsive to my comments / feedback.
I'm still not 100% happy with the changes in pkg/kubelet/cm/dra/plugin/client.go
, but we can address these concerns as part of #118619, which heavily modifies this file anyway.
/lgtm
/approve
LGTM label has been added. Git tree hash: 6fa3d8e1871eae5bb50f06b5501974aa10a423f9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
vendor changes
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: byako, klueska, pohly, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/test pull-kubernetes-unit flaky: |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Combining all prepare/unprepare operations for a pod enables plugins to optimize the execution. Plugins can continue to use the v1beta2 API for now, but should switch. The new API is designed so that plugins which want to work on each claim one-by-one can do so and then report errors for each claim separately, i.e. partial success is supported.
Which issue(s) this PR fixes:
Related-to: #117104
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @byako