-
Notifications
You must be signed in to change notification settings - Fork 40.9k
DRA: Improve allocator with better backtracking #130593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/cc @lauralorenz @lauralorenz is this one of the reliability improvements for DRA? Can you please triage? |
9435d77
to
a509b9a
Compare
The gofmt error in pull-kubernetes-verify needs to be fixed. |
a509b9a
to
b6946f3
Compare
It is fixed now. |
staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go
Outdated
Show resolved
Hide resolved
d46451e
to
a8a65ee
Compare
/test pull-kubernetes-integration |
a8a65ee
to
c947d8a
Compare
Needs a rebase. |
c947d8a
to
9bdcedb
Compare
/test pull-kubernetes-unit |
Done |
// allocationAttemptsByClaim collects the number of different permutations | ||
// of devices that was attempted before succeeding or failing to allocate | ||
// devices for a claim. | ||
// The key in the map is the index of the claim in claimsToAllocate and | ||
// the value is the number of permutations. | ||
// Access to the map must be syncronized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// allocationAttemptsByClaim collects the number of different permutations | |
// of devices that was attempted before succeeding or failing to allocate | |
// devices for a claim. | |
// The key in the map is the index of the claim in claimsToAllocate and | |
// the value is the number of permutations. | |
// Access to the map must be syncronized. | |
// allocationAttemptsByClaim collects the number of different permutations | |
// of devices that were attempted before succeeding or failing to allocate | |
// devices for a claim. In other words, this includes also | |
// incomplete permutations. | |
// The key in the map is the index of the claim in claimsToAllocate and | |
// the value is the number of permutations. | |
// Access to the map must be synchronized. |
type Stats struct { | ||
// AllocationAttemptsByClaim counts the number of allocation attempts per claim. | ||
// We count every combination of devices that were attempted in order to satisfy | ||
// the claim as an allocation attempt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// the claim as an allocation attempt. | |
// the claim as an allocation attempt, | |
// including incomplete solutions. |
allocationAttemptsByClaim := make(map[string]int64) | ||
for claimIndex, attemptsForClaim := range a.allocationAttemptsByClaim { | ||
claimName := a.claimsToAllocate[claimIndex].Name | ||
allocationAttemptsByClaim[claimName] = attemptsForClaim | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocationAttemptsByClaim := make(map[string]int64) | |
for claimIndex, attemptsForClaim := range a.allocationAttemptsByClaim { | |
claimName := a.claimsToAllocate[claimIndex].Name | |
allocationAttemptsByClaim[claimName] = attemptsForClaim | |
} | |
allocationAttemptsByClaim := maps.Clone(a.allocationAttemptsByClaim) |
// If we get here without finding a solution, then there is none. | ||
return false, nil | ||
} | ||
|
||
func (alloc *allocator) incAllocationAttempt(claimIndex, requestIndex int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requestIndex
parameter is unused.
@@ -766,20 +838,32 @@ func (alloc *allocator) allocateOne(r deviceIndices, allocateSubRequest bool) (b | |||
|
|||
// We already know how many devices per request are needed. | |||
if r.deviceIndex >= requestData.numDevices { | |||
// We have successfully allocated devices for this request. | |||
alloc.incAllocationAttempt(r.claimIndex, r.requestIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to verify that incAllocationAttempt
is called in all places where it needs to be called. I'm also worried that we will forget to add it when adding more return statements in the future.
Can we establish a pattern such that every return in allocatedOne
is directly preceded by a call to incAllocationAttempt
or a comment why there is none?
We also need very clear instructions in the comment for incAllocationAttempt
when to call it - currently there are none.
In this case, the call could be made before the return statement, i.e. it doesn't need to be incremented before moving on to the next claim, right?
Actually, this code isn't moving on to the next claim. It's moving to the next request. Do we really want to count this as an attempt for the claim? It's neither failed nor complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with your concern here. It is already hard to make sure the counter is incremented in the right places and it will be hard to verify correctness as we update the code.
Maybe this is a good reason to switch this to count the number of calls to allocateOne
as we discussed in #130593 (comment). I don't think either option is something we can expose as a metric without additional discussion, so we could just keep it simple and revisit if we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's count allocateOne
calls because it's easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
// The key in the map is the index of the claim in claimsToAllocate and | ||
// the value is the number of permutations. | ||
// Access to the map must be syncronized. | ||
allocationAttemptsByClaim map[int]int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we really need to track this by claim. I expect the question of when to call incAllocationAttempt
(see below) to become easier to answer when we count leaf nodes in the overall search tree, and the semantic also becomes simpler.
9bdcedb
to
77cffa6
Compare
/kind feature
What this PR does / why we need it:
This makes sure that the allocator doesn't keep retrying identical allocations if the limit for the maximum number of devices for a claim has been reached.
Which issue(s) this PR fixes:
Related-to #131730
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: