Skip to content

DRA: Improve allocator with better backtracking #130593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mortent
Copy link
Member

@mortent mortent commented Mar 5, 2025

/kind feature

What this PR does / why we need it:

This makes sure that the allocator doesn't keep retrying identical allocations if the limit for the maximum number of devices for a claim has been reached.

Which issue(s) this PR fixes:

Related-to #131730

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 5, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/code-generation area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 5, 2025
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Mar 5, 2025
@k8s-ci-robot k8s-ci-robot requested a review from andrewsykim March 5, 2025 16:52
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Mar 5, 2025
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Mar 5, 2025
@k8s-ci-robot k8s-ci-robot requested a review from AxeZhan March 5, 2025 16:52
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Mar 5, 2025
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@SergeyKanzhelev
Copy link
Member

/cc @lauralorenz

@lauralorenz is this one of the reliability improvements for DRA? Can you please triage?

@k8s-ci-robot k8s-ci-robot requested a review from lauralorenz March 5, 2025 18:31
@mortent mortent force-pushed the DRAAllocatorImprovements branch from 9435d77 to a509b9a Compare May 3, 2025 22:47
@pohly
Copy link
Contributor

pohly commented May 5, 2025

The gofmt error in pull-kubernetes-verify needs to be fixed.

@mortent mortent force-pushed the DRAAllocatorImprovements branch from a509b9a to b6946f3 Compare May 5, 2025 17:09
@mortent
Copy link
Member Author

mortent commented May 5, 2025

The gofmt error in pull-kubernetes-verify needs to be fixed.

It is fixed now.

@mortent mortent force-pushed the DRAAllocatorImprovements branch 4 times, most recently from d46451e to a8a65ee Compare May 9, 2025 15:30
@mortent
Copy link
Member Author

mortent commented May 9, 2025

/test pull-kubernetes-integration

@mortent mortent requested a review from pohly May 18, 2025 22:08
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2025
@mortent mortent force-pushed the DRAAllocatorImprovements branch from a8a65ee to c947d8a Compare May 21, 2025 15:38
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2025
@pohly
Copy link
Contributor

pohly commented Jun 9, 2025

Needs a rebase.

@mortent mortent force-pushed the DRAAllocatorImprovements branch from c947d8a to 9bdcedb Compare June 9, 2025 21:34
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2025
@mortent
Copy link
Member Author

mortent commented Jun 10, 2025

/test pull-kubernetes-unit

@mortent
Copy link
Member Author

mortent commented Jun 10, 2025

Needs a rebase.

Done

Comment on lines 69 to 74
// allocationAttemptsByClaim collects the number of different permutations
// of devices that was attempted before succeeding or failing to allocate
// devices for a claim.
// The key in the map is the index of the claim in claimsToAllocate and
// the value is the number of permutations.
// Access to the map must be syncronized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// allocationAttemptsByClaim collects the number of different permutations
// of devices that was attempted before succeeding or failing to allocate
// devices for a claim.
// The key in the map is the index of the claim in claimsToAllocate and
// the value is the number of permutations.
// Access to the map must be syncronized.
// allocationAttemptsByClaim collects the number of different permutations
// of devices that were attempted before succeeding or failing to allocate
// devices for a claim. In other words, this includes also
// incomplete permutations.
// The key in the map is the index of the claim in claimsToAllocate and
// the value is the number of permutations.
// Access to the map must be synchronized.

type Stats struct {
// AllocationAttemptsByClaim counts the number of allocation attempts per claim.
// We count every combination of devices that were attempted in order to satisfy
// the claim as an allocation attempt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// the claim as an allocation attempt.
// the claim as an allocation attempt,
// including incomplete solutions.

Comment on lines 411 to 415
allocationAttemptsByClaim := make(map[string]int64)
for claimIndex, attemptsForClaim := range a.allocationAttemptsByClaim {
claimName := a.claimsToAllocate[claimIndex].Name
allocationAttemptsByClaim[claimName] = attemptsForClaim
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allocationAttemptsByClaim := make(map[string]int64)
for claimIndex, attemptsForClaim := range a.allocationAttemptsByClaim {
claimName := a.claimsToAllocate[claimIndex].Name
allocationAttemptsByClaim[claimName] = attemptsForClaim
}
allocationAttemptsByClaim := maps.Clone(a.allocationAttemptsByClaim)

// If we get here without finding a solution, then there is none.
return false, nil
}

func (alloc *allocator) incAllocationAttempt(claimIndex, requestIndex int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requestIndex parameter is unused.

@@ -766,20 +838,32 @@ func (alloc *allocator) allocateOne(r deviceIndices, allocateSubRequest bool) (b

// We already know how many devices per request are needed.
if r.deviceIndex >= requestData.numDevices {
// We have successfully allocated devices for this request.
alloc.incAllocationAttempt(r.claimIndex, r.requestIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to verify that incAllocationAttempt is called in all places where it needs to be called. I'm also worried that we will forget to add it when adding more return statements in the future.

Can we establish a pattern such that every return in allocatedOne is directly preceded by a call to incAllocationAttempt or a comment why there is none?

We also need very clear instructions in the comment for incAllocationAttempt when to call it - currently there are none.

In this case, the call could be made before the return statement, i.e. it doesn't need to be incremented before moving on to the next claim, right?

Actually, this code isn't moving on to the next claim. It's moving to the next request. Do we really want to count this as an attempt for the claim? It's neither failed nor complete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with your concern here. It is already hard to make sure the counter is incremented in the right places and it will be hard to verify correctness as we update the code.

Maybe this is a good reason to switch this to count the number of calls to allocateOne as we discussed in #130593 (comment). I don't think either option is something we can expose as a metric without additional discussion, so we could just keep it simple and revisit if we need it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's count allocateOne calls because it's easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

// The key in the map is the index of the claim in claimsToAllocate and
// the value is the number of permutations.
// Access to the map must be syncronized.
allocationAttemptsByClaim map[int]int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we really need to track this by claim. I expect the question of when to call incAllocationAttempt (see below) to become easier to answer when we count leaf nodes in the overall search tree, and the semantic also becomes simpler.

@github-project-automation github-project-automation bot moved this from PRs - Needs Reviewer to PRs Waiting on Author in SIG Node CI/Test Board Jun 26, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to In Progress in SIG Apps Jun 26, 2025
@github-project-automation github-project-automation bot moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jun 26, 2025
@mortent mortent force-pushed the DRAAllocatorImprovements branch from 9bdcedb to 77cffa6 Compare June 28, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-generation area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: In Progress
Status: PRs Waiting on Author
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

7 participants