Skip to content

DRA: fix deleting orphaned ResourceClaim on startup #132533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Jun 25, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR implements the fix for #132334 proposed by @pohly in that issue and tested in #132295. This change fixes the logic in the ResourceClaim controller to clean up deleted ResourceClaims in upgrade and similar scenarios involving kube-controller-manager restarts.

Which issue(s) this PR is related to:

#132334

Special notes for your reviewer:

I plan to iterate on a test to reproduce the issue soon, but opened this PR now to unblock #132295.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 25, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 25, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Jun 25, 2025
@k8s-ci-robot k8s-ci-robot requested review from klueska and pohly June 25, 2025 16:12
Copy link

@dcq01 dcq01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm a grad student working on a research project about using large language models to automate code review. Based on your commit ef117ed and the changes in pkg/controller/resourceclaim/controller.go, my tool generated this comment:

  1. Nil Pointer Dereference: Ensure that there are adequate checks in place to handle cases where newClaim or oldClaim could be nil before accessing their properties.
  2. Error Handling: Consider implementing a more robust error handling strategy to prevent cascading failures if ec.podIndexer.ByIndex returns an error.
  3. Error Handling Tests: Cover error handling paths, particularly when there are issues retrieving claims from the lister or checking ownership.
  4. Logical Inversion: The change from deleted := newObj != nil to deleted := newObj == nil is correct if the intention is to treat a nil newObj as an indication of deletion. Ensure this change aligns with the intended behavior of the function.
  5. Impact on Subsequent Logic: Verify the logging statement for deleted claims to ensure it correctly logs and handles the case when a claim is deleted.
  6. Testing: It is essential to have tests that cover scenarios for both adding and deleting resource claims to ensure that the new logic behaves as expected.
  7. Testing the Logging Behavior: Add tests to verify the correct logging output for both deleted and non-deleted claims.
  8. Separation of Concerns: Consider breaking the enqueueResourceClaim function into smaller, more focused functions for checking deletions, updating metrics, and enqueuing claims.
  9. Loop Efficiency: Consider optimizing the ec.enqueuePod function in the loop iterating over objs. Batching the enqueueing of pods or using a more efficient data structure could reduce overhead.
  10. Resource Claim Checks: In the podNeedsWork function, cache the results of ec.claimLister.ResourceClaims(pod.Namespace).Get(*claimName) if the same claims are checked multiple times within the same execution context.

As part of my research, I'm trying to understand how useful these comments are in real-world development. If you have a moment, I'd be super grateful if you could quickly reply to these two yes/no questions:

  1. Does this comment provide suggestions from a dimension you hadn’t considered?
    1. Do you find this comment helpful?

Thanks a lot for your time and feedback! And sorry again if this message is a bother.

@pohly
Copy link
Contributor

pohly commented Jun 26, 2025

And sorry again if this message is a bother.

@dcq01: it is a bother. Please don't do this in the Kubernetes project, or in any other open source project without first ensuring that the owners of the project welcome such comments.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold

There's still time to add tests, I'll lift the hold when I am ready to merge #132033.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a7c1bb4213cc1efd998d3a9ff75d9d983fc32736

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 26, 2025
@k8s-ci-robot k8s-ci-robot removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 27, 2025
@k8s-ci-robot k8s-ci-robot requested a review from pohly June 27, 2025 04:40
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jun 27, 2025

I tried updating the tests to better exercise this change. The approach I took was to modify the existing tests that check the metrics tracked by enqueueResourceClaim to also inspect the state of the workqueue and make sure the right pods and claims get enqueued for a given claim add/update/delete. I didn't add any steps that check the specific kube-controller-manager restart case described by #132334 since the existing steps already fail without the functional change in this PR.

I'm also happy to iterate on these tests in a separate PR if we want this fix sooner.

Comment on lines +530 to +543
g.Eventually(ec.queue.Len).
WithTimeout(5*time.Second).
Should(gomega.Equal(len(expectedKeys)), lenDiffMessage)
g.Consistently(ec.queue.Len).
WithTimeout(1*time.Second).
Should(gomega.Equal(len(expectedKeys)), lenDiffMessage)

for _, expected := range expectedKeys {
actual, shuttingDown := ec.queue.Get()
g.Expect(shuttingDown).To(gomega.BeFalseBecause("workqueue is unexpectedly shutting down"))
g.Expect(actual).To(gomega.Equal(expected))
ec.queue.Forget(actual)
ec.queue.Done(actual)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic doesn't feel perfectly airtight and it slows the test way down (from ~3s to ~12s, mostly from the Consistently), so I'm open to ways to make that more robust and/or eliminate the long timeouts here. This might be a place that could make use of the Mock workqueue if it were importable here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

4 participants