DRA: fix deleting orphaned ResourceClaim on startup #132533

nojnhuh · 2025-06-25T16:12:23Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR implements the fix for #132334 proposed by @pohly in that issue and tested in #132295. This change fixes the logic in the ResourceClaim controller to clean up deleted ResourceClaims in upgrade and similar scenarios involving kube-controller-manager restarts.

Which issue(s) this PR is related to:

#132334

Special notes for your reviewer:

I plan to iterate on a test to reproduce the issue soon, but opened this PR now to unblock #132295.

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-06-25T16:12:33Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dcq01

Hi! I'm a grad student working on a research project about using large language models to automate code review. Based on your commit ef117ed and the changes in pkg/controller/resourceclaim/controller.go, my tool generated this comment:

Nil Pointer Dereference: Ensure that there are adequate checks in place to handle cases where newClaim or oldClaim could be nil before accessing their properties.
Error Handling: Consider implementing a more robust error handling strategy to prevent cascading failures if ec.podIndexer.ByIndex returns an error.
Error Handling Tests: Cover error handling paths, particularly when there are issues retrieving claims from the lister or checking ownership.
Logical Inversion: The change from deleted := newObj != nil to deleted := newObj == nil is correct if the intention is to treat a nil newObj as an indication of deletion. Ensure this change aligns with the intended behavior of the function.
Impact on Subsequent Logic: Verify the logging statement for deleted claims to ensure it correctly logs and handles the case when a claim is deleted.
Testing: It is essential to have tests that cover scenarios for both adding and deleting resource claims to ensure that the new logic behaves as expected.
Testing the Logging Behavior: Add tests to verify the correct logging output for both deleted and non-deleted claims.
Separation of Concerns: Consider breaking the enqueueResourceClaim function into smaller, more focused functions for checking deletions, updating metrics, and enqueuing claims.
Loop Efficiency: Consider optimizing the ec.enqueuePod function in the loop iterating over objs. Batching the enqueueing of pods or using a more efficient data structure could reduce overhead.
Resource Claim Checks: In the podNeedsWork function, cache the results of ec.claimLister.ResourceClaims(pod.Namespace).Get(*claimName) if the same claims are checked multiple times within the same execution context.

As part of my research, I'm trying to understand how useful these comments are in real-world development. If you have a moment, I'd be super grateful if you could quickly reply to these two yes/no questions:

Does this comment provide suggestions from a dimension you hadn’t considered?
1. Do you find this comment helpful?

Thanks a lot for your time and feedback! And sorry again if this message is a bother.

pohly · 2025-06-26T06:31:17Z

And sorry again if this message is a bother.

@dcq01: it is a bother. Please don't do this in the Kubernetes project, or in any other open source project without first ensuring that the owners of the project welcome such comments.

pohly

/lgtm
/approve
/hold

There's still time to add tests, I'll lift the hold when I am ready to merge #132033.

k8s-ci-robot · 2025-06-26T06:33:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/resourceclaim/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-06-26T06:33:43Z

LGTM label has been added.

Git tree hash: a7c1bb4213cc1efd998d3a9ff75d9d983fc32736

k8s-ci-robot · 2025-06-27T04:40:45Z

New changes are detected. LGTM label has been removed.

nojnhuh · 2025-06-27T04:40:46Z

I tried updating the tests to better exercise this change. The approach I took was to modify the existing tests that check the metrics tracked by enqueueResourceClaim to also inspect the state of the workqueue and make sure the right pods and claims get enqueued for a given claim add/update/delete. I didn't add any steps that check the specific kube-controller-manager restart case described by #132334 since the existing steps already fail without the functional change in this PR.

I'm also happy to iterate on these tests in a separate PR if we want this fix sooner.

nojnhuh · 2025-06-27T04:41:52Z

pkg/controller/resourceclaim/controller_test.go

+		g.Eventually(ec.queue.Len).
+			WithTimeout(5*time.Second).
+			Should(gomega.Equal(len(expectedKeys)), lenDiffMessage)
+		g.Consistently(ec.queue.Len).
+			WithTimeout(1*time.Second).
+			Should(gomega.Equal(len(expectedKeys)), lenDiffMessage)
+
+		for _, expected := range expectedKeys {
+			actual, shuttingDown := ec.queue.Get()
+			g.Expect(shuttingDown).To(gomega.BeFalseBecause("workqueue is unexpectedly shutting down"))
+			g.Expect(actual).To(gomega.Equal(expected))
+			ec.queue.Forget(actual)
+			ec.queue.Done(actual)
+		}


This logic doesn't feel perfectly airtight and it slows the test way down (from ~3s to ~12s, mostly from the Consistently), so I'm open to ways to make that more robust and/or eliminate the long timeouts here. This might be a place that could make use of the Mock workqueue if it were importable here.

DRA: fix deleting orphaned ResourceClaim on startup

ef117ed

github-project-automation bot added this to Dynamic Resource Allocation, SIG Apps and SIG Node: code and documentation PRs Jun 25, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 25, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jun 25, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Jun 25, 2025

k8s-ci-robot requested review from klueska and pohly June 25, 2025 16:12

This was referenced Jun 25, 2025

DRA resource claim controller: fails to clean up deleted ResourceClaims on startup #132334

Open

WIP: DRA: automated upgrade/downgrade testing #132295

Draft

dcq01 reviewed Jun 25, 2025

View reviewed changes

pohly reviewed Jun 26, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2025

k8s-ci-robot assigned pohly Jun 26, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2025

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 26, 2025

fixup! DRA: fix deleting orphaned ResourceClaim on startup

f184521

k8s-ci-robot removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 27, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 27, 2025

k8s-ci-robot requested a review from pohly June 27, 2025 04:40

nojnhuh commented Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA: fix deleting orphaned ResourceClaim on startup #132533

DRA: fix deleting orphaned ResourceClaim on startup #132533

Uh oh!

nojnhuh commented Jun 25, 2025

Uh oh!

k8s-ci-robot commented Jun 25, 2025

Uh oh!

dcq01 left a comment

Uh oh!

pohly commented Jun 26, 2025

Uh oh!

pohly left a comment

Uh oh!

k8s-ci-robot commented Jun 26, 2025

Uh oh!

k8s-ci-robot commented Jun 26, 2025

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

nojnhuh commented Jun 27, 2025

Uh oh!

nojnhuh Jun 27, 2025

Uh oh!

Uh oh!

DRA: fix deleting orphaned ResourceClaim on startup #132533

Are you sure you want to change the base?

DRA: fix deleting orphaned ResourceClaim on startup #132533

Uh oh!

Conversation

nojnhuh commented Jun 25, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Jun 25, 2025

Uh oh!

dcq01 left a comment

Choose a reason for hiding this comment

Uh oh!

pohly commented Jun 26, 2025

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 26, 2025

Uh oh!

k8s-ci-robot commented Jun 26, 2025

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

nojnhuh commented Jun 27, 2025

Uh oh!

nojnhuh Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!