DRA resource claim controller: fails to clean up deleted ResourceClaims on startup

### What happened?

* kube-controller-manager is stopped.
* A allocated claim with one pod in ReservedFor is marked as deleted (but not removed yet because of the finalizer).
* That pod gets deleted, terminates and gets removed.
* kube-controller-manager is restarted.

The ResourceClaim controllers logs:
```
I0616 15:55:01.290991       1 controller.go:390] "not enqueing deleted claim" logger="resourceclaim-controller" claim="dra-6273/external-claim-2"
I0616 15:55:01.291019       1 controller.go:401] "unrelated to any known pod" logger="resourceclaim-controller" claim="dra-6273/external-claim-2"
```

It does not do anything, so the claim remains pending.

This was triggered while working on upgrade/downgrade scenarios.

/wg device-management
/sig node

### What did you expect to happen?

The ResourceClaim controller should remove the pod from ReservedFor, the allocation, and the finalizer, thus unblocking the removal of the ResourceClaim.


### How can we reproduce it (as minimally and precisely as possible)?

Not easy, needs WIP test.

### Anything else we need to know?

This is a regression introduced by https://github.com/kubernetes/kubernetes/pull/127661.

The logic here is inverted:
https://github.com/kubernetes/kubernetes/blame/c2524cbf9b49f034053f758401ec3b08a4504e0e/pkg/controller/resourceclaim/controller.go#L330

The correct expression is `deleted := newObj == nil`.


This causes the enqueuing of the claim for processing to get skipped in

https://github.com/kubernetes/kubernetes/blob/c2524cbf9b49f034053f758401ec3b08a4504e0e/pkg/controller/resourceclaim/controller.go#L372-L377

Normally, this gets mitigated by pod removal which also has the desired effect, but in this particular case that removal is never observed - the pod is already gone when the kube-controller-manager starts.




### Kubernetes version

Kubernetes >= 1.32.


### Cloud provider

<details>

</details>


### OS version

<details>

```console
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
```

</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>


	// When starting up, we have to check all claims to find those with
	// stale pods in ReservedFor. During an update, a pod might get added
	// that already no longer exists.
	key := claimKeyPrefix + claim.Namespace + "/" + claim.Name
	logger.V(6).Info("enqueing new or updated claim", "claim", klog.KObj(claim), "key", key)
	ec.queue.Add(key)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA resource claim controller: fails to clean up deleted ResourceClaims on startup #132334

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DRA resource claim controller: fails to clean up deleted ResourceClaims on startup #132334

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions