Skip to content

DRA resource claim controller: fails to clean up deleted ResourceClaims on startup #132334

Open
@pohly

Description

@pohly

What happened?

  • kube-controller-manager is stopped.
  • A allocated claim with one pod in ReservedFor is marked as deleted (but not removed yet because of the finalizer).
  • That pod gets deleted, terminates and gets removed.
  • kube-controller-manager is restarted.

The ResourceClaim controllers logs:

I0616 15:55:01.290991       1 controller.go:390] "not enqueing deleted claim" logger="resourceclaim-controller" claim="dra-6273/external-claim-2"
I0616 15:55:01.291019       1 controller.go:401] "unrelated to any known pod" logger="resourceclaim-controller" claim="dra-6273/external-claim-2"

It does not do anything, so the claim remains pending.

This was triggered while working on upgrade/downgrade scenarios.

/wg device-management
/sig node

What did you expect to happen?

The ResourceClaim controller should remove the pod from ReservedFor, the allocation, and the finalizer, thus unblocking the removal of the ResourceClaim.

How can we reproduce it (as minimally and precisely as possible)?

Not easy, needs WIP test.

Anything else we need to know?

This is a regression introduced by #127661.

The logic here is inverted:
https://github.com/kubernetes/kubernetes/blame/c2524cbf9b49f034053f758401ec3b08a4504e0e/pkg/controller/resourceclaim/controller.go#L330

The correct expression is deleted := newObj == nil.

This causes the enqueuing of the claim for processing to get skipped in

// When starting up, we have to check all claims to find those with
// stale pods in ReservedFor. During an update, a pod might get added
// that already no longer exists.
key := claimKeyPrefix + claim.Namespace + "/" + claim.Name
logger.V(6).Info("enqueing new or updated claim", "claim", klog.KObj(claim), "key", key)
ec.queue.Add(key)

Normally, this gets mitigated by pod removal which also has the desired effect, but in this particular case that removal is never observed - the pod is already gone when the kube-controller-manager starts.

Kubernetes version

Kubernetes >= 1.32.

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

Status

🔖 Ready

Status

Triaged

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions