DRA kubelet: add dra_resource_claims_in_use gauge vector #131641

pohly · 2025-05-07T11:51:40Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

The new metric informs admins whether DRA in general (special "driver_name: " label) and/or specific DRA drivers (other label values) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked.

The implementation of the metric uses read locking of the claim info cache. It retrieves "claims in use" and turns those into the metric.

The same code is also used to log changes in the claim info cache with a diff. This hooks into a write update of the claim info cache and uses contextual logging.

Which issue(s) this PR fixes:

Slack discussion: https://kubernetes.slack.com/archives/C0409NGC1TK/p1746168044475379?thread_ts=1746001550.655339&cid=C0409NGC1TK
Related-to: kubernetes/enhancements#4381 (GA?)

Special notes for your reviewer:

The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data.

While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.

Let's review this proposal, then document it as part of the 1.34 KEP update before merging the implementation.

/hold
/assign @bart0sh

Does this PR introduce a user-facing change?

The new `dra_resource_claims_in_use` kubelet metrics informs about active ResourceClaims, overall and by driver.

pohly · 2025-05-07T11:52:47Z

pkg/kubelet/cm/dra/claiminfo.go

+// cdiDevicesAsList returns a list of CDIDevices from the provided claim info.
+// When the request name is non-empty, only devices relevant for that request
+// are returned.
+func (info *ClaimInfo) cdiDevicesAsList(requestName string) []kubecontainer.CDIDevice {


I moved this method unchanged because it was odd that most of the ClaimInfo methods were above, except for this one which came after the claimInfoCache methods.

pkg/kubelet/cm/dra/claiminfo.go

test/e2e_node/dra_test.go

pohly · 2025-05-07T12:02:40Z

test/e2e_node/dra_test.go

@@ -382,6 +384,43 @@ var _ = framework.SIGDescribe("node")("DRA", feature.DynamicResourceAllocation,
 			gomega.Eventually(kubeletPlugin2.GetGRPCCalls).WithTimeout(retryTestTimeout).Should(testdriver.NodeUnprepareResourcesSucceeded)
 		})

+		ginkgo.It("must provide metrics", func(ctx context.Context) {


This is an e2e_node test because it is easier to get the kubelet metrics there.

If that could be done also in an E2E test, then putting the test there would be more appropriate. The test doesn't really depend on the kubelet configuration and E2E tests are easier to run.

ArangoGutierrez

Non blocking comments

pkg/kubelet/cm/dra/metrics_test.go

test/e2e_node/dra_test.go

pohly · 2025-05-13T18:06:38Z

e2e-kind failed because of #131748.

pohly · 2025-05-13T18:10:35Z

test/e2e_node/dra_test.go

+			gomega.Expect(kubeletPlugin1.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 1 should have prepared resources.")
+			gomega.Expect(kubeletPlugin2.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 2 should have prepared resources.")
+			driverName := func(element any) string {
+				el := element.(*model.Sample)


This triggers

ERROR: Some files are importing packages under github.com/prometheus/* but are not allow-listed to do so. See: https://github.com/kubernetes/kubernetes/issues/89267 Failing files: ./test/e2e_node/dra_test.go

It is how some other code is checking metrics.

@dgrisonnet @richabanker: is this one of those cases where it's okay to extend the allow list? Or is there a different way of checking for the expected outcome?

Seems like we have allowed to extend the allow list in the past, ref so I guess we can just do that now? Regarding the usage, I see similar usage of the model package in the codebase to verify the metrics data so should be fine? cc @serathius - the author of the linked issue if he has any ideas on how to avoid importing the package here.

Alternatively maybe you could try to create a map representation (map[string]float64) of the vector where the key is the driver_name and the value is the metric value. And then use something like this for verifying the values?

claimsInUse := convertVectorToMap(metrics, "dra_resource_claims_in_use") gomega.Expect(claimsInUse).Should(gstruct.MatchKeys(gstruct.IgnoreExtras, gstruct.Keys{ "": gomega.BeEquivalentTo(1), kubeletPlugin1Name: gomega.BeEquivalentTo(1), kubeletPlugin2Name: gomega.BeEquivalentTo(1), }), "metrics while pod is running")

That feels like a workaround. I prefer adding a type alias in k8s.io/component-base/metrics/testutil: that package is geared towards use in tests, and already defines a type which directly exposes model.Sample, so letting consumers of that package also use that type directly seems fair. The code already depends on it anyway.

In other words, this was possible before:

var metrics testutils.Metrics metrics = ... samples := metrics["dra_resource_claims_in_use"] sample := samples[0]

It should also be possible to write:

var sample *testutil.Sample sample = samples[0]

I suppose some of the importers under

kubernetes/hack/verify-prometheus-imports.sh

Lines 72 to 79 in 3196c99

./test/e2e/apimachinery/flowcontrol.go

./test/e2e_node/mirror_pod_grace_period_test.go

./test/e2e/node/pods.go

./test/e2e_node/resource_metrics_test.go

./test/instrumentation/main_test.go

./test/integration/apiserver/flowcontrol/concurrency_test.go

./test/integration/apiserver/flowcontrol/concurrency_util_test.go

./test/integration/metrics/metrics_test.go

could use the same approach, but I haven't checked.

For now I have added the type aliases to this PR and use them in dra_test.go.

pkg/kubelet/cm/dra/claiminfo.go

bart0sh · 2025-05-21T10:54:54Z

pull-kubernetes-node-e2e-crio-cgrpv1-dra failure seems to be related to this PR.

bart0sh · 2025-05-21T11:13:00Z

/triage accepted
/lgtm

@pohly Feel free to unhold after CI failures fixed.

k8s-ci-robot · 2025-05-21T11:13:06Z

LGTM label has been added.

Git tree hash: 2dec57a3e52b9747f6b07cca8930e236fc904ff3

Test code using the testutil.Metrics type already depended on the Prometheus types, but couldn't reference them by name. This is necessary for example when using Gomega (to cast from `any` in a matcher) or when defining a `var sample *Sample` which gets set later.

The new metric informs admins whether DRA in general (special "driver_name: <any>" label) and/or specific DRA drivers (other label values) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked. The implementation of the metric uses read locking of the claim info cache. It retrieves "claims in use" and turns those into the metric. The same code is also used to log changes in the claim info cache with a diff. This hooks into a write update of the claim info cache and uses contextual logging. The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data. While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.

k8s-ci-robot · 2025-06-26T13:08:44Z

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-unit-windows-master	`6d6a749`	link	false	`/test pull-kubernetes-unit-windows-master`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pohly · 2025-06-30T12:26:50Z

@bart0sh: still LGTM? I fixed some test failure: https://github.com/kubernetes/kubernetes/compare/cb9971121208859af11e2657e7cf7db102c113ed..985f626be871d4023e7f402731ee46bd244e088c

/assign @klueska

For approval.

bart0sh · 2025-06-30T14:34:41Z

/lgtm

k8s-ci-robot · 2025-06-30T14:34:48Z

LGTM label has been added.

Git tree hash: 59fb3d89a93e3404fa347cd8d5315c2c0cc07aaf

k8s-ci-robot assigned bart0sh May 7, 2025

k8s-ci-robot requested review from dchen1107 and sjenning May 7, 2025 11:52

k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2025

github-project-automation bot moved this to Triage in SIG Node CI/Test Board May 7, 2025

github-project-automation bot added this to SIG Node CI/Test Board May 7, 2025

k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2025

github-project-automation bot added this to Dynamic Resource Allocation May 7, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation May 7, 2025

pohly commented May 7, 2025

View reviewed changes

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 7, 2025

SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board May 7, 2025

ArangoGutierrez reviewed May 12, 2025

View reviewed changes

pkg/kubelet/cm/dra/metrics_test.go Outdated Show resolved Hide resolved

test/e2e_node/dra_test.go Show resolved Hide resolved

test/e2e_node/dra_test.go Show resolved Hide resolved

pohly commented May 13, 2025

View reviewed changes

pkg/kubelet/cm/dra/claiminfo.go Show resolved Hide resolved

pohly commented May 14, 2025

View reviewed changes

pkg/kubelet/cm/dra/claiminfo.go Outdated Show resolved Hide resolved

pohly force-pushed the dra-kubelet-in-use-metric branch from 13012f0 to 935ff9f Compare May 14, 2025 11:36

bart0sh reviewed May 21, 2025

View reviewed changes

pkg/kubelet/cm/dra/claiminfo.go Outdated Show resolved Hide resolved

bart0sh reviewed May 21, 2025

View reviewed changes

pkg/kubelet/cm/dra/claiminfo.go Outdated Show resolved Hide resolved

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2025

k8s-ci-robot requested review from bart0sh and richabanker May 21, 2025 08:55

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 21, 2025

pohly force-pushed the dra-kubelet-in-use-metric branch 2 times, most recently from cb99711 to 985f626 Compare May 22, 2025 06:01

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2025

pohly mentioned this pull request May 23, 2025

KEP 4381: DRA structured parameters: updates, promotion to GA kubernetes/enhancements#5333

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2025

pohly added 2 commits June 26, 2025 14:15

pohly force-pushed the dra-kubelet-in-use-metric branch from 985f626 to 6d6a749 Compare June 26, 2025 12:34

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs Jun 26, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 26, 2025

pohly added a commit to pohly/kubernetes that referenced this pull request Jun 26, 2025

merge kubernetes#131641 dra-kubelet-in-use-metric

f091451

pohly added a commit to pohly/kubernetes that referenced this pull request Jun 27, 2025

merge kubernetes#131641 dra-kubelet-in-use-metric

0390700

k8s-ci-robot assigned klueska Jun 30, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2025

	./test/e2e/apimachinery/flowcontrol.go
	./test/e2e_node/mirror_pod_grace_period_test.go
	./test/e2e/node/pods.go
	./test/e2e_node/resource_metrics_test.go
	./test/instrumentation/main_test.go
	./test/integration/apiserver/flowcontrol/concurrency_test.go
	./test/integration/apiserver/flowcontrol/concurrency_util_test.go
	./test/integration/metrics/metrics_test.go

DRA kubelet: add dra_resource_claims_in_use gauge vector #131641

Are you sure you want to change the base?

DRA kubelet: add dra_resource_claims_in_use gauge vector #131641

Uh oh!

Conversation

pohly commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

pohly May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pohly May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pohly commented May 13, 2025

Uh oh!

pohly May 13, 2025

Choose a reason for hiding this comment

Uh oh!

richabanker May 18, 2025

Choose a reason for hiding this comment

Uh oh!

richabanker May 18, 2025

Choose a reason for hiding this comment

Uh oh!

pohly May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bart0sh commented May 21, 2025

Uh oh!

bart0sh commented May 21, 2025

Uh oh!

k8s-ci-robot commented May 21, 2025

Uh oh!

k8s-ci-robot commented Jun 26, 2025

Uh oh!

pohly commented Jun 30, 2025

Uh oh!

bart0sh commented Jun 30, 2025

Uh oh!

k8s-ci-robot commented Jun 30, 2025

Uh oh!

Uh oh!

pohly commented May 7, 2025 •

edited

Loading