-
Notifications
You must be signed in to change notification settings - Fork 40.9k
DRA kubelet: add dra_resource_claims_in_use gauge vector #131641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
// cdiDevicesAsList returns a list of CDIDevices from the provided claim info. | ||
// When the request name is non-empty, only devices relevant for that request | ||
// are returned. | ||
func (info *ClaimInfo) cdiDevicesAsList(requestName string) []kubecontainer.CDIDevice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this method unchanged because it was odd that most of the ClaimInfo
methods were above, except for this one which came after the claimInfoCache
methods.
@@ -382,6 +384,43 @@ var _ = framework.SIGDescribe("node")("DRA", feature.DynamicResourceAllocation, | |||
gomega.Eventually(kubeletPlugin2.GetGRPCCalls).WithTimeout(retryTestTimeout).Should(testdriver.NodeUnprepareResourcesSucceeded) | |||
}) | |||
|
|||
ginkgo.It("must provide metrics", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an e2e_node test because it is easier to get the kubelet metrics there.
If that could be done also in an E2E test, then putting the test there would be more appropriate. The test doesn't really depend on the kubelet configuration and E2E tests are easier to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking comments
e2e-kind failed because of #131748. |
test/e2e_node/dra_test.go
Outdated
gomega.Expect(kubeletPlugin1.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 1 should have prepared resources.") | ||
gomega.Expect(kubeletPlugin2.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 2 should have prepared resources.") | ||
driverName := func(element any) string { | ||
el := element.(*model.Sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggers
ERROR: Some files are importing packages under github.com/prometheus/* but are not allow-listed to do so.
See: https://github.com/kubernetes/kubernetes/issues/89267
Failing files:
./test/e2e_node/dra_test.go
It is how some other code is checking metrics.
@dgrisonnet @richabanker: is this one of those cases where it's okay to extend the allow list? Or is there a different way of checking for the expected outcome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we have allowed to extend the allow list in the past, ref so I guess we can just do that now? Regarding the usage, I see similar usage of the model package in the codebase to verify the metrics data so should be fine? cc @serathius - the author of the linked issue if he has any ideas on how to avoid importing the package here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively maybe you could try to create a map representation (map[string]float64) of the vector where the key is the driver_name and the value is the metric value. And then use something like this for verifying the values?
claimsInUse := convertVectorToMap(metrics, "dra_resource_claims_in_use")
gomega.Expect(claimsInUse).Should(gstruct.MatchKeys(gstruct.IgnoreExtras, gstruct.Keys{
"": gomega.BeEquivalentTo(1),
kubeletPlugin1Name: gomega.BeEquivalentTo(1),
kubeletPlugin2Name: gomega.BeEquivalentTo(1),
}), "metrics while pod is running")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That feels like a workaround. I prefer adding a type alias in k8s.io/component-base/metrics/testutil
: that package is geared towards use in tests, and already defines a type which directly exposes model.Sample
, so letting consumers of that package also use that type directly seems fair. The code already depends on it anyway.
In other words, this was possible before:
var metrics testutils.Metrics
metrics = ...
samples := metrics["dra_resource_claims_in_use"]
sample := samples[0]
It should also be possible to write:
var sample *testutil.Sample
sample = samples[0]
I suppose some of the importers under
kubernetes/hack/verify-prometheus-imports.sh
Lines 72 to 79 in 3196c99
./test/e2e/apimachinery/flowcontrol.go | |
./test/e2e_node/mirror_pod_grace_period_test.go | |
./test/e2e/node/pods.go | |
./test/e2e_node/resource_metrics_test.go | |
./test/instrumentation/main_test.go | |
./test/integration/apiserver/flowcontrol/concurrency_test.go | |
./test/integration/apiserver/flowcontrol/concurrency_util_test.go | |
./test/integration/metrics/metrics_test.go |
For now I have added the type aliases to this PR and use them in dra_test.go
.
13012f0
to
935ff9f
Compare
pull-kubernetes-node-e2e-crio-cgrpv1-dra failure seems to be related to this PR. |
/triage accepted @pohly Feel free to unhold after CI failures fixed. |
LGTM label has been added. Git tree hash: 2dec57a3e52b9747f6b07cca8930e236fc904ff3
|
cb99711
to
985f626
Compare
Test code using the testutil.Metrics type already depended on the Prometheus types, but couldn't reference them by name. This is necessary for example when using Gomega (to cast from `any` in a matcher) or when defining a `var sample *Sample` which gets set later.
The new metric informs admins whether DRA in general (special "driver_name: <any>" label) and/or specific DRA drivers (other label values) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked. The implementation of the metric uses read locking of the claim info cache. It retrieves "claims in use" and turns those into the metric. The same code is also used to log changes in the claim info cache with a diff. This hooks into a write update of the claim info cache and uses contextual logging. The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data. While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.
985f626
to
6d6a749
Compare
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@bart0sh: still LGTM? I fixed some test failure: https://github.com/kubernetes/kubernetes/compare/cb9971121208859af11e2657e7cf7db102c113ed..985f626be871d4023e7f402731ee46bd244e088c /assign @klueska For approval. |
/lgtm |
LGTM label has been added. Git tree hash: 59fb3d89a93e3404fa347cd8d5315c2c0cc07aaf
|
What type of PR is this?
/kind feature
What this PR does / why we need it:
The new metric informs admins whether DRA in general (special "driver_name: " label) and/or specific DRA drivers (other label values) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked.
The implementation of the metric uses read locking of the claim info cache. It retrieves "claims in use" and turns those into the metric.
The same code is also used to log changes in the claim info cache with a diff. This hooks into a write update of the claim info cache and uses contextual logging.
Which issue(s) this PR fixes:
Slack discussion: https://kubernetes.slack.com/archives/C0409NGC1TK/p1746168044475379?thread_ts=1746001550.655339&cid=C0409NGC1TK
Related-to: kubernetes/enhancements#4381 (GA?)
Special notes for your reviewer:
The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data.
While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.
Let's review this proposal, then document it as part of the 1.34 KEP update before merging the implementation.
/hold
/assign @bart0sh
Does this PR introduce a user-facing change?