Skip to content

WIP: DRA GA tracking branch #132554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from
Draft

WIP: DRA GA tracking branch #132554

wants to merge 18 commits into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 26, 2025

This is a combination of all PRs which are required for core DRA to become GA. It is not meant to be merged itself.

With benchstat it's easy to do before/after comparisons, but the section for
running benchmark didn't mention it at all and didn't work as shown there:
- benchmark results must be printed (FULL_LOG)
- timeout might have been too short (KUBE_TIMEOUT)
- only "short" benchmarks ran (SHORT)
- klog log output must be redirected (ARTIFACTS)
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 26, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and denkensk June 26, 2025 10:39
@k8s-ci-robot k8s-ci-robot added area/code-generation area/kubelet area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 26, 2025
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jun 26, 2025
@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 26, 2025
pohly added 2 commits June 26, 2025 14:15
Test code using the testutil.Metrics type already depended on the Prometheus
types, but couldn't reference them by name. This is necessary for example when
using Gomega (to cast from `any` in a matcher) or when defining a `var
sample *Sample` which gets set later.
The new metric informs admins whether DRA in general (special "driver_name: <any>"
label) and/or specific DRA drivers (other label values) are in use on nodes.
This is useful to know because removing a driver is only safe if it is not in
use. If a driver gets removed while it has prepared a ResourceClaim,
unpreparing that ResourceClaim and stopping pods is blocked.

The implementation of the metric uses read locking of the claim
info cache. It retrieves "claims in use" and turns those into the metric.

The same code is also used to log changes in the claim info cache with
a diff. This hooks into a write update of the claim info cache and uses
contextual logging.

The unit tests check that metrics get calculated. The e2e_node test checks that
kubelet really exports the metrics data.

While at it, some bugs in the claiminfo_test.go get fixed: the way how the
cache got populated in the test did not match the code anymore.
@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Jun 26, 2025
pohly added 15 commits June 26, 2025 17:10
When using context.CancelCause in the scheduler and context.Cause in plugins,
the status returned by plugins is more informative than just "context
canceled".

Context cancellation itself is not new, but many plugin authors probably
weren't aware of it because it wasn't documented.
The only option is the filter timeout.
The implementation of it follows in a separate commit.
The intent is to catch abnormal runtimes with the generously large default
timeout of 10 seconds.

We have to set up a context with the configured timeout (optional!), then
ensure that both CEL evaluation and the allocation logic itself properly
returns the context error. The scheduler plugin then can convert that into
"unschedulable".

The allocator and thus Filter now also check for context cancellation by the
scheduler. This happens when enough nodes have been found.
It's unclear why k8s.io/kubernetes/pkg/apis/resource/install needs
to be imported explicitly. Having the apiserver and scheduler ready
to be started ensures that all APIs are available.
This covers disabling the feature via the configuration, failing to schedule
because of timeouts for all nodes, and retrying after ResourceSlice changes with
partial success (timeout for one node, success for the other).

While at it, some helper code gets improved.
The DRASchedulerFilterTimeout feature gate simplifies disabling the timeout
because setting a feature gate is often easier than modifying the scheduler
configuration with a zero timeout value.

The timeout and feature gate are new. The gate starts as beta and enabled by
default, which is consistent with the "smaller changes with low enough risk
that still may need to be disabled..." guideline.
Tests which only exercise the control plane don't need DRA drivers on the nodes
and thus can run in any cluster where the API and feature gate is
enabled. Eventually they can become conformance tests.

The actual test cases follow the same pattern and in some cases are run twice,
once for "control plane" testing and once for "kubelet" testing. The difference
is that in "control plane" mode, the driver's don't get deployed and pods are
only expected to get scheduled instead of starting to run.
Requested during PR review. This mirrors skipping validation of invalid fields
in a REST API because they get dropped before validation.
It's not okay to drop a claim from the response just because it encountered no
error. We want to be sure that a DRA driver really looked at the claim.
v1alpha4 was added in 1.31 and superseded by v1beta1 in 1.32. Since that
release, plugins are also required to advertise the supported gRPC services
during registration. In practice, all known DRA drivers use the helper code
from 1.32 or newer and thus don't need the legacy support.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-generation area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

2 participants