-
Notifications
You must be signed in to change notification settings - Fork 40.9k
WIP: DRA GA tracking branch #132554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
WIP: DRA GA tracking branch #132554
Conversation
With benchstat it's easy to do before/after comparisons, but the section for running benchmark didn't mention it at all and didn't work as shown there: - benchmark results must be printed (FULL_LOG) - timeout might have been too short (KUBE_TIMEOUT) - only "short" benchmarks ran (SHORT) - klog log output must be redirected (ARTIFACTS)
Skipping CI for Draft Pull Request. |
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Adding label Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Test code using the testutil.Metrics type already depended on the Prometheus types, but couldn't reference them by name. This is necessary for example when using Gomega (to cast from `any` in a matcher) or when defining a `var sample *Sample` which gets set later.
The new metric informs admins whether DRA in general (special "driver_name: <any>" label) and/or specific DRA drivers (other label values) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked. The implementation of the metric uses read locking of the claim info cache. It retrieves "claims in use" and turns those into the metric. The same code is also used to log changes in the claim info cache with a diff. This hooks into a write update of the claim info cache and uses contextual logging. The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data. While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.
When using context.CancelCause in the scheduler and context.Cause in plugins, the status returned by plugins is more informative than just "context canceled". Context cancellation itself is not new, but many plugin authors probably weren't aware of it because it wasn't documented.
The only option is the filter timeout. The implementation of it follows in a separate commit.
The intent is to catch abnormal runtimes with the generously large default timeout of 10 seconds. We have to set up a context with the configured timeout (optional!), then ensure that both CEL evaluation and the allocation logic itself properly returns the context error. The scheduler plugin then can convert that into "unschedulable". The allocator and thus Filter now also check for context cancellation by the scheduler. This happens when enough nodes have been found.
It's unclear why k8s.io/kubernetes/pkg/apis/resource/install needs to be imported explicitly. Having the apiserver and scheduler ready to be started ensures that all APIs are available.
This covers disabling the feature via the configuration, failing to schedule because of timeouts for all nodes, and retrying after ResourceSlice changes with partial success (timeout for one node, success for the other). While at it, some helper code gets improved.
The DRASchedulerFilterTimeout feature gate simplifies disabling the timeout because setting a feature gate is often easier than modifying the scheduler configuration with a zero timeout value. The timeout and feature gate are new. The gate starts as beta and enabled by default, which is consistent with the "smaller changes with low enough risk that still may need to be disabled..." guideline.
Tests which only exercise the control plane don't need DRA drivers on the nodes and thus can run in any cluster where the API and feature gate is enabled. Eventually they can become conformance tests. The actual test cases follow the same pattern and in some cases are run twice, once for "control plane" testing and once for "kubelet" testing. The difference is that in "control plane" mode, the driver's don't get deployed and pods are only expected to get scheduled instead of starting to run.
Requested during PR review. This mirrors skipping validation of invalid fields in a REST API because they get dropped before validation.
It's not okay to drop a claim from the response just because it encountered no error. We want to be sure that a DRA driver really looked at the claim.
v1alpha4 was added in 1.31 and superseded by v1beta1 in 1.32. Since that release, plugins are also required to advertise the supported gRPC services during registration. In practice, all known DRA drivers use the helper code from 1.32 or newer and thus don't need the legacy support.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is a combination of all PRs which are required for core DRA to become GA. It is not meant to be merged itself.