Skip to content

DRA kubelet: connection monitoring #132058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 2, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

When a kubelet plugin shuts down without removing its registration socket (for example, because it was forcibly killed or crashed), then the kubelet does not clean up and delete ResourceSlices even if the plugin does not come back. With connection monitoring of the service socket, the kubelet recognizes when a plugin becomes unusable and then cleans up accordingly.

Which issue(s) this PR fixes:

Fixes #128696
Replaces #131073
Depends on #132096

Special notes for your reviewer:

Please review commit-by-commit. The initial commits refactor code to make the last commits simpler, so at least #132096 should get merged first.

Does this PR introduce a user-facing change?

DRA kubelet: the kubelet now also cleans up ResourceSlices in some additional failure scenarios (driver gets removed forcibly or crashes and does not restart).

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 2, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 2, 2025
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 2, 2025
@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Jun 2, 2025
type pluginsStore struct {
sync.RWMutex
// Store keeps track of how to reach plugins registered for DRA drivers.
// Each plugin has a gRPC endpoint. There may be more than one plugin per driver.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt to rationalize the difference between "kubelet plugin" and "DRA driver".

The DRA manager has traditionally favored "plugin" and "plugin name" when the rest of the system uses the terms "DRA driver" and "DRA driver name". IMHO this is unnecessary and the code would have been more readable when restricting the use of "plugin" to the registration mechanism. Even with some of my updates that separation is not complete and/or consistent.

Would it be worth doing more renames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets worse: sometimes the plugin instance is also called "client", leading to "NewDRAPluginClient". Apropos, the "New" is also wrong there because it does not create an instance.

I'll add a commit with further renaming...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just code readability: kubelet also logs "pluginName" as an attribute where kube-scheduler and other components use "driverName". This is visible to users.

@pohly pohly force-pushed the dra-kubelet-connection-monitoring branch 3 times, most recently from e37d351 to 6c65e86 Compare June 3, 2025 10:27
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2025
@pohly
Copy link
Contributor Author

pohly commented Jun 3, 2025

/test pull-kubernetes-node-e2e-containerd-2-0-dra

@pohly pohly marked this pull request as draft June 3, 2025 12:15
@pohly pohly force-pushed the dra-kubelet-connection-monitoring branch from 6c65e86 to 23b2871 Compare June 3, 2025 16:28
bart0sh added 6 commits June 24, 2025 10:42
Fixed the following warnings:
dra_test.go:884:2: singleCaseSwitch: should rewrite switch statement to if statement (gocritic)
	switch podName {
	^
dra_test.go:686:4: SA4006: this value of kubeletPlugin is never used (staticcheck)
	kubeletPlugin = newDRAService(ctx, f.ClientSet, nodeName, driverName)
        ^
Unix socket path has to be <= 108 characters in length on Windows.
Shortened DRA socket path for TestConnectionHandling and
TestRegistrationHandler tests should fix the test run on Windows.
Added an ability to specify the socket path for the DRA gRPC
service in the e2e node tests.

The PluginSocket option is added to allow setting the name
of the socket inside the directory where the DRA driver
creates the socket for the DRA gRPC calls. This is used by
the kubelet to connect to the DRA plugin.

The newDRAService and newRegistrar functions are updated to
accept a socketPath parameter, which is used to configure
the PluginDataDirectoryPath and PluginSocket options for the
DRA plugin.

This change enables more flexible configuration of the DRA
plugin in e2e tests, allowing for testing with different
socket paths.
Added tests to verify DRA functionality with 2 different socket
configurations:
- the same socket is used for the registration and the DRA service
- 2 separate sockets are used for the registration and the DRA service

Used table-driven ginkgo to avoid code duplication:
specs https://onsi.github.io/ginkgo/#table-driven-tests

This change enhances the robustness of the DRA e2e tests by
validating its behavior with different socket setups.
Fixed race condition caused by removing already removed
socket directory.
Using the same socket path for different test cases caused
test failure on windows:
listen unix C:\Users\azureuser\AppData\Local\Temp\TestRegistrationHandler3881105518\001/dra-plugin-a.sock:
bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

Creating unique socket path for every test case should fix it.
@pohly pohly force-pushed the dra-kubelet-connection-monitoring branch from f7f4708 to 6040344 Compare June 24, 2025 08:43
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2025
@bart0sh
Copy link
Contributor

bart0sh commented Jun 24, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 43d3df89d73c9843252c404af35f73c9e1ccaa7e

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 24, 2025

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-kind-dra-n-1-canary aee4073 link false /test pull-kubernetes-kind-dra-n-1-canary
pull-kubernetes-kind-dra-n-2-canary aee4073 link false /test pull-kubernetes-kind-dra-n-2-canary

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@pohly
Copy link
Contributor Author

pohly commented Jun 24, 2025

/test pull-kubernetes-e2e-kind

Diverse unrelated failures, too busy to look for known flakes....

/skip

For previous canary job failures.

@bart0sh bart0sh moved this from Triage to Needs Approver in SIG Node: code and documentation PRs Jun 25, 2025
@pohly pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Jun 26, 2025
@klueska
Copy link
Contributor

klueska commented Jun 26, 2025

/approve
based on @bart0sh's review

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Jun 26, 2025
@k8s-ci-robot k8s-ci-robot merged commit dcefe0e into kubernetes:master Jun 26, 2025
19 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 26, 2025
@github-project-automation github-project-automation bot moved this from Archive-it to Done in SIG Node CI/Test Board Jun 26, 2025
@github-project-automation github-project-automation bot moved this from Needs Approver to Done in SIG Node: code and documentation PRs Jun 26, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Jun 27, 2025
@bart0sh bart0sh moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Jun 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Development

Successfully merging this pull request may close these issues.

DRA: detect stale DRA plugin sockets
5 participants