-
Notifications
You must be signed in to change notification settings - Fork 40.9k
DRA kubelet: connection monitoring #132058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA kubelet: connection monitoring #132058
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
type pluginsStore struct { | ||
sync.RWMutex | ||
// Store keeps track of how to reach plugins registered for DRA drivers. | ||
// Each plugin has a gRPC endpoint. There may be more than one plugin per driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my attempt to rationalize the difference between "kubelet plugin" and "DRA driver".
The DRA manager has traditionally favored "plugin" and "plugin name" when the rest of the system uses the terms "DRA driver" and "DRA driver name". IMHO this is unnecessary and the code would have been more readable when restricting the use of "plugin" to the registration mechanism. Even with some of my updates that separation is not complete and/or consistent.
Would it be worth doing more renames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets worse: sometimes the plugin instance is also called "client", leading to "NewDRAPluginClient". Apropos, the "New" is also wrong there because it does not create an instance.
I'll add a commit with further renaming...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just code readability: kubelet also logs "pluginName" as an attribute where kube-scheduler and other components use "driverName". This is visible to users.
e37d351
to
6c65e86
Compare
/test pull-kubernetes-node-e2e-containerd-2-0-dra |
6c65e86
to
23b2871
Compare
Fixed the following warnings: dra_test.go:884:2: singleCaseSwitch: should rewrite switch statement to if statement (gocritic) switch podName { ^ dra_test.go:686:4: SA4006: this value of kubeletPlugin is never used (staticcheck) kubeletPlugin = newDRAService(ctx, f.ClientSet, nodeName, driverName) ^
Unix socket path has to be <= 108 characters in length on Windows. Shortened DRA socket path for TestConnectionHandling and TestRegistrationHandler tests should fix the test run on Windows.
Added an ability to specify the socket path for the DRA gRPC service in the e2e node tests. The PluginSocket option is added to allow setting the name of the socket inside the directory where the DRA driver creates the socket for the DRA gRPC calls. This is used by the kubelet to connect to the DRA plugin. The newDRAService and newRegistrar functions are updated to accept a socketPath parameter, which is used to configure the PluginDataDirectoryPath and PluginSocket options for the DRA plugin. This change enables more flexible configuration of the DRA plugin in e2e tests, allowing for testing with different socket paths.
Added tests to verify DRA functionality with 2 different socket configurations: - the same socket is used for the registration and the DRA service - 2 separate sockets are used for the registration and the DRA service Used table-driven ginkgo to avoid code duplication: specs https://onsi.github.io/ginkgo/#table-driven-tests This change enhances the robustness of the DRA e2e tests by validating its behavior with different socket setups.
Fixed race condition caused by removing already removed socket directory.
Using the same socket path for different test cases caused test failure on windows: listen unix C:\Users\azureuser\AppData\Local\Temp\TestRegistrationHandler3881105518\001/dra-plugin-a.sock: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted. Creating unique socket path for every test case should fix it.
f7f4708
to
6040344
Compare
/lgtm |
LGTM label has been added. Git tree hash: 43d3df89d73c9843252c404af35f73c9e1ccaa7e
|
@pohly: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/test pull-kubernetes-e2e-kind Diverse unrelated failures, too busy to look for known flakes.... /skip For previous canary job failures. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: klueska, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
When a kubelet plugin shuts down without removing its registration socket (for example, because it was forcibly killed or crashed), then the kubelet does not clean up and delete ResourceSlices even if the plugin does not come back. With connection monitoring of the service socket, the kubelet recognizes when a plugin becomes unusable and then cleans up accordingly.
Which issue(s) this PR fixes:
Fixes #128696
Replaces #131073
Depends on #132096
Special notes for your reviewer:
Please review commit-by-commit. The initial commits refactor code to make the last commits simpler, so at least #132096 should get merged first.
Does this PR introduce a user-facing change?