Skip to content

DRA plugin: handle serving failures #132598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Jun 28, 2025

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

  • Added an ErrorChannel option to the DRA plugin to report unrecoverable errors via a channel.
  • Updated e2e_node APIs to use options instead of positional arguments
  • Added an e2e_node test to verify error reporting for gRPC serving failures.

Special notes for your reviewer:

If this PR is accepted I'm going to modify example driver to fail on serving failures.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Jun 28, 2025
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 28, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubectl area/test sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 28, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 28, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG CLI Jun 28, 2025
@bart0sh bart0sh force-pushed the PR182-DRA-handle-serving-failures branch from a214e7d to ea7767d Compare June 28, 2025 18:26
@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 28, 2025

/pull-kubernetes-verify

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 28, 2025

/test pull-kubernetes-unit-windows-master
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-containerd-2-0-dra

@bart0sh bart0sh force-pushed the PR182-DRA-handle-serving-failures branch from ea7767d to 80517a9 Compare June 29, 2025 02:24
@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 29, 2025

/test pull-kubernetes-unit-windows-master
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 29, 2025

/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

@bart0sh bart0sh force-pushed the PR182-DRA-handle-serving-failures branch from d1a6655 to 07d0757 Compare June 29, 2025 09:58
@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 29, 2025

/test pull-kubernetes-linter-hints
/test pull-kubernetes-unit-windows-master
/test pull-kubernetes-kind-dra-n-1
/test pull-kubernetes-kind-dra-n-2
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-containerd-2-0-dra

@bart0sh bart0sh marked this pull request as ready for review June 29, 2025 12:13
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2025
@k8s-ci-robot k8s-ci-robot requested review from pohly and tallclair June 29, 2025 12:13
@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 29, 2025

/cc @nojnhuh

@k8s-ci-robot k8s-ci-robot requested a review from nojnhuh June 29, 2025 12:14
@bart0sh bart0sh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 29, 2025
@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 29, 2025

/cc @SergeyKanzhelev @lauralorenz

@@ -274,15 +274,15 @@ func PluginSocket(name string) Option {
}
}

// PluginListener configures how to create the registrar socket.
// PluginListener configures how to create the socket listener.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listener for which socket?

"plugin" and "endpoint" are both ambiguous. Let's be clear about which socket we are dealing with.

I also like "create socket" better than "create socket listener". The main purpose of the function is to create a socket; the returned net.Listener is then just an implementation detail.

The Go documentation is similar: "net.Listen" is described as "announces on the local network address", not as "creates a listener".

So I would fix the copy-and-paste error with:

PluginListener configures how to create the DRA service socket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listener for which socket?

For either plugin(service) or registrar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PluginListener configures how to create the DRA service socket

This could be understood as it's for creating plugin(service) socket only, which is not correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For either plugin(service) or registrar.

Really? There's a separate RegistrarListener for the registrar socket.

This could be understood as it's for creating plugin(service) socket only, which is not correct.

It is correct. The parameter of PluginListener is only used for one socket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out! Fixed.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 29, 2025

/test pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-containerd-1-7-dra

@bart0sh bart0sh force-pushed the PR182-DRA-handle-serving-failures branch from 07d0757 to 708afa1 Compare June 29, 2025 18:50
@@ -430,6 +430,18 @@ func DRAService(enabled bool) Option {
}
}

// ErrorChannel allows the plugin to send errors to the caller. This is
// useful to report errors that happen in goroutines. The caller is expected
// to read from the channel and handle errors appropriately.
Copy link
Contributor

@pohly pohly Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add guidance for DRA driver authors for what "handle errors appropriately" means?

Log them and continue? Log and exit because all errors are fatal? Are there special errors that are merely informative (probably not, but worth clarifying)?

Copy link
Contributor Author

@bart0sh bart0sh Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only usage of this for now is handling grpc.Server.Serve errors. Below explanation suggests performing a graceful shutdown. In other cases (feature cases) it could be different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "log this error and perform a graceful shutdown" (emphasis mine) - but it doesn't say how the caller can detect this error.

Either we need to document how it can detect that error, or recommend that it does that action for all errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description, PTAL.

// to read from the channel and handle errors appropriately.
// For example, this channel can be used to report errors returned by the
// grpcserver.Serve. Plugins then can use this error to perform a graceful shutdown.
func ErrorChannel(errorChannel chan error) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func ErrorChannel(errorChannel chan error) Option {
func ErrorChannel(errorChannel chan<- error) Option {

The helper is only meant to write to this channel, not read from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for pointing out!

@@ -78,7 +78,10 @@ func startGRPCServer(logger klog.Logger, grpcVerbosity int, unaryInterceptors []
defer s.wg.Done()
err := s.server.Serve(listener)
if err != nil {
logger.Error(err, "GRPC server failed")
logger.Error(err, "gRPC server failed to serve")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's log here only if we don't pass on the error. Otherwise it might get logged twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-project-automation github-project-automation bot moved this from Triage to PRs Waiting on Author in SIG Node CI/Test Board Jun 30, 2025
@github-project-automation github-project-automation bot moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jun 30, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to In Progress in SIG CLI Jun 30, 2025
@bart0sh bart0sh force-pushed the PR182-DRA-handle-serving-failures branch from 708afa1 to c361112 Compare June 30, 2025 14:53
bart0sh added 4 commits July 1, 2025 10:16
This change introduces an ErrorChannel option to the DRA kubelet plugin,
allowing errors from goroutines (such as gRPC server failures) to be
reported to the caller.
Refactor the DRA e2e_node test helpers and test cases to accept
variadic kubeletplugin.Option arguments.

This change improves test flexibility and maintainability, allowing
new options to be passed in the future without requiring widespread
code changes.

There are no functional changes to test coverage or behavior.
Added an e2e_node test to verify that the DRA plugin and registration
services report gRPC server serving error correctly. The closed
listener is created intentionally to simulate a serving error,
ensuring that grpc.Server.Serve fails as expected and the error is
sent to the error channel.
@bart0sh bart0sh force-pushed the PR182-DRA-handle-serving-failures branch from c361112 to 5d8e353 Compare July 1, 2025 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubectl area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: In Progress
Status: PRs Waiting on Author
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

3 participants