WIP: DRA: automated upgrade/downgrade testing #132295

pohly · 2025-06-13T18:38:49Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Promotion of any feature, in this case core DRA to GA, depends on (so far) manually testing the upgrade/downgrade path. It's hard to document what exactly was tested in a way that others can verify the procedure and/or replicate it.

This PR adds helper packages for upgrading/downgrading a kind cluster and running E2E tests against, and uses that to test some DRA scenarios. It runs as part of the normal e2e.test invocation in pull/ci-kubernetes-kind-dra.

Which issue(s) this PR is related to:

#128965
KEP: kubernetes/enhancements#4381

Special notes for your reviewer:

It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done as E2E test mostly for pragmatic reasons:

It can reuse the existing helper code.
It can run in the normal DRA E2E jobs, with timeouts defined there.
Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed.

The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. #122481 could make that a bit easier in an E2E test, but is not absolutely required.

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-06-13T18:38:52Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-06-13T18:38:58Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly · 2025-06-17T19:43:52Z

/test pull-kubernetes-kind-dra

This allows declaring a code region as one step without having to use an anonymous callback function, which has the advantage that variables set during the step are visible afterwards. In Python, this would be done as with ktesting.Step(tctx) as tcxt: // some code code inside step // code not in the same step But Go has no such construct. In contrast to WithStep, the start and end of the step are logged, including timing information.

This is a DRA-specific stop-gap solution for using the E2E framework together with ktesting. Long-term this should better land in the E2E framework itself.

pohly · 2025-06-18T20:10:53Z

/test pull-kubernetes-kind-dra pull-kubernetes-unit pull-kubernetes-integration

test/utils/kindcluster/kindcluster.go

pohly · 2025-06-23T05:52:49Z

/test pull-kubernetes-dra-integration-canary

The test brings up the kind cluster and uses that power to run through an upgrade/downgrade. Version skew testing (running tests while the cluster is partially up- or downgraded) could be added. Several tests could run in parallel because each cluster is independent. Inside the test the steps have to be sequential. It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done mostly for pragmatic reasons: - It can reuse the existing helper code. - It can run in the normal DRA E2E jobs, with timeouts defined there. - Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed. The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. kubernetes#122481 could make that a bit easier in an E2E test, but is not absolutely required.

"KIND_COMMAND=kind go test ./test/e2e/dra" can run it, as if it was a unit test. -ginkgo.junit-report can be used to produce a JUnit report. The main advantage over a unit test is that the report will only contain the actual failure message, because Ginkgo (in contrast to `go test`) tracks failures separately from output. From a practical perspective, doing this in a Ginkgo suite in the same directory as before has the advantage that none of the helper code has to be updated. Doing it as a Go unit test had been tried and turned into a major effort. To avoid running this in "make test", the test function returns unless KIND_COMMAND is set.

We can recover from exec failing, the portproxy code already retries port forwarding.

pohly · 2025-06-23T11:10:27Z

/test pull-kubernetes-dra-integration-canary

k8s-ci-robot · 2025-06-23T13:34:02Z

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-verify	`38fe4d7`	link	true	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

aojea · 2025-06-25T08:19:38Z

test/utils/kindcluster/kindcluster.go

+type Cluster struct {
+	running    bool
+	name       string
+	dir        string
+	kubeConfig string
+}
+
+// Start brings up the cluster anew. If it was already running, it will be stopped first.
+
+// Will be stopped automatically at the end of the test.
+// If the ARTIFACTS env variable is set and the test failed,
+// log files of the kind cluster get dumped into
+// $ARTIFACTS/<test name>/kind/<cluster name> before stopping it.
+//
+// The name should be unique to enable running clusters in parallel.
+// The config has to be a complete YAML according to https://kind.sigs.k8s.io/docs/user/configuration/.
+// The image source can be a directory containing Kubernetes source code or
+// a download URL like https://dl.k8s.io/ci/v1.33.1-19+f900f017250646/kubernetes-server-linux-amd64.tar.gz.
+func (c *Cluster) Start(tCtx ktesting.TContext, name, kindConfig, imageName string) {


my opinion as kind maintainer is that I really do not want to see kind coupled so tightly in tree, @BenTheElder WDYT

Longer discussion in #132295 (comment) (helpfully hidden by GitHub when loading the page, thanks GitHub! 😠 ).

Yeah ... I'm not convinced the e2e framework is the right place to maintain logic like this, which will likely span releases etc.

This is developing really tight coupling to kind internals.

BenTheElder

I don't think this is the right approach. We should keep the tests in e2e.test generic and put logic like upgrading clusters into external tools (like kinder, or https://github.com/kubernetes/test-infra/tree/master/experiment/compatibility-versions

I'm pretty concerned about having kubernetes/kubernetes take dependencies on internals we've repeatedly and explicitly told users not to use and documented as such. kind is tricky to maintain as-is.

BenTheElder · 2025-06-27T17:02:47Z

test/e2e/dra/versionskew_test.go

+// Adapted from https://github.com/kubernetes-sigs/kind/blob/3df64e784cc0ea74125b2a2e9877817418afa3af/pkg/build/nodeimage/internal/kube/source.go#L71-L104
+func sourceVersion(tCtx ktesting.TContext, kubeRoot string) (gitVersion string, dockerTag string, err error) {
+	// Get the version output.
+	cmd := exec.CommandContext(tCtx, "hack/print-workspace-status.sh")


I don't think e2e.test should be doing this, or compiling anything. This is something we'd do in a tool like kubetest2 or kinder. It's a slow, orthogonal concern to testing.

This code does not get built into e2e.test.

It's a stand-alone test binary which only does something when invoked by a dedicated job with the right environment. That I am implementing it in this package is a stop-gap solution: I want to make the shared helper code more reusable, but that refactoring will come later.

I might be able to move it sooner by exporting some helpers... I'll check.

BenTheElder · 2025-06-27T17:03:20Z

test/utils/kindcluster/kindcluster.go

+type Cluster struct {
+	running    bool
+	name       string
+	dir        string
+	kubeConfig string
+}
+
+// Start brings up the cluster anew. If it was already running, it will be stopped first.
+
+// Will be stopped automatically at the end of the test.
+// If the ARTIFACTS env variable is set and the test failed,
+// log files of the kind cluster get dumped into
+// $ARTIFACTS/<test name>/kind/<cluster name> before stopping it.
+//
+// The name should be unique to enable running clusters in parallel.
+// The config has to be a complete YAML according to https://kind.sigs.k8s.io/docs/user/configuration/.
+// The image source can be a directory containing Kubernetes source code or
+// a download URL like https://dl.k8s.io/ci/v1.33.1-19+f900f017250646/kubernetes-server-linux-amd64.tar.gz.
+func (c *Cluster) Start(tCtx ktesting.TContext, name, kindConfig, imageName string) {


Yeah ... I'm not convinced the e2e framework is the right place to maintain logic like this, which will likely span releases etc.

This is developing really tight coupling to kind internals.

BenTheElder · 2025-06-27T17:07:35Z

test/e2e/feature/feature.go

+	// They must support bringing up a cluster with a node image built from
+	// the current Kubernetes version on the host on which the E2E suite runs.
+	// The repo root must point to the Kubernetes source code.
+	KindCommand = framework.WithFeature(framework.ValidFeatures.Add("KindCommand"))


The main issue with local-up-cluster.sh is maintainers, @dims was the most active and has recently pointed people to kind. It's also a bit difficult for using locally, it runs locally and isn't sandboxed so it manipulates the host directly quite a bit, in CI of course we stick that inside a container and it's similar to kind ...

The Go code in test/utils/kindcluster uses similar kind APIs as other jobs, with one exception: to replace components, it has to make assumptions about how control plane components are started (static pods), where the kubelet is located in the node image, and that there is a systemd unit for it.

I'm not happy with kubernetes/kubernetes code making assumptions about kind internals, if they ever change in kind then we need to patch it in every active release branch. This can totally deadlock both projects. We very explicitly tell users not to depend on details like "what is the path to kubelet", see here:
https://kind.sigs.k8s.io/docs/design/node-image/

We only support that node images will create a working Kubernetes node at the advertised version with the kind version they were released with (and best effort with other releases), see the release notes.

The contents and implemlentation of the images are subject to change at any time to fix bugs, improve reliability, performance, or maintainability.

DO NOT DEPEND ON THE INTERNALS OF THE NODE IMAGES.

KIND provides conformant Kubernetes, anything else is an implementation detail.

We will not accept bugs about “breaking changes” to node images and you depend on the implementation details at your own peril.

At least when test-infra experimental scripts leverage these, we can patch it once across release branches, though I also warned of these risks.

pohly · 2025-06-27T19:09:04Z

We should keep the tests in e2e.test generic and put logic like upgrading clusters into external tools (like kinder, or https://github.com/kubernetes/test-infra/tree/master/experiment/compatibility-versions

I don't see how I can combine those tools with writing tests against the cluster before and after a cluster change in a way that is short, familiar, and concise - or at all. Both tools seem to be focused on executing the "normal" E2E tests. What I want to do is write tests that specifically check scenarios that only occur when up- or downgrading a cluster, like deploying something and then checking its state after a cluster change.

This is developing really tight coupling to kind internals.

Ack, so no kind.

The main issue with local-up-cluster.sh is maintainers, @dims was the most active and has recently pointed people to kind. It's also a bit difficult for using locally,

I am using it each day and will maintain it as long as I continue to do so. The modify Kubernetes/start cluster/run tests cycle is much faster and it's easier to restart individual components under a debugger. I understand that it's a power tool with some sharp edges, but it's a useful one.

k8s-ci-robot · 2025-06-27T19:09:13Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-project-automation bot added this to SIG Node CI/Test Board and SIG Node: code and documentation PRs Jun 13, 2025

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Jun 13, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 13, 2025

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 13, 2025

k8s-ci-robot requested review from bart0sh and liggitt June 13, 2025 18:39

github-project-automation bot added this to Dynamic Resource Allocation Jun 13, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jun 13, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025

pohly mentioned this pull request Jun 13, 2025

DRA: version skew testing #128965

Open

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2025

pohly force-pushed the dra-version-skew branch from 2a1de75 to ba9e6e5 Compare June 17, 2025 16:16

pohly force-pushed the dra-version-skew branch from d77a4d9 to 069aa93 Compare June 17, 2025 19:54

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 17, 2025

pohly mentioned this pull request Jun 18, 2025

DRA canary: add integration test job kubernetes/test-infra#35004

Merged

pohly added 2 commits June 18, 2025 16:33

DRA E2E: support using ktesting

5f2e663

This is a DRA-specific stop-gap solution for using the E2E framework together with ktesting. Long-term this should better land in the E2E framework itself.

pohly force-pushed the dra-version-skew branch from 069aa93 to cf1a211 Compare June 18, 2025 19:34

pohly commented Jun 18, 2025

View reviewed changes

test/utils/kindcluster/kindcluster.go Show resolved Hide resolved

pohly added 3 commits June 23, 2025 13:04

DRA E2E: retry exec of hostpathplugin

38fe4d7

We can recover from exec failing, the portproxy code already retries port forwarding.

pohly force-pushed the dra-version-skew branch from cf1a211 to 38fe4d7 Compare June 23, 2025 11:05

pohly marked this pull request as ready for review June 23, 2025 12:29

k8s-ci-robot requested a review from klueska June 23, 2025 12:30

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 25, 2025

pohly mentioned this pull request Jun 25, 2025

DRA resource claim controller: fails to clean up deleted ResourceClaims on startup #132334

Open

aojea reviewed Jun 25, 2025

View reviewed changes

nojnhuh mentioned this pull request Jun 25, 2025

DRA: fix deleting orphaned ResourceClaim on startup #132533

Open

SergeyKanzhelev moved this from Triage to PRs Waiting on Author in SIG Node CI/Test Board Jun 25, 2025

BenTheElder requested changes Jun 27, 2025

View reviewed changes

github-project-automation bot moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jun 27, 2025

github-project-automation bot moved this from Needs Triage to In Progress in SIG Apps Jun 27, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2025

WIP: DRA: automated upgrade/downgrade testing #132295

Are you sure you want to change the base?

WIP: DRA: automated upgrade/downgrade testing #132295

Uh oh!

Conversation

pohly commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

pohly commented Jun 17, 2025

Uh oh!

pohly commented Jun 18, 2025

Uh oh!

Uh oh!

pohly commented Jun 23, 2025

Uh oh!

pohly commented Jun 23, 2025

Uh oh!

k8s-ci-robot commented Jun 23, 2025

Uh oh!

aojea Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

pohly Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

BenTheElder Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

BenTheElder left a comment

Choose a reason for hiding this comment

Uh oh!

BenTheElder Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

pohly Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

BenTheElder Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

BenTheElder Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Jun 27, 2025

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

Uh oh!

pohly commented Jun 13, 2025 •

edited

Loading

BenTheElder Jun 27, 2025 •

edited

Loading