Skip to content

WIP: DRA: automated upgrade/downgrade testing #132295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 13, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Promotion of any feature, in this case core DRA to GA, depends on (so far) manually testing the upgrade/downgrade path. It's hard to document what exactly was tested in a way that others can verify the procedure and/or replicate it.

This PR adds helper packages for upgrading/downgrading a kind cluster and running E2E tests against, and uses that to test some DRA scenarios. It runs as part of the normal e2e.test invocation in pull/ci-kubernetes-kind-dra.

Which issue(s) this PR is related to:

#128965
KEP: kubernetes/enhancements#4381

Special notes for your reviewer:

It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done as E2E test mostly for pragmatic reasons:

  • It can reuse the existing helper code.
  • It can run in the normal DRA E2E jobs, with timeouts defined there.
  • Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed.

The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. #122481 could make that a bit easier in an E2E test, but is not absolutely required.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and liggitt June 13, 2025 18:39
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2025
@pohly pohly force-pushed the dra-version-skew branch from 2a1de75 to ba9e6e5 Compare June 17, 2025 16:16
@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2025

/test pull-kubernetes-kind-dra

@pohly pohly force-pushed the dra-version-skew branch from d77a4d9 to 069aa93 Compare June 17, 2025 19:54
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 17, 2025
pohly added 2 commits June 18, 2025 16:33
This allows declaring a code region as one step without having to use
an anonymous callback function, which has the advantage that variables
set during the step are visible afterwards.

In Python, this would be done as

    with ktesting.Step(tctx) as tcxt:
        // some code code inside step
    // code not in the same step

But Go has no such construct.

In contrast to WithStep, the start and end of the step are logged, including
timing information.
This is a DRA-specific stop-gap solution for using the E2E framework together
with ktesting. Long-term this should better land in the E2E framework itself.
@pohly pohly force-pushed the dra-version-skew branch from 069aa93 to cf1a211 Compare June 18, 2025 19:34
@pohly
Copy link
Contributor Author

pohly commented Jun 18, 2025

/test pull-kubernetes-kind-dra pull-kubernetes-unit pull-kubernetes-integration

@pohly
Copy link
Contributor Author

pohly commented Jun 23, 2025

/test pull-kubernetes-dra-integration-canary

pohly added 3 commits June 23, 2025 13:04
The test brings up the kind cluster and uses that power to run through
an upgrade/downgrade. Version skew testing (running tests while the cluster
is partially up- or downgraded) could be added.

Several tests could run in parallel because each cluster is independent.
Inside the test the steps have to be sequential.

It is debatable whether this should be an E2E test at all. Technically this
could also be an integration test. It's currently done mostly for pragmatic
reasons:
- It can reuse the existing helper code.
- It can run in the normal DRA E2E jobs, with timeouts defined there.
- Integration tests are limited to 10 minutes per directory and
  pull-kubernetes-integration is already a big and slow job.
  A different approach for per-SIG or per-feature integration
  testing would be needed.

The new helper code for managing a kind cluster is written so that it could be
used both in an integration test and an E2E
test. kubernetes#122481 could make that a
bit easier in an E2E test, but is not absolutely required.
"KIND_COMMAND=kind go test ./test/e2e/dra" can run it, as if it was a unit test.
-ginkgo.junit-report can be used to produce a JUnit report. The main advantage
over a unit test is that the report will only contain the actual failure
message, because Ginkgo (in contrast to `go test`) tracks failures separately
from output.

From a practical perspective, doing this in a Ginkgo suite in the same
directory as before has the advantage that none of the helper code has to be
updated. Doing it as a Go unit test had been tried and turned into a major
effort.

To avoid running this in "make test", the test function returns unless
KIND_COMMAND is set.
We can recover from exec failing, the portproxy code already retries port
forwarding.
@pohly pohly force-pushed the dra-version-skew branch from cf1a211 to 38fe4d7 Compare June 23, 2025 11:05
@pohly
Copy link
Contributor Author

pohly commented Jun 23, 2025

/test pull-kubernetes-dra-integration-canary

@pohly pohly marked this pull request as ready for review June 23, 2025 12:29
@k8s-ci-robot k8s-ci-robot requested a review from klueska June 23, 2025 12:30
@k8s-ci-robot
Copy link
Contributor

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-verify 38fe4d7 link true /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +65 to +83
type Cluster struct {
running bool
name string
dir string
kubeConfig string
}

// Start brings up the cluster anew. If it was already running, it will be stopped first.

// Will be stopped automatically at the end of the test.
// If the ARTIFACTS env variable is set and the test failed,
// log files of the kind cluster get dumped into
// $ARTIFACTS/<test name>/kind/<cluster name> before stopping it.
//
// The name should be unique to enable running clusters in parallel.
// The config has to be a complete YAML according to https://kind.sigs.k8s.io/docs/user/configuration/.
// The image source can be a directory containing Kubernetes source code or
// a download URL like https://dl.k8s.io/ci/v1.33.1-19+f900f017250646/kubernetes-server-linux-amd64.tar.gz.
func (c *Cluster) Start(tCtx ktesting.TContext, name, kindConfig, imageName string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my opinion as kind maintainer is that I really do not want to see kind coupled so tightly in tree, @BenTheElder WDYT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer discussion in #132295 (comment) (helpfully hidden by GitHub when loading the page, thanks GitHub! 😠 ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ... I'm not convinced the e2e framework is the right place to maintain logic like this, which will likely span releases etc.

This is developing really tight coupling to kind internals.

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach. We should keep the tests in e2e.test generic and put logic like upgrading clusters into external tools (like kinder, or https://github.com/kubernetes/test-infra/tree/master/experiment/compatibility-versions

I'm pretty concerned about having kubernetes/kubernetes take dependencies on internals we've repeatedly and explicitly told users not to use and documented as such. kind is tricky to maintain as-is.

// Adapted from https://github.com/kubernetes-sigs/kind/blob/3df64e784cc0ea74125b2a2e9877817418afa3af/pkg/build/nodeimage/internal/kube/source.go#L71-L104
func sourceVersion(tCtx ktesting.TContext, kubeRoot string) (gitVersion string, dockerTag string, err error) {
// Get the version output.
cmd := exec.CommandContext(tCtx, "hack/print-workspace-status.sh")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think e2e.test should be doing this, or compiling anything. This is something we'd do in a tool like kubetest2 or kinder. It's a slow, orthogonal concern to testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does not get built into e2e.test.

It's a stand-alone test binary which only does something when invoked by a dedicated job with the right environment. That I am implementing it in this package is a stop-gap solution: I want to make the shared helper code more reusable, but that refactoring will come later.

I might be able to move it sooner by exporting some helpers... I'll check.

Comment on lines +65 to +83
type Cluster struct {
running bool
name string
dir string
kubeConfig string
}

// Start brings up the cluster anew. If it was already running, it will be stopped first.

// Will be stopped automatically at the end of the test.
// If the ARTIFACTS env variable is set and the test failed,
// log files of the kind cluster get dumped into
// $ARTIFACTS/<test name>/kind/<cluster name> before stopping it.
//
// The name should be unique to enable running clusters in parallel.
// The config has to be a complete YAML according to https://kind.sigs.k8s.io/docs/user/configuration/.
// The image source can be a directory containing Kubernetes source code or
// a download URL like https://dl.k8s.io/ci/v1.33.1-19+f900f017250646/kubernetes-server-linux-amd64.tar.gz.
func (c *Cluster) Start(tCtx ktesting.TContext, name, kindConfig, imageName string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ... I'm not convinced the e2e framework is the right place to maintain logic like this, which will likely span releases etc.

This is developing really tight coupling to kind internals.

// They must support bringing up a cluster with a node image built from
// the current Kubernetes version on the host on which the E2E suite runs.
// The repo root must point to the Kubernetes source code.
KindCommand = framework.WithFeature(framework.ValidFeatures.Add("KindCommand"))
Copy link
Member

@BenTheElder BenTheElder Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue with local-up-cluster.sh is maintainers, @dims was the most active and has recently pointed people to kind. It's also a bit difficult for using locally, it runs locally and isn't sandboxed so it manipulates the host directly quite a bit, in CI of course we stick that inside a container and it's similar to kind ...

The Go code in test/utils/kindcluster uses similar kind APIs as other jobs, with one exception: to replace components, it has to make assumptions about how control plane components are started (static pods), where the kubelet is located in the node image, and that there is a systemd unit for it.

I'm not happy with kubernetes/kubernetes code making assumptions about kind internals, if they ever change in kind then we need to patch it in every active release branch. This can totally deadlock both projects. We very explicitly tell users not to depend on details like "what is the path to kubelet", see here:
https://kind.sigs.k8s.io/docs/design/node-image/

We only support that node images will create a working Kubernetes node at the advertised version with the kind version they were released with (and best effort with other releases), see the release notes.

The contents and implemlentation of the images are subject to change at any time to fix bugs, improve reliability, performance, or maintainability.

DO NOT DEPEND ON THE INTERNALS OF THE NODE IMAGES.

KIND provides conformant Kubernetes, anything else is an implementation detail.

We will not accept bugs about “breaking changes” to node images and you depend on the implementation details at your own peril.

At least when test-infra experimental scripts leverage these, we can patch it once across release branches, though I also warned of these risks.

@github-project-automation github-project-automation bot moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jun 27, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to In Progress in SIG Apps Jun 27, 2025
@pohly
Copy link
Contributor Author

pohly commented Jun 27, 2025

We should keep the tests in e2e.test generic and put logic like upgrading clusters into external tools (like kinder, or https://github.com/kubernetes/test-infra/tree/master/experiment/compatibility-versions

I don't see how I can combine those tools with writing tests against the cluster before and after a cluster change in a way that is short, familiar, and concise - or at all. Both tools seem to be focused on executing the "normal" E2E tests. What I want to do is write tests that specifically check scenarios that only occur when up- or downgrading a cluster, like deploying something and then checking its state after a cluster change.

This is developing really tight coupling to kind internals.

Ack, so no kind.

The main issue with local-up-cluster.sh is maintainers, @dims was the most active and has recently pointed people to kind. It's also a bit difficult for using locally,

I am using it each day and will maintain it as long as I continue to do so. The modify Kubernetes/start cluster/run tests cycle is much faster and it's easier to restart individual components under a debugger. I understand that it's a power tool with some sharp edges, but it's a useful one.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: In Progress
Status: PRs Waiting on Author
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

5 participants