Skip to content

WIP: DRA: automated upgrade/downgrade testing #132295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 13, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Promotion of any feature, in this case core DRA to GA, depends on (so far) manually testing the upgrade/downgrade path. It's hard to document what exactly was tested in a way that others can verify the procedure and/or replicate it.

This PR adds helper packages for upgrading/downgrading a kind cluster and running E2E tests against, and uses that to test some DRA scenarios. It runs as part of the normal e2e.test invocation in pull/ci-kubernetes-kind-dra.

Which issue(s) this PR is related to:

#128965
KEP: kubernetes/enhancements#4381

Special notes for your reviewer:

It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done as E2E test mostly for pragmatic reasons:

  • It can reuse the existing helper code.
  • It can run in the normal DRA E2E jobs, with timeouts defined there.
  • Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed.

The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. #122481 could make that a bit easier in an E2E test, but is not absolutely required.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and liggitt June 13, 2025 18:39
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2025
@pohly pohly force-pushed the dra-version-skew branch from 2a1de75 to ba9e6e5 Compare June 17, 2025 16:16
pohly added 17 commits July 1, 2025 15:33
A special test (for example, one which manages its own cluster) could almost
construct a Framework instance because most fields are exported. The
`clientConfig` field isn't because REST configs often need to be deep-copied to
avoid accidentally updating some shared copy, so for this special case a
SetClientConfig is needed.
EOF occurs after restarting the API server and, despite a retry loop in
client-go/rest/request.go, sometimes is returned to the application.
Getting slices can be done with the helper.
Can be done via -vmodule, albeit not precisely because other controllers
also have a controller.go file.
During upgrade/downgrade testing, errors are encountered while the apiserver is
down. This is normal and handled via retrying, so we don't need to be verbose.
Hiding the error in WithError is the right choice for example
when it is used inside ktesting.Eventually. Most callers probably want to deal
with the unexpected error themselves. For those who don't, WithErrorLogging
continues to log it.
This allows declaring a code region as one step without having to use
an anonymous callback function, which has the advantage that variables
set during the step are visible afterwards.

In Python, this would be done as

    with ktesting.Step(tctx) as tcxt:
        // some code code inside step
    // code not in the same step

But Go has no such construct.

In contrast to WithStep, the start and end of the step are logged, including
timing information.
This is a DRA-specific stop-gap solution for using the E2E framework together
with ktesting. Long-term this should better land in the E2E framework itself.
We can recover from exec failing, the portproxy code already retries port
forwarding.
That WithCancel added a deferred cleanup which cancels on test termination was
unexpected. This automatic cancellation makes sense only for the initial root
TContext.
This closes a gap compared to the context package. It's useful when combined
with Ginkgo to keep something running beyond the end of the Ginkgo BeforeEach
or It node.
Showing the configuration (= variable assignments) without going all the way to
KUBE_VERBOSE > 4 is useful.
Some ports (apiserver, one kubelet port) were already configurable.
Several others were not.

Primarily this is done to document the ports which are in used by the different
components.
This may be useful during manual invocations to see what commands would be
executed and with which parameters, without actually running them.

But the main purpose is to use this mode in automated upgrade/downgrade testing
where the caller parses the output to execute those commands under its own
control. Such a caller can then replaced individual component binaries with
those from other releases.
If we know that the test binary shares the filesystem with the cluster (for
example, when using local-up-cluster.sh), then we can avoid the whole
complicated portproxy solution and work directly with the paths on the
host.

Only works with suitable permissions! /var/lib/kubelet/plugins,
/var/lib/kubelet/plugin_registry, and /var/run/cdi must be writable.

portproxy remains the default because it automatically gains sufficient
permissions also when combined with local-up-cluster.sh.
@pohly pohly force-pushed the dra-version-skew branch from 38fe4d7 to 3d0f786 Compare July 1, 2025 14:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly pohly force-pushed the dra-version-skew branch from 3d0f786 to b155544 Compare July 1, 2025 16:06
@pohly pohly marked this pull request as draft July 1, 2025 16:15
@pohly
Copy link
Contributor Author

pohly commented Jul 1, 2025

/test pull-kubernetes-local-e2e

@pohly pohly force-pushed the dra-version-skew branch from b155544 to 98553b2 Compare July 1, 2025 16:24
@pohly
Copy link
Contributor Author

pohly commented Jul 1, 2025

/test pull-kubernetes-local-e2e pull-kubernetes-kind-dra

@k8s-ci-robot
Copy link
Contributor

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-linter-hints 3d0f786 link false /test pull-kubernetes-linter-hints
pull-kubernetes-verify 3d0f786 link true /test pull-kubernetes-verify
pull-kubernetes-dependencies b155544 link true /test pull-kubernetes-dependencies
pull-kubernetes-kind-dra-n-2 b155544 link false /test pull-kubernetes-kind-dra-n-2
pull-kubernetes-kind-dra-all b155544 link false /test pull-kubernetes-kind-dra-all
pull-kubernetes-kind-dra-n-1 b155544 link false /test pull-kubernetes-kind-dra-n-1
pull-kubernetes-e2e-gce-csi-serial b155544 link false /test pull-kubernetes-e2e-gce-csi-serial
pull-kubernetes-e2e-gce-storage-slow b155544 link false /test pull-kubernetes-e2e-gce-storage-slow
pull-kubernetes-e2e-gce-storage-snapshot b155544 link false /test pull-kubernetes-e2e-gce-storage-snapshot
pull-kubernetes-conformance-kind-ga-only-parallel b155544 link true /test pull-kubernetes-conformance-kind-ga-only-parallel
pull-kubernetes-e2e-gce b155544 link true /test pull-kubernetes-e2e-gce
pull-kubernetes-typecheck b155544 link true /test pull-kubernetes-typecheck
pull-kubernetes-e2e-storage-kind-disruptive b155544 link false /test pull-kubernetes-e2e-storage-kind-disruptive
pull-kubernetes-e2e-kind b155544 link true /test pull-kubernetes-e2e-kind
pull-kubernetes-local-e2e 98553b2 link false /test pull-kubernetes-local-e2e
pull-kubernetes-kind-dra 98553b2 link false /test pull-kubernetes-kind-dra

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pohly added 3 commits July 1, 2025 18:43
In upgrade/downgrade testings, sometimes kube-proxy doesn't get started. It's
not clear why, perhaps this additional output will show the reason.
The helper code is useful for a separate Ginkgo suite for upgrade/downgrade
testing. We don't want to import test/e2e/dra there because that would also
define additional tests.
The test brings up the cluster and uses that power to run through
an upgrade/downgrade scenario. Version skew testing (running tests while the cluster
is partially up- or downgraded) could be added.

The new helper code for managing the cluster is written so that it could be
used both in an integration test and an E2E
test. kubernetes#122481 could make that a
bit easier in an E2E test, but is not absolutely required.
@pohly pohly force-pushed the dra-version-skew branch from 98553b2 to 6386b6e Compare July 1, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: In Progress
Status: PRs Waiting on Author
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

5 participants