-
Notifications
You must be signed in to change notification settings - Fork 40.9k
WIP: DRA: automated upgrade/downgrade testing #132295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-kubernetes-kind-dra |
This allows declaring a code region as one step without having to use an anonymous callback function, which has the advantage that variables set during the step are visible afterwards. In Python, this would be done as with ktesting.Step(tctx) as tcxt: // some code code inside step // code not in the same step But Go has no such construct. In contrast to WithStep, the start and end of the step are logged, including timing information.
This is a DRA-specific stop-gap solution for using the E2E framework together with ktesting. Long-term this should better land in the E2E framework itself.
/test pull-kubernetes-kind-dra pull-kubernetes-unit pull-kubernetes-integration |
/test pull-kubernetes-dra-integration-canary |
The test brings up the kind cluster and uses that power to run through an upgrade/downgrade. Version skew testing (running tests while the cluster is partially up- or downgraded) could be added. Several tests could run in parallel because each cluster is independent. Inside the test the steps have to be sequential. It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done mostly for pragmatic reasons: - It can reuse the existing helper code. - It can run in the normal DRA E2E jobs, with timeouts defined there. - Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed. The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. kubernetes#122481 could make that a bit easier in an E2E test, but is not absolutely required.
"KIND_COMMAND=kind go test ./test/e2e/dra" can run it, as if it was a unit test. -ginkgo.junit-report can be used to produce a JUnit report. The main advantage over a unit test is that the report will only contain the actual failure message, because Ginkgo (in contrast to `go test`) tracks failures separately from output. From a practical perspective, doing this in a Ginkgo suite in the same directory as before has the advantage that none of the helper code has to be updated. Doing it as a Go unit test had been tried and turned into a major effort. To avoid running this in "make test", the test function returns unless KIND_COMMAND is set.
We can recover from exec failing, the portproxy code already retries port forwarding.
/test pull-kubernetes-dra-integration-canary |
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
type Cluster struct { | ||
running bool | ||
name string | ||
dir string | ||
kubeConfig string | ||
} | ||
|
||
// Start brings up the cluster anew. If it was already running, it will be stopped first. | ||
|
||
// Will be stopped automatically at the end of the test. | ||
// If the ARTIFACTS env variable is set and the test failed, | ||
// log files of the kind cluster get dumped into | ||
// $ARTIFACTS/<test name>/kind/<cluster name> before stopping it. | ||
// | ||
// The name should be unique to enable running clusters in parallel. | ||
// The config has to be a complete YAML according to https://kind.sigs.k8s.io/docs/user/configuration/. | ||
// The image source can be a directory containing Kubernetes source code or | ||
// a download URL like https://dl.k8s.io/ci/v1.33.1-19+f900f017250646/kubernetes-server-linux-amd64.tar.gz. | ||
func (c *Cluster) Start(tCtx ktesting.TContext, name, kindConfig, imageName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my opinion as kind maintainer is that I really do not want to see kind coupled so tightly in tree, @BenTheElder WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer discussion in #132295 (comment) (helpfully hidden by GitHub when loading the page, thanks GitHub! 😠 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ... I'm not convinced the e2e framework is the right place to maintain logic like this, which will likely span releases etc.
This is developing really tight coupling to kind internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right approach. We should keep the tests in e2e.test generic and put logic like upgrading clusters into external tools (like kinder, or https://github.com/kubernetes/test-infra/tree/master/experiment/compatibility-versions
I'm pretty concerned about having kubernetes/kubernetes take dependencies on internals we've repeatedly and explicitly told users not to use and documented as such. kind is tricky to maintain as-is.
// Adapted from https://github.com/kubernetes-sigs/kind/blob/3df64e784cc0ea74125b2a2e9877817418afa3af/pkg/build/nodeimage/internal/kube/source.go#L71-L104 | ||
func sourceVersion(tCtx ktesting.TContext, kubeRoot string) (gitVersion string, dockerTag string, err error) { | ||
// Get the version output. | ||
cmd := exec.CommandContext(tCtx, "hack/print-workspace-status.sh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think e2e.test should be doing this, or compiling anything. This is something we'd do in a tool like kubetest2 or kinder. It's a slow, orthogonal concern to testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code does not get built into e2e.test.
It's a stand-alone test binary which only does something when invoked by a dedicated job with the right environment. That I am implementing it in this package is a stop-gap solution: I want to make the shared helper code more reusable, but that refactoring will come later.
I might be able to move it sooner by exporting some helpers... I'll check.
type Cluster struct { | ||
running bool | ||
name string | ||
dir string | ||
kubeConfig string | ||
} | ||
|
||
// Start brings up the cluster anew. If it was already running, it will be stopped first. | ||
|
||
// Will be stopped automatically at the end of the test. | ||
// If the ARTIFACTS env variable is set and the test failed, | ||
// log files of the kind cluster get dumped into | ||
// $ARTIFACTS/<test name>/kind/<cluster name> before stopping it. | ||
// | ||
// The name should be unique to enable running clusters in parallel. | ||
// The config has to be a complete YAML according to https://kind.sigs.k8s.io/docs/user/configuration/. | ||
// The image source can be a directory containing Kubernetes source code or | ||
// a download URL like https://dl.k8s.io/ci/v1.33.1-19+f900f017250646/kubernetes-server-linux-amd64.tar.gz. | ||
func (c *Cluster) Start(tCtx ktesting.TContext, name, kindConfig, imageName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ... I'm not convinced the e2e framework is the right place to maintain logic like this, which will likely span releases etc.
This is developing really tight coupling to kind internals.
// They must support bringing up a cluster with a node image built from | ||
// the current Kubernetes version on the host on which the E2E suite runs. | ||
// The repo root must point to the Kubernetes source code. | ||
KindCommand = framework.WithFeature(framework.ValidFeatures.Add("KindCommand")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue with local-up-cluster.sh is maintainers, @dims was the most active and has recently pointed people to kind. It's also a bit difficult for using locally, it runs locally and isn't sandboxed so it manipulates the host directly quite a bit, in CI of course we stick that inside a container and it's similar to kind ...
The Go code in test/utils/kindcluster uses similar kind APIs as other jobs, with one exception: to replace components, it has to make assumptions about how control plane components are started (static pods), where the kubelet is located in the node image, and that there is a systemd unit for it.
I'm not happy with kubernetes/kubernetes code making assumptions about kind internals, if they ever change in kind then we need to patch it in every active release branch. This can totally deadlock both projects. We very explicitly tell users not to depend on details like "what is the path to kubelet", see here:
https://kind.sigs.k8s.io/docs/design/node-image/
We only support that node images will create a working Kubernetes node at the advertised version with the kind version they were released with (and best effort with other releases), see the release notes.
The contents and implemlentation of the images are subject to change at any time to fix bugs, improve reliability, performance, or maintainability.
DO NOT DEPEND ON THE INTERNALS OF THE NODE IMAGES.
KIND provides conformant Kubernetes, anything else is an implementation detail.
We will not accept bugs about “breaking changes” to node images and you depend on the implementation details at your own peril.
At least when test-infra experimental scripts leverage these, we can patch it once across release branches, though I also warned of these risks.
I don't see how I can combine those tools with writing tests against the cluster before and after a cluster change in a way that is short, familiar, and concise - or at all. Both tools seem to be focused on executing the "normal" E2E tests. What I want to do is write tests that specifically check scenarios that only occur when up- or downgrading a cluster, like deploying something and then checking its state after a cluster change.
Ack, so no kind.
I am using it each day and will maintain it as long as I continue to do so. The modify Kubernetes/start cluster/run tests cycle is much faster and it's easier to restart individual components under a debugger. I understand that it's a power tool with some sharp edges, but it's a useful one. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Promotion of any feature, in this case core DRA to GA, depends on (so far) manually testing the upgrade/downgrade path. It's hard to document what exactly was tested in a way that others can verify the procedure and/or replicate it.
This PR adds helper packages for upgrading/downgrading a kind cluster and running E2E tests against, and uses that to test some DRA scenarios. It runs as part of the normal e2e.test invocation in pull/ci-kubernetes-kind-dra.
Which issue(s) this PR is related to:
#128965
KEP: kubernetes/enhancements#4381
Special notes for your reviewer:
It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done as E2E test mostly for pragmatic reasons:
The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. #122481 could make that a bit easier in an E2E test, but is not absolutely required.
Does this PR introduce a user-facing change?