Skip to content

Don't create extra namespaces during pod resize tests #131518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kevinrizza
Copy link
Contributor

@kevinrizza kevinrizza commented Apr 28, 2025

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

  • Modifies the pod resize tests not to create extra unused namespaces
  • This reduces load on the namespace controller to help prevent timeouts when other tests are run in parallel
  • This is done by pulling the test framework initialization context up so that it is only done before iterating over the test specs rather than creating a separate test framework instance for each test spec. The way the tests were previously structured caused the framework's BeforeEach function to be registered before each test, causing a significant number of extra unused namespaces to be created during these tests.

Which issue(s) this PR fixes:

Fixes #131517

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @kevinrizza. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 28, 2025
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 28, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 28, 2025
@kevinrizza
Copy link
Contributor Author

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 28, 2025
@cici37
Copy link
Contributor

cici37 commented Apr 28, 2025

/ok-to-test
/lgtm
/cc @BenTheElder for review when you have time :)

Thank you Kevin for spending time on solving this and appreciate the PR!

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2025
@cici37
Copy link
Contributor

cici37 commented Apr 28, 2025

Note: This will block #131514

@kevinrizza
Copy link
Contributor Author

/cc @esotsal who wrote these tests last year. From what I can tell there shouldn't be a problem sharing a namespace here

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kevinrizza
Copy link
Contributor Author

@pohly I'm seeing a lot of latency to calls to delete a single namespace as part of the test that @cici37 is enabling in #131514 -- on the order of minutes. I took a look at API calls in the logs of the API server and compared them to the logs in the controller manager when the client called delete and saw an over two minute delay in some cases.

Then I took a look at other API calls in the same time period and saw a lot of deletes being called from this set of pod resize tests. I guessed that there was something related to these specific namespaces causing the latency, and to try to prove that that was what was causing those tests to fail I disabled this set of tests and that seemed to remove the latency I was seeing in subsequent test runs and got rid of the flakiness in the ordered namespace delete tests. So my next assumption was that updating these tests to share a namespace that would have the same effect as disabling these tests on the ordered namespace delete's test flake.

It's possible there is some other confounding factor here, but I definitely see a strong correlation between this set of tests and the other flakes I am observing.

@@ -1390,7 +1389,6 @@ func doPodResizeErrorTests() {

for idx := range tests {
tc := tests[idx]
f := framework.NewDefaultFramework("pod-resize-error-tests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving these outside the for loop is what is solving the problem here, right? IIUC each iteration was both creating a new spec and registering another BeforeEach. I see in the KCM logs from your linked PR before this change:

$ grep namespace_controller kcm.log | grep -Eo 'namespace="pod-resize-tests-[0-9]+"' | sort | uniq| wc -l
1999

And after:

$ grep namespace_controller kcm2.log | grep -Eo 'namespace="pod-resize-tests-[0-9]+"' | sort | uniq | wc -l
51

@kevinrizza
Copy link
Contributor Author

@benluddy You're right. And even reducing this test from creating numberOfTests^2 to numberOfTests namespaces still significantly improves the latency of the namespace controller wrt the flake I observed. It would probably be more better if we used a single namespace for all of these tests, but the improvement from that would be negligible by comparison.

This commit updates pod resize tests to resolve an
issue where extraneous unused namespaces are created
when the tests are run. This happens because a new
instance of the test framework is generated when
looping over the test specs, registering BeforeEach each
time.
@kevinrizza kevinrizza force-pushed the pod-resize-test-namespace branch from 43db65a to 903d6d4 Compare April 29, 2025 13:18
@kevinrizza kevinrizza changed the title Use single namespace for pod resize tests Don't create extra namespaces during pod resize tests Apr 29, 2025
@kevinrizza
Copy link
Contributor Author

@pohly @carlory I updated the commit message, pr title and description with a better explanation for what's actually happening here and why this pr resolves the problem. I also found another instance of this happening in the pod resize tests in test/e2e/node/pod_resize.go . It only has two test specs, so it wasn't nearly as many, but it was still another instance of the same test bug.

Hopefully the reason for this change is clearer now. Thank you for the review!

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Now the explanation for this PR makes sense.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4fd6b2d5183ce38851c2828a979105d940e7ac7a

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, kevinrizza, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@esotsal
Copy link
Contributor

esotsal commented Apr 29, 2025

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2025
@BenTheElder
Copy link
Member

Yeah, we observed this causing flakiness in the ordered namespace deletion tests due to the load, creating one per overall test makes sense but this created far more before.

I had spoke to Cici about the context so I focused on reviewing the code diff over non-release-note original description. You can see by inspection how this fixes the issue.

/hold cancel
/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 29, 2025
@k8s-ci-robot k8s-ci-robot merged commit e0dbdc9 into kubernetes:master Apr 29, 2025
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Apr 29, 2025
@github-project-automation github-project-automation bot moved this from Triage to Done in SIG Node CI/Test Board Apr 29, 2025
@cici37
Copy link
Contributor

cici37 commented Apr 29, 2025

Thank you Kevin for the fix and thank you everyone for the review! ♥️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

Pod InPlace Resize Container tests causing load issue on namespace controller
8 participants