-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Don't create extra namespaces during pod resize tests #131518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't create extra namespaces during pod resize tests #131518
Conversation
Hi @kevinrizza. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/release-note-none |
/ok-to-test Thank you Kevin for spending time on solving this and appreciate the PR! |
Note: This will block #131514 |
/cc @esotsal who wrote these tests last year. From what I can tell there shouldn't be a problem sharing a namespace here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
@pohly I'm seeing a lot of latency to calls to delete a single namespace as part of the test that @cici37 is enabling in #131514 -- on the order of minutes. I took a look at API calls in the logs of the API server and compared them to the logs in the controller manager when the client called delete and saw an over two minute delay in some cases. Then I took a look at other API calls in the same time period and saw a lot of deletes being called from this set of pod resize tests. I guessed that there was something related to these specific namespaces causing the latency, and to try to prove that that was what was causing those tests to fail I disabled this set of tests and that seemed to remove the latency I was seeing in subsequent test runs and got rid of the flakiness in the ordered namespace delete tests. So my next assumption was that updating these tests to share a namespace that would have the same effect as disabling these tests on the ordered namespace delete's test flake. It's possible there is some other confounding factor here, but I definitely see a strong correlation between this set of tests and the other flakes I am observing. |
@@ -1390,7 +1389,6 @@ func doPodResizeErrorTests() { | |||
|
|||
for idx := range tests { | |||
tc := tests[idx] | |||
f := framework.NewDefaultFramework("pod-resize-error-tests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving these outside the for loop is what is solving the problem here, right? IIUC each iteration was both creating a new spec and registering another BeforeEach. I see in the KCM logs from your linked PR before this change:
$ grep namespace_controller kcm.log | grep -Eo 'namespace="pod-resize-tests-[0-9]+"' | sort | uniq| wc -l
1999
And after:
$ grep namespace_controller kcm2.log | grep -Eo 'namespace="pod-resize-tests-[0-9]+"' | sort | uniq | wc -l
51
@benluddy You're right. And even reducing this test from creating |
This commit updates pod resize tests to resolve an issue where extraneous unused namespaces are created when the tests are run. This happens because a new instance of the test framework is generated when looping over the test specs, registering BeforeEach each time.
43db65a
to
903d6d4
Compare
@pohly @carlory I updated the commit message, pr title and description with a better explanation for what's actually happening here and why this pr resolves the problem. I also found another instance of this happening in the pod resize tests in test/e2e/node/pod_resize.go . It only has two test specs, so it wasn't nearly as many, but it was still another instance of the same test bug. Hopefully the reason for this change is clearer now. Thank you for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Now the explanation for this PR makes sense.
LGTM label has been added. Git tree hash: 4fd6b2d5183ce38851c2828a979105d940e7ac7a
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder, kevinrizza, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Yeah, we observed this causing flakiness in the ordered namespace deletion tests due to the load, creating one per overall test makes sense but this created far more before. I had spoke to Cici about the context so I focused on reviewing the code diff over non-release-note original description. You can see by inspection how this fixes the issue. /hold cancel |
Thank you Kevin for the fix and thank you everyone for the review! |
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #131517
Does this PR introduce a user-facing change?
NONE