scheduler_perf: wait for all pods to be scheduled before deletion in churnOp recreate mode #132167

utam0k · 2025-06-07T12:25:55Z

What type of PR is this?

/kind feature
/sig scheduling

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

scheduler_perf: churnOp in recreate mode now waits for all pods to be scheduled before starting deletion phase, ensuring consistent churn behavior and preventing unscheduled pods from being stuck in pending state

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-06-07T12:26:04Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-06-07T12:26:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: utam0k
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/integration/scheduler_perf/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

utam0k · 2025-06-07T12:26:24Z

/cc @macsko

macsko · 2025-06-11T14:33:35Z

test/integration/scheduler_perf/scheduler_perf.go

 		churnFns = append(churnFns, func(name string) string {
 			if name != "" {
-				if err := dynRes.Delete(e.tCtx, name, metav1.DeleteOptions{}); err != nil && !errors.Is(err, context.Canceled) {
-					e.tCtx.Errorf("op %d: unable to delete %v: %v", opIndex, name, err)
+				shouldDelete := true


The problem with this approach will appear when a second bunch of pods will start to be created. Then, we would end up with conflict on creating a pod in a place of a pod that wasn't deleted.

Maybe you could try waiting for all pods to be scheduled just after all of them were created, before deleting any?

…churnOp recreate mode Signed-off-by: utam0k <[email protected]>

k8s-ci-robot · 2025-06-29T13:53:33Z

@utam0k: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-integration	`d3bfd9f`	link	true	`/test pull-kubernetes-integration`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

macsko · 2025-06-30T08:40:43Z

test/integration/scheduler_perf/scheduler_perf.go

@@ -1868,7 +1921,29 @@ func (e *WorkloadExecutor) runChurnOp(opIndex int, op *churnOp) error {
 				retVals[i] = make([]string, op.Number)
 			}

-			count := 0
+			// Create all resources first


Doing this, we will wait only for the first bunch of pods. Then, after we create a second bunch, we will proceed with deletion without waiting.

@macsko
Thank you for pointing out this issue! You're right that the current implementation only waits for the first batch of pods; subsequent batches proceed without waiting.

I've been thinking about how to properly address this while maintaining the intended behavior of churnOp, and here are a few approaches I'm considering:

Option 1: Keep it simple — remove the waiting logic entirely.

Rely only on the scheduled check before deletion.

Unscheduled pods will remain until they are scheduled.
Pros: Simple, consistent behavior throughout the test
Cons: Some pods might accumulate in the pending state.

Option 2: Periodic synchronization - Wait every N cycles

// Wait when starting a new cycle of pods if hasPods && count % op.Number == 0 && count > op.Number { waitForScheduledPodsInNamespace(...) }

Pros: Ensures fair treatment of all pod generations

Cons: Adds complexity and periodic pauses

Option 3: Dynamic Throttling — Adjusts the churn rate based on the scheduled pod ratio.
Only proceed with deletion when X% of pods are scheduled.

Self-adjusts to scheduler performance
Pros: Adaptive to system state
Cons: More complex implementation

I'm leaning towards option 2, which removes the wait entirely. This option keeps the implementation simple while still addressing the original issue, #125974, by preventing the deletion of unscheduled pods.

What's your preference? Are there any other approaches that would better align with the goals of scheduler_perf testing?

What if we do:

churnFns = append(churnFns, func(name string) string { if name != "" { // New: Wait for `name` pod to be scheduled here, before deleting if isPod(name) { waitForPod(name) } if err := dynRes.Delete(e.tCtx, name, metav1.DeleteOptions{}); err != nil && !errors.Is(err, context.Canceled) { e.tCtx.Errorf("op %d: unable to delete %v: %v", opIndex, name, err) } return "" } live, err := dynRes.Create(e.tCtx, unstructuredObj, metav1.CreateOptions{}) if err != nil { return "" } return live.GetName() })

We could also make this waiting before deletion mechanism under boolean flag to allow to use the old way as well.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 7, 2025

k8s-ci-robot requested review from AxeZhan and denkensk June 7, 2025 12:26

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 7, 2025

k8s-ci-robot requested a review from macsko June 7, 2025 12:26

macsko reviewed Jun 11, 2025

View reviewed changes

scheduler_perf: wait for all pods to be scheduled before deletion in …

d3bfd9f

…churnOp recreate mode Signed-off-by: utam0k <[email protected]>

utam0k force-pushed the churnOp branch from bc43a3e to d3bfd9f Compare June 29, 2025 12:47

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 29, 2025

utam0k changed the title ~~scheduler_perf: only delete scheduled pods in churnOp recreate mode~~ scheduler_perf: wait for all pods to be scheduled before deletion in churnOp recreate mode Jun 29, 2025

macsko reviewed Jun 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scheduler_perf: wait for all pods to be scheduled before deletion in churnOp recreate mode #132167

scheduler_perf: wait for all pods to be scheduled before deletion in churnOp recreate mode #132167

utam0k commented Jun 7, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 7, 2025

Uh oh!

k8s-ci-robot commented Jun 7, 2025

Uh oh!

utam0k commented Jun 7, 2025

Uh oh!

macsko Jun 11, 2025

Uh oh!

k8s-ci-robot commented Jun 29, 2025

Uh oh!

macsko Jun 30, 2025

Uh oh!

utam0k Jun 30, 2025

Uh oh!

macsko Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

scheduler_perf: wait for all pods to be scheduled before deletion in churnOp recreate mode #132167

Are you sure you want to change the base?

scheduler_perf: wait for all pods to be scheduled before deletion in churnOp recreate mode #132167

Conversation

utam0k commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jun 7, 2025

Uh oh!

k8s-ci-robot commented Jun 7, 2025

Uh oh!

utam0k commented Jun 7, 2025

Uh oh!

macsko Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 29, 2025

Uh oh!

macsko Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

utam0k Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

macsko Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

utam0k commented Jun 7, 2025 •

edited

Loading

macsko Jul 1, 2025 •

edited

Loading