scheduler: avoid false "unschedulable" pod state #120334

pohly · 2023-09-01T07:08:26Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When some plugin was registered as "unschedulable" in some previous scheduling attempt, it kept that attribute for a pod forever. When that plugin then later failed with an error that requires backoff, the pod was incorrectly moved to the "unschedulable" queue where it got stuck until the periodic flushing because there was no event that the plugin was waiting for.

Here's an example where that happened:

framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c"
schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c"
...
scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip"

Pop still needs the information about unschedulable plugins to update the UnschedulableReason metric. It can reset that information before returning the PodInfo for the next scheduling attempt.

Special notes for your reviewer:

This was found when running go test -timeout=0 -count=1 -v -bench=BenchmarkPerfScheduling/.*Multiple.*Claim.*/2000pods_100nodes -benchtime=1ns -run=xxx ./test/integration/scheduler_perf -args -perf-scheduling-label-filter= repeatedly. That new SchedulingWithMultipleResourceClaims test case is not enabled by default in any Prow job. It is more likely to trigger the problem because it involves multiple DRA drivers and multiple claims per pod.

Does this PR introduce a user-facing change?

Sometimes, the scheduler incorrectly placed a pod in the "unschedulable" queue instead of the "backoff" queue. This happened when some plugin previously declared the pod as "unschedulable" and then in a later attempt encounters some other error. Scheduling of that pod then got delayed by up to five minutes, after which periodic flushing moved the pod back into the "active" queue.

k8s-ci-robot · 2023-09-01T07:08:34Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly · 2023-09-01T07:15:23Z

Once we agree on a solution, that solution should get backported.

sanposhiho

/lgtm

@kubernetes/sig-scheduling-leads If no other people have any concerns about this, any idea why we're doing this, please let us know.
Leave @alculquicondor for /approval just in case.

/assign @alculquicondor

k8s-ci-robot · 2023-09-03T10:39:59Z

LGTM label has been added.

Git tree hash: 14c941e98b68ac6b259189c8b8f1870488abf5a7

Huang-Wei · 2023-09-03T16:37:48Z

pkg/scheduler/internal/queue/scheduling_queue.go

 for plugin := range pInfo.UnschedulablePlugins {
 metrics.UnschedulableReason(plugin, pInfo.Pod.Spec.SchedulerName).Dec()
 }
+ pInfo.UnschedulablePlugins.Clear()


LGTM. Can we make it UTed?

What does "UTed" mean?

Probably "unit tested" - I don't know the unit tests for this code well, but I can have a look where such a test might fit.

yup, unit tested. It'd be nice to cover the code change in UT, or integration test if you think that's easier.

I couldn't find any unit tests for the "unschedulable_pods" metric, which is the other reason why Pop needs pInfo.UnschedulablePlugins.

Looking through the code I am not convinced that this metric was handled correctly at all.

For example, when a pod gets added first to the unschedulable queue and then moved to the backoff queue, should it still count as "unschedulable"? I don't think so. Not only does the code not decrement in this case, it increments again, counting the same pod twice, because AddUnschedulableIfNotPresent gets called twice - or at least that is what happens when the addUnschedulablePodBackToUnschedulablePods and moveAllToActiveOrBackoffQ operations get called.

I don't have time to untangle this right now. Can we merge this PR as-is as a hotfix and then clean up later? "unschedulable_pods" is alpha, so it's okayish that remains broken in 1.28.

I created #120395 with instructions and a preliminary unit test.

But we don't run ForgetPod if Reserve fails, which I think is what happens in DRA.

We actually do I think - when any Reserve plugin returns non-success code:

kubernetes/pkg/scheduler/schedule_one.go

Lines 201 to 207 in 130a5a4

// Run the Reserve method of reserve plugins.

if sts := fwk.RunReservePluginsReserve(ctx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {

// trigger un-reserve to clean up state associated with the reserved Pod

fwk.RunReservePluginsUnreserve(ctx, state, assumedPod, scheduleResult.SuggestedHost)

if forgetErr := sched.Cache.ForgetPod(logger, assumedPod); forgetErr != nil {

logger.Error(forgetErr, "Scheduler cache ForgetPod failed")

}

But we don't run ForgetPod if Reserve fails, which I think is what happens in DRA.

Then this PR maybe actually fixes the issue?

This PR is semantically correct regardless, so we should merge it. But maybe we can focus the UT on having the plugins cleared or have a UT at a higher level (maybe scheduler_test.go)

I agree, it's hard to be certain about "correct" behavior at this level because the unit tests must faithfully replicate what happens in "real" scheduling attempts and that seems to be ambiguous.

Let me remove the unit tests and archive them here instead:

commit 2ce67afa02be3912576604900ea9eb7eefdb393d Author: Patrick Ohly <[email protected]> Date: Wed Sep 6 10:52:38 2023 +0200 WIP: unit tests diff --git a/pkg/scheduler/internal/queue/scheduling_queue_test.go b/pkg/scheduler/internal/queue/scheduling_queue_test.go index 12b92bb7bdd..24c0b72d05a 100644 --- a/pkg/scheduler/internal/queue/scheduling_queue_test.go +++ b/pkg/scheduler/internal/queue/scheduling_queue_test.go @@ -542,6 +542,84 @@ func Test_InFlightPods(t *testing.T) { }, }, }, + { + name: "failed pod is placed in backoff queue with hints", + isSchedulingQueueHintEnabled: true, + initialPods: []*v1.Pod{pod}, + actions: []action{ + {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }}, + {callback: func(t *testing.T, q *PriorityQueue) { + logger, _ := ktesting.NewTestContext(t) + // Unschedulable. + poppedPod.UnschedulablePlugins = sets.New("fooPlugin1") + if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil { + t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err) + } + }}, + {eventHappens: &PvAdd}, // Active again. + {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }}, + {callback: func(t *testing.T, q *PriorityQueue) { + logger, _ := ktesting.NewTestContext(t) + // Failed (i.e. no UnschedulablePlugins). Should go to backoff. + if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil { + t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err) + } + }}, + }, + wantBackoffQPodNames: []string{pod.Name}, + wantInFlightPods: nil, + wantInFlightEvents: nil, + queueingHintMap: QueueingHintMapPerProfile{ + "": { + PvAdd: { + { + // The hint fn tells that this event makes a Pod scheudlable immediately. + PluginName: "fooPlugin1", + QueueingHintFn: queueHintReturnQueueImmediately, + }, + }, + }, + }, + }, + { + name: "failed pod is placed in backoff queue without hints", + isSchedulingQueueHintEnabled: false, + initialPods: []*v1.Pod{pod}, + actions: []action{ + {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }}, + {callback: func(t *testing.T, q *PriorityQueue) { + logger, _ := ktesting.NewTestContext(t) + // Unschedulable. + poppedPod.UnschedulablePlugins = sets.New("fooPlugin1") + if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil { + t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err) + } + }}, + {eventHappens: &PvAdd}, // Active again. + {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }}, + {callback: func(t *testing.T, q *PriorityQueue) { + logger, _ := ktesting.NewTestContext(t) + // Failed (i.e. no UnschedulablePlugins). Should go to backoff. + if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil { + t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err) + } + }}, + }, + wantBackoffQPodNames: []string{pod.Name}, + wantInFlightPods: nil, + wantInFlightEvents: nil, + queueingHintMap: QueueingHintMapPerProfile{ + "": { + PvAdd: { + { + // The hint fn tells that this event makes a Pod scheudlable immediately. + PluginName: "fooPlugin1", + QueueingHintFn: queueHintReturnQueueImmediately, + }, + }, + }, + }, + }, } for _, test := range tests {

Then I'll let you decide whether you want to merge this PR and add unit tests elsewhere, perhaps in #119105.

I would still leave a test that makes sure the list of unschedulable plugins is empty in a second attempt.

Done. The unit test fails (as it should) without the new pInfo.UnschedulablePlugins.Clear() in Pop.

Huang-Wei · 2023-09-06T23:24:39Z

A side request: can we patch (instead of update) the status if possible?

kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

Line 875 in 47fec59

 _, err := pl.clientset.ResourceV1alpha2().ResourceClaims(claim.Namespace).UpdateStatus(ctx, claim, metav1.UpdateOptions{}) 

Huang-Wei · 2023-09-06T23:53:09Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

+ {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+ {callback: func(t *testing.T, q *PriorityQueue) {
+ logger, _ := ktesting.NewTestContext(t)
+ // Failed (i.e. no UnschedulablePlugins). Should go to backoff.


It looks like even if the unschedulablePlugins get cleared, determineSchedulingHintForInFlightPod() still returns QueueSkip, and hence requeuePodViaQueueingHint() choose unschedulableQ to place the pod.

So I'm skeptical if this PR would fix the issue.

#119105 will fix that.

Huang-Wei · 2023-09-07T16:24:02Z

Patching doesn't really provide much benefits in such a case, does it?

Patch doesn't check that. Let me open a new issue (#120502) to track possible improvement.

alculquicondor · 2023-09-08T13:40:56Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

@@ -542,6 +542,47 @@ func Test_InFlightPods(t *testing.T) {
 },
 },
 },
+ {
+ name: "popped pod must have empty UnschedulablePlugins",
+ isSchedulingQueueHintEnabled: true,


let's add another one without scheduling hints.

We should backport that one to all supported versions

I guess it would have to be a separate test function, as this is specific to the inflight implementation.

I've moved the new test case into TestPop and run it with and without hints.

When some plugin was registered as "unschedulable" in some previous scheduling attempt, it kept that attribute for a pod forever. When that plugin then later failed with an error that requires backoff, the pod was incorrectly moved to the "unschedulable" queue where it got stuck until the periodic flushing because there was no event that the plugin was waiting for. Here's an example where that happened: framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c" schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c" ... scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip" Pop still needs the information about unschedulable plugins to update the UnschedulableReason metric. It can reset that information before returning the PodInfo for the next scheduling attempt.

alculquicondor

/lgtm
/approve

k8s-ci-robot · 2023-09-08T14:55:40Z

LGTM label has been added.

Git tree hash: 2aa953eb4e61f50c11ebae684e3405cf2a2c5487

k8s-ci-robot · 2023-09-08T14:55:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2023-09-08T17:55:49Z

/retest

k8s-ci-robot · 2023-09-08T18:22:29Z

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-linter-hints	`4e73634`	link	false	`/test pull-kubernetes-linter-hints`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

…4-origin-release-1.27 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

…4-origin-release-1.26 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

…4-origin-release-1.28 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

…4-origin-release-1.25 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 1, 2023

pohly mentioned this pull request Sep 1, 2023

scheduler_perf: show name of one pending pod in log #120335

Merged

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 1, 2023

k8s-ci-robot requested review from chendave and damemi September 1, 2023 07:09

pohly mentioned this pull request Sep 1, 2023

scheduler_perf: use different log names for different DRA drivers #120339

Merged

sanposhiho reviewed Sep 3, 2023

View reviewed changes

k8s-ci-robot assigned alculquicondor and sanposhiho Sep 3, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 3, 2023

Huang-Wei reviewed Sep 3, 2023

View reviewed changes

pohly mentioned this pull request Sep 4, 2023

scheduler: fix and test "unschedulable_pods" metric #120395

Open

pohly force-pushed the scheduler-clear-unschedulable-plugins branch from 904adcd to 2ce67af Compare September 6, 2023 08:52

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 6, 2023

k8s-ci-robot requested review from alculquicondor and sanposhiho September 6, 2023 08:52

Huang-Wei reviewed Sep 6, 2023

View reviewed changes

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 7, 2023

pohly mentioned this pull request Sep 8, 2023

[discussion] DRA plugin raises UPDATE API request in scheduling cycle #120502

Closed

pohly force-pushed the scheduler-clear-unschedulable-plugins branch from 0b7d153 to 8e79f18 Compare September 8, 2023 09:09

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 8, 2023

alculquicondor reviewed Sep 8, 2023

View reviewed changes

pohly force-pushed the scheduler-clear-unschedulable-plugins branch from 8e79f18 to 4e73634 Compare September 8, 2023 14:52

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 8, 2023

alculquicondor reviewed Sep 8, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 8, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2023

k8s-ci-robot merged commit 4168923 into kubernetes:master Sep 8, 2023
13 of 14 checks passed

k8s-ci-robot added this to the v1.29 milestone Sep 8, 2023

alculquicondor mentioned this pull request Sep 18, 2023

NodeUnschedulable: scheduler queueing hints #119396

Merged

k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023

Merge pull request #120556 from pohly/automated-cherry-pick-of-#12033…

e631439

…4-origin-release-1.27 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023

Merge pull request #120557 from pohly/automated-cherry-pick-of-#12033…

9bbb813

…4-origin-release-1.26 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023

Merge pull request #120535 from pohly/automated-cherry-pick-of-#12033…

7c5ead7

…4-origin-release-1.28 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023

Merge pull request #120558 from pohly/automated-cherry-pick-of-#12033…

189a9db

…4-origin-release-1.25 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean

sanposhiho mentioned this pull request Sep 22, 2023

Reevaluate flushing unschedulable pods into activeQ #87850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: avoid false "unschedulable" pod state #120334

scheduler: avoid false "unschedulable" pod state #120334

pohly commented Sep 1, 2023 •

edited

k8s-ci-robot commented Sep 1, 2023

pohly commented Sep 1, 2023

sanposhiho left a comment

k8s-ci-robot commented Sep 3, 2023

Huang-Wei Sep 3, 2023

pohly Sep 3, 2023

pohly Sep 3, 2023

Huang-Wei Sep 3, 2023

pohly Sep 4, 2023

Huang-Wei Sep 6, 2023

alculquicondor Sep 7, 2023 •

edited

pohly Sep 7, 2023

alculquicondor Sep 7, 2023 •

edited

pohly Sep 8, 2023

Huang-Wei commented Sep 6, 2023

Huang-Wei Sep 6, 2023

sanposhiho Sep 6, 2023

Huang-Wei commented Sep 7, 2023 •

edited

alculquicondor Sep 8, 2023

alculquicondor Sep 8, 2023

pohly Sep 8, 2023

alculquicondor left a comment

k8s-ci-robot commented Sep 8, 2023

k8s-ci-robot commented Sep 8, 2023

pohly commented Sep 8, 2023

k8s-ci-robot commented Sep 8, 2023 •

edited

	// Run the Reserve method of reserve plugins.
	if sts := fwk.RunReservePluginsReserve(ctx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {
	// trigger un-reserve to clean up state associated with the reserved Pod
	fwk.RunReservePluginsUnreserve(ctx, state, assumedPod, scheduleResult.SuggestedHost)
	if forgetErr := sched.Cache.ForgetPod(logger, assumedPod); forgetErr != nil {
	logger.Error(forgetErr, "Scheduler cache ForgetPod failed")
	}

scheduler: avoid false "unschedulable" pod state #120334

scheduler: avoid false "unschedulable" pod state #120334

Conversation

pohly commented Sep 1, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Sep 1, 2023

pohly commented Sep 1, 2023

sanposhiho left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Sep 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Sep 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Sep 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Sep 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 8, 2023

k8s-ci-robot commented Sep 8, 2023

pohly commented Sep 8, 2023

k8s-ci-robot commented Sep 8, 2023 • edited

pohly commented Sep 1, 2023 •

edited

alculquicondor Sep 7, 2023 •

edited

alculquicondor Sep 7, 2023 •

edited

Huang-Wei commented Sep 7, 2023 •

edited

k8s-ci-robot commented Sep 8, 2023 •

edited