Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: avoid false "unschedulable" pod state #120334

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 1, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

When some plugin was registered as "unschedulable" in some previous scheduling attempt, it kept that attribute for a pod forever. When that plugin then later failed with an error that requires backoff, the pod was incorrectly moved to the "unschedulable" queue where it got stuck until the periodic flushing because there was no event that the plugin was waiting for.

Here's an example where that happened:

framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c"
schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c"
...
scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip"

Pop still needs the information about unschedulable plugins to update the UnschedulableReason metric. It can reset that information before returning the PodInfo for the next scheduling attempt.

Special notes for your reviewer:

This was found when running go test -timeout=0 -count=1 -v -bench=BenchmarkPerfScheduling/.*Multiple.*Claim.*/2000pods_100nodes -benchtime=1ns -run=xxx ./test/integration/scheduler_perf -args -perf-scheduling-label-filter= repeatedly. That new SchedulingWithMultipleResourceClaims test case is not enabled by default in any Prow job. It is more likely to trigger the problem because it involves multiple DRA drivers and multiple claims per pod.

Does this PR introduce a user-facing change?

Sometimes, the scheduler incorrectly placed a pod in the "unschedulable" queue instead of the "backoff" queue. This happened when some plugin previously declared the pod as "unschedulable" and then in a later attempt encounters some other error. Scheduling of that pod then got delayed by up to five minutes, after which periodic flushing moved the pod back into the "active" queue.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 1, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 1, 2023
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 1, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 1, 2023

Once we agree on a solution, that solution should get backported.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kubernetes/sig-scheduling-leads If no other people have any concerns about this, any idea why we're doing this, please let us know.
Leave @alculquicondor for /approval just in case.

/assign @alculquicondor

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 3, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 14c941e98b68ac6b259189c8b8f1870488abf5a7

for plugin := range pInfo.UnschedulablePlugins {
metrics.UnschedulableReason(plugin, pInfo.Pod.Spec.SchedulerName).Dec()
}
pInfo.UnschedulablePlugins.Clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can we make it UTed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "UTed" mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably "unit tested" - I don't know the unit tests for this code well, but I can have a look where such a test might fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, unit tested. It'd be nice to cover the code change in UT, or integration test if you think that's easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any unit tests for the "unschedulable_pods" metric, which is the other reason why Pop needs pInfo.UnschedulablePlugins.

Looking through the code I am not convinced that this metric was handled correctly at all.

For example, when a pod gets added first to the unschedulable queue and then moved to the backoff queue, should it still count as "unschedulable"? I don't think so. Not only does the code not decrement in this case, it increments again, counting the same pod twice, because AddUnschedulableIfNotPresent gets called twice - or at least that is what happens when the addUnschedulablePodBackToUnschedulablePods and moveAllToActiveOrBackoffQ operations get called.

I don't have time to untangle this right now. Can we merge this PR as-is as a hotfix and then clean up later? "unschedulable_pods" is alpha, so it's okayish that remains broken in 1.28.

I created #120395 with instructions and a preliminary unit test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't run ForgetPod if Reserve fails, which I think is what happens in DRA.

We actually do I think - when any Reserve plugin returns non-success code:

// Run the Reserve method of reserve plugins.
if sts := fwk.RunReservePluginsReserve(ctx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {
// trigger un-reserve to clean up state associated with the reserved Pod
fwk.RunReservePluginsUnreserve(ctx, state, assumedPod, scheduleResult.SuggestedHost)
if forgetErr := sched.Cache.ForgetPod(logger, assumedPod); forgetErr != nil {
logger.Error(forgetErr, "Scheduler cache ForgetPod failed")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't run ForgetPod if Reserve fails, which I think is what happens in DRA.

Then this PR maybe actually fixes the issue?

This PR is semantically correct regardless, so we should merge it. But maybe we can focus the UT on having the plugins cleared or have a UT at a higher level (maybe scheduler_test.go)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's hard to be certain about "correct" behavior at this level because the unit tests must faithfully replicate what happens in "real" scheduling attempts and that seems to be ambiguous.

Let me remove the unit tests and archive them here instead:

commit 2ce67afa02be3912576604900ea9eb7eefdb393d
Author: Patrick Ohly <[email protected]>
Date:   Wed Sep 6 10:52:38 2023 +0200

    WIP: unit tests

diff --git a/pkg/scheduler/internal/queue/scheduling_queue_test.go b/pkg/scheduler/internal/queue/scheduling_queue_test.go
index 12b92bb7bdd..24c0b72d05a 100644
--- a/pkg/scheduler/internal/queue/scheduling_queue_test.go
+++ b/pkg/scheduler/internal/queue/scheduling_queue_test.go
@@ -542,6 +542,84 @@ func Test_InFlightPods(t *testing.T) {
 				},
 			},
 		},
+		{
+			name:                         "failed pod is placed in backoff queue with hints",
+			isSchedulingQueueHintEnabled: true,
+			initialPods:                  []*v1.Pod{pod},
+			actions: []action{
+				{callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+				{callback: func(t *testing.T, q *PriorityQueue) {
+					logger, _ := ktesting.NewTestContext(t)
+					// Unschedulable.
+					poppedPod.UnschedulablePlugins = sets.New("fooPlugin1")
+					if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+						t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+					}
+				}},
+				{eventHappens: &PvAdd}, // Active again.
+				{callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+				{callback: func(t *testing.T, q *PriorityQueue) {
+					logger, _ := ktesting.NewTestContext(t)
+					// Failed (i.e. no UnschedulablePlugins). Should go to backoff.
+					if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+						t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+					}
+				}},
+			},
+			wantBackoffQPodNames: []string{pod.Name},
+			wantInFlightPods:     nil,
+			wantInFlightEvents:   nil,
+			queueingHintMap: QueueingHintMapPerProfile{
+				"": {
+					PvAdd: {
+						{
+							// The hint fn tells that this event makes a Pod scheudlable immediately.
+							PluginName:     "fooPlugin1",
+							QueueingHintFn: queueHintReturnQueueImmediately,
+						},
+					},
+				},
+			},
+		},
+		{
+			name:                         "failed pod is placed in backoff queue without hints",
+			isSchedulingQueueHintEnabled: false,
+			initialPods:                  []*v1.Pod{pod},
+			actions: []action{
+				{callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+				{callback: func(t *testing.T, q *PriorityQueue) {
+					logger, _ := ktesting.NewTestContext(t)
+					// Unschedulable.
+					poppedPod.UnschedulablePlugins = sets.New("fooPlugin1")
+					if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+						t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+					}
+				}},
+				{eventHappens: &PvAdd}, // Active again.
+				{callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+				{callback: func(t *testing.T, q *PriorityQueue) {
+					logger, _ := ktesting.NewTestContext(t)
+					// Failed (i.e. no UnschedulablePlugins). Should go to backoff.
+					if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+						t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+					}
+				}},
+			},
+			wantBackoffQPodNames: []string{pod.Name},
+			wantInFlightPods:     nil,
+			wantInFlightEvents:   nil,
+			queueingHintMap: QueueingHintMapPerProfile{
+				"": {
+					PvAdd: {
+						{
+							// The hint fn tells that this event makes a Pod scheudlable immediately.
+							PluginName:     "fooPlugin1",
+							QueueingHintFn: queueHintReturnQueueImmediately,
+						},
+					},
+				},
+			},
+		},
 	}
 
 	for _, test := range tests {

Then I'll let you decide whether you want to merge this PR and add unit tests elsewhere, perhaps in #119105.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still leave a test that makes sure the list of unschedulable plugins is empty in a second attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The unit test fails (as it should) without the new pInfo.UnschedulablePlugins.Clear() in Pop.

@pohly pohly force-pushed the scheduler-clear-unschedulable-plugins branch from 904adcd to 2ce67af Compare September 6, 2023 08:52
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 6, 2023
@Huang-Wei
Copy link
Member

A side request: can we patch (instead of update) the status if possible?

_, err := pl.clientset.ResourceV1alpha2().ResourceClaims(claim.Namespace).UpdateStatus(ctx, claim, metav1.UpdateOptions{})

{callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
{callback: func(t *testing.T, q *PriorityQueue) {
logger, _ := ktesting.NewTestContext(t)
// Failed (i.e. no UnschedulablePlugins). Should go to backoff.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like even if the unschedulablePlugins get cleared, determineSchedulingHintForInFlightPod() still returns QueueSkip, and hence requeuePodViaQueueingHint() choose unschedulableQ to place the pod.

So I'm skeptical if this PR would fix the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#119105 will fix that.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 7, 2023
@Huang-Wei
Copy link
Member

Huang-Wei commented Sep 7, 2023

Patching doesn't really provide much benefits in such a case, does it?

Patch doesn't check that. Let me open a new issue (#120502) to track possible improvement.

@pohly pohly force-pushed the scheduler-clear-unschedulable-plugins branch from 0b7d153 to 8e79f18 Compare September 8, 2023 09:09
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 8, 2023
@@ -542,6 +542,47 @@ func Test_InFlightPods(t *testing.T) {
},
},
},
{
name: "popped pod must have empty UnschedulablePlugins",
isSchedulingQueueHintEnabled: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add another one without scheduling hints.

We should backport that one to all supported versions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would have to be a separate test function, as this is specific to the inflight implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the new test case into TestPop and run it with and without hints.

When some plugin was registered as "unschedulable" in some previous scheduling
attempt, it kept that attribute for a pod forever. When that plugin then later
failed with an error that requires backoff, the pod was incorrectly moved to the
"unschedulable" queue where it got stuck until the periodic flushing because
there was no event that the plugin was waiting for.

Here's an example where that happened:

     framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c"
    schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c"
    ...
    scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip"

Pop still needs the information about unschedulable plugins to update the
UnschedulableReason metric. It can reset that information before returning the
PodInfo for the next scheduling attempt.
@pohly pohly force-pushed the scheduler-clear-unschedulable-plugins branch from 8e79f18 to 4e73634 Compare September 8, 2023 14:52
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 8, 2023
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 8, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2aa953eb4e61f50c11ebae684e3405cf2a2c5487

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 8, 2023

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 8, 2023

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-linter-hints 4e73634 link false /test pull-kubernetes-linter-hints

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit 4168923 into kubernetes:master Sep 8, 2023
13 of 14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Sep 8, 2023
k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023
…4-origin-release-1.27

Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023
…4-origin-release-1.26

Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023
…4-origin-release-1.28

Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
k8s-ci-robot added a commit that referenced this pull request Sep 21, 2023
…4-origin-release-1.25

Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants