-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: avoid false "unschedulable" pod state #120334
scheduler: avoid false "unschedulable" pod state #120334
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Once we agree on a solution, that solution should get backported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@kubernetes/sig-scheduling-leads If no other people have any concerns about this, any idea why we're doing this, please let us know.
Leave @alculquicondor for /approval
just in case.
/assign @alculquicondor
LGTM label has been added. Git tree hash: 14c941e98b68ac6b259189c8b8f1870488abf5a7
|
for plugin := range pInfo.UnschedulablePlugins { | ||
metrics.UnschedulableReason(plugin, pInfo.Pod.Spec.SchedulerName).Dec() | ||
} | ||
pInfo.UnschedulablePlugins.Clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can we make it UTed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "UTed" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably "unit tested" - I don't know the unit tests for this code well, but I can have a look where such a test might fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, unit tested. It'd be nice to cover the code change in UT, or integration test if you think that's easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any unit tests for the "unschedulable_pods" metric, which is the other reason why Pop
needs pInfo.UnschedulablePlugins
.
Looking through the code I am not convinced that this metric was handled correctly at all.
For example, when a pod gets added first to the unschedulable queue and then moved to the backoff queue, should it still count as "unschedulable"? I don't think so. Not only does the code not decrement in this case, it increments again, counting the same pod twice, because AddUnschedulableIfNotPresent
gets called twice - or at least that is what happens when the addUnschedulablePodBackToUnschedulablePods
and moveAllToActiveOrBackoffQ
operations get called.
I don't have time to untangle this right now. Can we merge this PR as-is as a hotfix and then clean up later? "unschedulable_pods" is alpha, so it's okayish that remains broken in 1.28.
I created #120395 with instructions and a preliminary unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't run ForgetPod if Reserve fails, which I think is what happens in DRA.
We actually do I think - when any Reserve plugin returns non-success code:
kubernetes/pkg/scheduler/schedule_one.go
Lines 201 to 207 in 130a5a4
// Run the Reserve method of reserve plugins. | |
if sts := fwk.RunReservePluginsReserve(ctx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() { | |
// trigger un-reserve to clean up state associated with the reserved Pod | |
fwk.RunReservePluginsUnreserve(ctx, state, assumedPod, scheduleResult.SuggestedHost) | |
if forgetErr := sched.Cache.ForgetPod(logger, assumedPod); forgetErr != nil { | |
logger.Error(forgetErr, "Scheduler cache ForgetPod failed") | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't run ForgetPod if Reserve fails, which I think is what happens in DRA.
Then this PR maybe actually fixes the issue?
This PR is semantically correct regardless, so we should merge it. But maybe we can focus the UT on having the plugins cleared or have a UT at a higher level (maybe scheduler_test.go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's hard to be certain about "correct" behavior at this level because the unit tests must faithfully replicate what happens in "real" scheduling attempts and that seems to be ambiguous.
Let me remove the unit tests and archive them here instead:
commit 2ce67afa02be3912576604900ea9eb7eefdb393d
Author: Patrick Ohly <[email protected]>
Date: Wed Sep 6 10:52:38 2023 +0200
WIP: unit tests
diff --git a/pkg/scheduler/internal/queue/scheduling_queue_test.go b/pkg/scheduler/internal/queue/scheduling_queue_test.go
index 12b92bb7bdd..24c0b72d05a 100644
--- a/pkg/scheduler/internal/queue/scheduling_queue_test.go
+++ b/pkg/scheduler/internal/queue/scheduling_queue_test.go
@@ -542,6 +542,84 @@ func Test_InFlightPods(t *testing.T) {
},
},
},
+ {
+ name: "failed pod is placed in backoff queue with hints",
+ isSchedulingQueueHintEnabled: true,
+ initialPods: []*v1.Pod{pod},
+ actions: []action{
+ {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+ {callback: func(t *testing.T, q *PriorityQueue) {
+ logger, _ := ktesting.NewTestContext(t)
+ // Unschedulable.
+ poppedPod.UnschedulablePlugins = sets.New("fooPlugin1")
+ if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+ t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+ }
+ }},
+ {eventHappens: &PvAdd}, // Active again.
+ {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+ {callback: func(t *testing.T, q *PriorityQueue) {
+ logger, _ := ktesting.NewTestContext(t)
+ // Failed (i.e. no UnschedulablePlugins). Should go to backoff.
+ if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+ t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+ }
+ }},
+ },
+ wantBackoffQPodNames: []string{pod.Name},
+ wantInFlightPods: nil,
+ wantInFlightEvents: nil,
+ queueingHintMap: QueueingHintMapPerProfile{
+ "": {
+ PvAdd: {
+ {
+ // The hint fn tells that this event makes a Pod scheudlable immediately.
+ PluginName: "fooPlugin1",
+ QueueingHintFn: queueHintReturnQueueImmediately,
+ },
+ },
+ },
+ },
+ },
+ {
+ name: "failed pod is placed in backoff queue without hints",
+ isSchedulingQueueHintEnabled: false,
+ initialPods: []*v1.Pod{pod},
+ actions: []action{
+ {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+ {callback: func(t *testing.T, q *PriorityQueue) {
+ logger, _ := ktesting.NewTestContext(t)
+ // Unschedulable.
+ poppedPod.UnschedulablePlugins = sets.New("fooPlugin1")
+ if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+ t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+ }
+ }},
+ {eventHappens: &PvAdd}, // Active again.
+ {callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }},
+ {callback: func(t *testing.T, q *PriorityQueue) {
+ logger, _ := ktesting.NewTestContext(t)
+ // Failed (i.e. no UnschedulablePlugins). Should go to backoff.
+ if err := q.AddUnschedulableIfNotPresent(logger, poppedPod, q.SchedulingCycle()); err != nil {
+ t.Errorf("Unexpected error from AddUnschedulableIfNotPresent: %v", err)
+ }
+ }},
+ },
+ wantBackoffQPodNames: []string{pod.Name},
+ wantInFlightPods: nil,
+ wantInFlightEvents: nil,
+ queueingHintMap: QueueingHintMapPerProfile{
+ "": {
+ PvAdd: {
+ {
+ // The hint fn tells that this event makes a Pod scheudlable immediately.
+ PluginName: "fooPlugin1",
+ QueueingHintFn: queueHintReturnQueueImmediately,
+ },
+ },
+ },
+ },
+ },
}
for _, test := range tests {
Then I'll let you decide whether you want to merge this PR and add unit tests elsewhere, perhaps in #119105.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still leave a test that makes sure the list of unschedulable plugins is empty in a second attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The unit test fails (as it should) without the new pInfo.UnschedulablePlugins.Clear()
in Pop
.
904adcd
to
2ce67af
Compare
A side request: can we
|
{callback: func(t *testing.T, q *PriorityQueue) { poppedPod = popPod(t, q, pod) }}, | ||
{callback: func(t *testing.T, q *PriorityQueue) { | ||
logger, _ := ktesting.NewTestContext(t) | ||
// Failed (i.e. no UnschedulablePlugins). Should go to backoff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like even if the unschedulablePlugins get cleared, determineSchedulingHintForInFlightPod() still returns QueueSkip
, and hence requeuePodViaQueueingHint() choose unschedulableQ to place the pod.
So I'm skeptical if this PR would fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#119105 will fix that.
Patch doesn't check that. Let me open a new issue (#120502) to track possible improvement. |
0b7d153
to
8e79f18
Compare
@@ -542,6 +542,47 @@ func Test_InFlightPods(t *testing.T) { | |||
}, | |||
}, | |||
}, | |||
{ | |||
name: "popped pod must have empty UnschedulablePlugins", | |||
isSchedulingQueueHintEnabled: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add another one without scheduling hints.
We should backport that one to all supported versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would have to be a separate test function, as this is specific to the inflight implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the new test case into TestPop
and run it with and without hints.
When some plugin was registered as "unschedulable" in some previous scheduling attempt, it kept that attribute for a pod forever. When that plugin then later failed with an error that requires backoff, the pod was incorrectly moved to the "unschedulable" queue where it got stuck until the periodic flushing because there was no event that the plugin was waiting for. Here's an example where that happened: framework.go:1280: E0831 20:03:47.184243] Reserve/DynamicResources: Plugin failed err="Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" node="scheduler-perf-dra-7l2v2" plugin="DynamicResources" pod="test/test-dragxd5c" schedule_one.go:1001: E0831 20:03:47.184345] Error scheduling pod; retrying err="running Reserve plugin \"DynamicResources\": Operation cannot be fulfilled on podschedulingcontexts.resource.k8s.io \"test-dragxd5c\": the object has been modified; please apply your changes to the latest version and try again" pod="test/test-dragxd5c" ... scheduling_queue.go:745: I0831 20:03:47.198968] Pod moved to an internal scheduling queue pod="test/test-dragxd5c" event="ScheduleAttemptFailure" queue="Unschedulable" schedulingCycle=9576 hint="QueueSkip" Pop still needs the information about unschedulable plugins to update the UnschedulableReason metric. It can reset that information before returning the PodInfo for the next scheduling attempt.
8e79f18
to
4e73634
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 2aa953eb4e61f50c11ebae684e3405cf2a2c5487
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
…4-origin-release-1.27 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
…4-origin-release-1.26 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
…4-origin-release-1.28 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
…4-origin-release-1.25 Automated cherry pick of #120334: scheduler: start scheduling attempt with clean
What type of PR is this?
/kind bug
What this PR does / why we need it:
When some plugin was registered as "unschedulable" in some previous scheduling attempt, it kept that attribute for a pod forever. When that plugin then later failed with an error that requires backoff, the pod was incorrectly moved to the "unschedulable" queue where it got stuck until the periodic flushing because there was no event that the plugin was waiting for.
Here's an example where that happened:
Pop still needs the information about unschedulable plugins to update the UnschedulableReason metric. It can reset that information before returning the PodInfo for the next scheduling attempt.
Special notes for your reviewer:
This was found when running
go test -timeout=0 -count=1 -v -bench=BenchmarkPerfScheduling/.*Multiple.*Claim.*/2000pods_100nodes -benchtime=1ns -run=xxx ./test/integration/scheduler_perf -args -perf-scheduling-label-filter=
repeatedly. That new SchedulingWithMultipleResourceClaims test case is not enabled by default in any Prow job. It is more likely to trigger the problem because it involves multiple DRA drivers and multiple claims per pod.Does this PR introduce a user-facing change?