-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(scheduler): simplify QueueingHintFn by introducing new status Pending
#119517
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
5d9d802
to
c0d16fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
We should involve other approves as this PR proposes something completely new.
71c9512
to
428a670
Compare
cc @pohly |
@sanposhiho are you still working on this? |
Oh, sorry.. I didn't notice CI failed and I thought it's ready to get reviews. Let me modify it. |
6e2aca3
to
c8f8820
Compare
c8f8820
to
cb5dc46
Compare
// So, Pods rejected by such reasons don't need to suffer a penalty (backoff). | ||
// When the scheduling queue requeues Pods, which was rejected with Pending in the last scheduling, | ||
// the Pod goes to activeQ directly ignoring backoff. | ||
Pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we keep them separate, to avoid fitting too many concepts into one word.
pkg/scheduler/framework/interface.go
Outdated
@@ -151,7 +180,7 @@ type Status struct { | |||
reasons []string | |||
err error | |||
// failedPlugin is an optional field that records the plugin name a Pod failed by. | |||
// It's set by the framework when code is Error, Unschedulable or UnschedulableAndUnresolvable. | |||
// It's set by the framework when code is Unschedulable or UnschedulableAndUnresolvable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or Pending?
maybe failedPlugin
is not the best name, but we can rename it in a separate PR. Maybe plugin
is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can rename it in a separate PR. Maybe plugin is enough.
👍
@@ -131,6 +131,7 @@ func TestStatusCodes(t *testing.T) { | |||
assertStatusCode(t, UnschedulableAndUnresolvable, 3) | |||
assertStatusCode(t, Wait, 4) | |||
assertStatusCode(t, Skip, 5) | |||
assertStatusCode(t, Pending, 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't see the point of this test, but we could remove it in another PR, if you agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree.
// because the scheduling result is used to proceed the Pod's scheduling forward, | ||
// that particular scheduling cycle is failed though. | ||
// So, Pods rejected by such reasons don't need to suffer a penalty (backoff). | ||
// When the scheduling queue requeues Pods, which was rejected with Pending in the last scheduling, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this scenario?
A pod failed with two plugins, plugin A returning Pending and plugin B returning Unschedulable.
An event comes and plugin A's hint is Queue, while plugin B's is Skip.
Will this Pod re-enter the queue? I guess it does, and it should, even if it's not schedulable yet. But a scheduling attempt can clear plugin A from the list of failed plugins, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that Pod is moved to activeQ immediately. As you said, in the next scheduling, probably plugin A is removed from the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops the scenario I had in mind was the opposite:
A pod failed with two plugins, plugin A returning Pending and plugin B returning Unschedulable.
An event comes and plugin B's hint is Queue, while plugin A's is Skip.
But the same reasoning applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: e584ccf8c1d97596524b2fdc504d9c2f8be55c5f
|
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
@alculquicondor Thanks. I opened two follow up PRs: |
Pending
What type of PR is this?
/kind feature
What this PR does / why we need it:
QueueingHintFn returns three values, QueueSkip, QueueAfterBackoff, and QueueImmediately.
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/types.go#L109-L121
QueueSkip is OK as it's pretty clear, but it's unclear when to return QueueAfterBackoff and when to return QueueImmediately. Then, the current unclear situation results in different opinions from different people about which value we should return like this.
#118551 (comment) is the thread which I wanted to clarify that point.
In the thread, #118551 (comment) is one idea to clarify this.
BackoffQ is a light way of keeping throughput high by preventing pods that are "permanently unschedulable" from blocking the queue. (quote from #117561 (comment)).
So, the backoff delay is a penalty of wasting the scheduling cycle -- the more rejection the Pod has gotten at the scheduling cycle, the more delay that Pod should get in backoffQ as a penalty for wasting scheduling cycles.
And based on that concept, we split the failure into two reasons:
And, we consider they equal when to return QueueAfterBackoff (the former) and when to return QueueImmediately (the latter).
Based on the above thoughts from the discussion, it feels like we shouldn't control how to requeue Pods via the returning values (QueueAfterBackoff/QueueImmediately). Rather, I believe we should control it how the Pod is rejected (with which status the Pod is rejected).
That's more natural than QueueAfterBackoff/QueueImmediately because we define when to return QueueAfterBackoff/QueueImmediately is tightly coupled with the reason the Pod was rejected.
This PR proposes to simplify QueueingHintFn's returning value.
QueueingHint will have only two values, Queue or QueueSkip.
And instead, introduce a new status, SuccessButReject.
SuccessButReject is used when the scheduler succeeds everything in the scheduling cycle but needs to reject Pod once, for example, in order to wait for the external component to do something.
For example, as said earlier, the DRA plugin sometimes needs to wait for the external device driver to provision the resource for the Pod based on the scheduling result.
It's different from when to return Unschedulable/UnschedulableAndUnresolvable, because in this case, the scheduling cycle, used to decide the Node in this example, isn't a vain effort, considering DRA can proceed the scheduling further by the scheduling result.
When the plugin rejectes a pod with this code, the scheduler remembers the pod and the plugin name, and when the appropriate event happens, which the plugin's QueueingHintFn returns Queue, the scheduler retries the scheduling immediately without waiting for the backoff. (Yes! it's a same behaviour as QueueImmediately that we now have)
Which issue(s) this PR fixes:
Part of: #114297
Special notes for your reviewer:
/cc @alculquicondor @kerthcet
There might be another better name instead of
SuccessButReject
. Please propose if any 😅EDIT: Per discussion in KEP, we use the word
Pending
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: