-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add warnings for big number of completions and parallelism #118420
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig apps |
77799df
to
11dad0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some nits.
pkg/api/job/warnings_test.go
Outdated
warnings := WarningsForJobSpec(ctx, nil, &tc.spec, nil) | ||
if len(warnings) != tc.wantWarningsCount { | ||
t.Errorf("Got %d warnings, want %d", len(warnings), tc.wantWarningsCount) | ||
t.Logf("Warnings: %v", warnings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it left intentionally? I suppose it is a left-over of dev process. Maybe we could consider an additional field to match the warnings, but not requesting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's intentional. It could be useful for debugging if the test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe then in the Errorf
say not just the count, but also the received warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged above.
Template: validPodTemplate, | ||
}, | ||
}, | ||
"invalid Indexed high completions high parallelism": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add test case with high completions and high parallelism when completionMode is NonIndexed, to show the validation is if
ed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -892,6 +899,32 @@ func TestJobStrategy_WarningsOnCreate(t *testing.T) { | |||
Spec: validSpec, | |||
}, | |||
}, | |||
"high completions and parallelism": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have an analogous test for updates to ensure both entry paths to the new code are adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
11dad0f
to
9bb36ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposal to the warning message and two questions / suggestions regarding tests
var warnings []string | ||
if spec.CompletionMode != nil && *spec.CompletionMode == batch.IndexedCompletion { | ||
if *spec.Completions > completionsSoftLimit && *spec.Parallelism > parallelismSoftLimitForUnlimitedCompletions { | ||
msg := "In Indexed Jobs with a number of completions higher than 10^5 and a parallelism higher than 10^4, Kubernetes might not be able to track completedIndexes when a big number of indexes fail" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking nit: I'm wondering about adding something like "Consider splitting the Job into smaller ones"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is within the scope of the API. But maybe we can add it to the documentation.
MatchLabels: map[string]string{"a": "b"}, | ||
} | ||
|
||
validPodTemplate = core.PodTemplateSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I see declaring globally enables sharing, but on the other hand, if they are shared by many tests it might be not obvious if modifying them we don't change the intention of one of the tests that uses the shared objects. Thus, I would suggest declaring them scoped within a test,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests that use this don't care about the Pod template. Being verbose doesn't add much value in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it
@@ -749,6 +749,7 @@ func TestJobStrategy_WarningsOnUpdate(t *testing.T) { | |||
Generation: 0, | |||
}, | |||
Spec: batch.JobSpec{ | |||
CompletionMode: completionModePtr(batch.NonIndexedCompletion), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why adding this line? It seems unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially wanted to assume that the object had gone through defaulting. But this caused too many tests to change. So I'll revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, please do revert in the existing places. Adding in the new tests is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
058871c
to
3995209
Compare
/lgtm |
LGTM label has been added. Git tree hash: 4b8563f83bd62bbb6d2524061658626eff9afdca
|
pkg/api/job/warnings.go
Outdated
func WarningsForJobSpec(ctx context.Context, path *field.Path, spec, oldSpec *batch.JobSpec) []string { | ||
var warnings []string | ||
if spec.CompletionMode != nil && *spec.CompletionMode == batch.IndexedCompletion { | ||
if *spec.Completions > completionsSoftLimit && *spec.Parallelism > parallelismSoftLimitForUnlimitedCompletions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what prevents spec.Parallelism from being nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API defaulting sets it to 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API defaulting sets it to 1
Even so, given the pointer we should check nil-ness. Defaulting could change in future versions for instance and it's easy to check now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched the implementation to pointer.Int32Deref
3995209
to
8c0eb07
Compare
d361ff5
to
d255c87
Compare
The recent changes LGTM. Please take a look why the unit test failed, is it a flake? |
Yeah, I don't know why it's failing. I can't reproduce locally :( |
Change-Id: I63e192b1ce9da7d8bb04f8be1a6e19ec6fbbfa5a
d255c87
to
c27f9fd
Compare
/lgtm |
LGTM label has been added. Git tree hash: 6630c4f3c709bd94a9a422051420cd92a319627f
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, deads2k The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
/kind api-change
What this PR does / why we need it:
Add a warning for Jobs when
completions > 1e5
andparallelism > 1e4
.In this case, the size of the field
.status.completedIndexes
might exceed the storage limits of etcd, if a big number of pod indexes fail.parallelism <= 1e4
, then the field consumes at most(10+1)*(10^4+10^4)=0.21Mi
.completions <= 1e5
, then the field consumes at most(5+1)*10^5=0.572Mi
These limits are well below the default etcd limits of ~1.5Mi
Which issue(s) this PR fixes:
Fixes #118085
Special notes for your reviewer:
This warning is in accordance with the proposal in kubernetes/enhancements#3967.
I'll also submit a PR to update the documentation for Indexed Job.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: