Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warnings for big number of completions and parallelism #118420

Merged
merged 1 commit into from
Jun 15, 2023

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Jun 2, 2023

What type of PR is this?

/kind bug
/kind api-change

What this PR does / why we need it:

Add a warning for Jobs when completions > 1e5 and parallelism > 1e4.

In this case, the size of the field .status.completedIndexes might exceed the storage limits of etcd, if a big number of pod indexes fail.

  • If parallelism <= 1e4, then the field consumes at most (10+1)*(10^4+10^4)=0.21Mi.
  • If completions <= 1e5, then the field consumes at most (5+1)*10^5=0.572Mi

These limits are well below the default etcd limits of ~1.5Mi

Which issue(s) this PR fixes:

Fixes #118085

Special notes for your reviewer:

This warning is in accordance with the proposal in kubernetes/enhancements#3967.
I'll also submit a PR to update the documentation for Indexed Job.

Does this PR introduce a user-facing change?

ACTION_REQUIRED
When an Indexed Job has a number of completions higher than 10^5 and parallelism higher than 10^4, and a big number of Indexes fail, Kubernetes might not be able to track the termination of the Job. Kubernetes now emits a warning, at Job creation, when the Job manifest exceeds both of these limits.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [Usage]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2023
@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 2, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 2, 2023
@alculquicondor
Copy link
Member Author

/assign @mimowo @deads2k

@alculquicondor
Copy link
Member Author

/sig apps
/label api-review

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. api-review Categorizes an issue or PR as actively needing an API review. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 2, 2023
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some nits.

warnings := WarningsForJobSpec(ctx, nil, &tc.spec, nil)
if len(warnings) != tc.wantWarningsCount {
t.Errorf("Got %d warnings, want %d", len(warnings), tc.wantWarningsCount)
t.Logf("Warnings: %v", warnings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it left intentionally? I suppose it is a left-over of dev process. Maybe we could consider an additional field to match the warnings, but not requesting it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's intentional. It could be useful for debugging if the test fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe then in the Errorf say not just the count, but also the received warnings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged above.

Template: validPodTemplate,
},
},
"invalid Indexed high completions high parallelism": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add test case with high completions and high parallelism when completionMode is NonIndexed, to show the validation is ifed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -892,6 +899,32 @@ func TestJobStrategy_WarningsOnCreate(t *testing.T) {
Spec: validSpec,
},
},
"high completions and parallelism": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have an analogous test for updates to ensure both entry paths to the new code are adjusted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal to the warning message and two questions / suggestions regarding tests

var warnings []string
if spec.CompletionMode != nil && *spec.CompletionMode == batch.IndexedCompletion {
if *spec.Completions > completionsSoftLimit && *spec.Parallelism > parallelismSoftLimitForUnlimitedCompletions {
msg := "In Indexed Jobs with a number of completions higher than 10^5 and a parallelism higher than 10^4, Kubernetes might not be able to track completedIndexes when a big number of indexes fail"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking nit: I'm wondering about adding something like "Consider splitting the Job into smaller ones"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is within the scope of the API. But maybe we can add it to the documentation.

MatchLabels: map[string]string{"a": "b"},
}

validPodTemplate = core.PodTemplateSpec{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I see declaring globally enables sharing, but on the other hand, if they are shared by many tests it might be not obvious if modifying them we don't change the intention of one of the tests that uses the shared objects. Thus, I would suggest declaring them scoped within a test,.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests that use this don't care about the Pod template. Being verbose doesn't add much value in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it

@@ -749,6 +749,7 @@ func TestJobStrategy_WarningsOnUpdate(t *testing.T) {
Generation: 0,
},
Spec: batch.JobSpec{
CompletionMode: completionModePtr(batch.NonIndexedCompletion),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why adding this line? It seems unrelated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wanted to assume that the object had gone through defaulting. But this caused too many tests to change. So I'll revert.

Copy link
Contributor

@mimowo mimowo Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, please do revert in the existing places. Adding in the new tests is good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@alculquicondor alculquicondor force-pushed the job_warnings branch 2 times, most recently from 058871c to 3995209 Compare June 5, 2023 16:58
@mimowo
Copy link
Contributor

mimowo commented Jun 5, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4b8563f83bd62bbb6d2524061658626eff9afdca

func WarningsForJobSpec(ctx context.Context, path *field.Path, spec, oldSpec *batch.JobSpec) []string {
var warnings []string
if spec.CompletionMode != nil && *spec.CompletionMode == batch.IndexedCompletion {
if *spec.Completions > completionsSoftLimit && *spec.Parallelism > parallelismSoftLimitForUnlimitedCompletions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what prevents spec.Parallelism from being nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API defaulting sets it to 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API defaulting sets it to 1

Even so, given the pointer we should check nil-ness. Defaulting could change in future versions for instance and it's easy to check now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched the implementation to pointer.Int32Deref

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2023
@alculquicondor alculquicondor force-pushed the job_warnings branch 2 times, most recently from d361ff5 to d255c87 Compare June 13, 2023 20:17
@mimowo
Copy link
Contributor

mimowo commented Jun 14, 2023

The recent changes LGTM. Please take a look why the unit test failed, is it a flake?

@alculquicondor
Copy link
Member Author

Yeah, I don't know why it's failing. I can't reproduce locally :(

Change-Id: I63e192b1ce9da7d8bb04f8be1a6e19ec6fbbfa5a
@mimowo
Copy link
Contributor

mimowo commented Jun 14, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6630c4f3c709bd94a9a422051420cd92a319627f

@deads2k
Copy link
Contributor

deads2k commented Jun 15, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023
@k8s-ci-robot k8s-ci-robot merged commit b637006 into kubernetes:master Jun 15, 2023
12 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexed Jobs can break with high number of parallelism or completions
4 participants