Add warnings for big number of completions and parallelism #118420

alculquicondor · 2023-06-02T19:11:59Z

What type of PR is this?

/kind bug
/kind api-change

What this PR does / why we need it:

Add a warning for Jobs when completions > 1e5 and parallelism > 1e4.

In this case, the size of the field .status.completedIndexes might exceed the storage limits of etcd, if a big number of pod indexes fail.

If parallelism <= 1e4, then the field consumes at most (10+1)*(10^4+10^4)=0.21Mi.
If completions <= 1e5, then the field consumes at most (5+1)*10^5=0.572Mi

These limits are well below the default etcd limits of ~1.5Mi

Which issue(s) this PR fixes:

Fixes #118085

Special notes for your reviewer:

This warning is in accordance with the proposal in kubernetes/enhancements#3967.
I'll also submit a PR to update the documentation for Indexed Job.

Does this PR introduce a user-facing change?

ACTION_REQUIRED
When an Indexed Job has a number of completions higher than 10^5 and parallelism higher than 10^4, and a big number of Indexes fail, Kubernetes might not be able to track the termination of the Job. Kubernetes now emits a warning, at Job creation, when the Job manifest exceeds both of these limits.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [Usage]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode

k8s-ci-robot · 2023-06-02T19:12:08Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2023-06-02T19:12:15Z

/assign @mimowo @deads2k

alculquicondor · 2023-06-02T19:12:30Z

/sig apps
/label api-review

mimowo

LGTM, some nits.

mimowo · 2023-06-05T07:40:05Z

pkg/api/job/warnings_test.go

+ warnings := WarningsForJobSpec(ctx, nil, &tc.spec, nil)
+ if len(warnings) != tc.wantWarningsCount {
+ t.Errorf("Got %d warnings, want %d", len(warnings), tc.wantWarningsCount)
+ t.Logf("Warnings: %v", warnings)


Is it left intentionally? I suppose it is a left-over of dev process. Maybe we could consider an additional field to match the warnings, but not requesting it.

Yes, it's intentional. It could be useful for debugging if the test fails.

Maybe then in the Errorf say not just the count, but also the received warnings?

Merged above.

mimowo · 2023-06-05T07:45:36Z

pkg/api/job/warnings_test.go

+ Template: validPodTemplate,
+ },
+ },
+ "invalid Indexed high completions high parallelism": {


I think it would be good to add test case with high completions and high parallelism when completionMode is NonIndexed, to show the validation is ifed

mimowo · 2023-06-05T07:46:38Z

pkg/registry/batch/job/strategy_test.go

@@ -892,6 +899,32 @@ func TestJobStrategy_WarningsOnCreate(t *testing.T) {
 Spec: validSpec,
 },
 },
+ "high completions and parallelism": {


I would like to have an analogous test for updates to ensure both entry paths to the new code are adjusted.

mimowo

Proposal to the warning message and two questions / suggestions regarding tests

mimowo · 2023-06-05T15:43:02Z

pkg/api/job/warnings.go

+ var warnings []string
+ if spec.CompletionMode != nil && *spec.CompletionMode == batch.IndexedCompletion {
+ if *spec.Completions > completionsSoftLimit && *spec.Parallelism > parallelismSoftLimitForUnlimitedCompletions {
+ msg := "In Indexed Jobs with a number of completions higher than 10^5 and a parallelism higher than 10^4, Kubernetes might not be able to track completedIndexes when a big number of indexes fail"


non-blocking nit: I'm wondering about adding something like "Consider splitting the Job into smaller ones"?

I don't think this is within the scope of the API. But maybe we can add it to the documentation.

mimowo · 2023-06-05T15:53:57Z

pkg/api/job/warnings_test.go

+ MatchLabels: map[string]string{"a": "b"},
+ }
+
+ validPodTemplate = core.PodTemplateSpec{


nit: I see declaring globally enables sharing, but on the other hand, if they are shared by many tests it might be not obvious if modifying them we don't change the intention of one of the tests that uses the shared objects. Thus, I would suggest declaring them scoped within a test,.

The tests that use this don't care about the Pod template. Being verbose doesn't add much value in this case.

mimowo · 2023-06-05T15:54:11Z

pkg/registry/batch/job/strategy_test.go

@@ -749,6 +749,7 @@ func TestJobStrategy_WarningsOnUpdate(t *testing.T) {
 Generation: 0,
 },
 Spec: batch.JobSpec{
+ CompletionMode: completionModePtr(batch.NonIndexedCompletion),


nit: why adding this line? It seems unrelated.

I initially wanted to assume that the object had gone through defaulting. But this caused too many tests to change. So I'll revert.

I see, please do revert in the existing places. Adding in the new tests is good.

mimowo · 2023-06-05T17:14:31Z

/lgtm

k8s-ci-robot · 2023-06-05T17:14:39Z

LGTM label has been added.

Git tree hash: 4b8563f83bd62bbb6d2524061658626eff9afdca

deads2k · 2023-06-12T19:24:36Z

pkg/api/job/warnings.go

+func WarningsForJobSpec(ctx context.Context, path *field.Path, spec, oldSpec *batch.JobSpec) []string {
+ var warnings []string
+ if spec.CompletionMode != nil && *spec.CompletionMode == batch.IndexedCompletion {
+ if *spec.Completions > completionsSoftLimit && *spec.Parallelism > parallelismSoftLimitForUnlimitedCompletions {


what prevents spec.Parallelism from being nil?

API defaulting sets it to 1

API defaulting sets it to 1

Even so, given the pointer we should check nil-ness. Defaulting could change in future versions for instance and it's easy to check now.

I switched the implementation to pointer.Int32Deref

mimowo · 2023-06-14T14:09:23Z

The recent changes LGTM. Please take a look why the unit test failed, is it a flake?

alculquicondor · 2023-06-14T14:16:10Z

Yeah, I don't know why it's failing. I can't reproduce locally :(

Change-Id: I63e192b1ce9da7d8bb04f8be1a6e19ec6fbbfa5a

mimowo · 2023-06-14T15:02:08Z

/lgtm

k8s-ci-robot · 2023-06-14T15:02:21Z

LGTM label has been added.

Git tree hash: 6630c4f3c709bd94a9a422051420cd92a319627f

deads2k · 2023-06-15T20:22:42Z

/approve

k8s-ci-robot · 2023-06-15T20:23:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/api/OWNERS~~ [deads2k]
~~pkg/registry/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from deads2k and justinsb June 2, 2023 19:12

k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 2, 2023

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 2, 2023

k8s-ci-robot assigned deads2k and mimowo Jun 2, 2023

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. api-review Categorizes an issue or PR as actively needing an API review. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 2, 2023

alculquicondor force-pushed the job_warnings branch from 77799df to 11dad0f Compare June 2, 2023 20:49

mimowo reviewed Jun 5, 2023

View reviewed changes

alculquicondor force-pushed the job_warnings branch from 11dad0f to 9bb36ff Compare June 5, 2023 15:33

mimowo reviewed Jun 5, 2023

View reviewed changes

alculquicondor force-pushed the job_warnings branch 2 times, most recently from 058871c to 3995209 Compare June 5, 2023 16:58

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023

deads2k reviewed Jun 12, 2023

View reviewed changes

alculquicondor force-pushed the job_warnings branch from 3995209 to 8c0eb07 Compare June 13, 2023 18:38

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2023

k8s-ci-robot requested review from deads2k and mimowo June 13, 2023 18:38

alculquicondor force-pushed the job_warnings branch 2 times, most recently from d361ff5 to d255c87 Compare June 13, 2023 20:17

Add warnings for big number of completions and parallelism

c27f9fd

Change-Id: I63e192b1ce9da7d8bb04f8be1a6e19ec6fbbfa5a

alculquicondor force-pushed the job_warnings branch from d255c87 to c27f9fd Compare June 14, 2023 14:38

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023

k8s-ci-robot merged commit b637006 into kubernetes:master Jun 15, 2023
12 checks passed

k8s-ci-robot added this to the v1.28 milestone Jun 15, 2023

jaskaransarkaria mentioned this pull request Jun 5, 2024

Planning upgrade to EKS 1.28 ministryofjustice/cloud-platform#5570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add warnings for big number of completions and parallelism #118420

Add warnings for big number of completions and parallelism #118420

alculquicondor commented Jun 2, 2023 •

edited

k8s-ci-robot commented Jun 2, 2023

alculquicondor commented Jun 2, 2023

alculquicondor commented Jun 2, 2023

mimowo left a comment

mimowo Jun 5, 2023

alculquicondor Jun 5, 2023

mimowo Jun 5, 2023

alculquicondor Jun 5, 2023

mimowo Jun 5, 2023

alculquicondor Jun 5, 2023

mimowo Jun 5, 2023

alculquicondor Jun 5, 2023

mimowo left a comment

mimowo Jun 5, 2023

alculquicondor Jun 5, 2023

mimowo Jun 5, 2023

alculquicondor Jun 5, 2023 •

edited

mimowo Jun 5, 2023

mimowo Jun 5, 2023

alculquicondor Jun 5, 2023

mimowo Jun 5, 2023 •

edited

alculquicondor Jun 5, 2023

mimowo commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

deads2k Jun 12, 2023

alculquicondor Jun 12, 2023

deads2k Jun 13, 2023

alculquicondor Jun 13, 2023

mimowo commented Jun 14, 2023

alculquicondor commented Jun 14, 2023

mimowo commented Jun 14, 2023

k8s-ci-robot commented Jun 14, 2023

deads2k commented Jun 15, 2023

k8s-ci-robot commented Jun 15, 2023

Add warnings for big number of completions and parallelism #118420

Add warnings for big number of completions and parallelism #118420

Conversation

alculquicondor commented Jun 2, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Jun 2, 2023

alculquicondor commented Jun 2, 2023

alculquicondor commented Jun 2, 2023

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Jun 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Jun 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Jun 14, 2023

alculquicondor commented Jun 14, 2023

mimowo commented Jun 14, 2023

k8s-ci-robot commented Jun 14, 2023

deads2k commented Jun 15, 2023

k8s-ci-robot commented Jun 15, 2023

alculquicondor commented Jun 2, 2023 •

edited

alculquicondor Jun 5, 2023 •

edited

mimowo Jun 5, 2023 •

edited