Fix tracking of terminating Pods when nothing else changes #121342

dejanzele · 2023-10-19T01:13:49Z

… assertion

What type of PR is this?

/kind cleanup
/kind bug
/sig apps

What this PR does / why we need it:

Improves the assertions for PodReplacementPolicy integration test by asserting by stages thus better testing the whole flow

Now we have the following flow:

create job and wait for active pods
delete pods and verify terminating status
fail pods and verify terminating, active & failed job status counters for each PodReplacementPolicy

Also it fixes a bug where the Job controller was missing the status update when terminating pods decrease and that field was updater when a different reason later triggers the Job status update.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This has been discussed with @alculquicondor

Does this PR introduce a user-facing change?

Fixed tracking of terminating Pods in the Job status. The field was not updated unless there were other changes to apply

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-10-19T01:13:57Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-10-19T01:13:59Z

Hi @dejanzele. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dejanzele · 2023-10-19T01:14:03Z

/cc @mimowo @alculquicondor

dejanzele · 2023-10-19T01:14:12Z

/assign dejanzele

dejanzele · 2023-10-19T01:16:07Z

test/integration/job/job_test.go

+ wantFailedAfterDeletion: 2,
+ wantFailedAfterFailure: 2,
+ wantTerminatingAfterDeletion: ptr.To[int32](2),
+ wantTerminatingAfterFailure: ptr.To[int32](2),


I think there might be a bug with how terminating pods are counted in the Job status when PodReplacementPolicy is TerminatingOrFailed. The .Status.Terminating field never decreases to 0 unlike for Failed policy where it decreases to 0.
Unless I am mistaken?
cc @mimowo @alculquicondor @kannon92

I’ll look into this tomorrow but my first thought is that this is expected.

Failed is essentially saying I want all terminating pods to be fully terminated before starting new pods.

TerminatingOrFailed says create new pods as soon as they are deleted. It takes time to create new ones but it’s pretty close to almost after deletion is registered.

so in your case what I think is happening is that you are waiting for active pods in both but in TerminatingOrFailed you are getting active pods while the terminating ones are still being deleted.

Hmmm but I expected Terminating to drop to 0 after deletion & fail pods, but it didn't decrease from the value 2.

I agree with @dejanzele. I would expect Terminating to drop to 0. Can you investigate why it's not dropping?

They do drop on kind.

Again, deletion & fail for TerminatingOrFailed doesn't mean that it will happen immediately. My point is that I think you would need to wait for the terminating pods to go away separately from waiting for active pods.

We delete pods and we have this check which asserts how many pods should be terminating - https://github.com/kubernetes/kubernetes/pull/121342/files#diff-ace35385ec6f5400e8ccbfe8e9f35c2f9573d0fc05889071dc9c4b36e5e975f7R1804

Then we fail pods and do the assertion here to verify the status after faiiling them - https://github.com/kubernetes/kubernetes/pull/121342/files#diff-ace35385ec6f5400e8ccbfe8e9f35c2f9573d0fc05889071dc9c4b36e5e975f7R1815

Both checks use validateJobsPodsStatusOnly which does a poll and timeout is 30 seconds. I'd expect the 2 terminating pods to drop to 0 in 30 seconds.

it works as expected for Failed policy, after deletion I assert 2 pods are terminating and 0 active, after failing pods I assert 0 are terminating and 2 are active.

kannon92 · 2023-10-19T01:41:12Z

/ok-to-test

alculquicondor · 2023-10-19T13:55:27Z

test/integration/job/job_test.go

+ wantFailedAfterDeletion: 2,
+ wantFailedAfterFailure: 2,
+ wantTerminatingAfterDeletion: ptr.To[int32](2),
+ wantTerminatingAfterFailure: ptr.To[int32](2),


I agree with @dejanzele. I would expect Terminating to drop to 0. Can you investigate why it's not dropping?

test/integration/job/job_test.go

alculquicondor · 2023-10-19T19:35:31Z

test/integration/job/job_test.go

+ wantStatusAfterFailure: jobStatus{
+ active: 2,
+ failed: 2,
+ terminating: ptr.To[int32](2),


did you investigate why this stays at 2?

To some degree, this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L672 returns different results between the TerminatingOrFailed and Failed policy test cases, but I still cannot fully understand what is going on

But even if it does, they should be filtered out here, no? https://github.com/kubernetes/kubernetes/blob/baea2df9d6f02935849e00390e0e8a5fad596f51/pkg/controller/job/job_controller.go#L790C46-L790C46

here is the bug https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L925-L933

we set job.Status.Terminating = jobCtx.terminating on line 929 and then do the check needsStatusUpdate = needsStatusUpdate || !ptr.Equal(job.Status.Terminating, jobCtx.terminating) but at this point !ptr.Equal(job.Status.Terminating, jobCtx.terminating) will always return false.

If we just do a simple reorder, all tests work correctly and terminating count correctly drops to 0 for TerminatingOrFailed policy.

needsStatusUpdate := suspendCondChanged || active != job.Status.Active || !ptr.Equal(ready, job.Status.Ready) needsStatusUpdate = needsStatusUpdate || !ptr.Equal(job.Status.Terminating, jobCtx.terminating) job.Status.Active = active job.Status.Ready = ready job.Status.Terminating = jobCtx.terminating err = jm.trackJobStatusAndRemoveFinalizers(ctx, jobCtx, needsStatusUpdate) if err != nil { return fmt.Errorf("tracking status: %w", err) }

/cc @mimowo @kannon92

Is the only reason why this worked for Failed is because we had other updates around the same time?

In case of TerminatingOrFailed, we had a case where terminating was the only field changing. While Failed had the failed or active field set?

This change looks good to me. I think you can update it in this PR if you want. We don't cherry-pick alpha features so I don't think we have to backport this.

I guess it got decremented when the Job status got updated for some other reason, and the side-effect was the terminating count dropping to 0, but it was missing the update because of the terminating field being 0 earlier

test/integration/job/job_test.go

alculquicondor · 2023-10-19T19:57:21Z

/hold
for possible bug in controller

… policy

…s and refactor pod replacement policy integration test

alculquicondor · 2023-10-23T20:26:54Z

/lgtm
/approve
/hold cancel
/label tide/merge-method-squash

k8s-ci-robot · 2023-10-23T20:27:02Z

LGTM label has been added.

Git tree hash: 3e4fdade5d2b1f02882722b1d47091ea13e18858

pacoxu · 2023-10-24T01:27:56Z

CI may need a fix.
/hold
please unhold when it is ready.

Invalid invocations of featuregatetesting.SetFeatureGateDuringTest():
test/integration/job/job_test.go:			t.Cleanup(featuregatetesting.SetFeatureGateDuringTest(t, feature.DefaultFeatureGate, features.JobPodReplacementPolicy, tc.podReplacementPolicyEnabled))
test/integration/job/job_test.go:			t.Cleanup(featuregatetesting.SetFeatureGateDuringTest(t, feature.DefaultFeatureGate, features.JobPodFailurePolicy, tc.jobSpec.PodFailurePolicy != nil))

Always make a deferred call to the returned function to ensure the feature gate is reset:
  defer featuregatetesting.SetFeatureGateDuringTest(t, utilfeature.DefaultFeatureGate, features.<FeatureName>, <value>)()

…Policy integration tests

dejanzele · 2023-10-24T09:02:12Z

/unhold

dims · 2023-10-24T11:42:15Z

/lgtm
/approve

k8s-ci-robot · 2023-10-24T11:42:23Z

LGTM label has been added.

Git tree hash: a3cfd957d49d51cbc13fbb469faf1d7a4412fb29

k8s-ci-robot · 2023-10-24T11:42:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, dejanzele, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [alculquicondor,dims]
~~test/integration/job/OWNERS~~ [alculquicondor,dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…s#121342) * cleanup: refactor pod replacement policy integration test into staged assertion * cleanup: remove typo in job_test.go * refactor PodReplacementPolicy test and remove test for defaulting the policy * fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test * use t.Cleanup instead of defer in PodReplacementPolicy integration tests * revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 19, 2023

k8s-ci-robot requested review from alculquicondor and mimowo October 19, 2023 01:14

k8s-ci-robot assigned dejanzele Oct 19, 2023

k8s-ci-robot requested review from kow3ns and mortent October 19, 2023 01:14

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 19, 2023

dejanzele commented Oct 19, 2023

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 19, 2023

alculquicondor reviewed Oct 19, 2023

View reviewed changes

test/integration/job/job_test.go Outdated Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2023

k8s-ci-robot requested a review from kannon92 October 19, 2023 21:05

dejanzele force-pushed the cleanup/refactor-pod-replacement-policy-int-test branch from 9f1b91b to 3c99165 Compare October 19, 2023 21:47

dejanzele added 3 commits October 23, 2023 22:11

refactor PodReplacementPolicy test and remove test for defaulting the…

21b28ac

… policy

fix issue with missing update in job controller for terminating statu…

57284f6

…s and refactor pod replacement policy integration test

use t.Cleanup instead of defer in PodReplacementPolicy integration tests

e8c74af

dejanzele force-pushed the cleanup/refactor-pod-replacement-policy-int-test branch from 56f3996 to e8c74af Compare October 23, 2023 20:11

k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Oct 23, 2023

k8s-ci-robot assigned alculquicondor Oct 23, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2023

revert t.Cleanup to defer for reseting feature flag in PodReplacement…

2478e3a

…Policy integration tests

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2023

k8s-ci-robot requested a review from alculquicondor October 24, 2023 07:19

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2023

k8s-ci-robot assigned dims Oct 24, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2023

k8s-ci-robot merged commit f8a4e34 into kubernetes:master Oct 24, 2023
15 checks passed

k8s-ci-robot added this to the v1.29 milestone Oct 24, 2023

dejanzele mentioned this pull request Oct 30, 2023

REQUEST: New membership for dejanzele kubernetes/org#4561

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tracking of terminating Pods when nothing else changes #121342

Fix tracking of terminating Pods when nothing else changes #121342

dejanzele commented Oct 19, 2023 •

edited by k8s-ci-robot

k8s-ci-robot commented Oct 19, 2023

k8s-ci-robot commented Oct 19, 2023

dejanzele commented Oct 19, 2023

dejanzele commented Oct 19, 2023

dejanzele Oct 19, 2023

kannon92 Oct 19, 2023

dejanzele Oct 19, 2023

alculquicondor Oct 19, 2023

kannon92 Oct 19, 2023

dejanzele Oct 19, 2023

dejanzele Oct 19, 2023

kannon92 commented Oct 19, 2023

alculquicondor Oct 19, 2023

alculquicondor Oct 19, 2023

dejanzele Oct 19, 2023

alculquicondor Oct 19, 2023

dejanzele Oct 19, 2023

kannon92 Oct 19, 2023

dejanzele Oct 20, 2023

alculquicondor commented Oct 19, 2023

alculquicondor commented Oct 23, 2023

k8s-ci-robot commented Oct 23, 2023

pacoxu commented Oct 24, 2023

dejanzele commented Oct 24, 2023

dims commented Oct 24, 2023

k8s-ci-robot commented Oct 24, 2023

k8s-ci-robot commented Oct 24, 2023

Fix tracking of terminating Pods when nothing else changes #121342

Fix tracking of terminating Pods when nothing else changes #121342

Conversation

dejanzele commented Oct 19, 2023 • edited by k8s-ci-robot

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 19, 2023

k8s-ci-robot commented Oct 19, 2023

dejanzele commented Oct 19, 2023

dejanzele commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kannon92 commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Oct 19, 2023

alculquicondor commented Oct 23, 2023

k8s-ci-robot commented Oct 23, 2023

pacoxu commented Oct 24, 2023

dejanzele commented Oct 24, 2023

dims commented Oct 24, 2023

k8s-ci-robot commented Oct 24, 2023

k8s-ci-robot commented Oct 24, 2023

dejanzele commented Oct 19, 2023 •

edited by k8s-ci-robot