-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tracking of terminating Pods when nothing else changes #121342
Fix tracking of terminating Pods when nothing else changes #121342
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @dejanzele. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign dejanzele |
test/integration/job/job_test.go
Outdated
wantFailedAfterDeletion: 2, | ||
wantFailedAfterFailure: 2, | ||
wantTerminatingAfterDeletion: ptr.To[int32](2), | ||
wantTerminatingAfterFailure: ptr.To[int32](2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a bug with how terminating pods are counted in the Job status when PodReplacementPolicy
is TerminatingOrFailed
. The .Status.Terminating
field never decreases to 0 unlike for Failed
policy where it decreases to 0.
Unless I am mistaken?
cc @mimowo @alculquicondor @kannon92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll look into this tomorrow but my first thought is that this is expected.
Failed is essentially saying I want all terminating pods to be fully terminated before starting new pods.
TerminatingOrFailed says create new pods as soon as they are deleted. It takes time to create new ones but it’s pretty close to almost after deletion is registered.
so in your case what I think is happening is that you are waiting for active pods in both but in TerminatingOrFailed you are getting active pods while the terminating ones are still being deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm but I expected Terminating
to drop to 0
after deletion & fail pods, but it didn't decrease from the value 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @dejanzele. I would expect Terminating
to drop to 0
. Can you investigate why it's not dropping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do drop on kind
.
Again, deletion & fail for TerminatingOrFailed
doesn't mean that it will happen immediately. My point is that I think you would need to wait for the terminating pods to go away separately from waiting for active pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We delete pods and we have this check which asserts how many pods should be terminating - https://github.com/kubernetes/kubernetes/pull/121342/files#diff-ace35385ec6f5400e8ccbfe8e9f35c2f9573d0fc05889071dc9c4b36e5e975f7R1804
Then we fail pods and do the assertion here to verify the status after faiiling them - https://github.com/kubernetes/kubernetes/pull/121342/files#diff-ace35385ec6f5400e8ccbfe8e9f35c2f9573d0fc05889071dc9c4b36e5e975f7R1815
Both checks use validateJobsPodsStatusOnly
which does a poll and timeout is 30 seconds. I'd expect the 2 terminating pods to drop to 0 in 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works as expected for Failed policy, after deletion I assert 2 pods are terminating and 0 active, after failing pods I assert 0 are terminating and 2 are active.
/ok-to-test |
test/integration/job/job_test.go
Outdated
wantFailedAfterDeletion: 2, | ||
wantFailedAfterFailure: 2, | ||
wantTerminatingAfterDeletion: ptr.To[int32](2), | ||
wantTerminatingAfterFailure: ptr.To[int32](2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @dejanzele. I would expect Terminating
to drop to 0
. Can you investigate why it's not dropping?
test/integration/job/job_test.go
Outdated
wantStatusAfterFailure: jobStatus{ | ||
active: 2, | ||
failed: 2, | ||
terminating: ptr.To[int32](2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you investigate why this stays at 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To some degree, this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L672 returns different results between the TerminatingOrFailed and Failed policy test cases, but I still cannot fully understand what is going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even if it does, they should be filtered out here, no? https://github.com/kubernetes/kubernetes/blob/baea2df9d6f02935849e00390e0e8a5fad596f51/pkg/controller/job/job_controller.go#L790C46-L790C46
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the bug https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L925-L933
we set job.Status.Terminating = jobCtx.terminating
on line 929 and then do the check needsStatusUpdate = needsStatusUpdate || !ptr.Equal(job.Status.Terminating, jobCtx.terminating)
but at this point !ptr.Equal(job.Status.Terminating, jobCtx.terminating)
will always return false
.
If we just do a simple reorder, all tests work correctly and terminating
count correctly drops to 0 for TerminatingOrFailed
policy.
needsStatusUpdate := suspendCondChanged || active != job.Status.Active || !ptr.Equal(ready, job.Status.Ready)
needsStatusUpdate = needsStatusUpdate || !ptr.Equal(job.Status.Terminating, jobCtx.terminating)
job.Status.Active = active
job.Status.Ready = ready
job.Status.Terminating = jobCtx.terminating
err = jm.trackJobStatusAndRemoveFinalizers(ctx, jobCtx, needsStatusUpdate)
if err != nil {
return fmt.Errorf("tracking status: %w", err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only reason why this worked for Failed is because we had other updates around the same time?
In case of TerminatingOrFailed, we had a case where terminating was the only field changing. While Failed had the failed or active field set?
This change looks good to me. I think you can update it in this PR if you want. We don't cherry-pick alpha features so I don't think we have to backport this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it got decremented when the Job status got updated for some other reason, and the side-effect was the terminating
count dropping to 0, but it was missing the update because of the terminating
field being 0 earlier
/hold |
9f1b91b
to
3c99165
Compare
…s and refactor pod replacement policy integration test
56f3996
to
e8c74af
Compare
/lgtm |
LGTM label has been added. Git tree hash: 3e4fdade5d2b1f02882722b1d47091ea13e18858
|
CI may need a fix.
|
…Policy integration tests
/unhold |
/lgtm |
LGTM label has been added. Git tree hash: a3cfd957d49d51cbc13fbb469faf1d7a4412fb29
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, dejanzele, dims The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…s#121342) * cleanup: refactor pod replacement policy integration test into staged assertion * cleanup: remove typo in job_test.go * refactor PodReplacementPolicy test and remove test for defaulting the policy * fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test * use t.Cleanup instead of defer in PodReplacementPolicy integration tests * revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
…s#121342) * cleanup: refactor pod replacement policy integration test into staged assertion * cleanup: remove typo in job_test.go * refactor PodReplacementPolicy test and remove test for defaulting the policy * fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test * use t.Cleanup instead of defer in PodReplacementPolicy integration tests * revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
…s#121342) * cleanup: refactor pod replacement policy integration test into staged assertion * cleanup: remove typo in job_test.go * refactor PodReplacementPolicy test and remove test for defaulting the policy * fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test * use t.Cleanup instead of defer in PodReplacementPolicy integration tests * revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
…s#121342) * cleanup: refactor pod replacement policy integration test into staged assertion * cleanup: remove typo in job_test.go * refactor PodReplacementPolicy test and remove test for defaulting the policy * fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test * use t.Cleanup instead of defer in PodReplacementPolicy integration tests * revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
… assertion
What type of PR is this?
/kind cleanup
/kind bug
/sig apps
What this PR does / why we need it:
Improves the assertions for PodReplacementPolicy integration test by asserting by stages thus better testing the whole flow
Now we have the following flow:
Also it fixes a bug where the Job controller was missing the status update when
terminating
pods decrease and that field was updater when a different reason later triggers the Job status update.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This has been discussed with @alculquicondor
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: