Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tracking of terminating Pods when nothing else changes #121342

Conversation

dejanzele
Copy link
Contributor

@dejanzele dejanzele commented Oct 19, 2023

… assertion

What type of PR is this?

/kind cleanup
/kind bug
/sig apps

What this PR does / why we need it:

Improves the assertions for PodReplacementPolicy integration test by asserting by stages thus better testing the whole flow

Now we have the following flow:

  1. create job and wait for active pods
  2. delete pods and verify terminating status
  3. fail pods and verify terminating, active & failed job status counters for each PodReplacementPolicy

Also it fixes a bug where the Job controller was missing the status update when terminating pods decrease and that field was updater when a different reason later triggers the Job status update.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This has been discussed with @alculquicondor

Does this PR introduce a user-facing change?

Fixed tracking of terminating Pods in the Job status. The field was not updated unless there were other changes to apply

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 19, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 19, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @dejanzele. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dejanzele
Copy link
Contributor Author

/cc @mimowo @alculquicondor

@dejanzele
Copy link
Contributor Author

/assign dejanzele

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 19, 2023
wantFailedAfterDeletion: 2,
wantFailedAfterFailure: 2,
wantTerminatingAfterDeletion: ptr.To[int32](2),
wantTerminatingAfterFailure: ptr.To[int32](2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a bug with how terminating pods are counted in the Job status when PodReplacementPolicy is TerminatingOrFailed. The .Status.Terminating field never decreases to 0 unlike for Failed policy where it decreases to 0.
Unless I am mistaken?
cc @mimowo @alculquicondor @kannon92

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll look into this tomorrow but my first thought is that this is expected.

Failed is essentially saying I want all terminating pods to be fully terminated before starting new pods.

TerminatingOrFailed says create new pods as soon as they are deleted. It takes time to create new ones but it’s pretty close to almost after deletion is registered.

so in your case what I think is happening is that you are waiting for active pods in both but in TerminatingOrFailed you are getting active pods while the terminating ones are still being deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm but I expected Terminating to drop to 0 after deletion & fail pods, but it didn't decrease from the value 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dejanzele. I would expect Terminating to drop to 0. Can you investigate why it's not dropping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do drop on kind.

Again, deletion & fail for TerminatingOrFailed doesn't mean that it will happen immediately. My point is that I think you would need to wait for the terminating pods to go away separately from waiting for active pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We delete pods and we have this check which asserts how many pods should be terminating - https://github.com/kubernetes/kubernetes/pull/121342/files#diff-ace35385ec6f5400e8ccbfe8e9f35c2f9573d0fc05889071dc9c4b36e5e975f7R1804

Then we fail pods and do the assertion here to verify the status after faiiling them - https://github.com/kubernetes/kubernetes/pull/121342/files#diff-ace35385ec6f5400e8ccbfe8e9f35c2f9573d0fc05889071dc9c4b36e5e975f7R1815

Both checks use validateJobsPodsStatusOnly which does a poll and timeout is 30 seconds. I'd expect the 2 terminating pods to drop to 0 in 30 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works as expected for Failed policy, after deletion I assert 2 pods are terminating and 0 active, after failing pods I assert 0 are terminating and 2 are active.

@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 19, 2023
wantFailedAfterDeletion: 2,
wantFailedAfterFailure: 2,
wantTerminatingAfterDeletion: ptr.To[int32](2),
wantTerminatingAfterFailure: ptr.To[int32](2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dejanzele. I would expect Terminating to drop to 0. Can you investigate why it's not dropping?

test/integration/job/job_test.go Outdated Show resolved Hide resolved
test/integration/job/job_test.go Outdated Show resolved Hide resolved
test/integration/job/job_test.go Outdated Show resolved Hide resolved
wantStatusAfterFailure: jobStatus{
active: 2,
failed: 2,
terminating: ptr.To[int32](2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you investigate why this stays at 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To some degree, this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L672 returns different results between the TerminatingOrFailed and Failed policy test cases, but I still cannot fully understand what is going on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the bug https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L925-L933

we set job.Status.Terminating = jobCtx.terminating on line 929 and then do the check needsStatusUpdate = needsStatusUpdate || !ptr.Equal(job.Status.Terminating, jobCtx.terminating) but at this point !ptr.Equal(job.Status.Terminating, jobCtx.terminating) will always return false.

If we just do a simple reorder, all tests work correctly and terminating count correctly drops to 0 for TerminatingOrFailed policy.

needsStatusUpdate := suspendCondChanged || active != job.Status.Active || !ptr.Equal(ready, job.Status.Ready)
needsStatusUpdate = needsStatusUpdate || !ptr.Equal(job.Status.Terminating, jobCtx.terminating)
job.Status.Active = active
job.Status.Ready = ready
job.Status.Terminating = jobCtx.terminating
err = jm.trackJobStatusAndRemoveFinalizers(ctx, jobCtx, needsStatusUpdate)
if err != nil {
	return fmt.Errorf("tracking status: %w", err)
}

/cc @mimowo @kannon92

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only reason why this worked for Failed is because we had other updates around the same time?

In case of TerminatingOrFailed, we had a case where terminating was the only field changing. While Failed had the failed or active field set?

This change looks good to me. I think you can update it in this PR if you want. We don't cherry-pick alpha features so I don't think we have to backport this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it got decremented when the Job status got updated for some other reason, and the side-effect was the terminating count dropping to 0, but it was missing the update because of the terminating field being 0 earlier

@alculquicondor
Copy link
Member

/hold
for possible bug in controller

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2023
@dejanzele dejanzele force-pushed the cleanup/refactor-pod-replacement-policy-int-test branch from 9f1b91b to 3c99165 Compare October 19, 2023 21:47
@dejanzele dejanzele force-pushed the cleanup/refactor-pod-replacement-policy-int-test branch from 56f3996 to e8c74af Compare October 23, 2023 20:11
@alculquicondor
Copy link
Member

/lgtm
/approve
/hold cancel
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Oct 23, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3e4fdade5d2b1f02882722b1d47091ea13e18858

@pacoxu
Copy link
Member

pacoxu commented Oct 24, 2023

CI may need a fix.
/hold
please unhold when it is ready.

Invalid invocations of featuregatetesting.SetFeatureGateDuringTest():
test/integration/job/job_test.go:			t.Cleanup(featuregatetesting.SetFeatureGateDuringTest(t, feature.DefaultFeatureGate, features.JobPodReplacementPolicy, tc.podReplacementPolicyEnabled))
test/integration/job/job_test.go:			t.Cleanup(featuregatetesting.SetFeatureGateDuringTest(t, feature.DefaultFeatureGate, features.JobPodFailurePolicy, tc.jobSpec.PodFailurePolicy != nil))

Always make a deferred call to the returned function to ensure the feature gate is reset:
  defer featuregatetesting.SetFeatureGateDuringTest(t, utilfeature.DefaultFeatureGate, features.<FeatureName>, <value>)()

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2023
@dejanzele
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2023
@dims
Copy link
Member

dims commented Oct 24, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a3cfd957d49d51cbc13fbb469faf1d7a4412fb29

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, dejanzele, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit f8a4e34 into kubernetes:master Oct 24, 2023
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 24, 2023
daemon365 pushed a commit to daemon365/kubernetes that referenced this pull request Oct 25, 2023
…s#121342)

* cleanup: refactor pod replacement policy integration test into staged assertion

* cleanup: remove typo in job_test.go

* refactor PodReplacementPolicy test and remove test for defaulting the policy

* fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test

* use t.Cleanup instead of defer in PodReplacementPolicy integration tests

* revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
Sharpz7 pushed a commit to Sharpz7/kubernetes that referenced this pull request Oct 27, 2023
…s#121342)

* cleanup: refactor pod replacement policy integration test into staged assertion

* cleanup: remove typo in job_test.go

* refactor PodReplacementPolicy test and remove test for defaulting the policy

* fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test

* use t.Cleanup instead of defer in PodReplacementPolicy integration tests

* revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
richabanker pushed a commit to richabanker/kubernetes that referenced this pull request Jan 9, 2024
…s#121342)

* cleanup: refactor pod replacement policy integration test into staged assertion

* cleanup: remove typo in job_test.go

* refactor PodReplacementPolicy test and remove test for defaulting the policy

* fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test

* use t.Cleanup instead of defer in PodReplacementPolicy integration tests

* revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
jiahuif pushed a commit to jiahuif-forks/kubernetes that referenced this pull request Jan 23, 2024
…s#121342)

* cleanup: refactor pod replacement policy integration test into staged assertion

* cleanup: remove typo in job_test.go

* refactor PodReplacementPolicy test and remove test for defaulting the policy

* fix issue with missing update in job controller for terminating status and refactor pod replacement policy integration test

* use t.Cleanup instead of defer in PodReplacementPolicy integration tests

* revert t.Cleanup to defer for reseting feature flag in PodReplacementPolicy integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants