[strategicpatch] support duplicated mergeKey values #125932

antomy-gc · 2024-07-06T18:00:51Z

What type of PR is this?

/kind bug
/sig api-machinery

What this PR does / why we need it:

Fix bug in strategicpatch which was unable to properly handle list values with duplicated MergeKey values.

Original manifest contains duplicated value with mergeKey 2:

mergingList:
- name: 1
- name: 2
  value: dup1
- name: 3
- name: 2
  value: dup2

Trying to remove one of duplicates (for example, with kubectl apply of following):

mergingList:
- name: 1
- name: 2
  value: dup2
- name: 3

Actual result of patching: removed both of duplicates:

mergingList:
- name: 1
- name: 3

Since we allow user to have duplicated values in manifests, it seems to be reasonable to provide at least minimal support of those in strategicmerge. Currently, the user cannot even delete one of the duplicates added by mistake, as this would result in a manifest missing the necessary value.

Adding stricter validation might be a more strategic solution, but it requires significant time for migration. Therefore, I suggest supporting duplicate values at least until the validation is implemented and prevents the creation of such manifests.

Which issue(s) this PR fixes:

Fixes #58477
Fixes #93266

Special notes for your reviewer:

Discovered reason of incorrect element ordering described in test case add map and delete map from merging list.
This behavior is caused by a side effect of mergeSliceWithSpecialElements function, which was changing contents of original slice upper in call stack, where it was used to determine elements order and assumed to be still unmodified.
Left comment, didnt fix.

Does this PR introduce a user-facing change?

Fixed strategicpatch bug when attempt to remove one of duplicated by mergeKey value elements from list 
with `kubectl appy` could lead to removal of all elements with this mergeKey.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2024-07-06T18:00:54Z

The committers listed above are authorized under a signed CLA.

✅ login: antomy-gc / name: Ilya Kuzmin (278980f, 1dbfb3c, 8647582, 95060dd, 49ec604, a442300, 6af5417, e70c446, 243c862)

k8s-ci-robot · 2024-07-06T18:00:59Z

Welcome @antomy-gc!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-07-06T18:01:00Z

Hi @antomy-gc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

antomy-gc · 2024-07-08T07:01:08Z

/assign @ncdc

cici37 · 2024-07-09T20:16:06Z

/ok-to-test
/assign @seans3 Thanks!
/triage accepted

antomy-gc · 2024-07-09T23:11:24Z

/test pull-kubernetes-unit

antomy-gc · 2024-07-14T20:42:45Z

Hi @seans3 could you please provide an estimate on when you might be able to review this?

liggitt · 2024-08-21T21:31:05Z

staging/src/k8s.io/apimachinery/pkg/util/strategicpatch/patch_test.go

  - name: 4
    other: c
-  - name: 3


/hold

Changes to these test case results represent compatibility breaks we cannot make. Existing API requests generally must produce the same results or we will break existing successful clients

I changed this test because of these reasons:

it seems to describe invalid behavior

it does not affect content of test data, just changes order of elements after patch

TwoWay patch directive says: add element '3', remove '1', place '3' after '2'.
But in original test we expect new element to be appended to the end of list ignoring order directive.

TwoWay: []byte(` $setElementOrder/mergingList: - name: 2 - name: 3 mergingList: - name: 3 - name: 1 $patch: delete `), Current: []byte(` mergingList: - name: 1 - name: 2 other: b - name: 4 other: c `), Result: []byte(` mergingList: - name: 2 other: b - name: 4 other: c - name: 3

Same problem for ThreeWay.

I was able to reproduce original behavior with my new code, but it resulted in breaking all other cases: instead of using order directive it was just appending new elements.
Then i found out why this behavior occurred in this test case: all the code was describing behavior like "keep previous order, then apply order directive for new elements", but side effect on slice`s underlying array was resulting in malformed 'order' slice.

Do we have any common approach for kubernetes to fix such cases? i`m sure its not the first time when we have incorrect behavior described in tests.

liggitt · 2024-08-21T21:36:38Z

staging/src/k8s.io/apimachinery/pkg/util/strategicpatch/patch.go

 	if err != nil {
 		return nil, nil, err
 	}

-	originalIndex, modifiedIndex := 0, 0


Does this change how the client generates a patch to send to the server, or how the server handles an existing patch request from the client?

This method changes the way how the client generates patch command and PR in general affects both client and server side.

All changes are only related to duplicate MergeKey values cases, non-duplicated elements are processed by existing rules.

This method changes the way how the client generates patch command and PR in general affects both client and server side.

Then we have to reason about the following combinations:

new client, new server. This is the only case we're exercising in unit tests now, I think (since both the client and server side are running with new code).

new client, old server. We have to capture the change in what a new client would generate/send as a patch, and find a way to demonstrate that an old server will do something no worse than what it does today.

old client, new server. We have to capture what an old client would generate/send as a patch, and make sure a new server will produce the same result when given that patch (which is one reason the ordering changes to the existing unit test were concerning).

I started trying to trace all the call sites that lead here to understand if this is only impacting clients, or all server-side logic, and which components this touches. I ended up with something like this before I ran out of time:

diffListsOfMaps diffLists handleSliceDiff diffMaps CreateThreeWayMergePatch kubectl apply (kubectl) CreateTwoWayMergeMapPatchUsingLookupPatchMeta CreateTwoWayMergeMapPatch (unused?) CreateTwoWayMergePatchUsingLookupPatchMeta CreateTwoWayMergePatch kubeadm pkg/controller PatchNodeTaints ... pkg/controller AddOrUpdateLabelsOnNode ... TTL controller (kube-controller-manager) pkg/scheduler/util PatchPodStatus (kube-scheduler) pkg/util/pod PatchPodStatus PodGC controller (kube-controller-manager) Taint eviction controller (kube-controller-manager) kubelet status manager (kubelet) kube-scheduler preemption (kube-scheduler) pkg/volume/util/resize_util PatchPV AddAnnPreResizeCapacity ...? DeleteAnnPreResizeCapacity ...? createPVCPatch MarkNodeExpansionInfeasible ...? MarkNodeExpansionFailedCondition ...? PatchPVCStatus ...? client-go tools/events/event_broadcaster ...? client-go tools/record/events_cache ...? cloud-provider node/helpers addOrUpdateLabelsOnNode AddOrUpdateLabelsOnNode cloud-provider/controllers/node (cloud-controller-manager) PatchNodeTaints AddOrUpdateTaintOnNode cloud-provider/controllers/nodelifecycle (cloud-controller-manager) RemoveTaintOffNode cloud-provider/controllers/nodelifecycle (cloud-controller-manager) getPatchBytes PatchService cloud-provider/controllers/service (cloud-controller-manager) component-helpers node/util/status ...? kubectl debug (kubectl) kubectl set (kubectl) kubectl taint (kubectl) kubectl edit (kubectl) kubectl drain (kubectl) diffListsOfMaps (recursive) handleMapDiff diffMaps (recursive)

The ...? bits need to be chased to ground.

I apologize for the long delay on this, but we have to fully understand what this is changing, which components it is changing in, and the impact of that new behavior on skewed components those talk to across a REST API. Unfortunately, the package this PR touches is very difficult to reason about, which is why we haven't made much progress on fully tracing out the impact of the change.

thank you for clarifying this.
i`ll try to collect full list of usages, but may be unable to properly analyze this data. Anyway, i will be back here with results asap.

And thank you for spending so much time on this and providing detailed and comprehensive comments!

liggitt · 2024-08-21T22:09:18Z

staging/src/k8s.io/apimachinery/pkg/util/strategicpatch/patch.go

 	if err != nil {
 		return nil, nil, err
 	}

-	originalIndex, modifiedIndex := 0, 0
-	for {


The structure of this change makes it really hard to reason about the side effects, and there's no reasonable way to mitigate a bug introduced by this change.

We have to be 100% sure this won't regress stuff that currently is working, and have a controlled way to roll back a change to the old behavior if an issue is found. Normally this takes the shape of an isolated code change guarded by a feature gate that can be opted into to start, then opted out of as a mitigation.

May you give any references on how this usually done? I guess i could isolate the changes

Normally this is done with a feature gate, and an isolated code path only active when the gate is enabled, but if the behavior change is client-side, that's harder to gate.

antomy-gc · 2024-09-30T16:08:37Z

/retest

antomy-gc · 2024-09-30T18:13:11Z

@liggitt could you check it once again?
i`ve managed to implement it without affecting existing unit tests and patch application code, only patch generation step has changed.
it also reduced amount of changed code so it should not take lot of your time to review

antomy-gc · 2024-12-05T20:51:21Z

@apelisse could you review it instead, please?
This PR is still waiting for review 3 months after last commit and i`m a bit confused on how to get it reviewed.
Appreciate your help!

liggitt · 2024-12-05T20:58:20Z

Apologies for the delay / radio silence, I've been consumed with 1.32 blockers and activities.

This is still in my queue. The difficulty in reviewing this is ensuring that it is not breaking compatibility between a new client (built with this code) talking to an old server, or an old client talking to a new server (built with this code). It's hard to tell if the tests in this PR are exercising those scenarios (or if they even can)

liggitt · 2024-12-06T20:01:12Z

To expand on #125932 (comment), these are the combinations I mentioned in #125932 (comment)

Normally, the approach to testing this is:

Capture client / server wire payloads for the relevant scenario
A. capture the request produced by an old client, and the response of an old server to that request
B. capture the request produced by a new client, and the response of a new server with the change in this PR
Write tests against master (without this PR) using the captured data
A. Test what a master server (an old server) does against the captured payload from a "new" client
B. Test what a master server (an old server) does against the captured payload from an "old" client
C. Test what a master client (an old client) does with the captured payload response(s) from a "new" server
D. Test what a master client (an old client) does with the captured payload response(s) from an "old" server
Make sure those tests still pass when run against the changes in this PR

Demonstrating the same tests pass against master (where the server and clients are old) and this PR (where the server and clients are new) builds confidence that we're not breaking behavior with any client / server skew combination

antomy-gc · 2024-12-14T11:53:44Z

should it be done in another PR first (add tests with captured data, ensure all is ok, merge) and then using those tests for validating this, or adding feature gate and using it in tests with captured data will be enough?

…uplicates support

antomy-gc · 2024-12-14T16:10:29Z

@liggitt

I tested the changes locally using a feature gate with all combinations of “new/old client/server,” and everything works.
I`ve also tested manually patch application with patches from new and old clients on main, it also works.
As expected, the changes only affect the patch generation process, not its application.

However, adding this feature gate was not such a good idea (or maybe i just did it wrong way) since it breaks dependencies across the entire project. But without it, as you said before, we can’t quickly roll back in case of a bug.
Replacing feature gate with option won’t help either, since this code is used throughout the project, and the flag would have to be passed from everywhere.

Maybe you have any suggestions on how we could handle this? For now i`ll use variable instead of gate.

k8s-ci-robot · 2024-12-14T17:17:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: antomy-gc
Once this PR has been reviewed and has the lgtm label, please assign pwittrock for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

staging/src/k8s.io/apimachinery/pkg/util/strategicpatch/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

antomy-gc · 2024-12-14T17:25:49Z

/retest

k8s-ci-robot · 2024-12-14T18:14:15Z

@antomy-gc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-gce-cos-alpha-features	`49ec604`	link	false	`/test pull-kubernetes-e2e-gce-cos-alpha-features`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

antomy-gc · 2024-12-14T19:03:33Z

/retest

antomy-gc · 2025-01-14T04:04:14Z

@liggitt could you give any estimates on when you`ll be able to check this?

k8s-triage-robot · 2025-06-15T14:40:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 6, 2024

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 6, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 6, 2024

k8s-ci-robot requested review from apelisse and ncdc July 6, 2024 18:01

antomy-gc force-pushed the fix-duplicate-keys-merge branch from 6ed4b41 to f7fb7a1 Compare July 6, 2024 18:12

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 6, 2024

k8s-ci-robot assigned ncdc Jul 8, 2024

ncdc removed their assignment Jul 8, 2024

liggitt reviewed Aug 21, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2024

liggitt reviewed Aug 21, 2024

View reviewed changes

antomy-gc force-pushed the fix-duplicate-keys-merge branch from fe51050 to 371e8de Compare September 30, 2024 14:55

k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Sep 30, 2024

antomy-gc requested a review from liggitt September 30, 2024 18:09

antomy-gc added 3 commits December 14, 2024 15:38

[strategicpatch] add test case

278980f

[strategicpatch] fix deletion of duplicated element in list

6af5417

[strategicpatch] fix deletion of duplicated element in list

243c862

antomy-gc added 2 commits December 14, 2024 18:05

introduce feature gate on strategicpatch duplicates support

8647582

add tests for feature gate backward compatibility on strategicpatch d…

a442300

…uplicates support

antomy-gc force-pushed the fix-duplicate-keys-merge branch from 966bd44 to a442300 Compare December 14, 2024 14:41

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 14, 2024

antomy-gc added 2 commits December 14, 2024 19:57

move feature gate to apiextensions

49ec604

move feature gate to apimachinery

1dbfb3c

antomy-gc added 2 commits December 14, 2024 21:41

remove feature gate

95060dd

remove feature gate

e70c446

antomy-gc mentioned this pull request Dec 14, 2024

[flaky test] [It] [sig-network] Services should implement NodePort and HealthCheckNodePort correctly when ExternalTrafficPolicy changes #129221

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2025

[strategicpatch] support duplicated mergeKey values #125932

Are you sure you want to change the base?

[strategicpatch] support duplicated mergeKey values #125932

Uh oh!

Conversation

antomy-gc commented Jul 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Jul 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 6, 2024

Uh oh!

k8s-ci-robot commented Jul 6, 2024

Uh oh!

antomy-gc commented Jul 8, 2024

Uh oh!

cici37 commented Jul 9, 2024

Uh oh!

antomy-gc commented Jul 9, 2024

Uh oh!

antomy-gc commented Jul 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antomy-gc Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antomy-gc commented Sep 30, 2024

Uh oh!

antomy-gc commented Sep 30, 2024

Uh oh!

antomy-gc commented Dec 5, 2024

Uh oh!

liggitt commented Dec 5, 2024

Uh oh!

liggitt commented Dec 6, 2024

Uh oh!

antomy-gc commented Dec 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antomy-gc commented Dec 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 14, 2024

Uh oh!

antomy-gc commented Dec 14, 2024

Uh oh!

k8s-ci-robot commented Dec 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antomy-gc commented Dec 14, 2024

Uh oh!

antomy-gc commented Jan 14, 2025

Uh oh!

k8s-triage-robot commented Jun 15, 2025

Uh oh!

antomy-gc commented Jul 6, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jul 6, 2024 •

edited

Loading

antomy-gc Aug 22, 2024 •

edited

Loading

antomy-gc commented Dec 14, 2024 •

edited

Loading

antomy-gc commented Dec 14, 2024 •

edited

Loading

k8s-ci-robot commented Dec 14, 2024 •

edited

Loading