Skip to content

[strategicpatch] support duplicated mergeKey values #125932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

antomy-gc
Copy link

@antomy-gc antomy-gc commented Jul 6, 2024

What type of PR is this?

/kind bug
/sig api-machinery

What this PR does / why we need it:

Fix bug in strategicpatch which was unable to properly handle list values with duplicated MergeKey values.

Original manifest contains duplicated value with mergeKey 2:

mergingList:
- name: 1
- name: 2
  value: dup1
- name: 3
- name: 2
  value: dup2

Trying to remove one of duplicates (for example, with kubectl apply of following):

mergingList:
- name: 1
- name: 2
  value: dup2
- name: 3

Actual result of patching: removed both of duplicates:

mergingList:
- name: 1
- name: 3

Since we allow user to have duplicated values in manifests, it seems to be reasonable to provide at least minimal support of those in strategicmerge. Currently, the user cannot even delete one of the duplicates added by mistake, as this would result in a manifest missing the necessary value.

Adding stricter validation might be a more strategic solution, but it requires significant time for migration. Therefore, I suggest supporting duplicate values at least until the validation is implemented and prevents the creation of such manifests.

Which issue(s) this PR fixes:

Fixes #58477
Fixes #93266

Special notes for your reviewer:

Discovered reason of incorrect element ordering described in test case add map and delete map from merging list.
This behavior is caused by a side effect of mergeSliceWithSpecialElements function, which was changing contents of original slice upper in call stack, where it was used to determine elements order and assumed to be still unmodified.
Left comment, didnt fix.

Does this PR introduce a user-facing change?

Fixed strategicpatch bug when attempt to remove one of duplicated by mergeKey value elements from list 
with `kubectl appy` could lead to removal of all elements with this mergeKey.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 6, 2024
Copy link

linux-foundation-easycla bot commented Jul 6, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 6, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @antomy-gc!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 6, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @antomy-gc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested review from apelisse and ncdc July 6, 2024 18:01
@antomy-gc antomy-gc force-pushed the fix-duplicate-keys-merge branch from 6ed4b41 to f7fb7a1 Compare July 6, 2024 18:12
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 6, 2024
@antomy-gc
Copy link
Author

/assign @ncdc

@ncdc ncdc removed their assignment Jul 8, 2024
@cici37
Copy link
Contributor

cici37 commented Jul 9, 2024

/ok-to-test
/assign @seans3 Thanks!
/triage accepted

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 9, 2024
@antomy-gc
Copy link
Author

/test pull-kubernetes-unit

@antomy-gc
Copy link
Author

Hi @seans3 could you please provide an estimate on when you might be able to review this?

- name: 4
other: c
- name: 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Changes to these test case results represent compatibility breaks we cannot make. Existing API requests generally must produce the same results or we will break existing successful clients

Copy link
Author

@antomy-gc antomy-gc Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this test because of these reasons:

  • it seems to describe invalid behavior
  • it does not affect content of test data, just changes order of elements after patch

TwoWay patch directive says: add element '3', remove '1', place '3' after '2'.
But in original test we expect new element to be appended to the end of list ignoring order directive.

			TwoWay: []byte(`
$setElementOrder/mergingList:
  - name: 2
  - name: 3
mergingList:
  - name: 3
  - name: 1
    $patch: delete
`),
			Current: []byte(`
mergingList:
  - name: 1
  - name: 2
    other: b
  - name: 4
    other: c
`),
			Result: []byte(`
mergingList:
  - name: 2
    other: b
  - name: 4
    other: c
  - name: 3

Same problem for ThreeWay.

I was able to reproduce original behavior with my new code, but it resulted in breaking all other cases: instead of using order directive it was just appending new elements.
Then i found out why this behavior occurred in this test case: all the code was describing behavior like "keep previous order, then apply order directive for new elements", but side effect on slice`s underlying array was resulting in malformed 'order' slice.

Do we have any common approach for kubernetes to fix such cases? i`m sure its not the first time when we have incorrect behavior described in tests.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2024
if err != nil {
return nil, nil, err
}

originalIndex, modifiedIndex := 0, 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change how the client generates a patch to send to the server, or how the server handles an existing patch request from the client?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method changes the way how the client generates patch command and PR in general affects both client and server side.

All changes are only related to duplicate MergeKey values cases, non-duplicated elements are processed by existing rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method changes the way how the client generates patch command and PR in general affects both client and server side.

Then we have to reason about the following combinations:

  1. new client, new server. This is the only case we're exercising in unit tests now, I think (since both the client and server side are running with new code).
  2. new client, old server. We have to capture the change in what a new client would generate/send as a patch, and find a way to demonstrate that an old server will do something no worse than what it does today.
  3. old client, new server. We have to capture what an old client would generate/send as a patch, and make sure a new server will produce the same result when given that patch (which is one reason the ordering changes to the existing unit test were concerning).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started trying to trace all the call sites that lead here to understand if this is only impacting clients, or all server-side logic, and which components this touches. I ended up with something like this before I ran out of time:

diffListsOfMaps
  diffLists
    handleSliceDiff
      diffMaps
        CreateThreeWayMergePatch
          kubectl apply (kubectl)
        CreateTwoWayMergeMapPatchUsingLookupPatchMeta
          CreateTwoWayMergeMapPatch (unused?)
          CreateTwoWayMergePatchUsingLookupPatchMeta
            CreateTwoWayMergePatch
              kubeadm
              pkg/controller PatchNodeTaints
                ...
              pkg/controller AddOrUpdateLabelsOnNode
                ...
              TTL controller (kube-controller-manager)
              pkg/scheduler/util PatchPodStatus (kube-scheduler)
              pkg/util/pod PatchPodStatus
                PodGC controller (kube-controller-manager)
                Taint eviction controller (kube-controller-manager)
                kubelet status manager (kubelet)
                kube-scheduler preemption (kube-scheduler)
              pkg/volume/util/resize_util
                PatchPV
                  AddAnnPreResizeCapacity
                    ...?
                  DeleteAnnPreResizeCapacity
                    ...?
                createPVCPatch
                  MarkNodeExpansionInfeasible
                    ...?
                  MarkNodeExpansionFailedCondition
                    ...?
                  PatchPVCStatus
                    ...?
              client-go tools/events/event_broadcaster
                ...?
              client-go tools/record/events_cache
                ...?
              cloud-provider node/helpers
                addOrUpdateLabelsOnNode
                  AddOrUpdateLabelsOnNode
                    cloud-provider/controllers/node (cloud-controller-manager)
                PatchNodeTaints
                  AddOrUpdateTaintOnNode
                    cloud-provider/controllers/nodelifecycle (cloud-controller-manager)
                  RemoveTaintOffNode
                    cloud-provider/controllers/nodelifecycle (cloud-controller-manager)
                getPatchBytes
                  PatchService
                    cloud-provider/controllers/service (cloud-controller-manager)
              component-helpers node/util/status
                ...?
              kubectl debug (kubectl)
              kubectl set (kubectl)
              kubectl taint (kubectl)
              kubectl edit (kubectl)
              kubectl drain (kubectl)
        diffListsOfMaps (recursive)
        handleMapDiff
          diffMaps (recursive)

The ...? bits need to be chased to ground.

I apologize for the long delay on this, but we have to fully understand what this is changing, which components it is changing in, and the impact of that new behavior on skewed components those talk to across a REST API. Unfortunately, the package this PR touches is very difficult to reason about, which is why we haven't made much progress on fully tracing out the impact of the change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for clarifying this.
i`ll try to collect full list of usages, but may be unable to properly analyze this data. Anyway, i will be back here with results asap.

And thank you for spending so much time on this and providing detailed and comprehensive comments!

if err != nil {
return nil, nil, err
}

originalIndex, modifiedIndex := 0, 0
for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of this change makes it really hard to reason about the side effects, and there's no reasonable way to mitigate a bug introduced by this change.

We have to be 100% sure this won't regress stuff that currently is working, and have a controlled way to roll back a change to the old behavior if an issue is found. Normally this takes the shape of an isolated code change guarded by a feature gate that can be opted into to start, then opted out of as a mitigation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you give any references on how this usually done? I guess i could isolate the changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally this is done with a feature gate, and an isolated code path only active when the gate is enabled, but if the behavior change is client-side, that's harder to gate.

@antomy-gc antomy-gc force-pushed the fix-duplicate-keys-merge branch from fe51050 to 371e8de Compare September 30, 2024 14:55
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Sep 30, 2024
@antomy-gc
Copy link
Author

/retest

@antomy-gc antomy-gc requested a review from liggitt September 30, 2024 18:09
@antomy-gc
Copy link
Author

@liggitt could you check it once again?
i`ve managed to implement it without affecting existing unit tests and patch application code, only patch generation step has changed.
it also reduced amount of changed code so it should not take lot of your time to review

@antomy-gc
Copy link
Author

@apelisse could you review it instead, please?
This PR is still waiting for review 3 months after last commit and i`m a bit confused on how to get it reviewed.
Appreciate your help!

@liggitt
Copy link
Member

liggitt commented Dec 5, 2024

Apologies for the delay / radio silence, I've been consumed with 1.32 blockers and activities.

This is still in my queue. The difficulty in reviewing this is ensuring that it is not breaking compatibility between a new client (built with this code) talking to an old server, or an old client talking to a new server (built with this code). It's hard to tell if the tests in this PR are exercising those scenarios (or if they even can)

@liggitt
Copy link
Member

liggitt commented Dec 6, 2024

To expand on #125932 (comment), these are the combinations I mentioned in #125932 (comment)

Normally, the approach to testing this is:

  1. Capture client / server wire payloads for the relevant scenario
    A. capture the request produced by an old client, and the response of an old server to that request
    B. capture the request produced by a new client, and the response of a new server with the change in this PR

  2. Write tests against master (without this PR) using the captured data
    A. Test what a master server (an old server) does against the captured payload from a "new" client
    B. Test what a master server (an old server) does against the captured payload from an "old" client
    C. Test what a master client (an old client) does with the captured payload response(s) from a "new" server
    D. Test what a master client (an old client) does with the captured payload response(s) from an "old" server

  3. Make sure those tests still pass when run against the changes in this PR

Demonstrating the same tests pass against master (where the server and clients are old) and this PR (where the server and clients are new) builds confidence that we're not breaking behavior with any client / server skew combination

@antomy-gc
Copy link
Author

antomy-gc commented Dec 14, 2024

should it be done in another PR first (add tests with captured data, ensure all is ok, merge) and then using those tests for validating this, or adding feature gate and using it in tests with captured data will be enough?

@antomy-gc antomy-gc force-pushed the fix-duplicate-keys-merge branch from 966bd44 to a442300 Compare December 14, 2024 14:41
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 14, 2024
@antomy-gc
Copy link
Author

antomy-gc commented Dec 14, 2024

@liggitt

I tested the changes locally using a feature gate with all combinations of “new/old client/server,” and everything works.
I`ve also tested manually patch application with patches from new and old clients on main, it also works.
As expected, the changes only affect the patch generation process, not its application.

However, adding this feature gate was not such a good idea (or maybe i just did it wrong way) since it breaks dependencies across the entire project. But without it, as you said before, we can’t quickly roll back in case of a bug.
Replacing feature gate with option won’t help either, since this code is used throughout the project, and the flag would have to be passed from everywhere.

Maybe you have any suggestions on how we could handle this? For now i`ll use variable instead of gate.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: antomy-gc
Once this PR has been reviewed and has the lgtm label, please assign pwittrock for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@antomy-gc
Copy link
Author

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 14, 2024

@antomy-gc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gce-cos-alpha-features 49ec604 link false /test pull-kubernetes-e2e-gce-cos-alpha-features

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@antomy-gc
Copy link
Author

/retest

@antomy-gc
Copy link
Author

@liggitt could you give any estimates on when you`ll be able to check this?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: No status
6 participants