-
Notifications
You must be signed in to change notification settings - Fork 40.9k
[strategicpatch] support duplicated mergeKey values #125932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Welcome @antomy-gc! |
Hi @antomy-gc. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
6ed4b41
to
f7fb7a1
Compare
/assign @ncdc |
/ok-to-test |
/test pull-kubernetes-unit |
Hi @seans3 could you please provide an estimate on when you might be able to review this? |
- name: 4 | ||
other: c | ||
- name: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
Changes to these test case results represent compatibility breaks we cannot make. Existing API requests generally must produce the same results or we will break existing successful clients
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this test because of these reasons:
- it seems to describe invalid behavior
- it does not affect content of test data, just changes order of elements after patch
TwoWay patch directive says: add element '3', remove '1', place '3' after '2'.
But in original test we expect new element to be appended to the end of list ignoring order directive.
TwoWay: []byte(`
$setElementOrder/mergingList:
- name: 2
- name: 3
mergingList:
- name: 3
- name: 1
$patch: delete
`),
Current: []byte(`
mergingList:
- name: 1
- name: 2
other: b
- name: 4
other: c
`),
Result: []byte(`
mergingList:
- name: 2
other: b
- name: 4
other: c
- name: 3
Same problem for ThreeWay.
I was able to reproduce original behavior with my new code, but it resulted in breaking all other cases: instead of using order directive it was just appending new elements.
Then i found out why this behavior occurred in this test case: all the code was describing behavior like "keep previous order, then apply order directive for new elements", but side effect on slice`s underlying array was resulting in malformed 'order' slice.
Do we have any common approach for kubernetes to fix such cases? i`m sure its not the first time when we have incorrect behavior described in tests.
if err != nil { | ||
return nil, nil, err | ||
} | ||
|
||
originalIndex, modifiedIndex := 0, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change how the client generates a patch to send to the server, or how the server handles an existing patch request from the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method changes the way how the client generates patch command and PR in general affects both client and server side.
All changes are only related to duplicate MergeKey values cases, non-duplicated elements are processed by existing rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method changes the way how the client generates patch command and PR in general affects both client and server side.
Then we have to reason about the following combinations:
- new client, new server. This is the only case we're exercising in unit tests now, I think (since both the client and server side are running with new code).
- new client, old server. We have to capture the change in what a new client would generate/send as a patch, and find a way to demonstrate that an old server will do something no worse than what it does today.
- old client, new server. We have to capture what an old client would generate/send as a patch, and make sure a new server will produce the same result when given that patch (which is one reason the ordering changes to the existing unit test were concerning).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started trying to trace all the call sites that lead here to understand if this is only impacting clients, or all server-side logic, and which components this touches. I ended up with something like this before I ran out of time:
diffListsOfMaps
diffLists
handleSliceDiff
diffMaps
CreateThreeWayMergePatch
kubectl apply (kubectl)
CreateTwoWayMergeMapPatchUsingLookupPatchMeta
CreateTwoWayMergeMapPatch (unused?)
CreateTwoWayMergePatchUsingLookupPatchMeta
CreateTwoWayMergePatch
kubeadm
pkg/controller PatchNodeTaints
...
pkg/controller AddOrUpdateLabelsOnNode
...
TTL controller (kube-controller-manager)
pkg/scheduler/util PatchPodStatus (kube-scheduler)
pkg/util/pod PatchPodStatus
PodGC controller (kube-controller-manager)
Taint eviction controller (kube-controller-manager)
kubelet status manager (kubelet)
kube-scheduler preemption (kube-scheduler)
pkg/volume/util/resize_util
PatchPV
AddAnnPreResizeCapacity
...?
DeleteAnnPreResizeCapacity
...?
createPVCPatch
MarkNodeExpansionInfeasible
...?
MarkNodeExpansionFailedCondition
...?
PatchPVCStatus
...?
client-go tools/events/event_broadcaster
...?
client-go tools/record/events_cache
...?
cloud-provider node/helpers
addOrUpdateLabelsOnNode
AddOrUpdateLabelsOnNode
cloud-provider/controllers/node (cloud-controller-manager)
PatchNodeTaints
AddOrUpdateTaintOnNode
cloud-provider/controllers/nodelifecycle (cloud-controller-manager)
RemoveTaintOffNode
cloud-provider/controllers/nodelifecycle (cloud-controller-manager)
getPatchBytes
PatchService
cloud-provider/controllers/service (cloud-controller-manager)
component-helpers node/util/status
...?
kubectl debug (kubectl)
kubectl set (kubectl)
kubectl taint (kubectl)
kubectl edit (kubectl)
kubectl drain (kubectl)
diffListsOfMaps (recursive)
handleMapDiff
diffMaps (recursive)
The ...?
bits need to be chased to ground.
I apologize for the long delay on this, but we have to fully understand what this is changing, which components it is changing in, and the impact of that new behavior on skewed components those talk to across a REST API. Unfortunately, the package this PR touches is very difficult to reason about, which is why we haven't made much progress on fully tracing out the impact of the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for clarifying this.
i`ll try to collect full list of usages, but may be unable to properly analyze this data. Anyway, i will be back here with results asap.
And thank you for spending so much time on this and providing detailed and comprehensive comments!
if err != nil { | ||
return nil, nil, err | ||
} | ||
|
||
originalIndex, modifiedIndex := 0, 0 | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of this change makes it really hard to reason about the side effects, and there's no reasonable way to mitigate a bug introduced by this change.
We have to be 100% sure this won't regress stuff that currently is working, and have a controlled way to roll back a change to the old behavior if an issue is found. Normally this takes the shape of an isolated code change guarded by a feature gate that can be opted into to start, then opted out of as a mitigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May you give any references on how this usually done? I guess i could isolate the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally this is done with a feature gate, and an isolated code path only active when the gate is enabled, but if the behavior change is client-side, that's harder to gate.
fe51050
to
371e8de
Compare
/retest |
@liggitt could you check it once again? |
@apelisse could you review it instead, please? |
Apologies for the delay / radio silence, I've been consumed with 1.32 blockers and activities. This is still in my queue. The difficulty in reviewing this is ensuring that it is not breaking compatibility between a new client (built with this code) talking to an old server, or an old client talking to a new server (built with this code). It's hard to tell if the tests in this PR are exercising those scenarios (or if they even can) |
To expand on #125932 (comment), these are the combinations I mentioned in #125932 (comment) Normally, the approach to testing this is:
Demonstrating the same tests pass against master (where the server and clients are old) and this PR (where the server and clients are new) builds confidence that we're not breaking behavior with any client / server skew combination |
should it be done in another PR first (add tests with captured data, ensure all is ok, merge) and then using those tests for validating this, or adding feature gate and using it in tests with captured data will be enough? |
966bd44
to
a442300
Compare
I tested the changes locally using a feature gate with all combinations of “new/old client/server,” and everything works. However, adding this feature gate was not such a good idea (or maybe i just did it wrong way) since it breaks dependencies across the entire project. But without it, as you said before, we can’t quickly roll back in case of a bug. Maybe you have any suggestions on how we could handle this? For now i`ll use variable instead of gate. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: antomy-gc The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
@antomy-gc: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/retest |
@liggitt could you give any estimates on when you`ll be able to check this? |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What type of PR is this?
/kind bug
/sig api-machinery
What this PR does / why we need it:
Fix bug in strategicpatch which was unable to properly handle list values with duplicated MergeKey values.
Original manifest contains duplicated value with mergeKey
2
:Trying to remove one of duplicates (for example, with
kubectl apply
of following):Actual result of patching: removed both of duplicates:
Since we allow user to have duplicated values in manifests, it seems to be reasonable to provide at least minimal support of those in strategicmerge. Currently, the user cannot even delete one of the duplicates added by mistake, as this would result in a manifest missing the necessary value.
Adding stricter validation might be a more strategic solution, but it requires significant time for migration. Therefore, I suggest supporting duplicate values at least until the validation is implemented and prevents the creation of such manifests.
Which issue(s) this PR fixes:
Fixes #58477
Fixes #93266
Special notes for your reviewer:
Discovered reason of incorrect element ordering described in test case
add map and delete map from merging list
.This behavior is caused by a side effect of
mergeSliceWithSpecialElements
function, which was changing contents oforiginal
slice upper in call stack, where it was used to determine elements order and assumed to be still unmodified.Left comment, didnt fix.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: