-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Fixed large resourceversion and limit for storages #132374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @PatrickLaabs. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Could you please add a unit test for it? |
Thanks for the response and of course. I was waiting for a response on this, before I invest more time. |
Reason why you get different errors is due to I'm not against unifying the semantic, I think it's even benefitial, but we definitely need a test for that. Suggest adding a scenario to |
// If we can't get the current RV, use 0 as a fallback. | ||
currentRV = 0 | ||
} | ||
return storage.NewTooLargeResourceVersionError(requestedRV, currentRV, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly the 1
in NewTooLargeResourceVersionError
stands for 1 second in Retry-After
. Seems like a incorrect suggestion for clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thats correct. Setting it to value of 0 is a better approach, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check other places we use NewTooLargeResourceVersionError
and see what we do there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go
Lines 1035 to 1049 in 0f478e5
func (s *store) validateMinimumResourceVersion(minimumResourceVersion string, actualRevision uint64) error { | |
if minimumResourceVersion == "" { | |
return nil | |
} | |
minimumRV, err := s.versioner.ParseResourceVersion(minimumResourceVersion) | |
if err != nil { | |
return apierrors.NewBadRequest(fmt.Sprintf("invalid resource version: %v", err)) | |
} | |
// Enforce the storage.Interface guarantee that the resource version of the returned data | |
// "will be at least 'resourceVersion'". | |
if minimumRV > actualRevision { | |
return storage.NewTooLargeResourceVersionError(minimumRV, actualRevision, 0) | |
} | |
return nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found 3 cases:
return storage.NewTooLargeResourceVersionError(minimumRV, actualRevision, 0) 0
used in etcd3 store.return storage.NewTooLargeResourceVersionError(resourceVersion, w.resourceVersion, resourceVersionTooHighRetrySeconds) resourceVersionTooHighRetrySeconds = 1
used in cache.wc.sendError(storage.NewTooLargeResourceVersionError(uint64(wc.initialRev), currentStorageRV, int(wait.Jitter(1*time.Second, 3).Seconds())))
Looks like there is no established value. For now I would set it to 0
to be consistent with etcd3 store. Long term it would be better to have one common strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats exactly what i thought. I'll set a reminder, to open up a follow-up issue for this one 👍
After reviewing the unit test, I have found something:
If I am not completely wrong here.. thats exact the point we are looking for in our unit tests, right? With my update Code, I had commented in the reviews, I made these adjustments for the unit test:
What do you thing? Or shall we extend the testings? |
yes |
/ok-to-test |
/lgtm PTAL @wojtek-t |
LGTM label has been added. Git tree hash: 5a109d1122209e58f31535536f127fa2c8ca0d60
|
@@ -744,6 +746,14 @@ func (s *store) GetList(ctx context.Context, key string, opts storage.ListOption | |||
}) | |||
metrics.RecordEtcdRequest(metricsOp, s.groupResource, err, startTime) | |||
if err != nil { | |||
if errors.Is(err, etcdrpc.ErrFutureRev) { | |||
currentRV, getRVErr := s.GetCurrentResourceVersion(ctx) | |||
if getRVErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to surface this error somewhere, or just log it maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already on error handling path, we only call GetCurrentResourceVersion to provide more information to error. I don't think there is a need to surface the error, but maybe returning NewTooLargeResourceVersionError
with rev 0 might not be correct. Maybe we should change error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently a little busy at work. I'll get back to this in Friday 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can improved that as a follow-up - let's merge that as this is already unifying the error types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t Sounds good. Shall I create the issue, or do you want to create and assign it to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please go ahead and create it
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: PatrickLaabs, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Follow-Up issue to improve returned error message: |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Currently, when you do try to query the StorageClassList with a very high resourceVersion and a high limit value, you'll get a response, like this:
NOTE: Youll not only get a error message from etcdserver, you will also get a 500 error code.
The issues #132358 suggested, that we might want to return a more graceful error message and a error code of 504, like this:
After making some slight adjustments on the error handling, we are able to query the StorageClassList with a very high resourceVersion and a high limit value, and we will receive the desired message:
Which issue(s) this PR is related to:
Fixes #132358
Special notes for your reviewer:
I am quite not sure, if the current change is what we really want.
This was a challenge for me 😄
We will check for the error message from etcd:
if err == etcdrpc.ErrFutureRev
, which is infact thisAs I am not sure, if this is a good practice, I looked at the
interpretListError
function, which will be called within theGetList
function.And yes, we already doing it this way:
But I'd be more than happy for some suggestions, if this is not the right approach 👍
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: