Skip to content

Estimate average size of objects in etcd and plug it into request cost estimator #132355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 26, 2025

Conversation

serathius
Copy link
Contributor

@serathius serathius commented Jun 17, 2025

/kind feature

Ref #132233
Alternative to #132292

Added SizeBasedListCostEstimate feature gate that allows apiserver to estimate sizes of objects to calculate cost of LIST requests

This PR proposes to track object size in resource by caching last observed state for each key and then calculating average size of keys available in etcd. Fetching the list of keys from etcd is required to make sure we remove stale keys that were not deleted due skipped watch events. We could take list of keys from cacher to avoid WithKeysOnly request.

/assign @deads2k @wojtek-t

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 17, 2025
@serathius serathius changed the title Apf estimate size Estimate average size of objects in etcd and plug it into request cost estimator Jun 17, 2025
@serathius serathius force-pushed the apf_estimate_size branch 5 times, most recently from 9771f1c to f4ea6bd Compare June 17, 2025 17:50
@serathius
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2025
@yongruilin
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 17, 2025
@serathius serathius force-pushed the apf_estimate_size branch from f4ea6bd to 88bd225 Compare June 18, 2025 07:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2025
@serathius serathius force-pushed the apf_estimate_size branch from 88bd225 to 018bae7 Compare June 18, 2025 09:05
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2025
@k8s-ci-robot k8s-ci-robot requested review from deads2k and wojtek-t June 23, 2025 12:52
@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2025
@serathius serathius force-pushed the apf_estimate_size branch from de52fb4 to c0dc3e1 Compare June 23, 2025 14:35
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 23, 2025
@serathius
Copy link
Contributor Author

Updated implementation to solve both issues:

  • automatically execute GetKeys periodically if it was not executed within last minute.
  • Use watch cache to get list of keys if enabled.

I didn't had time to polish the PR, but please review the interfaces cc @wojtek-t @deads2k

@serathius serathius force-pushed the apf_estimate_size branch 2 times, most recently from 4bfdf01 to bdf3707 Compare June 23, 2025 21:21
@serathius serathius force-pushed the apf_estimate_size branch from bdf3707 to e9173b8 Compare June 24, 2025 10:03
@serathius
Copy link
Contributor Author

/retest

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments - but overall this lgtm

@@ -792,6 +814,9 @@ func (s *store) GetList(ctx context.Context, key string, opts storage.ListOption

// being unable to set the version does not prevent the object from being extracted
if matched, err := opts.Predicate.Matches(obj); err == nil && matched {
if s.stats != nil {
s.stats.AddOrUpdate(string(kv.Key), kv.ModRevision, int64(len(kv.Value)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started wondering about the impact of this change on potential list latency (as well as the corresponding watch change on the watch throughput).

Recording the change in cache is fast, but if we want to acquire and release a lock for every single change here - do we have any benchmarks showing the impact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the updates to stats to handle all KVs at once instead of one by one to reduce locking.

Started doing benchmarks for List, preliminary results show 3-7% performance overhead. Pending estimation of watch overhead.

Copy link
Contributor Author

@serathius serathius Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rewrite it looks like the overhead is at max 5% when we paginate with 150`000 objects. As benchmark uses minimal object size this is worst case scenario.

$ benchstat -col /SizeBasedListCostEstimate bench5.txt
                                                                                     │    true     │               false                │
                                                                                     │   sec/op    │   sec/op     vs base               │
StoreList/Namespaces=10000/Pods=150000/Nodes=5000/RV=/Scope=Cluster/Paginate=0-24      212.3m ± 6%   204.8m ± 2%  -3.56% (p=0.001 n=10)
StoreList/Namespaces=10000/Pods=150000/Nodes=5000/RV=/Scope=Cluster/Paginate=1000-24   292.3m ± 2%   306.9m ± 1%  +5.02% (p=0.000 n=10)
StoreList/Namespaces=10000/Pods=150000/Nodes=5000/RV=/Scope=Node/Paginate=0-24         204.1m ± 1%   203.9m ± 3%       ~ (p=0.912 n=10)
StoreList/Namespaces=10000/Pods=150000/Nodes=5000/RV=/Scope=Namespace/Paginate=0-24    81.17µ ± 2%   83.82µ ± 1%  +3.27% (p=0.000 n=10)
StoreList/Namespaces=50/Pods=150000/Nodes=5000/RV=/Scope=Cluster/Paginate=0-24         206.9m ± 6%   208.5m ± 1%       ~ (p=0.579 n=10)
StoreList/Namespaces=50/Pods=150000/Nodes=5000/RV=/Scope=Cluster/Paginate=1000-24      320.3m ± 1%   325.7m ± 1%  +1.69% (p=0.004 n=10)
StoreList/Namespaces=50/Pods=150000/Nodes=5000/RV=/Scope=Node/Paginate=0-24            205.6m ± 3%   207.3m ± 5%       ~ (p=0.684 n=10)
StoreList/Namespaces=50/Pods=150000/Nodes=5000/RV=/Scope=Namespace/Paginate=0-24       4.034m ± 1%   4.030m ± 1%       ~ (p=0.315 n=10)
StoreList/Namespaces=50/Pods=150000/Nodes=5000/RV=/Scope=Namespace/Paginate=1000-24    3.752m ± 1%   3.703m ± 1%  -1.28% (p=0.002 n=10)
StoreList/Namespaces=100/Pods=110000/Nodes=1000/RV=/Scope=Cluster/Paginate=0-24        148.4m ± 2%   144.7m ± 3%  -2.43% (p=0.019 n=10)
StoreList/Namespaces=100/Pods=110000/Nodes=1000/RV=/Scope=Cluster/Paginate=1000-24     203.1m ± 1%   201.7m ± 1%       ~ (p=0.063 n=10)
StoreList/Namespaces=100/Pods=110000/Nodes=1000/RV=/Scope=Node/Paginate=0-24           147.6m ± 2%   145.1m ± 3%       ~ (p=0.143 n=10)
StoreList/Namespaces=100/Pods=110000/Nodes=1000/RV=/Scope=Node/Paginate=100-24          15.80 ± 1%    15.61 ± 1%  -1.16% (p=0.011 n=10)
StoreList/Namespaces=100/Pods=110000/Nodes=1000/RV=/Scope=Namespace/Paginate=0-24      1.434m ± 1%   1.419m ± 1%  -1.02% (p=0.043 n=10)
StoreList/Namespaces=100/Pods=110000/Nodes=1000/RV=/Scope=Namespace/Paginate=1000-24   1.430m ± 1%   1.418m ± 1%       ~ (p=0.089 n=10)
geomean                                                                                50.00m        49.95m       -0.10%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the BenchmarkStatsCacheCleanKeys to observe the potential negative impact on watch latency during cleanup.

BenchmarkStatsCacheCleanKeys-24               55          21719938 ns/op        10496345 B/op        515 allocs/op
BenchmarkStatsCacheCleanKeys-24               57          28318478 ns/op        10496350 B/op        515 allocs/op
BenchmarkStatsCacheCleanKeys-24               62          22347929 ns/op        10496460 B/op        516 allocs/op

It takes 2-3ms to run cleanup for 150'000 keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I think that's acceptable thanks!

@serathius serathius force-pushed the apf_estimate_size branch 3 times, most recently from 34edc05 to 67b52d7 Compare June 25, 2025 15:25
@serathius serathius force-pushed the apf_estimate_size branch from 67b52d7 to 1d6409a Compare June 25, 2025 15:26
@serathius serathius force-pushed the apf_estimate_size branch from 1d6409a to 1639b09 Compare June 26, 2025 09:09
@wojtek-t
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3a0607c1e6a4b1231533052ebe8be461a89ce067

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: serathius, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0256361 into kubernetes:master Jun 26, 2025
14 of 15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants