-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Index pods on namespace labels for KCM controllers to query efficiently from Informer cache. Optimize Endpoint and EndpointSlice controller lock contention #132396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
dde68b1
to
5486c1d
Compare
/assign @aojea |
5486c1d
to
c4af9f9
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hakuna-matatah The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/release-note-none |
c4af9f9
to
1bd6463
Compare
/retest |
// For each label in the pod, create an index key in the format of "ns:<namespace>/label:<key>=<value>" | ||
indexKeys := make([]string, 0, len(pod.Labels)) | ||
for k, v := range pod.Labels { | ||
indexKeys = append(indexKeys, fmt.Sprintf("ns:%s/label:%s=%s", pod.Namespace, k, v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious question, what is the impact on memory consumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the benchmark shows a ~ 3x , from 21 alloca/op to 73 alloc/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm - indexing on every single label of any pod is super risky imho and in certain deployments may lead to every substantial number of indexes, which I don't think is the good direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we have the experience of cilium depending on labels to create indexes for network policies and that scales very bad, the fact that labels also mutate during the pod lifecycle makes it problematic too , @hakuna-matatah I do agree with @wojtek-t and better not pursue this path.
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea @wojtek-t really appreciate the input — I agree that unbounded label cardinality or mutation is concerning in terms of memory consumption. I want to clarify that the goal of the benchmark was to show performance impact, not optimize for memory per se.
Let's just walk through memory requirements to better quantify the memory tradeoff:
N = number of pods
X = number of unique labels per pod and across all pods
Y = number of shared labels per pod
Rough Memory Usage:
Slice entry pointer = 8 Bytes
Index key string (namespace + key + value) = 63 B
UniqueKeys Memory footprint: (N * X) * (63B Index key + 8B ptr )
SharedKeys Memory footprint: Y * (63B Index key + N * 8B ptr)
Total memory approximation ~= UniqueKeys Memory footprint + SharedKeys Memory footprint
If we take 150K pods in worst case, and 10 unique labels for each pod and 10 shared labels for each pod.
Unique key mem footprint ~= 150k*10(63B + 8B ) = ~= 106 MB
Shared key mem footprint ~= 10 * ( 63B + 150k*8B) ~= 12 MB
Total memory foot print approximation roughly should be around ~= 128 MB
With slice headers + map overhead, it might add up to ~=200 MB overall, still i feel like 200 MB for 10 shared + 10 unique labels across all 150K pods is ok ?
Given the ~15x performance gain in label-based pod lookups, I believe this could be a decent tradeoff in clusters where labels are controlled . To be safe, I propose:
Guarding the label indexing behind a KCM flag like --enable-pod-label-indexing=false (default false)
Cluster operators can opt in if they know label cardinality is bounded and pod labels are stable
This gives flexibility to performance-sensitive clusters without imposing memory costs on the general case. Let me know if this direction sounds reasonable ???
Like to hear your thoughts by guarding it using KCM flag and leaving it to K8s customer to opt in.
sgtm, some comments /assign @wojtek-t for the scalability angle and the analysys of the trade off beween cpu and memory, as number of labels are unbounded the number of keys of the new index can grow |
@@ -109,6 +109,10 @@ func NewEndpointController(ctx context.Context, podInformer coreinformers.PodInf | |||
e.podLister = podInformer.Lister() | |||
e.podsSynced = podInformer.Informer().HasSynced | |||
|
|||
// Initialize the pod indexer | |||
controller.AddPodNamespaceLabelIndexer(podInformer.Informer()) //nolint:errcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not checking error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do the same in multiple controller today, following the same.
@@ -136,6 +136,9 @@ func NewController(ctx context.Context, podInformer coreinformers.PodInformer, | |||
}) | |||
c.podLister = podInformer.Lister() | |||
c.podsSynced = podInformer.Informer().HasSynced | |||
// Initialize the pod indexer | |||
controller.AddPodNamespaceLabelIndexer(podInformer.Informer()) //nolint:errcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Namespace: "default", | ||
Name: fmt.Sprintf("pod-%d", i), | ||
UID: types.UID(fmt.Sprintf("uid-%d", i)), | ||
Labels: map[string]string{"app": v}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark should the perf diff when there's 1 label.
It'd be good to also include 5 labels and 10 labels.
1bd6463
to
f0dc9f3
Compare
f0dc9f3
to
0c14815
Compare
…ly. Also, Optimize Endpoint and Endpoint slice Controller Performance: Reduce Work Duration Time & Minimize Cache LockiEndpoint
0c14815
to
8612b61
Compare
/hold |
/retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
What ?
Why ?
TL;DR
Details:
We list all pods from store in endpoint controller - xref
We list all pods from store in endpointslice controller - xref
For every add/update/delete pod event belonging to service (and/or) for every add/update/delete service event that gets enqueued to the queue, these controllers will hold lock for every such event enqueued to the Queue while processing, time complexity in the order of - O(#of pods in a namespace) , this can be several 100"s of milliseconds in worst case (including wait time to be processes and acquire lock ) based on the benchmarking i ran couple months when i opened this issue . The impact of this is especially more in
kube-system
namespace wherekube-proxy
and other core components live.Benchmarking Results :
Which issue(s) this PR is related to:
Fixes #130767
Special notes for your reviewer:
Some PRs i made in the past improving lock contention - #130859 , #130961 , #132305
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: