Skip to content

Index pods on namespace labels for KCM controllers to query efficiently from Informer cache. Optimize Endpoint and EndpointSlice controller lock contention #132396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hakuna-matatah
Copy link
Contributor

@hakuna-matatah hakuna-matatah commented Jun 19, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

What ?

  • This PR introduces a namespace label indexer for PodInformer to efficiently query Pods based on set of namespace labels. This can be use across many KCM controllers to query index store based on label instead of listing everything from store and filtering.

Why ?

TL;DR

  • This improves scale and performance of controllers and alleviates lock contention by ~12-20X roughly from Endpoint controller and Endpointslice

Details:

  1. We list all pods from store in endpoint controller - xref

  2. We list all pods from store in endpointslice controller - xref

For every add/update/delete pod event belonging to service (and/or) for every add/update/delete service event that gets enqueued to the queue, these controllers will hold lock for every such event enqueued to the Queue while processing, time complexity in the order of - O(#of pods in a namespace) , this can be several 100"s of milliseconds in worst case (including wait time to be processes and acquire lock ) based on the benchmarking i ran couple months when i opened this issue . The impact of this is especially more in kube-system namespace where kube-proxy and other core components live.

Benchmarking Results :


Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkPodNamespaceLabelIndexer$ k8s.io/kubernetes/pkg/controller

goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/pkg/controller
cpu: AMD EPYC 7R13 Processor
BenchmarkPodIndexer_SingleLabel/uniform_LabelQueryWithIndex-96         	     435	   2481950 ns/op	 1234367 B/op	      73 allocs/op
BenchmarkPodIndexer_SingleLabel/uniform_LabelQueryFullScan-96         	      30	  38885333 ns/op	 2972792 B/op	      21 allocs/op
BenchmarkPodIndexer_SingleLabel/zipf_LabelQueryWithIndex-96            	     250	   4711724 ns/op	 1518890 B/op	      73 allocs/op
BenchmarkPodIndexer_SingleLabel/zipf_LabelQueryFullScan-96            	      32	  35888043 ns/op	 3398776 B/op	      23 allocs/op
PASS
ok  	k8s.io/kubernetes/pkg/controller	9.195s



15.7X faster in performance. By that much faster it releases the lock.

Which issue(s) this PR is related to:

Fixes #130767

Special notes for your reviewer:

Some PRs i made in the past improving lock contention - #130859 , #130961 , #132305

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 19, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 19, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Jun 19, 2025
@hakuna-matatah
Copy link
Contributor Author

/assign @soltysh @wojtek-t

@hakuna-matatah
Copy link
Contributor Author

/assign @aojea

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jun 19, 2025

cc: @dims @mengqiy @shyamjvs

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hakuna-matatah
Once this PR has been reviewed and has the lgtm label, please ask for approval from soltysh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dims
Copy link
Member

dims commented Jun 19, 2025

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 19, 2025
@hakuna-matatah hakuna-matatah changed the title Index pods on namespace labels for KCM controllers to query efficiently from Informer cache. Optimize Endpoint controller lock contention Index pods on namespace labels for KCM controllers to query efficiently from Informer cache. Optimize Endpoint and EndpointSlice controller lock contention Jun 19, 2025
@hakuna-matatah
Copy link
Contributor Author

/retest

// For each label in the pod, create an index key in the format of "ns:<namespace>/label:<key>=<value>"
indexKeys := make([]string, 0, len(pod.Labels))
for k, v := range pod.Labels {
indexKeys = append(indexKeys, fmt.Sprintf("ns:%s/label:%s=%s", pod.Namespace, k, v))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious question, what is the impact on memory consumption?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the benchmark shows a ~ 3x , from 21 alloca/op to 73 alloc/op

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - indexing on every single label of any pod is super risky imho and in certain deployments may lead to every substantial number of indexes, which I don't think is the good direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we have the experience of cilium depending on labels to create indexes for network policies and that scales very bad, the fact that labels also mutate during the pod lifecycle makes it problematic too , @hakuna-matatah I do agree with @wojtek-t and better not pursue this path.

/hold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea @wojtek-t really appreciate the input — I agree that unbounded label cardinality or mutation is concerning in terms of memory consumption. I want to clarify that the goal of the benchmark was to show performance impact, not optimize for memory per se.

Let's just walk through memory requirements to better quantify the memory tradeoff:

N = number of pods

X = number of unique labels per pod and across all pods

Y = number of shared labels per pod

Rough Memory Usage:

Slice entry  pointer =  8 Bytes 

Index key string (namespace + key + value) = 63 B


UniqueKeys Memory footprint:  (N * X) * (63B Index key  + 8B ptr )

SharedKeys Memory footprint:  Y * (63B Index key  + N * 8B ptr)

Total memory approximation ~=  UniqueKeys Memory footprint + SharedKeys Memory footprint

If we take 150K pods in worst case, and 10 unique labels  for each pod and 10 shared labels for each pod.

Unique key mem footprint ~= 150k*10(63B + 8B  ) =  ~= 106 MB 

Shared key mem footprint ~= 10 * ( 63B + 150k*8B) ~= 12 MB


Total memory foot print approximation roughly should be around  ~= 128 MB 

With slice headers + map overhead, it might add up to ~=200 MB overall, still i feel like 200 MB for 10 shared + 10 unique labels across all 150K pods is ok ?

Given the ~15x performance gain in label-based pod lookups, I believe this could be a decent tradeoff in clusters where labels are controlled . To be safe, I propose:

Guarding the label indexing behind a KCM flag like --enable-pod-label-indexing=false (default false)

Cluster operators can opt in if they know label cardinality is bounded and pod labels are stable

This gives flexibility to performance-sensitive clusters without imposing memory costs on the general case. Let me know if this direction sounds reasonable ???

Like to hear your thoughts by guarding it using KCM flag and leaving it to K8s customer to opt in.

@aojea
Copy link
Member

aojea commented Jun 19, 2025

sgtm, some comments

/assign @wojtek-t

for the scalability angle and the analysys of the trade off beween cpu and memory, as number of labels are unbounded the number of keys of the new index can grow

@@ -109,6 +109,10 @@ func NewEndpointController(ctx context.Context, podInformer coreinformers.PodInf
e.podLister = podInformer.Lister()
e.podsSynced = podInformer.Informer().HasSynced

// Initialize the pod indexer
controller.AddPodNamespaceLabelIndexer(podInformer.Informer()) //nolint:errcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the same in multiple controller today, following the same.

@@ -136,6 +136,9 @@ func NewController(ctx context.Context, podInformer coreinformers.PodInformer,
})
c.podLister = podInformer.Lister()
c.podsSynced = podInformer.Informer().HasSynced
// Initialize the pod indexer
controller.AddPodNamespaceLabelIndexer(podInformer.Informer()) //nolint:errcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Namespace: "default",
Name: fmt.Sprintf("pod-%d", i),
UID: types.UID(fmt.Sprintf("uid-%d", i)),
Labels: map[string]string{"app": v},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark should the perf diff when there's 1 label.
It'd be good to also include 5 labels and 10 labels.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 20, 2025
…ly. Also, Optimize Endpoint and Endpoint slice Controller Performance: Reduce Work Duration Time & Minimize Cache LockiEndpoint
@wojtek-t
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2025
@hakuna-matatah
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

Informer Cache RW Lock Contention Causes DeltaFIFO Backlog and Cache Staleness in KCM.
7 participants