Description
What happened?
Summary:
The Kubernetes Controller Manager (KCM) Informer caches experiences read/write mutex contention at scale. KCM controllers frequently acquire a read lock on the Informer cache to list objects from the store, which can starve the CacheController's DeltaFIFO processing from acquiring a write lock to update the Store/Cache during high throughput. This contention can lead to a backlog in DeltaFIFO, delaying delta event processing and stale cache.
Impact
A high backlog in DeltaFIFO results in stale Informer cache data, affecting KCM controllers that rely on it for resource reconciliation. This can lead to incorrect decisions, such as unnecessary resource creation or unnecessary load to APIServer and etcd from KCM Controllers. More details of impact/analysis on Statefulset controller, DaemonSet Controller is provided later in the example section.
Code Flow :
-
Cache controller processLoop pops from DeltaFIFO queue to processDeltas.
-
As cache controller processes deltas, it adds or updates the store based on ADD/UPDATE etc events from DeltFIFO queue.
-
For Add/Update operations to store, it acquires lock here to update the store.
-
Meanwhile controller, in the process of listing pods , it acquires read lock on store here while DeltaFIFO in the above is trying to update the store at the same time.
I believe Delete in Store is also prone to this, this code path was not exercised in the test.
Mutex pprof for KCM showing RW and Rlock on Cache/Store from CacheController Loop and StatetfulSet controller getPodsFromList

Impact/Analysis examples :
1. Statefulset Controller
In statefulset controller case, if/when cache is stale and Pod phase status reflects nil in cache, even though kubelet has patched the pod to Running
phase, Statefulset controller will keep trying to create pods and get 409's - conflict as it exists in etcd as this condition will evaluate to true
When I patched the KCM Statefulset controller, I can see that just LISTING
pods from informer cache in statefulset controller when there are lots of pods in single namespace can take upto at peak ~980 ms
, which creates a huge backlog in DeltaFIFO during this time.
In Statefulset controller, in the process of listing pods , it acquires read lock on Store here while DeltaFIFO in the above is trying to update the Store at the same time ( will cover on this more below)
Example of DeltaFIFO backlog
12 trace.go:236] Trace[410148869]: "DeltaFIFO Pop Process" ID:kube-system/kube-proxy-fv428,Depth:229809,Reason:slow event handlers blocking the queue ... (total time: 321ms):
2. DeamonSet Controller :
In DS conttoller it does similar thing , listing all pods from cache holding a read lock, which blocks all that time from updating the store in Process Deltas.
Consequence of stale cache can make DS controller to create more DS pods for same node.
What did you expect to happen?
Ideally Controllers should be isolated from each other to avoid noisy neighbor problem, which isn't possible due to Controllers sharing Shared InformerCache.
Long Term:
Food for thought:
-
Introduce a dedicated Indexer/Cache solely for DeltaFIFO writes and syncing it in a separate routine to a separate Indexer/Store for controllers to read from can reduce lock contention from list from downstream controllers.
-
However, this means controllers will operate on slightly delayed data, which is much better than super stale data at scale, we need to ensure this copy is light weight unlike Huge lists that controllers do today which worsens the contention. This will help APIMachinery code has more control on these components performance instead of depending on implementations of downstream controllers.
-
Main thing about this theory is that - we need to ensure we are not shifting the problem to second cache to reduce overhead in first cache :) , we can something similar to what etcd does today, if difference between applies and commits b/w first cache and second cache is like say
X
, we will stop read lock to second cache from controller, to ensure second cache (i.e controller relies on) catches up to first cache, to guarantee controllers are not reading stale data -
This comes with a con of increasing the memory overhead because its essentially storing a separate cache copy.
This requires changing pieces of API Machinery architecture as it stands now.
This approach might not only help KCM controllers but any controllers relying on caches built out of API Machinery code that runs on Dataplane Nodes.
At the minimum in short term:
-
we should ensure all Controllers under KCM avoid heavy list operations on the Informer Cache/Store to reduce lock contention.
-
Build and utilize Indexers where applicable to minimize the number of items fetched, reducing read lock duration and improving cache update efficiency.
How can we reproduce it (as minimally and precisely as possible)?
- Create 200 Statefulsets of 1000 pods each at once.
- Ensure all those pods are Schedulable to the nodes.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
dev-dsk-hakuna-2c-2122d141 % kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.2
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status