Informer Cache RW Lock Contention Causes DeltaFIFO Backlog and Cache Staleness in KCM.

### What happened?

### Summary:

The Kubernetes Controller Manager (KCM) Informer caches experiences read/write mutex contention at scale. KCM controllers frequently acquire a read lock on the Informer cache to list objects from the store, which can starve the CacheController's DeltaFIFO processing from acquiring a write lock to update the Store/Cache during high throughput. This contention can lead to a backlog in DeltaFIFO, delaying delta event processing and stale cache.


### Impact

A high backlog in DeltaFIFO results in stale Informer cache data, affecting KCM controllers that rely on it for resource reconciliation. This can lead to incorrect decisions, such as unnecessary resource creation or unnecessary load to APIServer and etcd from KCM Controllers. More details of impact/analysis on Statefulset controller, DaemonSet Controller is provided later in the example section.




### Code Flow :

1. Cache controller [processLoop](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/controller.go#L205-L216) pops from DeltaFIFO queue to [processDeltas](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/controller.go#L552-L584). 

2. As cache controller processes deltas, it [adds](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/controller.go#L571) or [updates](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/controller.go#L566) the store based on[ ADD/UPDATE etc events ](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/controller.go#L564) from DeltFIFO queue. 

3. For Add/Update operations to store,  it acquires lock [here](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/thread_safe_store.go#L236-L242) to update the store.


4. Meanwhile controller, in the process of [listing pods](https://github.com/kubernetes/kubernetes/blob/d386d6880e57556f7fe19e65e89f60501aa0aa25/pkg/controller/statefulset/stateful_set.go#L314) , it acquires read lock on store [here](https://github.com/kubernetes/kubernetes/blob/6549f52b97d4d0837d9d0c2c1758320b19b38472/staging/src/k8s.io/client-go/tools/cache/thread_safe_store.go#L296-L310) while DeltaFIFO in the above is trying to update the store at the same time.

I believe [Delete in Store](https://github.com/kubernetes/kubernetes/blob/6549f52b97d4d0837d9d0c2c1758320b19b38472/staging/src/k8s.io/client-go/tools/cache/thread_safe_store.go#L244-L251) is also prone to this,  this code path was not exercised in the test.




### Mutex pprof for KCM showing RW and Rlock on Cache/Store from CacheController Loop and StatetfulSet controller getPodsFromList 

![Image](https://github.com/user-attachments/assets/d3dff4d4-0e1f-41bd-8c24-bd852c565917)

<img width="1878" alt="Image" src="https://github.com/user-attachments/assets/6a6808d9-3d7e-49f0-957f-6b02368b8563" />

#### Impact/Analysis examples :

##### 1. Statefulset Controller 

 In statefulset controller case, if/when cache is stale and Pod phase status reflects[ nil](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_utils.go#L470) in cache, even though kubelet has patched the pod to `Running` phase, Statefulset controller will keep trying to [create pods and get 409's  - conflict as it exists in etcd ](https://github.com/kubernetes/kubernetes/blob/7b6c56e5fb530c80e3510977a752be5ce6e23a27/pkg/controller/statefulset/stateful_set_control.go#L397) as this [condition](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/stateful_set_control.go#L388) will evaluate to true


When I patched the KCM Statefulset controller, I can see that just `LISTING` pods from informer cache in statefulset controller when there are lots of pods in single namespace can take upto at peak  `~980 ms `, which creates a huge backlog in DeltaFIFO during this time.


In Statefulset controller, in the process of [listing pods](https://github.com/kubernetes/kubernetes/blob/d386d6880e57556f7fe19e65e89f60501aa0aa25/pkg/controller/statefulset/stateful_set.go#L314) , it acquires read lock on Store [here](https://github.com/kubernetes/kubernetes/blob/6549f52b97d4d0837d9d0c2c1758320b19b38472/staging/src/k8s.io/client-go/tools/cache/thread_safe_store.go#L296-L310) while DeltaFIFO in the above is trying to update the Store at the same time ( will cover on this more below)



![Image](https://github.com/user-attachments/assets/8904f4cf-140b-4b17-b750-b8c04ea7d8f1)



**Example of DeltaFIFO backlog** 

```
12 trace.go:236] Trace[410148869]: "DeltaFIFO Pop Process" ID:kube-system/kube-proxy-fv428,Depth:229809,Reason:slow event handlers blocking the queue ... (total time: 321ms):
```



##### 2. DeamonSet Controller :

In DS conttoller it does similar thing , [listing all pods from cache](https://github.com/kubernetes/kubernetes/blob/6d0ac8c561a7ac66c21e4ee7bd1976c2ecedbf32/pkg/controller/daemon/daemon_controller.go#L702)  holding a read lock,  which blocks all that time from updating the store in Process Deltas.

Consequence of stale cache can make DS controller to create more DS pods for same node.








### What did you expect to happen?

Ideally Controllers should be isolated from each other to avoid noisy neighbor problem, which isn't possible due to Controllers sharing Shared InformerCache.

**Long Term**:

Food for thought: 

- Introduce a dedicated Indexer/Cache solely for DeltaFIFO writes and syncing it  in a separate routine  to a separate Indexer/Store for controllers to read from can reduce lock contention from list from downstream controllers.
- However, this means controllers will operate on slightly delayed data, which is much better than super stale data at scale, we need to ensure this copy is light weight unlike Huge lists that controllers do today which worsens the contention. This will help APIMachinery code has more control on these components performance instead of depending on implementations of downstream controllers.

-  Main thing about this theory is that - we need to ensure we are not shifting the problem to second cache to reduce overhead in first cache :)  , we can something similar to what etcd does today, if difference between applies and commits b/w first cache and second cache is like say `X` , we will stop read lock to second cache from controller, to ensure second cache (i.e controller relies on) catches up to first cache, to guarantee controllers are not reading stale data  

- This comes with a con of increasing the memory overhead  because its essentially storing a separate cache copy.

This requires changing pieces of API Machinery architecture as it stands now.

This approach might not only help KCM controllers but any controllers relying on caches built out of API Machinery code that runs on Dataplane Nodes.


**At the minimum in short term**:

-  we should ensure all Controllers under KCM avoid heavy list operations on the Informer Cache/Store to reduce lock contention.

- Build and utilize Indexers where applicable to minimize the number of items fetched, reducing read lock duration and improving cache update efficiency.

### How can we reproduce it (as minimally and precisely as possible)?

- Create 200 Statefulsets of 1000 pods each at once.
- Ensure all those pods are Schedulable to the nodes.


### Anything else we need to know?

_No response_

### Kubernetes version

<details>

```console
$ kubectl version

dev-dsk-hakuna-2c-2122d141 % kubectl version 
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.2
```

</details>


### Cloud provider

<details>
AWS
</details>


### OS version

<details>

```console
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
```

</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Informer Cache RW Lock Contention Causes DeltaFIFO Backlog and Cache Staleness in KCM. #130767

What happened?

Summary:

Impact

Code Flow :

Mutex pprof for KCM showing RW and Rlock on Cache/Store from CacheController Loop and StatetfulSet controller getPodsFromList

Impact/Analysis examples :

1. Statefulset Controller

2. DeamonSet Controller :

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Informer Cache RW Lock Contention Causes DeltaFIFO Backlog and Cache Staleness in KCM. #130767

Description

What happened?

Summary:

Impact

Code Flow :

Mutex pprof for KCM showing RW and Rlock on Cache/Store from CacheController Loop and StatetfulSet controller getPodsFromList

Impact/Analysis examples :

1. Statefulset Controller

2. DeamonSet Controller :

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions