Description
What happened?
It appears that APIServer watchcache occasionally lost events. We can confirm that this is NOT a stale watchcache issue.
In some 1.27 clusters, we observed that both watchcache in 2 APIServer instances are pretty up-to-date (object created within 60s can be found from both cache). However, we believe some delete events were lost in the APIServer watchcache. In the bad apiserver, a few objects that deleted more than 24 hours ago still shows up in one of the APIServer cache. It's possible that other types of events (e.g. update) may also get lost, but they are not as noticeable as delete event since it can recover from the 2nd update event even if the first update event is lost.
This issue impacts k8s clients that use an informer cache. Once the informer get the same events from the bad APIServer, it won't recover until it gets restarted. Replacing the bad APIServer with a good one won't help the informer to discover the missing events.
We have observed at least 6 clusters run into this issue in EKS. 5 of them started to have this issue shortly after control plane upgrade but 1 cluster started to have this issue more than 1 hour before the control plane upgrade kicked in.
The clusters were running OK on 1.26, the issue started to show up when the clusters were upgraded to 1.27.
We saw apiserver_watch_cache_events_received_total{resource="pods"}
diverged between the 2 apiserver instances. during the incident while the delta between the 2 apiserver instances are expected to be the same.
We run the following command kubectl get --raw "/api/v1/namespaces/my-ns/pods/my-pod?resourceVersion=0"
on 2 different APIServers. One returns an object and the other returns NotFound.
EDIT: add some additional data points.
We did see the etcd memory kept increasing during the incident.
We believe the components that triggered this is Falco v0.35.1. It is a daemonset and it made a lot of watch requests w/o resourceVersion. All 6 clusters have a couple hundreds of nodes when the incident started.
In my repro cluster, I saw one etcd has much higher etcd_debugging_mvcc_pending_events_total
(over 1 million) than the other etcd instances (< 20k).
What did you expect to happen?
APIServer watch cache not to lose event.
How can we reproduce it (as minimally and precisely as possible)?
EDIT:
We have a way to repro in EKS. It may not be the minimum steps to repro.
- Create a 1.27 cluster with 800 worker nodes
- Maintain high pod churn in the cluster. I have 1000 per minute
- Deploy falco using helm:
helm install falco falcosecurity/falco --create-namespace --namespace falco --version 3.6.0 --values falco-chart-values.yaml
Note chart version 3.6.0 has falco version 0.35.1. You need to mirror the falco images to your container registry. Otherwise, your kubelet will be throttled by docker hub heavily and it will cause daemonset pods to come up slowly.
falco-chart-values.yaml
json_output: true
log_syslog: false
collectors:
containerd:
enabled: false
crio:
enabled: false
docker:
enabled: false
driver:
enabled: true
kind: module
loader:
enabled: true
initContainer:
image:
registry: your-account.dkr.ecr.us-west-2.amazonaws.com
repository: david-falco-driver-loader
tag: 0.35.1
pullPolicy: IfNotPresent
image:
registry: your-account.dkr.ecr.us-west-2.amazonaws.com
repository: david-falco
tag: 0.35.1
pullPolicy: IfNotPresent
podPriorityClassName: system-node-critical
tolerations:
- operator: Exists
falcoctl:
image:
registry: your-account.dkr.ecr.us-west-2.amazonaws.com
repository: david-falcoctl
tag: 0.5.1
pullPolicy: IfNotPresent
artifact:
install:
enabled: false
follow:
enabled: false
falco:
syscall_event_drops:
actions:
- log
- alert
rate: 1
max_burst: 999
metadata_download:
maxMb: 200
extra:
env:
- name: SKIP_DRIVER_LOADER
value: "yes"
Anything else we need to know?
No response
Kubernetes version
1.27
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here