Pods in statefulsets sometimes have a blank podIP annotation, causing network issues during termination #4710

LeeHampton · 2021-07-01T15:40:19Z

Summary

Seeing a strange issue where sometimes pods in a statefulset lose network connectivity when they're in a Terminating state (i.e., some pods make TCP calls when they're shutting down, but they get connection timeouts, keeping them hung until the shutdown grace period ends).

We've discovered that this happens when pods have an empty cni.projectcalico.org/podIP annotation. Most pods have an IP for that value, but some just have an empty string where the IP should be. Sometimes restarting pods gets them re-annotated, but not always.

The pods with empty podIP annotations have totally normal network connectivity when they are running, despite the blank annotation. They can be reached by other pods, reached from the public internet via NodePorts, and can communicate with other pods and external services like S3. It is only when they enter Terminating that they lose network connectivity.

Expected Behavior

We expect the pod to retain network connectivity throughout its Terminating state until the pod is fully torn down.

Steps to Reproduce (for bugs)

We have not found a way to reproduce the empty podIP annotation. But once a podIP annotation is empty, reproducing is straightforward.

Ensure pod is configured to run some network call (push WAL segments to S3 in our case) during shutdown
Delete the pod
Observe network timeouts while the pod is Terminating.

Your Environment

Calico version - 3.17.3
Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.19.9 managed by kops 1.19.1

fasaxc · 2021-07-01T16:38:38Z

My guess is that, because this is a stateful set:

The name of the pod gets reused when it gets restarted
The new pod with same name is starting up while the old one is in its graceful shutdown window
Graceful shutdown window ends and CNI DEL happens, patches the IP annotation on the wrong instance of the pod (removes it from the new pod instead of the old).

We have explicit code in place to prevent exactly that scenario so it'll take a bit of debugging.

I checked that v3.17.3 has the above fix (cni-plugin and node repos are pinned to a version with the above fix).

@LeeHampton Can you check that the calico CNI binary on your host really is of the right version. If you find the binaries usually in /opt/bin/cni and run ./calico -v then it should print it out.

LeeHampton · 2021-07-01T19:17:17Z

@fasaxc Yes, ./calico -v gives me v3.17.3 when I run it on the host nodes.

song-jiang · 2021-07-13T17:00:36Z

@LeeHampton Is it easy to repo? Could you repo the issue and take the note of the pod name involved (e.g. pod-A).

Can you confirm if a new pod with same name has been started?
Can you get the log from calico-typha and logs from calico-node scheduled on the same node as pod-A once you hit the issue?

feikesteenbergen · 2021-07-20T13:55:55Z

@song-jiang as far as we can tell, the new pod is only spun up after the old pod is torn down, at no point in time are there 2 pods with the exact name in play.

hindessm · 2021-07-22T13:41:57Z

We are seeing a similar issue. We hadn't observed the missing podIP annotation but will look for that in testing going forward. We were seeing problems with statefulset graceful shutdown and noticed that if we logged strace of the pid=1 in the container and ip -ts monitor route dev $(ip route get $PODIP |awk '/dev/{print $3}' on the host that often the timestamps would show the route being deleted before the TERM signal had been delivered to the process. In our case, it is not related to the new pod starting as the placement/pv requirements wouldn't allow that to happen.

We are using k8s 1.20.8 and calico v3.17.4.

Update: Checking the statefulset that I had been testing - the podIP/podIPs annotations are empty. The same is true of the pods of another statefulset in the cluster but I checked a few pods that aren't in statefulsets and they have the correct annotations.)

hindessm · 2021-07-22T16:12:22Z

If we manually reset the empty annotations with the expected podIP/podIPs values then the graceful shutdown works as expected.

LeeHampton · 2021-07-22T16:13:36Z

If we manually reset the empty annotations with the expected podIP/podIPs values then the graceful shutdown works as expected.

That happens with us as well @hindessm . Our current workaround is to just periodically find pods without those annotations and re-annotate them.

hindessm · 2021-07-22T16:59:06Z

A colleague has monitored the annotations to see when they get cleared and the clearing of the annotations on the new pod seems to coincide with:

2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 132: Calculated pod patch. patch="{\"metadata\":{\"annotations\":{\"cni.projectcalico.org/podIP\":\"\",\"cni.projectcalico.org/podIPs\":\"\"},\"resourceVersion\":\"28332994\",\"uid\":\"5035ea85-9bfb-47e2-a7e7-dbbda513a768\"}}"

in the node/host log, /var/log/calico/cni/cni.log.

hindessm · 2021-07-23T09:29:21Z

To give a little more context, the timeline when deleting a pod, foobar-2, that is part of a statefulset looks like:

16:32:14 - old foobar-2 delete begins
16:32:32 - old foobar-2 delete finishes, new foobar-2 pod starts showing up
16:32:45 - IP claimed for new pod

2021-07-22 16:32:45.108 [INFO][243806] ipam.go 983: Successfully claimed IPs: [172.30.197.150/26] block=172.30.197.128/26 handle="k8s-pod-network.9b2efaad87d0b6c5c8dd6cf32161569d3f6d1dfa2a4f861024c9b2c4d8b3e9c2" host="10.36.58.103"
2021-07-22 16:32:45.160 [DEBUG][243794] workloadendpoint.go 132: Calculated pod patch. patch="{\"metadata\":{\"annotations\":{\"cni.projectcalico.org/podIP\":\"172.30.197.150/32\",\"cni.projectcalico.org/podIPs\":\"172.30.197.150/32\"}}}"

16:33:24 calico decides to patch out the IPs

2021-07-22 16:33:24.319 [DEBUG][246369] k8s.go 536: Performing 'DeleteKVP' for WorkloadEndpoint(default/10.36.58.103-k8s-foobar--2-eth0)
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 79: Delete for WorkloadEndpoint, patching out annotations.
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 160: Generating patch for specific rev rev="28332994"
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 166: Generating patch for specific UID uid="5035ea85-9bfb-47e2-a7e7-dbbda513a768"
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 132: Calculated pod patch. patch="{\"metadata\":{\"annotations\":{\"cni.projectcalico.org/podIP\":\"\",\"cni.projectcalico.org/podIPs\":\"\"},\"resourceVersion\":\"28332994\",\"uid\":\"5035ea85-9bfb-47e2-a7e7-dbbda513a768\"}}"

At this point, the annotations on the new pod have been cleared and at the next termination the network is torn down prematurely.

fasaxc · 2021-07-23T14:14:33Z

@hindessm sent me a full log via DM. I think I see what's going on; the issue appears to be a bit of missing handling in the kubernetes API datastore. The CNI plugin does this to spot if we're racing between an ADD of a new pod and teardown of an old pod:

		} else if wep.Spec.ContainerID != "" && args.ContainerID != wep.Spec.ContainerID {
			// If the ContainerID is populated and doesn't match the CNI_CONTAINERID provided for this execution, then
			// we shouldn't delete the workload endpoint. We identify workload endpoints based on pod name and namespace, which means
			// we can receive DEL commands for an old sandbox for a currently running pod. However, we key IPAM allocations based on the
			// CNI_CONTAINERID, so we should still do that below for this case.
			logger.WithField("WorkloadEndpoint", wep).Warning("CNI_CONTAINERID does not match WorkloadEndpoint ConainerID, don't delete WEP.")
		} else ...

However, the kubernetes API datastore doesn't seem to populate the ContainerID so we fall through to the else. That used to have no impact because:

We track IPAM allocations by containerID so they were correctly scoped to the particular instance of the pod already.
Before we added the podIPs="" annotation, WEP deletion was a no-op in the k8s API datastore.

However, with the current code, deletion is not a no-op, it sets podIPs to "" to signal that the IP has been removed from the pod. Hence, we're clobbering the value on the newly-created pod when the old one is torn down.

@hindessm I notice in your log that the stateful set pod is deleted successfully once, then kubelet (or CRIO if you're using that) retries the DEL some time after the new pod has been created. If you can figure out why it's retrying (probabaly some error during pod teardown unrelated to CNI) you might be able to fix the immediate problem by stopping the retry.

2021-07-22 16:32:26.080 [INFO][242504] k8s.go 571: Teardown processing complete. ContainerID="4acb7ead1e808d46b4dfc443181a75b463633945c916b3bab01291363933380d"
2021-07-22 16:32:26.988 [INFO][242659] k8s.go 571: Teardown processing complete. ContainerID="4acb7ead1e808d46b4dfc443181a75b463633945c916b3bab01291363933380d"
2021-07-22 16:33:24.464 [INFO][246369] k8s.go 571: Teardown processing complete. ContainerID="4acb7ead1e808d46b4dfc443181a75b463633945c916b3bab01291363933380d"

^^ three CNI DELs for the same instance of the pod, I think.

This should fix the following issue: projectcalico/calico#4710 beause the CNI plugin will now hit the container ID-based guard logic.

fasaxc · 2021-07-29T08:19:08Z

This should be fixed in Calico v3.20, which is due to be released this week. If you want to try the release candidate, it's available here via this nightly build of our docs: https://2021-07-29-v3-20-governai.docs.eng.tigera.net/

I also backported to the v3.19/18 release brnaches so it'll be in the next patch release for those minor releases.

fasaxc · 2021-08-03T09:20:20Z

Closing: v3.20 is now released. Please re-open if you see this again after upgrading to v3.20.

fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 27, 2021

Add round tripping of ContainerID to KDD.

1e806df

This should fix the following issue: projectcalico/calico#4710 beause the CNI plugin will now hit the container ID-based guard logic.

fasaxc mentioned this issue Jul 27, 2021

Add round tripping of ContainerID to KDD. projectcalico/libcalico-go#1472

Merged

3 tasks

fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 27, 2021

Add round tripping of ContainerID to KDD.

f2b825a

This should fix the following issue: projectcalico/calico#4710 beause the CNI plugin will now hit the container ID-based guard logic.

fasaxc mentioned this issue Jul 28, 2021

Fix up tests after adding KDD container ID support. projectcalico/cni-plugin#1122

Merged

3 tasks

fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 28, 2021

Add round tripping of ContainerID to KDD.

a41c2a7

This should fix the following issue: projectcalico/calico#4710 beause the CNI plugin will now hit the container ID-based guard logic.

fasaxc mentioned this issue Jul 28, 2021

[v3.20] Add round tripping of ContainerID to KDD. projectcalico/libcalico-go#1481

Merged

3 tasks

fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 28, 2021

Add round tripping of ContainerID to KDD.

80b38b1

This should fix the following issue: projectcalico/calico#4710 beause the CNI plugin will now hit the container ID-based guard logic.

fasaxc mentioned this issue Jul 28, 2021

[v3.19] Add round tripping of ContainerID to KDD. projectcalico/libcalico-go#1482

Merged

3 tasks

fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 28, 2021

Add round tripping of ContainerID to KDD.

c0197ae

This should fix the following issue: projectcalico/calico#4710 beause the CNI plugin will now hit the container ID-based guard logic.

fasaxc closed this as completed Aug 3, 2021

ysksuzuki mentioned this issue Aug 26, 2021

Calico-node removes a pod network configuration before the pod is fully torn down cybozu-go/neco#1716

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods in statefulsets sometimes have a blank podIP annotation, causing network issues during termination #4710

Pods in statefulsets sometimes have a blank podIP annotation, causing network issues during termination #4710

LeeHampton commented Jul 1, 2021

fasaxc commented Jul 1, 2021

LeeHampton commented Jul 1, 2021

song-jiang commented Jul 13, 2021

feikesteenbergen commented Jul 20, 2021

hindessm commented Jul 22, 2021 •

edited

hindessm commented Jul 22, 2021

LeeHampton commented Jul 22, 2021

hindessm commented Jul 22, 2021

hindessm commented Jul 23, 2021

fasaxc commented Jul 23, 2021 •

edited

fasaxc commented Jul 29, 2021

fasaxc commented Aug 3, 2021

Pods in statefulsets sometimes have a blank podIP annotation, causing network issues during termination #4710

Pods in statefulsets sometimes have a blank podIP annotation, causing network issues during termination #4710

Comments

LeeHampton commented Jul 1, 2021

Summary

Expected Behavior

Steps to Reproduce (for bugs)

Your Environment

fasaxc commented Jul 1, 2021

LeeHampton commented Jul 1, 2021

song-jiang commented Jul 13, 2021

feikesteenbergen commented Jul 20, 2021

hindessm commented Jul 22, 2021 • edited

hindessm commented Jul 22, 2021

LeeHampton commented Jul 22, 2021

hindessm commented Jul 22, 2021

hindessm commented Jul 23, 2021

fasaxc commented Jul 23, 2021 • edited

fasaxc commented Jul 29, 2021

fasaxc commented Aug 3, 2021

hindessm commented Jul 22, 2021 •

edited

fasaxc commented Jul 23, 2021 •

edited