Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods in statefulsets sometimes have a blank podIP annotation, causing network issues during termination #4710

Closed
LeeHampton opened this issue Jul 1, 2021 · 12 comments

Comments

@LeeHampton
Copy link

Summary

Seeing a strange issue where sometimes pods in a statefulset lose network connectivity when they're in a Terminating state (i.e., some pods make TCP calls when they're shutting down, but they get connection timeouts, keeping them hung until the shutdown grace period ends).

We've discovered that this happens when pods have an empty cni.projectcalico.org/podIP annotation. Most pods have an IP for that value, but some just have an empty string where the IP should be. Sometimes restarting pods gets them re-annotated, but not always.

The pods with empty podIP annotations have totally normal network connectivity when they are running, despite the blank annotation. They can be reached by other pods, reached from the public internet via NodePorts, and can communicate with other pods and external services like S3. It is only when they enter Terminating that they lose network connectivity.

Expected Behavior

We expect the pod to retain network connectivity throughout its Terminating state until the pod is fully torn down.

Steps to Reproduce (for bugs)

We have not found a way to reproduce the empty podIP annotation. But once a podIP annotation is empty, reproducing is straightforward.

  1. Ensure pod is configured to run some network call (push WAL segments to S3 in our case) during shutdown
  2. Delete the pod
  3. Observe network timeouts while the pod is Terminating.

Your Environment

  • Calico version - 3.17.3
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.19.9 managed by kops 1.19.1
@fasaxc
Copy link
Member

fasaxc commented Jul 1, 2021

My guess is that, because this is a stateful set:

  • The name of the pod gets reused when it gets restarted
  • The new pod with same name is starting up while the old one is in its graceful shutdown window
  • Graceful shutdown window ends and CNI DEL happens, patches the IP annotation on the wrong instance of the pod (removes it from the new pod instead of the old).

We have explicit code in place to prevent exactly that scenario so it'll take a bit of debugging.

I checked that v3.17.3 has the above fix (cni-plugin and node repos are pinned to a version with the above fix).

@LeeHampton Can you check that the calico CNI binary on your host really is of the right version. If you find the binaries usually in /opt/bin/cni and run ./calico -v then it should print it out.

@LeeHampton
Copy link
Author

@fasaxc Yes, ./calico -v gives me v3.17.3 when I run it on the host nodes.

@song-jiang
Copy link
Member

@LeeHampton Is it easy to repo? Could you repo the issue and take the note of the pod name involved (e.g. pod-A).

  • Can you confirm if a new pod with same name has been started?
  • Can you get the log from calico-typha and logs from calico-node scheduled on the same node as pod-A once you hit the issue?

@feikesteenbergen
Copy link

@song-jiang as far as we can tell, the new pod is only spun up after the old pod is torn down, at no point in time are there 2 pods with the exact name in play.

@hindessm
Copy link

hindessm commented Jul 22, 2021

We are seeing a similar issue. We hadn't observed the missing podIP annotation but will look for that in testing going forward. We were seeing problems with statefulset graceful shutdown and noticed that if we logged strace of the pid=1 in the container and ip -ts monitor route dev $(ip route get $PODIP |awk '/dev/{print $3}' on the host that often the timestamps would show the route being deleted before the TERM signal had been delivered to the process. In our case, it is not related to the new pod starting as the placement/pv requirements wouldn't allow that to happen.

We are using k8s 1.20.8 and calico v3.17.4.

Update: Checking the statefulset that I had been testing - the podIP/podIPs annotations are empty. The same is true of the pods of another statefulset in the cluster but I checked a few pods that aren't in statefulsets and they have the correct annotations.)

@hindessm
Copy link

If we manually reset the empty annotations with the expected podIP/podIPs values then the graceful shutdown works as expected.

@LeeHampton
Copy link
Author

If we manually reset the empty annotations with the expected podIP/podIPs values then the graceful shutdown works as expected.

That happens with us as well @hindessm . Our current workaround is to just periodically find pods without those annotations and re-annotate them.

@hindessm
Copy link

A colleague has monitored the annotations to see when they get cleared and the clearing of the annotations on the new pod seems to coincide with:

2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 132: Calculated pod patch. patch="{\"metadata\":{\"annotations\":{\"cni.projectcalico.org/podIP\":\"\",\"cni.projectcalico.org/podIPs\":\"\"},\"resourceVersion\":\"28332994\",\"uid\":\"5035ea85-9bfb-47e2-a7e7-dbbda513a768\"}}"

in the node/host log, /var/log/calico/cni/cni.log.

@hindessm
Copy link

To give a little more context, the timeline when deleting a pod, foobar-2, that is part of a statefulset looks like:

  • 16:32:14 - old foobar-2 delete begins
  • 16:32:32 - old foobar-2 delete finishes, new foobar-2 pod starts showing up
  • 16:32:45 - IP claimed for new pod
2021-07-22 16:32:45.108 [INFO][243806] ipam.go 983: Successfully claimed IPs: [172.30.197.150/26] block=172.30.197.128/26 handle="k8s-pod-network.9b2efaad87d0b6c5c8dd6cf32161569d3f6d1dfa2a4f861024c9b2c4d8b3e9c2" host="10.36.58.103"
2021-07-22 16:32:45.160 [DEBUG][243794] workloadendpoint.go 132: Calculated pod patch. patch="{\"metadata\":{\"annotations\":{\"cni.projectcalico.org/podIP\":\"172.30.197.150/32\",\"cni.projectcalico.org/podIPs\":\"172.30.197.150/32\"}}}"
  • 16:33:24 calico decides to patch out the IPs
2021-07-22 16:33:24.319 [DEBUG][246369] k8s.go 536: Performing 'DeleteKVP' for WorkloadEndpoint(default/10.36.58.103-k8s-foobar--2-eth0)
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 79: Delete for WorkloadEndpoint, patching out annotations.
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 160: Generating patch for specific rev rev="28332994"
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 166: Generating patch for specific UID uid="5035ea85-9bfb-47e2-a7e7-dbbda513a768"
2021-07-22 16:33:24.319 [DEBUG][246369] workloadendpoint.go 132: Calculated pod patch. patch="{\"metadata\":{\"annotations\":{\"cni.projectcalico.org/podIP\":\"\",\"cni.projectcalico.org/podIPs\":\"\"},\"resourceVersion\":\"28332994\",\"uid\":\"5035ea85-9bfb-47e2-a7e7-dbbda513a768\"}}"

At this point, the annotations on the new pod have been cleared and at the next termination the network is torn down prematurely.

@fasaxc
Copy link
Member

fasaxc commented Jul 23, 2021

@hindessm sent me a full log via DM. I think I see what's going on; the issue appears to be a bit of missing handling in the kubernetes API datastore. The CNI plugin does this to spot if we're racing between an ADD of a new pod and teardown of an old pod:

		} else if wep.Spec.ContainerID != "" && args.ContainerID != wep.Spec.ContainerID {
			// If the ContainerID is populated and doesn't match the CNI_CONTAINERID provided for this execution, then
			// we shouldn't delete the workload endpoint. We identify workload endpoints based on pod name and namespace, which means
			// we can receive DEL commands for an old sandbox for a currently running pod. However, we key IPAM allocations based on the
			// CNI_CONTAINERID, so we should still do that below for this case.
			logger.WithField("WorkloadEndpoint", wep).Warning("CNI_CONTAINERID does not match WorkloadEndpoint ConainerID, don't delete WEP.")
		} else ...

However, the kubernetes API datastore doesn't seem to populate the ContainerID so we fall through to the else. That used to have no impact because:

  • We track IPAM allocations by containerID so they were correctly scoped to the particular instance of the pod already.
  • Before we added the podIPs="" annotation, WEP deletion was a no-op in the k8s API datastore.

However, with the current code, deletion is not a no-op, it sets podIPs to "" to signal that the IP has been removed from the pod. Hence, we're clobbering the value on the newly-created pod when the old one is torn down.

@hindessm I notice in your log that the stateful set pod is deleted successfully once, then kubelet (or CRIO if you're using that) retries the DEL some time after the new pod has been created. If you can figure out why it's retrying (probabaly some error during pod teardown unrelated to CNI) you might be able to fix the immediate problem by stopping the retry.

2021-07-22 16:32:26.080 [INFO][242504] k8s.go 571: Teardown processing complete. ContainerID="4acb7ead1e808d46b4dfc443181a75b463633945c916b3bab01291363933380d"
2021-07-22 16:32:26.988 [INFO][242659] k8s.go 571: Teardown processing complete. ContainerID="4acb7ead1e808d46b4dfc443181a75b463633945c916b3bab01291363933380d"
2021-07-22 16:33:24.464 [INFO][246369] k8s.go 571: Teardown processing complete. ContainerID="4acb7ead1e808d46b4dfc443181a75b463633945c916b3bab01291363933380d"

^^ three CNI DELs for the same instance of the pod, I think.

fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 27, 2021
This should fix the following issue:
projectcalico/calico#4710
beause the CNI plugin will now hit the container ID-based guard
logic.
fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 27, 2021
This should fix the following issue:
projectcalico/calico#4710
beause the CNI plugin will now hit the container ID-based guard
logic.
fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 28, 2021
This should fix the following issue:
projectcalico/calico#4710
beause the CNI plugin will now hit the container ID-based guard
logic.
fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 28, 2021
This should fix the following issue:
projectcalico/calico#4710
beause the CNI plugin will now hit the container ID-based guard
logic.
fasaxc added a commit to fasaxc/libcalico-go that referenced this issue Jul 28, 2021
This should fix the following issue:
projectcalico/calico#4710
beause the CNI plugin will now hit the container ID-based guard
logic.
@fasaxc
Copy link
Member

fasaxc commented Jul 29, 2021

This should be fixed in Calico v3.20, which is due to be released this week. If you want to try the release candidate, it's available here via this nightly build of our docs: https://2021-07-29-v3-20-governai.docs.eng.tigera.net/

I also backported to the v3.19/18 release brnaches so it'll be in the next patch release for those minor releases.

@fasaxc
Copy link
Member

fasaxc commented Aug 3, 2021

Closing: v3.20 is now released. Please re-open if you see this again after upgrading to v3.20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants