Load Balancer Node membership not being updated for externalTrafficPolicy Local #4230

JoelSpeed · 2023-07-05T09:13:35Z

What happened:

When creating a cluster and creating the first service with externalTrafficPolicy: Local, we are observing that the new nodes are not added to the load balancer in a timely manner as they should be.

It takes one of three things to trigger the nodes to be added to the load balancer:

Restarting the CCM
Adding another node (this triggers old nodes to add, but not always the new node)
Waiting for the full resync to occur (IIRC 10 hours?)

What you expected to happen:

Nodes should be added to the load balancer as soon as they are added to the cluster.

How to reproduce it (as minimally and precisely as possible):

Create a cluster with the standard load balancer, and create a single service of eTP: Local. This doesn't always happen, but, we are observing with a good frequency that the worker nodes are not being added to the load balancer when the load balancer is created, and the nodes are freshly added.

Anything else we need to know?:

I believe this to be an incompatibility between the predicates used in the cloud-provider core service controller, and the logic in the Azure provider that excludes nodes from being added to the load balancer when they are unready.

In kubernetes/kubernetes#109706, the service controller was updated so that it monitors the nodes that it's passing into UpdateLoadBalancer. The PR aimed to reduce the number of times you would resync the load balancer membership.

The UpdateLoadBalancer function will only be called when the set of nodes changes, which means either a new node is added/removed or a node is excluded via label/tainted/has a transition of the provider ID.

The call to update the load balancers eventually calls down to EnsureHostsInPool, which then calls out to ShouldNodeExcludedFromLoadBalancer. This determines whether the node should be included or not.

When a node is not ready, it is excluded which means that, when a node is first added, there is a race. If the CCM observes the update for the load balancers before the Node becomes ready, then the node is excluded from the load balancer. In this case, with the new reduction of calls to UpdateLoadBalancer, when the node flicks to being ready, the node no longer gets added to the load balancers, as nothing is calling UpdateLoadBalancer. According to the core CCM code, nothing has changed.

In my opinion, and that of KEP-3458, we need to remove the readiness check from the Azure CCM and allow the nodes to be added independent of their readiness.

To do this though, we must first fix the broken health checks (#3499) which has a PR up already (#3887). This is also related to #3500 which adds another reason not to manage load balancer membership based on whether a node is ready or not.

Environment:

Kubernetes version (use kubectl version): 1.27
Cloud provider or hardware configuration: Azure 1.27
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

feiskyer · 2023-07-05T23:51:05Z

Thanks @JoelSpeed for reporting the issues and adding the KEP links. Fully agreed with you that health probes should be improved and aligned with KEP-3836 and KEP-3458.

Added to v1.28 milestone for tracking.

JoelSpeed added the kind/bug Categorizes issue or PR as related to a bug. label Jul 5, 2023

JoelSpeed mentioned this issue Jul 5, 2023

Starting in 1.25 clusters, services of type=LB and xTP=Local sometimes does not update node backends on load balancers kubernetes/kubernetes#112793

Closed

alexanderConstantinescu mentioned this issue Jul 5, 2023

Issue 4230: remove readiness check for cache exclusion #4234

Merged

feiskyer added this to the v1.28 milestone Jul 5, 2023

feiskyer mentioned this issue Jul 6, 2023

CCM might be too sensitive to update the LB backend pool #4200

Closed

alexanderConstantinescu mentioned this issue Jul 6, 2023

[Azure] Issue 4230: remove readiness check for cache exclusion kubernetes/kubernetes#119128

Merged

This was referenced Jul 6, 2023

OCPBUGS-14762: remove readiness check for cache exclusion openshift/cloud-provider-azure#77

Merged

OCPBUGS-15866: remove readiness check for cache exclusion openshift/kubernetes#1625

Merged

k8s-ci-robot closed this as completed in #4234 Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load Balancer Node membership not being updated for externalTrafficPolicy Local #4230

Load Balancer Node membership not being updated for externalTrafficPolicy Local #4230

JoelSpeed commented Jul 5, 2023

feiskyer commented Jul 5, 2023

Load Balancer Node membership not being updated for externalTrafficPolicy Local #4230

Load Balancer Node membership not being updated for externalTrafficPolicy Local #4230

Comments

JoelSpeed commented Jul 5, 2023

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

feiskyer commented Jul 5, 2023