kube-proxy 1.23 → 1.28+ directly upgrade may introduce Latent failures

### What happened?

We encountered an issue where upgrading kube-proxy **directly** from v1.23 to v1.30 led to a latent failure **that only surfaced a month later.**

We’re aware this upgrade path violates the Kubernetes version skew policy and isn’t officially supported. However, the failure mode is subtle and could easily go unnoticed in real-world environments.

Sharing this here to let you decide whether it qualifies as a valid bug worth addressing 😄

From Changelog, In kube-proxy v1.23, KUBE-XLB-* chains were used to handle local traffic masquerading for LoadBalancer-type services. These were renamed to KUBE-EXT-* in v1.24, and mention of KUBE-XLB-* was fully removed starting in v1.28.

What happened:
* Initial state: The node was running kube-proxy v1.23, which created KUBE-XLB-* chains. These chains referenced KUBE-SVC-* and KUBE-SEP-* chains.

* Upgrade: kube-proxy was upgraded **directly** to v1.30. At this point, both KUBE-XLB-* and KUBE-EXT-* chains existed, but only KUBE-EXT-* was actively used (e.g., referenced by KUBE-NODEPORTS). The node continued to function normally for about a month.

* Failure trigger: A pod restarted on the node. This pod was behind a LoadBalancer service with externalTrafficPolicy: Local. During the iptables update (via iptables-restore), kube-proxy failed to run `iptables-restore` due to lingering references from the obsolete KUBE-XLB-* chains. This caused kube-proxy to break **indefinitely**(until a operator go flush the IPTables on node). The error message were obscure：
```
2025-06-03T05:04:00.119705706Z stderr F I0603 05:04:00.119482       1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:00.125120581Z stderr F I0603 05:04:00.124957       1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=76 numFilterChains=6 numFilterRules=8 numNATChains=22 numNATRules=91
2025-06-03T05:04:00.131593017Z stderr F I0603 05:04:00.131407       1 proxier.go:799] "SyncProxyRules complete" elapsed="12.080955ms"
2025-06-03T05:04:11.434602934Z stderr F I0603 05:04:11.434487       1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:11.440533365Z stderr F I0603 05:04:11.440310       1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=73 numFilterChains=6 numFilterRules=11 numNATChains=22 numNATRules=85
2025-06-03T05:04:11.449614998Z stderr F E0603 05:04:11.449370       1 proxier.go:1511] "Failed to execute iptables-restore" err=<
2025-06-03T05:04:11.449649046Z stderr F 	exit status 1: iptables-restore: line 122 failed
2025-06-03T05:04:11.449656349Z stderr F  >
2025-06-03T05:04:11.449662389Z stderr F I0603 05:04:11.449402       1 proxier.go:810] "Sync failed" retryingTime="30s"
2025-06-03T05:04:11.449668176Z stderr F I0603 05:04:11.449423       1 proxier.go:799] "SyncProxyRules complete" elapsed="15.034086ms"
2025-06-03T05:04:12.866456601Z stderr F I0603 05:04:12.866339       1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:12.916236764Z stderr F I0603 05:04:12.916145       1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=76 numFilterChains=6 numFilterRules=11 numNATChains=133 numNATRules=286
2025-06-03T05:04:12.923157165Z stderr F E0603 05:04:12.923093       1 proxier.go:1511] "Failed to execute iptables-restore" err=<
2025-06-03T05:04:12.923171168Z stderr F 	exit status 1: iptables-restore: line 433 failed

==================================================
==================================================
=====all kube-proxy operation fails with error like iptables-restore: line XXX(which differs) failed past this point ======
==================================================
==================================================
```

### What did you expect to happen?

One of the following:
1. KubeProxy should handle the situation gracefully, instead of entering a permanently broken state(until a operator flush iptables on node)
2. Clear error messages when iptables-restore fails.
    * Currently, the logs do not mention the KUBE-XLB chain or the specific rule that caused the failure.
    * Ideally, kube-proxy should log the current iptables state, the desired state if restore attempts failed, to aid debugging.

### How can we reproduce it (as minimally and precisely as possible)?

1. Deploy kube-proxy version v1.23.
2. Add a node configured with only iptables-legacy (this may also affect nodes with both iptables-nft and iptables-legacy, but i hasn't been this configuration).
3. Ensure kube-proxy v1.23 is running on the new node.
4. Create a Service of **type LoadBalancer** with **ExternalTrafficPolicy: Local**, and deploy a Daemonset backing this service. (other way to ensure a pod land on that nodes works as well)
5. Upgrade kube-proxy directly to v1.30, and ensure it starts on the node.
6. Delete the service’s pods on that node, kube-proxy on that node will enter this broken state indefinitely.

### Anything else we need to know?

_No response_

### Kubernetes version

<details>

```console
$ kubectl version
# paste output here
```
</details>
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.13-eks-5d4a308

### Cloud provider

<details>
</details>
EKS

### OS version

<details>

```console
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
```

</details>
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2026-06-30"

Linux ip-192-168-57-100.us-west-2.compute.internal 5.10.237-230.949.amzn2.x86_64 #1 SMP Thu Jun 5 23:30:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kube-proxy 1.23 → 1.28+ directly upgrade may introduce Latent failures #132516

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kube-proxy 1.23 → 1.28+ directly upgrade may introduce Latent failures #132516

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions