Description
What happened?
We encountered an issue where upgrading kube-proxy directly from v1.23 to v1.30 led to a latent failure that only surfaced a month later.
We’re aware this upgrade path violates the Kubernetes version skew policy and isn’t officially supported. However, the failure mode is subtle and could easily go unnoticed in real-world environments.
Sharing this here to let you decide whether it qualifies as a valid bug worth addressing 😄
From Changelog, In kube-proxy v1.23, KUBE-XLB-* chains were used to handle local traffic masquerading for LoadBalancer-type services. These were renamed to KUBE-EXT-* in v1.24, and mention of KUBE-XLB-* was fully removed starting in v1.28.
What happened:
-
Initial state: The node was running kube-proxy v1.23, which created KUBE-XLB-* chains. These chains referenced KUBE-SVC-* and KUBE-SEP-* chains.
-
Upgrade: kube-proxy was upgraded directly to v1.30. At this point, both KUBE-XLB-* and KUBE-EXT-* chains existed, but only KUBE-EXT-* was actively used (e.g., referenced by KUBE-NODEPORTS). The node continued to function normally for about a month.
-
Failure trigger: A pod restarted on the node. This pod was behind a LoadBalancer service with externalTrafficPolicy: Local. During the iptables update (via iptables-restore), kube-proxy failed to run
iptables-restore
due to lingering references from the obsolete KUBE-XLB-* chains. This caused kube-proxy to break indefinitely(until a operator go flush the IPTables on node). The error message were obscure:
2025-06-03T05:04:00.119705706Z stderr F I0603 05:04:00.119482 1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:00.125120581Z stderr F I0603 05:04:00.124957 1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=76 numFilterChains=6 numFilterRules=8 numNATChains=22 numNATRules=91
2025-06-03T05:04:00.131593017Z stderr F I0603 05:04:00.131407 1 proxier.go:799] "SyncProxyRules complete" elapsed="12.080955ms"
2025-06-03T05:04:11.434602934Z stderr F I0603 05:04:11.434487 1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:11.440533365Z stderr F I0603 05:04:11.440310 1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=73 numFilterChains=6 numFilterRules=11 numNATChains=22 numNATRules=85
2025-06-03T05:04:11.449614998Z stderr F E0603 05:04:11.449370 1 proxier.go:1511] "Failed to execute iptables-restore" err=<
2025-06-03T05:04:11.449649046Z stderr F exit status 1: iptables-restore: line 122 failed
2025-06-03T05:04:11.449656349Z stderr F >
2025-06-03T05:04:11.449662389Z stderr F I0603 05:04:11.449402 1 proxier.go:810] "Sync failed" retryingTime="30s"
2025-06-03T05:04:11.449668176Z stderr F I0603 05:04:11.449423 1 proxier.go:799] "SyncProxyRules complete" elapsed="15.034086ms"
2025-06-03T05:04:12.866456601Z stderr F I0603 05:04:12.866339 1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:12.916236764Z stderr F I0603 05:04:12.916145 1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=76 numFilterChains=6 numFilterRules=11 numNATChains=133 numNATRules=286
2025-06-03T05:04:12.923157165Z stderr F E0603 05:04:12.923093 1 proxier.go:1511] "Failed to execute iptables-restore" err=<
2025-06-03T05:04:12.923171168Z stderr F exit status 1: iptables-restore: line 433 failed
==================================================
==================================================
=====all kube-proxy operation fails with error like iptables-restore: line XXX(which differs) failed past this point ======
==================================================
==================================================
What did you expect to happen?
One of the following:
- KubeProxy should handle the situation gracefully, instead of entering a permanently broken state(until a operator flush iptables on node)
- Clear error messages when iptables-restore fails.
- Currently, the logs do not mention the KUBE-XLB chain or the specific rule that caused the failure.
- Ideally, kube-proxy should log the current iptables state, the desired state if restore attempts failed, to aid debugging.
How can we reproduce it (as minimally and precisely as possible)?
- Deploy kube-proxy version v1.23.
- Add a node configured with only iptables-legacy (this may also affect nodes with both iptables-nft and iptables-legacy, but i hasn't been this configuration).
- Ensure kube-proxy v1.23 is running on the new node.
- Create a Service of type LoadBalancer with ExternalTrafficPolicy: Local, and deploy a Daemonset backing this service. (other way to ensure a pod land on that nodes works as well)
- Upgrade kube-proxy directly to v1.30, and ensure it starts on the node.
- Delete the service’s pods on that node, kube-proxy on that node will enter this broken state indefinitely.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
# paste output here
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Linux ip-192-168-57-100.us-west-2.compute.internal 5.10.237-230.949.amzn2.x86_64 #1 SMP Thu Jun 5 23:30:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux