Skip to content

kube-proxy 1.23 → 1.28+ directly upgrade may introduce Latent failures #132516

Closed
@M00nF1sh

Description

@M00nF1sh

What happened?

We encountered an issue where upgrading kube-proxy directly from v1.23 to v1.30 led to a latent failure that only surfaced a month later.

We’re aware this upgrade path violates the Kubernetes version skew policy and isn’t officially supported. However, the failure mode is subtle and could easily go unnoticed in real-world environments.

Sharing this here to let you decide whether it qualifies as a valid bug worth addressing 😄

From Changelog, In kube-proxy v1.23, KUBE-XLB-* chains were used to handle local traffic masquerading for LoadBalancer-type services. These were renamed to KUBE-EXT-* in v1.24, and mention of KUBE-XLB-* was fully removed starting in v1.28.

What happened:

  • Initial state: The node was running kube-proxy v1.23, which created KUBE-XLB-* chains. These chains referenced KUBE-SVC-* and KUBE-SEP-* chains.

  • Upgrade: kube-proxy was upgraded directly to v1.30. At this point, both KUBE-XLB-* and KUBE-EXT-* chains existed, but only KUBE-EXT-* was actively used (e.g., referenced by KUBE-NODEPORTS). The node continued to function normally for about a month.

  • Failure trigger: A pod restarted on the node. This pod was behind a LoadBalancer service with externalTrafficPolicy: Local. During the iptables update (via iptables-restore), kube-proxy failed to run iptables-restore due to lingering references from the obsolete KUBE-XLB-* chains. This caused kube-proxy to break indefinitely(until a operator go flush the IPTables on node). The error message were obscure:

2025-06-03T05:04:00.119705706Z stderr F I0603 05:04:00.119482       1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:00.125120581Z stderr F I0603 05:04:00.124957       1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=76 numFilterChains=6 numFilterRules=8 numNATChains=22 numNATRules=91
2025-06-03T05:04:00.131593017Z stderr F I0603 05:04:00.131407       1 proxier.go:799] "SyncProxyRules complete" elapsed="12.080955ms"
2025-06-03T05:04:11.434602934Z stderr F I0603 05:04:11.434487       1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:11.440533365Z stderr F I0603 05:04:11.440310       1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=73 numFilterChains=6 numFilterRules=11 numNATChains=22 numNATRules=85
2025-06-03T05:04:11.449614998Z stderr F E0603 05:04:11.449370       1 proxier.go:1511] "Failed to execute iptables-restore" err=<
2025-06-03T05:04:11.449649046Z stderr F 	exit status 1: iptables-restore: line 122 failed
2025-06-03T05:04:11.449656349Z stderr F  >
2025-06-03T05:04:11.449662389Z stderr F I0603 05:04:11.449402       1 proxier.go:810] "Sync failed" retryingTime="30s"
2025-06-03T05:04:11.449668176Z stderr F I0603 05:04:11.449423       1 proxier.go:799] "SyncProxyRules complete" elapsed="15.034086ms"
2025-06-03T05:04:12.866456601Z stderr F I0603 05:04:12.866339       1 proxier.go:805] "Syncing iptables rules"
2025-06-03T05:04:12.916236764Z stderr F I0603 05:04:12.916145       1 proxier.go:1494] "Reloading service iptables data" numServices=48 numEndpoints=76 numFilterChains=6 numFilterRules=11 numNATChains=133 numNATRules=286
2025-06-03T05:04:12.923157165Z stderr F E0603 05:04:12.923093       1 proxier.go:1511] "Failed to execute iptables-restore" err=<
2025-06-03T05:04:12.923171168Z stderr F 	exit status 1: iptables-restore: line 433 failed

==================================================
==================================================
=====all kube-proxy operation fails with error like iptables-restore: line XXX(which differs) failed past this point ======
==================================================
==================================================

What did you expect to happen?

One of the following:

  1. KubeProxy should handle the situation gracefully, instead of entering a permanently broken state(until a operator flush iptables on node)
  2. Clear error messages when iptables-restore fails.
    • Currently, the logs do not mention the KUBE-XLB chain or the specific rule that caused the failure.
    • Ideally, kube-proxy should log the current iptables state, the desired state if restore attempts failed, to aid debugging.

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy kube-proxy version v1.23.
  2. Add a node configured with only iptables-legacy (this may also affect nodes with both iptables-nft and iptables-legacy, but i hasn't been this configuration).
  3. Ensure kube-proxy v1.23 is running on the new node.
  4. Create a Service of type LoadBalancer with ExternalTrafficPolicy: Local, and deploy a Daemonset backing this service. (other way to ensure a pod land on that nodes works as well)
  5. Upgrade kube-proxy directly to v1.30, and ensure it starts on the node.
  6. Delete the service’s pods on that node, kube-proxy on that node will enter this broken state indefinitely.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
Client Version: v1.30.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.13-eks-5d4a308

Cloud provider

EKS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2026-06-30"

Linux ip-192-168-57-100.us-west-2.compute.internal 5.10.237-230.949.amzn2.x86_64 #1 SMP Thu Jun 5 23:30:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-sigIndicates an issue or PR lacks a `sig/foo` label and requires one.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions