Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay in processing HNS LB policies on kube-proxy start on Windows nodes results in unreachable services #109162

Closed
LP0101 opened this issue Mar 30, 2022 · 3 comments · Fixed by #109124
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@LP0101
Copy link

LP0101 commented Mar 30, 2022

What happened?

When starting windows nodes with a high number of HNS LB policies/rules on the cluster, there is a delay in processing them. This leaves services unreachable during the delay, which takes about half a minute per policy. This can be substatial given enough rules.

This occurs when restarting kube-proxy and rebooting the host. Once the system does reach a state where all the policylists are processed, incremental updates to the services are handled fine (ie. endpoint changes).

What did you expect to happen?

HNS policies should not cause a large delay for Windows nodes.

How can we reproduce it (as minimally and precisely as possible)?

With a large number of HNS policies in place, restart kube-proxy on a Windows node.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.7", GitCommit:"1f86634ff08f37e54e8bfcd86bc90b61c98f84d4", GitTreeState:"clean", BuildDate:"2021-11-17T14:41:19Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.7", GitCommit:"f74784f1eaf1e02b651778d6ee2df1ae5ee729ae", GitTreeState:"clean", BuildDate:"2022-03-10T07:58:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Azure AKS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@LP0101 LP0101 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 30, 2022
@neolit123
Copy link
Member

/sig windows network

@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 30, 2022
@jsturtevant
Copy link
Contributor

jsturtevant commented Mar 31, 2022

/triage accepted

There are more details on timing in the fix that @daschott opened #109124

When doing a sync of Services on a new node joining the cluster, the HNS is queried for state on every endpoint in a service which is expensive. When iterating over thousands of services, this can take hours (!).

The fix proposed by @daschott gets the HNS state once per sync instead of each time. This plus a fix in Windows OS the sync is reduced to mins in WS 2019 and ~1 min in WS 2022.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 31, 2022
@zhiweiv
Copy link

zhiweiv commented Jun 9, 2022

@daschott @jsturtevant
For the Windows OS fix mentioned, do you know when will it arrive at Windows Server 2019?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants