Description
What happened?
Hi everyone,
I encountered an issue when restarting one of our API servers (v1.29.10). When I restart it, it never comes back up and remains unready. Its post-start hook fails with the following error:
F0223 14:49:57.253137 1 hooks.go:203] PostStartHook "start-service-ip-repair-controllers" failed: unable to perform initial IP and Port allocation check
Below is the output of its liveness endpoint (https://127.0.0.1:6443/livez
):
curl -k https://127.0.0.1:6443/livez
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[-]poststarthook/start-service-ip-repair-controllers failed: reason withheld
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
livez check failed
I also encounter the following errors on my currently running API server instances:
E0223 14:48:58.900100 1 repair.go:85] Operation cannot be fulfilled on servicenodeportallocations: the provided resource version does not match
E0223 14:48:59.153223 1 repair.go:127] Operation cannot be fulfilled on serviceipallocations: the provided resource version does not match
I also checked my etcd cluster, and everything is OK—there is no latency or I/O wait and read/write time is under 1 millisecond.
What did you expect to happen?
I expect the API server to work correctly after a restart.
How can we reproduce it (as minimally and precisely as possible)?
I don't know how to reproduce this situation. I tried it in my staging environment, and everything was fine.
Anything else we need to know?
I also checked kube-apiserver code and realized that this error may relates to this part of code:
https://github.com/kubernetes/kubernetes/blob/v1.29.10/pkg/registry/core/rest/storage_core.go#L466
or this part:
https://github.com/kubernetes/kubernetes/blob/v1.29.10/pkg/registry/core/service/allocator/storage/storage.go#L203
Kubernetes version
$ kubectl version
Client Version: v1.29.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.10
Cloud provider
OS version
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux controlplane1 5.15.0-92-generic #102-Ubuntu SMP Wed Jan 10 09:33:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux