Skip to content

PostStartHook "start-service-ip-repair-controllers" failed: unable to perform initial IP and Port allocation check #130377

Open
@hawwwdi

Description

@hawwwdi

What happened?

Hi everyone,
I encountered an issue when restarting one of our API servers (v1.29.10). When I restart it, it never comes back up and remains unready. Its post-start hook fails with the following error:
F0223 14:49:57.253137 1 hooks.go:203] PostStartHook "start-service-ip-repair-controllers" failed: unable to perform initial IP and Port allocation check
Below is the output of its liveness endpoint (https://127.0.0.1:6443/livez):

curl -k https://127.0.0.1:6443/livez
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[-]poststarthook/start-service-ip-repair-controllers failed: reason withheld
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok

livez check failed

I also encounter the following errors on my currently running API server instances:

E0223 14:48:58.900100       1 repair.go:85] Operation cannot be fulfilled on servicenodeportallocations: the provided resource version does not match
E0223 14:48:59.153223       1 repair.go:127] Operation cannot be fulfilled on serviceipallocations: the provided resource version does not match

I also checked my etcd cluster, and everything is OK—there is no latency or I/O wait and read/write time is under 1 millisecond.

Image

What did you expect to happen?

I expect the API server to work correctly after a restart.

How can we reproduce it (as minimally and precisely as possible)?

I don't know how to reproduce this situation. I tried it in my staging environment, and everything was fine.

Anything else we need to know?

I also checked kube-apiserver code and realized that this error may relates to this part of code:
https://github.com/kubernetes/kubernetes/blob/v1.29.10/pkg/registry/core/rest/storage_core.go#L466
or this part:
https://github.com/kubernetes/kubernetes/blob/v1.29.10/pkg/registry/core/service/allocator/storage/storage.go#L203

Kubernetes version

$ kubectl version
Client Version: v1.29.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.10

Cloud provider

self-hosted bare-metal using kubespray.

OS version

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux controlplane1 5.15.0-92-generic #102-Ubuntu SMP Wed Jan 10 09:33:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

kubespray

Container runtime (CRI) and version (if applicable)

containerd v1.7.22

Related plugins (CNI, CSI, ...) and versions (if applicable)

calico

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-sigIndicates an issue or PR lacks a `sig/foo` label and requires one.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions