Description
What would you like to be added?
When node non graceful shutdown occur kube controller manager updates taints on node and sets node.kubernetes.io/unreachable
taint.
When pod's tolerationSeconds
expired controller manager evict pod from node. I set tolerationSeconds: 60
for my pod and it's get evicted in time. But cannot start because I use storage with RWO and volume must be detached from failed node first. So controller manager tries to detach volume and after 6 minutes logs
I0124 12:52:02.752565 1 reconciler.go:279] "attacherDetacher.DetachVolume started: this volume is not safe to detach, but maxWaitForUnmountDuration expired, force detaching" logger="persistentvolume-attach-detach-controller" duration="6m0s" node="prod-tages-k8s-worker-2" volumeName="kubernetes.io/csi/linstor.csi.linbit.com^pvc-fb36d4b7-d16d-4e6b-a82f-70cdf5dbf7d0"
I know that NodeOutOfServiceVolumeDetach
feature gate in GA since kubernetes 1.28.
With that logic I have to update node with node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
taint to get volume attached on new node and delete from failed node.
As I see the maxWaitForUnmountDuration is hardcoded to 6 minutes in code
Why is this needed?
I think that's common case.
My goal is to make fast migration of pod in kubernetes, I can't run this pod in multiple replicas.
But for that I need to write special controller that will taint nodes on fail. And this controller will just mark node out of service after some time it gets Unreachable.