Description
What would you like to be added?
For VolumeBinding
plugin, I hope to retry on conflict while updating pvc or pv in perBind
stage.
Due to there is no API rollback if the actual updating fails, so if one pods with multi pvcs, it could happen that only parts of pvcs updated with pod re-scheduled, which would leads pods stuck in pending forever (see Example for details) .
This issue wants to help pod to be bound successfully, while greatly reducing the situation that the pods have been scheduled in cache, but the final binding fails due to update pvc/pv conflicts.
@Huang-Wei @cofyc @kerthcet PTAL, thanks.
/sig scheduling
Why is this needed?
currently, once the pvc / pv is out of date, the updating operation will failed, then pod will be scheduled failed in perBind stage fro this round, and backoff to re-schedule.
for most situation, current impl looks fine. But When one pod with multi pvcs, it could let pod stuck in pending state forever for some cases.
Example
There are only :
- two nodes: nodeA & nodeB,
- two pods: podA & podB, podA with pvcA1 & pvcA2, podB with pvcB1 & pvcB2.
We want those two pods to be deployed on different nodes with TopologySpreadConstraints.
Then:
- scheduling podA, it assumed to
nodeA
in cache, but failed in in VolumeBinding plugin perBind stage, andthe annotations on pvcA1 updated successfully
, updating pvcA2 failed with conflict. - scheduling podB, it successfully bind to
nodeA
. Meanwhile, pvcA1 is bound. - re-scheduling podA, TopologySpreadConstraints wants podA to nodeB, But the node affinity of pv for pvcA1 wants podA to nodeA.
Finally, podB stuck in pending forever, unless we delete pvcA1.