Open
Description
What happened?
volume manager use NestedPendingOperations framework to do attach/mount/detach/ummount operation
if pod use a pv do VerifyControllerAttachedVolume operation failed, and expBackoff.durationBeforeRetry exponent increase until maxDurationBeforeRetry.
then user delete pod,operations didn't delete from operations queue
3hours later(whatever), user create another pod use this pv and also schedule to this nodes, NestedPendingOperations will reuse previous operation, expBackoff not reset to initial value(500ms),if failed first time,operation not retries after maxDurationBeforeRetry.
func (grm *nestedPendingOperations) Run(
volumeName v1.UniqueVolumeName,
podName volumetypes.UniquePodName,
nodeName types.NodeName,
generatedOperations volumetypes.GeneratedOperations) error {
grm.lock.Lock()
defer grm.lock.Unlock()
opKey := operationKey{volumeName, podName, nodeName}
opExists, previousOpIndex := grm.isOperationExists(opKey)
if opExists {
previousOp := grm.operations[previousOpIndex]
// Operation already exists
if previousOp.operationPending {
// Operation is pending
return NewAlreadyExistsError(opKey)
}
backOffErr := previousOp.expBackoff.SafeToRetry(fmt.Sprintf("%+v", opKey))
if backOffErr != nil {
if previousOp.operationName == generatedOperations.OperationName {
return backOffErr
}
// previous operation and new operation are different. reset op. name and exp. backoff
grm.operations[previousOpIndex].operationName = generatedOperations.OperationName
grm.operations[previousOpIndex].expBackoff = exponentialbackoff.ExponentialBackoff{}
}
// Update existing operation to mark as pending.
grm.operations[previousOpIndex].operationPending = true
grm.operations[previousOpIndex].key = opKey
} else {
// Create a new operation
grm.operations = append(grm.operations,
operation{
key: opKey,
operationPending: true,
operationName: generatedOperations.OperationName,
expBackoff: exponentialbackoff.ExponentialBackoff{},
})
}
go func() (eventErr, detailedErr error) {
// Handle unhandled panics (very unlikely)
defer k8sRuntime.HandleCrash()
// Handle completion of and error, if any, from operationFunc()
defer grm.operationComplete(opKey, &detailedErr)
return generatedOperations.Run()
}()
return nil
}
What did you expect to happen?
another pods use same pv backoff limit use initialDurationBeforeRetry, not reuse previous one
or need we has same principle to reset backoff
How can we reproduce it (as minimally and precisely as possible)?
- create pod with pv schedule to node
- attach pods failed for several times, delete pods
- 5min passed, create another pod with same pv schedule to same node
- first VerifyControllerAttachedVolume opeation failed
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
# paste output here
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here