Skip to content

Job backoff delay increases despite ignore action in pod failure policy #130881

Open
@norbertcyran

Description

@norbertcyran

What happened?

With the following pod failure policy:

  podFailurePolicy:
    rules:
      - action: Ignore
        onPodConditions:
        - type: DisruptionTarget

When evicting a pod, such pod won't be counted towards the backoff limit, which is expected, but backoff delay will be increased, so a new pod will be created according to the exponential backoff rules. If we drain a node with 3 job pods running on it, they will get terminated, and then new pods will be created after 40 seconds.

What did you expect to happen?

Configured podFailurePolicy suggests that pod disruption won't be treated as a job failure, so besides not counting it towards the backoff limit, pod disruption also shouldn't increase the backoff delay. In the above example, new pods should be recreated on other nodes right away.

If this is WAI, is there an option to bypass the backoff in case of an expected pod disruption?

How can we reproduce it (as minimally and precisely as possible)?

Example job:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  parallelism: 10
  completions: 10
  podFailurePolicy:
    rules:
      - action: Ignore
        onPodConditions:
        - type: DisruptionTarget
  template:
    spec:
      containers:
      - name: pause
        image: gcr.io/google_containers/pause
        resources:
          requests:
            cpu: 500m
      restartPolicy: Never
  backoffLimit: 4

Apply this job, wait for pods to be running. Evict some pods, either via kubectl drain or the eviction API. Pods should get recreated after the backoff delay.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: v1.31.6-dispatcher
Kustomize Version: v5.4.2
Server Version: v1.31.5-gke.1023000

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.wg/batchCategorizes an issue or PR as relevant to WG Batch.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions