Skip to content

Cronjob's stuck job was not marked as failed and blocked new schedules #128358

Closed as not planned
@koote

Description

@koote

What happened?

I have a cronjob that runs every 5 minutes, concurrency policy is forbidden:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-job
spec:
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 200
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 0
      activeDeadlineSeconds: 240
      ......
      spec:
        restartPolicy: Never
        activeDeadlineSeconds: 240

I found that a job that has been there for 84 minutes when I was checking:

$ kubectl get jobs -n zzz
NAME                COMPLETIONS     DURATION    AGE
xxxxx-job-28833075  0/1             84m         84m

However when I try to list pods I got nothing:

$ kubectl get pods --selector=job-name=xxxxx-job-28833075 -n zzz
No resources found in zzz namespace

I think the reason was one of the sidecar containers failed to start, however since all pods owned by this job were deleted so I could not debug into this job, my guess is because .spec.jobTemplate.spec.activeDeadlineSeconds is set, when job runs > 240 seconds all pods owned by the job are deleted, but in this case should not the job be marked as failed status? Why this job was actually failed but its status was neither completion nor failed?

Also, this stuck job causes 2 following schedules missed, I don't understand since with .spec.jobTemplate.spec.activeDeadlineSeconds is set to 4 minutes, should controller schedule a new job after this job was in active status for 4 minutes?

What did you expect to happen?

If any pods/containers owned by a job failed to run, the job should be marked as failed status and all pods/container should be kept (since by default failedJobsHistoryLimit is 1).

How can we reproduce it (as minimally and precisely as possible)?

I am not sure if it is expected behavior?

Anything else we need to know?

No response

Kubernetes version

v1.28.13-eks-a737599

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.triage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    Status

    Closed

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions