Skip to content

Job with no parellelism randomly creates 2 duplicate Pods instead of 1 #130683

Open
@istepaniuk

Description

@istepaniuk

What happened?

A Job is created with completions: 1, parallelism: 1. However, two pods appear a few minutes apart, both with identical ownerReferences (name, uid, etc. all point to the same unique Job).

I don't understand what I see in the kube-controller-manager logs, when the first pod is scheduled, I see

I0305 18:28:46.167341       1 job_controller.go:566] "enqueueing job" logger="job-controller" key="the-namesapce/the-job-name-1741199325"

That same "enqueueing job" log line repeats 9 times, most in the same second, some a bit later:

I0305 18:28:46.167341 ...
I0305 18:28:46.183597 ...
I0305 18:28:46.192648 ...
I0305 18:28:46.195377 ...
I0305 18:28:46.233094 ...
I0305 18:28:48.915103 ...
I0305 18:29:24.315840 ...
I0305 18:29:25.328100 ...
I0305 18:29:26.339424 ...

At few minutes later, with the first pod is already running, a second one appears. At this exact time the logs show a similar message:

I0305 18:31:46.236414       1 job_controller.go:566] "enqueueing job" logger="job-controller" key="the-namesapce/the-job-name-1741199325"
I0305 18:31:47.613379       1 job_controller.go:566] "enqueueing job" logger="job-controller" key="the-namesapce/the-job-name-1741199325"
E0305 18:31:50.044308       1 job_controller.go:599] syncing job: tracking status: adding uncounted pods to status: Operation cannot be fulfilled on jobs.batch "the-job-name-1741199325": the object has been modified; please apply your changes to the latest version and try again
...
I0305 18:31:51.068789       1 job_controller.go:566] "enqueueing job" logger="job-controller" .... (repeats again 7 times rapidly)

The only clue I have is this "the object has been modified", but certainly does not make any sense to me. The Job object has been created with a single "kubectl create -f job.yaml", nothing fancy in it. What could be going on?

What did you expect to happen?

Only one Pod should be scheduled.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately, this seems to happen randomly (once in hundreds of Jobs). I need help understanding what causes this, if I do, I can try to reproduce it.

Anything else we need to know?

Something similar seems to have been reported for a very old version #120790, but it's difficult to understand if it's the same.

Kubernetes version

$ kubectl version
Client Version: v1.30.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.4

Cloud provider

On-prem 20 node cluster deployed with Kubespray.

OS version

$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

$ uname -a
Linux [redacted host name] 6.1.0-28-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.119-1 (2024-11-22) x86_64 GNU/Linux

Install tools

Kubespray

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.kind/supportCategorizes issue or PR as a support question.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions