Skip to content

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

Open
@alculquicondor

Description

@alculquicondor

What would you like to be added?

A mechanism to set a Pod into phase=Failed when the image pull has failed for a number of times (perhaps configurable).
Currently, the Pod just stays in phase=Pending

Why is this needed?

This is especially problematic for Jobs submitted through a queueing system.

In a queued environment, the time when the job starts running (pods created) might be hours or even days after the Job is created. Then, the user that submitted the job might not realize their mistake until it's too late. Since these Pods block resources in the cluster, it might cause other pending Jobs not to start.

If the Pods stay in phase=Pending, the job controller cannot do anything about them, as it only does "failure handling" once the Pods actually terminate with a phase=Failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/nodeCategorizes an issue or PR as relevant to SIG Node.wg/batchCategorizes an issue or PR as relevant to WG Batch.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions