Report actionable error when GC fails due to disk pressure #132578

drigz · 2025-06-27T14:11:01Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see #71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure.

I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: #120698).

Which issue(s) this PR is related to:

Fixes #71869

Special notes for your reviewer:

See individual commits for a breakdown into:

add test for user-visible message
rephrase message to focus on disk usage (saves user from running df -h)

Regarding formatSize(): I considered using resource.NewQuantity but it wasn't clear how to round the size appropriately, and it can't generate non-whole numbers like 12.3 G.

Does this PR introduce a user-facing change?

When image garbage collection is unable to free enough disk space, the FreeDiskSpaceFailed warning event is now more actionable. Example: `Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images. Consider resizing the disk or deleting unused files.`

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None - although I considered collecting the suggested fixes from #71869 (kill processes that are holding deleted files open, delete orphaned volumes, etc) I decided this would easily become crufty with irrelevant steps.

linux-foundation-easycla · 2025-06-27T14:11:06Z

The committers listed above are authorized under a signed CLA.

✅ login: drigz / name: Rodrigo Queiro (d340c55, 466c985, bf23d26, 2995f0f, f3a0fca, a1e19a0, 8d14da6)

k8s-ci-robot · 2025-06-27T14:11:10Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-06-27T14:11:11Z

Hi @drigz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-06-27T14:11:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: drigz
Once this PR has been reviewed and has the lgtm label, please assign sergeykanzhelev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

drigz · 2025-06-27T14:14:26Z

@haircommander I'm still figuring out how to manually test my kubelet change in a real cluster (just ran unit tests so far) but in the meantime, can you /ok-to-test this? Thanks in advance!

drigz · 2025-06-28T15:23:52Z

Looks like this exposes a bug where an error message is used as format string: (%!o(MISSING))

Warning  FreeDiskSpaceFailed      24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5% of 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.
  Warning  ImageGCFailed            24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5%!o(MISSING)f 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.

Should be easy to fix by adding "%s", but is it WAI that both FreeDiskSpaceFailed and ImageGCFailed are generated in this case?

I'm also unsure if the XX.X GB used for active images is adding value here since for containerd it's only the compressed size (containerd/containerd#9261) and for other runtimes it's uncompressed size, but in both cases it's an underestimate, and the CRI API doesn't let us report the true size:

root@kind-control-plane:/# du -sh /var/lib/containerd/* | sort -h
[snip: small directories]
1.6M    /var/lib/containerd/io.containerd.metadata.v1.bolt
421M    /var/lib/containerd/io.containerd.content.v1.content
702M    /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

That said, if it's less than 0.1 * disk size it's still a pointer that "too many/big images" is not the cause, whereas if it's greater than 0.3 * disk size it might suggest "too many/big images" is the cause. But not sure there's a good way to suggest that in the error message...

kannon92 · 2025-06-28T15:34:06Z

pkg/kubelet/kubelet.go

@@ -1561,7 +1561,7 @@ func (kl *Kubelet) StartGarbageCollection() {
 			if prevImageGCFailed {
 				klog.ErrorS(err, "Image garbage collection failed multiple times in a row")
 				// Only create an event for repeated failures
-				kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
+				kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, "%s", err.Error())


Why was this change added? Was the error not reported?

The ImageGCFailed event had the following format problem in its message: 53.5%!o(MISSING)f 976.2 GiB used because the error contained a % character and was incorrectly used as a format string.

kannon92 · 2025-06-28T15:43:52Z

/ok-to-test

Please sign the CLA.

Two of your commits have no author on it which may need to be fixed before merge.

This reverts commit f3a0fca, because different container runtimes report either compressed or uncompressed size but not total disk usage, so this is misleading as to the total disk space used for container images.

drigz · 2025-06-28T17:05:55Z

/test pull-kubernetes-linter-hints

drigz · 2025-06-28T17:09:17Z

Two of your commits have no author on it which may need to be fixed before merge.

Thanks, sorry for the noise, rookie error!

kannon92 · 2025-06-28T17:34:09Z

pkg/kubelet/images/image_gc_manager.go

+			// be due to an unusually large number or size of in-use container images.
+			message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%.1f%% of %s used). "+
+				"Failed to free sufficient space by deleting unused images. "+
+				"Consider resizing the disk or deleting unused files.",


I don't really think we should mention "resizing" the disk.

This is not that helpful to an end-user as in many cases it is not easy to "resize" a disk.

For fedora/fcos, I don't think it is possible to resize /var on an existing node.

If you are on a cloud instance, you could provision a new disk and remount but that is very different.

Anyway, maybe you could just suggest taking a look at other resources on this node as you mention in your comment.

Someone called it out on the original issue (comment):

I changed Node disk size from 10 GiB to 20GiB and error didn't appeard

which is why I added it here. However, it's implicit when you see an error like 95% of 10GB used that you could just bump the disk size, so perhaps it's not needed.

you could just suggest taking a look at other resources on this node

I'm assuming you mean the code comment here ("This usually means the disk is full for reasons other than container images, such as logs, volumes, or other files. However, it could also be due to an unusually large number or size of in-use container images."). I've changed to:

Investigate disk usage, as it could be used by active images, logs, volumes, or other data.

This strikes me as a necessary first step, and immediate mitigation is probably not the operator's priority because this is just a "warning".

Thanks! Lgtm.

drigz added 2 commits June 26, 2025 19:52

test: verify FreeDiskSpaceFailed event when GC fails

466c985

fix: Add disk usage to image GC failure error message

8d14da6

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 27, 2025

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 27, 2025

k8s-ci-robot requested review from kannon92 and matthyx June 27, 2025 14:11

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 27, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs Jun 27, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 27, 2025

fix: Add total image size to image GC failure message

f3a0fca

drigz force-pushed the actionable-gc-error branch from 3680aa4 to f3a0fca Compare June 27, 2025 14:12

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 27, 2025

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 28, 2025

kannon92 reviewed Jun 28, 2025

View reviewed changes

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2025

drigz added 2 commits June 28, 2025 18:47

fix: use error as string, not format string

d340c55

Revert "fix: Add total image size to image GC failure message"

bf23d26

This reverts commit f3a0fca, because different container runtimes report either compressed or uncompressed size but not total disk usage, so this is misleading as to the total disk space used for container images.

drigz force-pushed the actionable-gc-error branch from 63cac12 to bf23d26 Compare June 28, 2025 16:47

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 28, 2025

fix: lint error

2995f0f

drigz requested a review from kannon92 June 28, 2025 17:09

kannon92 reviewed Jun 28, 2025

View reviewed changes

fix: avoid suggesting resizing the disk

a1e19a0

drigz changed the title ~~WIP Report actionable error when GC fails due to disk pressure~~ Report actionable error when GC fails due to disk pressure Jun 29, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2025

drigz requested a review from kannon92 June 29, 2025 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report actionable error when GC fails due to disk pressure #132578

Report actionable error when GC fails due to disk pressure #132578

drigz commented Jun 27, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

drigz commented Jun 27, 2025

Uh oh!

drigz commented Jun 28, 2025

Uh oh!

kannon92 Jun 28, 2025

Uh oh!

drigz Jun 28, 2025

Uh oh!

kannon92 commented Jun 28, 2025

Uh oh!

drigz commented Jun 28, 2025

Uh oh!

drigz commented Jun 28, 2025

Uh oh!

kannon92 Jun 28, 2025

Uh oh!

drigz Jun 29, 2025

Uh oh!

kannon92 Jun 29, 2025

Uh oh!

Uh oh!

Report actionable error when GC fails due to disk pressure #132578

Are you sure you want to change the base?

Report actionable error when GC fails due to disk pressure #132578

Conversation

drigz commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

drigz commented Jun 27, 2025

Uh oh!

drigz commented Jun 28, 2025

Uh oh!

kannon92 Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

drigz Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Jun 28, 2025

Uh oh!

drigz commented Jun 28, 2025

Uh oh!

drigz commented Jun 28, 2025

Uh oh!

kannon92 Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

drigz Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

kannon92 Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drigz commented Jun 27, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jun 27, 2025 •

edited

Loading