Skip to content

Report actionable error when GC fails due to disk pressure #132578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

drigz
Copy link
Contributor

@drigz drigz commented Jun 27, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see #71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure.

I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: #120698).

Which issue(s) this PR is related to:

Fixes #71869

Special notes for your reviewer:

See individual commits for a breakdown into:

  1. add test for user-visible message
  2. rephrase message to focus on disk usage (saves user from running df -h)

Regarding formatSize(): I considered using resource.NewQuantity but it wasn't clear how to round the size appropriately, and it can't generate non-whole numbers like 12.3 G.

Does this PR introduce a user-facing change?

When image garbage collection is unable to free enough disk space, the FreeDiskSpaceFailed warning event is now more actionable. Example: `Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images. Consider resizing the disk or deleting unused files.`

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None - although I considered collecting the suggested fixes from #71869 (kill processes that are holding deleted files open, delete orphaned volumes, etc) I decided this would easily become crufty with irrelevant steps.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Jun 27, 2025
Copy link

linux-foundation-easycla bot commented Jun 27, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @drigz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: drigz
Once this PR has been reviewed and has the lgtm label, please assign sergeykanzhelev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from kannon92 and matthyx June 27, 2025 14:11
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 27, 2025
@drigz drigz force-pushed the actionable-gc-error branch from 3680aa4 to f3a0fca Compare June 27, 2025 14:12
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 27, 2025
@drigz
Copy link
Contributor Author

drigz commented Jun 27, 2025

@haircommander I'm still figuring out how to manually test my kubelet change in a real cluster (just ran unit tests so far) but in the meantime, can you /ok-to-test this? Thanks in advance!

@drigz
Copy link
Contributor Author

drigz commented Jun 28, 2025

Looks like this exposes a bug where an error message is used as format string: (%!o(MISSING))

Warning  FreeDiskSpaceFailed      24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5% of 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.
  Warning  ImageGCFailed            24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5%!o(MISSING)f 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.

Should be easy to fix by adding "%s", but is it WAI that both FreeDiskSpaceFailed and ImageGCFailed are generated in this case?

I'm also unsure if the XX.X GB used for active images is adding value here since for containerd it's only the compressed size (containerd/containerd#9261) and for other runtimes it's uncompressed size, but in both cases it's an underestimate, and the CRI API doesn't let us report the true size:

root@kind-control-plane:/# du -sh /var/lib/containerd/* | sort -h
[snip: small directories]
1.6M    /var/lib/containerd/io.containerd.metadata.v1.bolt
421M    /var/lib/containerd/io.containerd.content.v1.content
702M    /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

That said, if it's less than 0.1 * disk size it's still a pointer that "too many/big images" is not the cause, whereas if it's greater than 0.3 * disk size it might suggest "too many/big images" is the cause. But not sure there's a good way to suggest that in the error message...

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 28, 2025
@@ -1561,7 +1561,7 @@ func (kl *Kubelet) StartGarbageCollection() {
if prevImageGCFailed {
klog.ErrorS(err, "Image garbage collection failed multiple times in a row")
// Only create an event for repeated failures
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, "%s", err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change added? Was the error not reported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ImageGCFailed event had the following format problem in its message: 53.5%!o(MISSING)f 976.2 GiB used because the error contained a % character and was incorrectly used as a format string.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2025
@kannon92
Copy link
Contributor

/ok-to-test

Please sign the CLA.

Two of your commits have no author on it which may need to be fixed before merge.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2025
drigz added 2 commits June 28, 2025 18:47
This reverts commit f3a0fca, because
different container runtimes report either compressed or uncompressed
size but not total disk usage, so this is misleading as to the total
disk space used for container images.
@drigz drigz force-pushed the actionable-gc-error branch from 63cac12 to bf23d26 Compare June 28, 2025 16:47
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 28, 2025
@drigz
Copy link
Contributor Author

drigz commented Jun 28, 2025

/test pull-kubernetes-linter-hints

@drigz
Copy link
Contributor Author

drigz commented Jun 28, 2025

Two of your commits have no author on it which may need to be fixed before merge.

Thanks, sorry for the noise, rookie error!

@drigz drigz requested a review from kannon92 June 28, 2025 17:09
// be due to an unusually large number or size of in-use container images.
message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%.1f%% of %s used). "+
"Failed to free sufficient space by deleting unused images. "+
"Consider resizing the disk or deleting unused files.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think we should mention "resizing" the disk.

This is not that helpful to an end-user as in many cases it is not easy to "resize" a disk.

For fedora/fcos, I don't think it is possible to resize /var on an existing node.

If you are on a cloud instance, you could provision a new disk and remount but that is very different.

Anyway, maybe you could just suggest taking a look at other resources on this node as you mention in your comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone called it out on the original issue (comment):

I changed Node disk size from 10 GiB to 20GiB and error didn't appeard

which is why I added it here. However, it's implicit when you see an error like 95% of 10GB used that you could just bump the disk size, so perhaps it's not needed.

you could just suggest taking a look at other resources on this node

I'm assuming you mean the code comment here ("This usually means the disk is full for reasons other than container images, such as logs, volumes, or other files. However, it could also be due to an unusually large number or size of in-use container images."). I've changed to:

Investigate disk usage, as it could be used by active images, logs, volumes, or other data.

This strikes me as a necessary first step, and immediate mitigation is probably not the operator's priority because this is just a "warning".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Lgtm.

@drigz drigz changed the title WIP Report actionable error when GC fails due to disk pressure Report actionable error when GC fails due to disk pressure Jun 29, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2025
@drigz drigz requested a review from kannon92 June 29, 2025 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Development

Successfully merging this pull request may close these issues.

failed to garbage collect required amount of images. Wanted to free 473842483 bytes, but freed 0 bytes
3 participants