-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Report actionable error when GC fails due to disk pressure #132578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @drigz. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: drigz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3680aa4
to
f3a0fca
Compare
@haircommander I'm still figuring out how to manually test my kubelet change in a real cluster (just ran unit tests so far) but in the meantime, can you /ok-to-test this? Thanks in advance! |
Looks like this exposes a bug where an error message is used as format string: (
Should be easy to fix by adding I'm also unsure if the
That said, if it's less than 0.1 * disk size it's still a pointer that "too many/big images" is not the cause, whereas if it's greater than 0.3 * disk size it might suggest "too many/big images" is the cause. But not sure there's a good way to suggest that in the error message... |
@@ -1561,7 +1561,7 @@ func (kl *Kubelet) StartGarbageCollection() { | |||
if prevImageGCFailed { | |||
klog.ErrorS(err, "Image garbage collection failed multiple times in a row") | |||
// Only create an event for repeated failures | |||
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error()) | |||
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, "%s", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change added? Was the error not reported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ImageGCFailed event had the following format problem in its message: 53.5%!o(MISSING)f 976.2 GiB used
because the error contained a %
character and was incorrectly used as a format string.
/ok-to-test Please sign the CLA. Two of your commits have no author on it which may need to be fixed before merge. |
This reverts commit f3a0fca, because different container runtimes report either compressed or uncompressed size but not total disk usage, so this is misleading as to the total disk space used for container images.
63cac12
to
bf23d26
Compare
/test pull-kubernetes-linter-hints |
Thanks, sorry for the noise, rookie error! |
// be due to an unusually large number or size of in-use container images. | ||
message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%.1f%% of %s used). "+ | ||
"Failed to free sufficient space by deleting unused images. "+ | ||
"Consider resizing the disk or deleting unused files.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really think we should mention "resizing" the disk.
This is not that helpful to an end-user as in many cases it is not easy to "resize" a disk.
For fedora/fcos, I don't think it is possible to resize /var on an existing node.
If you are on a cloud instance, you could provision a new disk and remount but that is very different.
Anyway, maybe you could just suggest taking a look at other resources on this node as you mention in your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone called it out on the original issue (comment):
I changed Node disk size from 10 GiB to 20GiB and error didn't appeard
which is why I added it here. However, it's implicit when you see an error like 95% of 10GB used
that you could just bump the disk size, so perhaps it's not needed.
you could just suggest taking a look at other resources on this node
I'm assuming you mean the code comment here ("This usually means the disk is full for reasons other than container images, such as logs, volumes, or other files. However, it could also be due to an unusually large number or size of in-use container images."). I've changed to:
Investigate disk usage, as it could be used by active images, logs, volumes, or other data.
This strikes me as a necessary first step, and immediate mitigation is probably not the operator's priority because this is just a "warning".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Lgtm.
What type of PR is this?
/kind bug
What this PR does / why we need it:
When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see #71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure.
I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: #120698).
Which issue(s) this PR is related to:
Fixes #71869
Special notes for your reviewer:
See individual commits for a breakdown into:
Regarding formatSize(): I considered using resource.NewQuantity but it wasn't clear how to round the size appropriately, and it can't generate non-whole numbers like
12.3 G
.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
None - although I considered collecting the suggested fixes from #71869 (kill processes that are holding deleted files open, delete orphaned volumes, etc) I decided this would easily become crufty with irrelevant steps.