kubelet: new kubelet config option for disabling group oom kill #122813

tzneal · 2024-01-16T18:29:09Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added `singleProcessOOMKill` flag to the kubelet configuration. Setting that to true enable single process OOM killing in cgroups v2. In this mode, if a single process is OOM killed within a container, the remaining processes will not be OOM killed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-01-16T18:30:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tzneal
Once this PR has been reviewed and has the lgtm label, please assign liggitt for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/generated/openapi/OWNERS
pkg/kubelet/OWNERS
pkg/kubelet/apis/config/OWNERS
staging/src/k8s.io/kubelet/config/OWNERS
test/e2e_node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tzneal · 2024-01-16T18:50:34Z

/cc @haircommander

test/e2e_node/oomkiller_linux_test.go

tzneal · 2024-01-16T19:28:06Z

Also, I have no attachment to the name SingleProcessOOMKill, if anyone has a better name.

cici37 · 2024-01-16T21:14:23Z

/remove-sig api-machinery

mkarrmann · 2024-02-19T23:57:05Z

@tzneal As someone whose made 0 contributions to K8s, I just wanted to chime in with my opinion here (won't be offended if it's dismissed out-of-hand :)).

I think that this configuration should be at the container-level, not at the Kubelet. Whether or not a single process or the entire container should be killed is entirely dependent on the specific application, and likely doesn't have much of an impact outside of the specific container in question. So, I think people writing the config for a container are going to be in a far better position to determine what the behavior for their container will be than admins setting the kubelet config.

Of course, nothing here precludes adding the container-level configuration in addition to this configuration (or using nodeSelector to accomplish something similar), but I'm worried that having both types of configuration would be unecessarily confusing, and little advantage over just offering container-level config. So, I'd prefer to not see this config option added in the first place.

liggitt · 2024-02-22T20:29:39Z

If this has agreement on direction from kubelet component owners (since there's no linked design, it's not clear to me that it does), get the kubelet packages reviewed / approved, then relabel for api-review for the mechanical flag / config plumbing.

/remove-label api-review

k8s-triage-robot · 2024-02-22T20:59:02Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

k8s-ci-robot · 2024-03-06T11:36:48Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

iholder101 · 2024-03-18T16:44:52Z

this feature is not recommended yet so maybe we can pass on a doc update.

It's not that the feature isn't recommended yet, I don't think it will ever be recommended. With cgroupsv2, memory.oom.group is set by default which causes OOM kills to work like the documentation has stated (i.e. kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource where the docs talk about containers being killed). It didn't work that way previously due needing a user space handler to implement what memory.oom.group does automatically.

This change is just to add support for the hopefully small number of users that want the arguably broken behavior back where individual processes in a container can be OOM killed, but the remaining processes keep running.

Is there a specific use-case that's needed to be supported? If so, what is it?

I kind of miss the point of supporting a broken cgroup v1 behavior for backward-compatibility's sake in a point of time where cgroup v1 is deprecated.

Another thing to mention is that the killed process won't get restarted automatically. Something would need to either restart the process inside the container or restart the container itself when needed.

mkarrmann · 2024-03-18T17:29:25Z

@iholder101

Is there a specific use-case that's needed to be supported? If so, what is it?

I think there's lots of situations where not automatically killing the entire container is desirable behavior. The most common example I've seen others point to is postgres, which is designed to let the main postmaster process do clean up in case that a child process crashes. A hard SIGKILL to everything is undesirable from a stability perspective.

A pattern I've used is having a main process on a container which handles light housekeeping logic (e.g. reading from a queue, heartbeating) and child process which does the heavy lifting of handling a request. I've even been forced to use this pattern in Python applications due to GIL-locking issues. In this case, it's generally going to be the child process which is the proximate cause for the OOM, and most likely what the kernel is going to target. In this pattern, the main process can detect that the request itself is "bad" because it resulted in a killed process, and can handle that appropriately (e.g. return an error, discard the message, etc.). If the entire container were OOMKilled, then there's no opportunity to do any housekeeping and the rest of the system has no insight into what went wrong (maybe pod disconnected due to a network issue?), meaning I need to retry the job.

As you note, the child process will need to be restarted, but most Process Pool libaries handle this out of the box, so isn't much of an issue.

With all that being said, I stand by my above comment that I don't believe this configuration belongs in the kubelet, since the situations where this is desirable behavior is very application/container-specific. I'd rather have the configuration live at that level, and I'm worried having the configuration at the kubelet level as well would muddy the water.

gdhagger · 2024-03-18T21:19:51Z

We have a bunch of processes that currently have a python wrapper program that's managing a task through a bunch of phases that call out to other binaries. If any of those binaries fails due to a segfault - or more frequently, and OOM event - the python code is responsible for seeing that it OOMed and updating our records of how that job failed.

There are definitely cons to this setup, and more K8s native ways to reimplement it, but it's a big lift to refactor our processes to handle this in the near-medium term. I'm also certain that we're not the only folks relying on this mode of operation.

iholder101 · 2024-03-19T09:09:46Z

Thank you @mkarrmann and @gdhagger for the detailed explanation!

This change is just to add support for the hopefully small number of users that want the arguably broken behavior

If so, I'd say that it's not a broken behavior, but rather a feature that's needed for exceptional use-cases and that needs to be used with care.

I think that this configuration should be at the container-level, not at the Kubelet.

Since this behavior is expected to be valuable only for exceptional use-cases, I tend to agree that we shouldn't define it on the kubelet (=node) level, unless there's a use-case for turning it on for the whole node that I can't think of.

superbrothers · 2024-04-03T07:11:55Z

I agree that the container level is best, but then I would suggest reverting to #117793. I just upgraded our cluster and our workloads were fatally affected and there is no way around it. #117793 (comment)

mkarrmann · 2024-04-06T18:09:34Z

@superbrothers I think the community consensus is that cgroup-aware OOMs are better default behavior. The K8s docs have always been written in a way that implies that this is the case, so arguably it was a bug that this wasn't the default behavior. It also takes forethought to take advantage of non-cgroup-aware OOMKiller, which most devs won't do.

The fact that #117793 broke workloads is unfortuanate, but I also think the damage is done. Now that that's been released, there's likely workloads that now implicitely depend on that behavior, so reverting that behavior would break those.

I'm glad to see we agree that container-level config would be best :) I'm brand new to K8s development, but I have a goal of shipping container-level configuration for this flag. Hopefully that can be completed and out the door relatively soon, so that all use cases can be satisfied.

utam0k · 2024-04-09T13:07:15Z

@mkarrmann Hi, Matt 👋

I think the community consensus is that cgroup-aware OOMs are better default behavior.

I think this is the community consensus.
#117793 (comment)

It would take a lot of time and effort to get this configuration into the container level. I fear that many users will not be able to migrate cgroup v1/v2 kubernetes/enhancements#4572 in the meantime.
You may be right about the final landing point, but I just think a practical solution would be to support it once with kubelet, write documentation for migration, and deprecate the kubelet configuration once it is possible to configure it at the container level. What do you think?

mkarrmann · 2024-04-09T14:14:26Z

@utam0k Hello!

I personally am unsure if introducing and documenting a feature which is intended to be deprecated soon is a great idea (I don't know whether or not there's precedent for that in K8s). I also am quite sure that the kubelet solution will be overly burdensome for the majority developers who wish to configure this (I bet in most orgs, the vast majority of devs don't even know who to reach out to to discuss making a kubelet change in their nodes). Upon reflection, I think I actually would prefer temporarily reverting #117793, as opposed to introducing this config to compensate.

But that's all just my personal opinion and I don't have much conviction on this point—I'm definitely not going to object to anything trying to unblock users. I also grant your point that shipping the container-level config simply may take too long.

marquiz · 2024-04-10T09:56:09Z

I think that this configuration should be at the container-level, not at the Kubelet. Whether or not a single process or the entire container should be killed is entirely dependent on the specific application

I agree that this should not be a kubelet/node level setting. I also feel that one problem with the kubelet flags is that the users don't have good visibility and in worst case nodes might be configured differently.

A possible solution to provide per-container control would be QoS-resources (KEP). The OOM kill behavior would be a Kubernetes-managed QoS-resources with two possible settings (classes): group kill or process kill. It would also have the advantage to set per-cluster/per-node/per-namespace defaults. Plus the user could always see what berhavior their applications get.

yujuhong · 2024-04-10T16:50:18Z

I think we need to decide whether this is a behavior/setting we want to support long-term, or simply a temporary crutch to help with the transition from cgroup v1 to v2. If it's the latter, we should not be adding anything to the k8s core API.

mkarrmann · 2024-04-10T18:01:18Z

I've opened an issue describing my proposal: #124253

I tried to thoroughly articulate why it's important that first-class support for this behavior is provided.

utam0k · 2024-04-30T23:31:19Z

@tzneal @haircommander @mrunalp @dchen1107
Sorry for pinging you, but I'd be happy to know the status of this PR in the sig-node team. I think it is a bit risky to enter this transitional period between cgroup v1 and v2 with this PR in limbo.
Two users¹² have already struggled with this behavior. I guess more users who will probably migrate to cgroup v2 soon. For example, it is possible that Ubuntu 24.04 was released and they will try it together to migrate v1 to v2.

mkarrmann · 2024-04-30T23:51:46Z

For the record, I largely retract my original comment about not wanting this merged. It will take a while for a proper solution to be merged, and I agree this is important that it deserves a sooner fix (even if this solution will be inconvenient for many).

One suggestion @tzneal :

Make the config of type *bool with default nil, where the semantics of nil is that no guarantees will be made at the kubelet level which option will be used. In practice, this would mean respect any Container/Pod-level configuration that is added later, but the kubelet can still default to using group OOM kills.

I still think it's important to eventually have an option for configuring this per Pod in the K8s API, in which case it'd be best to avoid confusion or ambiguity regarding how that will interact with this feature.

utam0k · 2024-04-30T23:56:03Z

@mkarrmann Thanks for your understanding and effort 🙏 I totally agree with you.

For the record, I largely retract my original comment about not wanting this merged. It will take a while for a proper solution to be merged, and I agree this is important that it deserves a sooner fix (even if this solution will be inconvenient for many).

tzneal · 2024-05-23T14:11:18Z

Closing this for now as I don't have time at the moment.

utam0k · 2024-05-24T12:10:55Z

@tzneal Thanks for your effort 🙏 I understand your situation. Can I take over this PR?

tzneal · 2024-05-24T13:06:21Z

@tzneal Thanks for your effort 🙏 I understand your situation. Can I take over this PR?

Sure thing, please do!

k8s-ci-robot requested review from andrewsykim and derekwaynecarr January 16, 2024 18:29

k8s-ci-robot requested a review from haircommander January 16, 2024 18:50

kannon92 reviewed Jan 16, 2024

View reviewed changes

test/e2e_node/oomkiller_linux_test.go Outdated Show resolved Hide resolved

tzneal force-pushed the support-disabling-oom-group-kill branch from d59ff64 to a443686 Compare January 16, 2024 19:02

haircommander reviewed Jan 16, 2024

View reviewed changes

test/e2e_node/oomkiller_linux_test.go Show resolved Hide resolved

tzneal force-pushed the support-disabling-oom-group-kill branch from a443686 to 9b58625 Compare January 16, 2024 20:46

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2024

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 16, 2024

tzneal force-pushed the support-disabling-oom-group-kill branch from 9b58625 to ee5e6e8 Compare January 17, 2024 13:18

k8s-ci-robot removed the api-review Categorizes an issue or PR as actively needing an API review. label Feb 22, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2024

toVersus mentioned this pull request Apr 6, 2024

Move cgroup v1 support into maintenance mode kubernetes/enhancements#4572

Merged

mkarrmann mentioned this pull request Apr 10, 2024

[Feature Proposal]: Ability to Configure Whether cgroupv2's group OOMKill is Used at the Pod Level #124253

Closed

tzneal closed this May 23, 2024

utam0k mentioned this pull request Jul 15, 2024

kubelet: new kubelet config option for disabling group oom kill #126096

Merged

kubelet: new kubelet config option for disabling group oom kill #122813

kubelet: new kubelet config option for disabling group oom kill #122813

Uh oh!

Conversation

tzneal commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jan 16, 2024

Uh oh!

tzneal commented Jan 16, 2024

Uh oh!

Uh oh!

Uh oh!

tzneal commented Jan 16, 2024

Uh oh!

cici37 commented Jan 16, 2024

Uh oh!

mkarrmann commented Feb 19, 2024

Uh oh!

liggitt commented Feb 22, 2024

Uh oh!

k8s-triage-robot commented Feb 22, 2024

Uh oh!

k8s-ci-robot commented Mar 6, 2024

Uh oh!

iholder101 commented Mar 18, 2024

Uh oh!

mkarrmann commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gdhagger commented Mar 18, 2024

Uh oh!

iholder101 commented Mar 19, 2024

Uh oh!

superbrothers commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkarrmann commented Apr 6, 2024

Uh oh!

utam0k commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkarrmann commented Apr 9, 2024

Uh oh!

marquiz commented Apr 10, 2024

Uh oh!

yujuhong commented Apr 10, 2024

Uh oh!

mkarrmann commented Apr 10, 2024

Uh oh!

utam0k commented Apr 30, 2024

Footnotes

Uh oh!

mkarrmann commented Apr 30, 2024

Uh oh!

utam0k commented Apr 30, 2024

Uh oh!

tzneal commented May 23, 2024

Uh oh!

utam0k commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzneal commented May 24, 2024

Uh oh!

Uh oh!

tzneal commented Jan 16, 2024 •

edited

Loading

mkarrmann commented Mar 18, 2024 •

edited

Loading

superbrothers commented Apr 3, 2024 •

edited

Loading

utam0k commented Apr 9, 2024 •

edited

Loading

utam0k commented May 24, 2024 •

edited

Loading