Skip to content

kubelet: new kubelet config option for disabling group oom kill #122813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tzneal
Copy link
Contributor

@tzneal tzneal commented Jan 16, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added `singleProcessOOMKill` flag to the kubelet configuration. Setting that to true enable single process OOM killing in cgroups v2. In this mode, if a single process is OOM killed within a container, the remaining processes will not be OOM killed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 16, 2024
@k8s-ci-robot k8s-ci-robot added area/code-generation area/kubelet area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 16, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tzneal
Once this PR has been reviewed and has the lgtm label, please assign liggitt for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tzneal
Copy link
Contributor Author

tzneal commented Jan 16, 2024

/cc @haircommander

@tzneal tzneal force-pushed the support-disabling-oom-group-kill branch from d59ff64 to a443686 Compare January 16, 2024 19:02
@tzneal
Copy link
Contributor Author

tzneal commented Jan 16, 2024

Also, I have no attachment to the name SingleProcessOOMKill, if anyone has a better name.

@tzneal tzneal force-pushed the support-disabling-oom-group-kill branch from a443686 to 9b58625 Compare January 16, 2024 20:46
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2024
@cici37
Copy link
Contributor

cici37 commented Jan 16, 2024

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 16, 2024
@tzneal tzneal force-pushed the support-disabling-oom-group-kill branch from 9b58625 to ee5e6e8 Compare January 17, 2024 13:18
@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 18, 2024
@mkarrmann
Copy link
Contributor

@tzneal As someone whose made 0 contributions to K8s, I just wanted to chime in with my opinion here (won't be offended if it's dismissed out-of-hand :)).

I think that this configuration should be at the container-level, not at the Kubelet. Whether or not a single process or the entire container should be killed is entirely dependent on the specific application, and likely doesn't have much of an impact outside of the specific container in question. So, I think people writing the config for a container are going to be in a far better position to determine what the behavior for their container will be than admins setting the kubelet config.

Of course, nothing here precludes adding the container-level configuration in addition to this configuration (or using nodeSelector to accomplish something similar), but I'm worried that having both types of configuration would be unecessarily confusing, and little advantage over just offering container-level config. So, I'd prefer to not see this config option added in the first place.

@liggitt
Copy link
Member

liggitt commented Feb 22, 2024

If this has agreement on direction from kubelet component owners (since there's no linked design, it's not clear to me that it does), get the kubelet packages reviewed / approved, then relabel for api-review for the mechanical flag / config plumbing.

/remove-label api-review

@k8s-ci-robot k8s-ci-robot removed the api-review Categorizes an issue or PR as actively needing an API review. label Feb 22, 2024
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@iholder101
Copy link
Contributor

this feature is not recommended yet so maybe we can pass on a doc update.

It's not that the feature isn't recommended yet, I don't think it will ever be recommended. With cgroupsv2, memory.oom.group is set by default which causes OOM kills to work like the documentation has stated (i.e. kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource where the docs talk about containers being killed). It didn't work that way previously due needing a user space handler to implement what memory.oom.group does automatically.

This change is just to add support for the hopefully small number of users that want the arguably broken behavior back where individual processes in a container can be OOM killed, but the remaining processes keep running.

Is there a specific use-case that's needed to be supported? If so, what is it?

I kind of miss the point of supporting a broken cgroup v1 behavior for backward-compatibility's sake in a point of time where cgroup v1 is deprecated.

Another thing to mention is that the killed process won't get restarted automatically. Something would need to either restart the process inside the container or restart the container itself when needed.

@mkarrmann
Copy link
Contributor

mkarrmann commented Mar 18, 2024

@iholder101

Is there a specific use-case that's needed to be supported? If so, what is it?

I think there's lots of situations where not automatically killing the entire container is desirable behavior. The most common example I've seen others point to is postgres, which is designed to let the main postmaster process do clean up in case that a child process crashes. A hard SIGKILL to everything is undesirable from a stability perspective.

A pattern I've used is having a main process on a container which handles light housekeeping logic (e.g. reading from a queue, heartbeating) and child process which does the heavy lifting of handling a request. I've even been forced to use this pattern in Python applications due to GIL-locking issues. In this case, it's generally going to be the child process which is the proximate cause for the OOM, and most likely what the kernel is going to target. In this pattern, the main process can detect that the request itself is "bad" because it resulted in a killed process, and can handle that appropriately (e.g. return an error, discard the message, etc.). If the entire container were OOMKilled, then there's no opportunity to do any housekeeping and the rest of the system has no insight into what went wrong (maybe pod disconnected due to a network issue?), meaning I need to retry the job.

As you note, the child process will need to be restarted, but most Process Pool libaries handle this out of the box, so isn't much of an issue.

With all that being said, I stand by my above comment that I don't believe this configuration belongs in the kubelet, since the situations where this is desirable behavior is very application/container-specific. I'd rather have the configuration live at that level, and I'm worried having the configuration at the kubelet level as well would muddy the water.

@gdhagger
Copy link

We have a bunch of processes that currently have a python wrapper program that's managing a task through a bunch of phases that call out to other binaries. If any of those binaries fails due to a segfault - or more frequently, and OOM event - the python code is responsible for seeing that it OOMed and updating our records of how that job failed.

There are definitely cons to this setup, and more K8s native ways to reimplement it, but it's a big lift to refactor our processes to handle this in the near-medium term. I'm also certain that we're not the only folks relying on this mode of operation.

@iholder101
Copy link
Contributor

Thank you @mkarrmann and @gdhagger for the detailed explanation!

This change is just to add support for the hopefully small number of users that want the arguably broken behavior

If so, I'd say that it's not a broken behavior, but rather a feature that's needed for exceptional use-cases and that needs to be used with care.

I think that this configuration should be at the container-level, not at the Kubelet.

Since this behavior is expected to be valuable only for exceptional use-cases, I tend to agree that we shouldn't define it on the kubelet (=node) level, unless there's a use-case for turning it on for the whole node that I can't think of.

@superbrothers
Copy link
Member

superbrothers commented Apr 3, 2024

I agree that the container level is best, but then I would suggest reverting to #117793. I just upgraded our cluster and our workloads were fatally affected and there is no way around it. #117793 (comment)

@mkarrmann
Copy link
Contributor

@superbrothers I think the community consensus is that cgroup-aware OOMs are better default behavior. The K8s docs have always been written in a way that implies that this is the case, so arguably it was a bug that this wasn't the default behavior. It also takes forethought to take advantage of non-cgroup-aware OOMKiller, which most devs won't do.

The fact that #117793 broke workloads is unfortuanate, but I also think the damage is done. Now that that's been released, there's likely workloads that now implicitely depend on that behavior, so reverting that behavior would break those.

I'm glad to see we agree that container-level config would be best :) I'm brand new to K8s development, but I have a goal of shipping container-level configuration for this flag. Hopefully that can be completed and out the door relatively soon, so that all use cases can be satisfied.

@utam0k
Copy link
Member

utam0k commented Apr 9, 2024

@mkarrmann Hi, Matt 👋

I think the community consensus is that cgroup-aware OOMs are better default behavior.

I think this is the community consensus.
#117793 (comment)

It would take a lot of time and effort to get this configuration into the container level. I fear that many users will not be able to migrate cgroup v1/v2 kubernetes/enhancements#4572 in the meantime.
You may be right about the final landing point, but I just think a practical solution would be to support it once with kubelet, write documentation for migration, and deprecate the kubelet configuration once it is possible to configure it at the container level. What do you think?

@mkarrmann
Copy link
Contributor

@utam0k Hello!

I personally am unsure if introducing and documenting a feature which is intended to be deprecated soon is a great idea (I don't know whether or not there's precedent for that in K8s). I also am quite sure that the kubelet solution will be overly burdensome for the majority developers who wish to configure this (I bet in most orgs, the vast majority of devs don't even know who to reach out to to discuss making a kubelet change in their nodes). Upon reflection, I think I actually would prefer temporarily reverting #117793, as opposed to introducing this config to compensate.

But that's all just my personal opinion and I don't have much conviction on this point—I'm definitely not going to object to anything trying to unblock users. I also grant your point that shipping the container-level config simply may take too long.

@marquiz
Copy link
Contributor

marquiz commented Apr 10, 2024

I think that this configuration should be at the container-level, not at the Kubelet. Whether or not a single process or the entire container should be killed is entirely dependent on the specific application

I agree that this should not be a kubelet/node level setting. I also feel that one problem with the kubelet flags is that the users don't have good visibility and in worst case nodes might be configured differently.

A possible solution to provide per-container control would be QoS-resources (KEP). The OOM kill behavior would be a Kubernetes-managed QoS-resources with two possible settings (classes): group kill or process kill. It would also have the advantage to set per-cluster/per-node/per-namespace defaults. Plus the user could always see what berhavior their applications get.

@yujuhong
Copy link
Contributor

I think we need to decide whether this is a behavior/setting we want to support long-term, or simply a temporary crutch to help with the transition from cgroup v1 to v2. If it's the latter, we should not be adding anything to the k8s core API.

@mkarrmann
Copy link
Contributor

I've opened an issue describing my proposal: #124253

I tried to thoroughly articulate why it's important that first-class support for this behavior is provided.

@utam0k
Copy link
Member

utam0k commented Apr 30, 2024

@tzneal @haircommander @mrunalp @dchen1107
Sorry for pinging you, but I'd be happy to know the status of this PR in the sig-node team. I think it is a bit risky to enter this transitional period between cgroup v1 and v2 with this PR in limbo.
Two users12 have already struggled with this behavior. I guess more users who will probably migrate to cgroup v2 soon. For example, it is possible that Ubuntu 24.04 was released and they will try it together to migrate v1 to v2.

Footnotes

  1. https://github.com/kubernetes/kubernetes/pull/117793#issuecomment-2059012668

  2. https://github.com/kubernetes/kubernetes/pull/117793#issuecomment-1837400726

@mkarrmann
Copy link
Contributor

For the record, I largely retract my original comment about not wanting this merged. It will take a while for a proper solution to be merged, and I agree this is important that it deserves a sooner fix (even if this solution will be inconvenient for many).

One suggestion @tzneal :

Make the config of type *bool with default nil, where the semantics of nil is that no guarantees will be made at the kubelet level which option will be used. In practice, this would mean respect any Container/Pod-level configuration that is added later, but the kubelet can still default to using group OOM kills.

I still think it's important to eventually have an option for configuring this per Pod in the K8s API, in which case it'd be best to avoid confusion or ambiguity regarding how that will interact with this feature.

@utam0k
Copy link
Member

utam0k commented Apr 30, 2024

@mkarrmann Thanks for your understanding and effort 🙏 I totally agree with you.

For the record, I largely retract my original comment about not wanting this merged. It will take a while for a proper solution to be merged, and I agree this is important that it deserves a sooner fix (even if this solution will be inconvenient for many).

@tzneal
Copy link
Contributor Author

tzneal commented May 23, 2024

Closing this for now as I don't have time at the moment.

@tzneal tzneal closed this May 23, 2024
@utam0k
Copy link
Member

utam0k commented May 24, 2024

@tzneal Thanks for your effort 🙏 I understand your situation. Can I take over this PR?

@tzneal
Copy link
Contributor Author

tzneal commented May 24, 2024

@tzneal Thanks for your effort 🙏 I understand your situation. Can I take over this PR?

Sure thing, please do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-generation area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.