Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use the cgroup aware OOM killer if available #117793

Merged
merged 1 commit into from
Jun 12, 2023

Conversation

tzneal
Copy link
Contributor

@tzneal tzneal commented May 5, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Sets the oom.memory.group bit for container cgroups on kernels with cgroups v2.

Which issue(s) this PR fixes:

Fixes #117070

Special notes for your reviewer:

Does this PR introduce a user-facing change?

If using cgroups v2, then the cgroup aware OOM killer will be enabled for container cgroups via  `memory.oom.group` .  This causes processes within the cgroup to be treated as a unit and killed simultaneously in the event of an OOM kill on any process in the cgroup.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/torvalds/linux/blob/3c4aa44343777844e425c28f1427127f3e55826f/Documentation/admin-guide/cgroup-v2.rst?plain=1#L1280

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 5, 2023
@tzneal
Copy link
Contributor Author

tzneal commented May 5, 2023

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 5, 2023
@tzneal tzneal force-pushed the memory-oom-group-support branch 3 times, most recently from 97c3c08 to ab4192d Compare May 5, 2023 01:54
@bart0sh bart0sh added this to Triage in SIG Node PR Triage May 5, 2023
@tzneal tzneal force-pushed the memory-oom-group-support branch 2 times, most recently from 4e8beb7 to bfb463b Compare May 5, 2023 13:54
@bobbypage
Copy link
Member

bobbypage commented May 5, 2023

Thanks for making the change. I agree this would be nice to have, my only concern is if there are some workloads that want to maintain the current behavior and avoid having a cgroup OOM. I agree this this a better default for the majority of workloads, but does it make sense to have some way for a workload to opt out of this behavior? Some folks mention that some workloads (e.g. postgressql, ngnix) do handle this case correctly, xref: #50632 (comment)

/cc @mrunalp @haircommander @giuseppe

for thoughts

@k8s-ci-robot
Copy link
Contributor

@bobbypage: GitHub didn't allow me to request PR reviews from the following users: for, thoughts.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Thanks for making the change. I agree this would be nice to have, my only concern is if there are some workloads that want to maintain the current behavior and avoid having a cgroup OOM. I agree this this a better default for the majority of workloads, but does it make sense to have some way for a workload to opt out of this behavior? Some folks mention that some workloads (e.g. postgressql, ngnix) do handle this case correctly, xref: #50632 (comment)

/cc @mrunalp @haircommander @guineveresaenger for thoughts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@haircommander
Copy link
Contributor

Yeah at this point, we have an established behavior of not doing this (even if it's the right thing). Unfortunately, if we want it tunable, we may need to introduce a knob to the pod api

@tzneal
Copy link
Contributor Author

tzneal commented May 5, 2023

Yeah at this point, we have an established behavior of not doing this (even if it's the right thing). Unfortunately, if we want it tunable, we may need to introduce a knob to the pod api

What's the consensus on enabling the behavior by default if its tunable? E.g. something like:

diff --git a/staging/src/k8s.io/api/core/v1/types.go b/staging/src/k8s.io/api/core/v1/types.go
index 567592efd7c..a18ff2b3400 100644
--- a/staging/src/k8s.io/api/core/v1/types.go
+++ b/staging/src/k8s.io/api/core/v1/types.go
@@ -2534,6 +2534,11 @@ type Container struct {
        // Default is false.
        // +optional
        TTY bool `json:"tty,omitempty" protobuf:"varint,18,opt,name=tty"`
+       // Whether an OOM kill on a process in the container should only kill that process, or all
+       // processes in the container if possible.
+       // Default is false.
+       // +optional
+       oomSingleProcess bool `json:"oomSingle,omitempty" protobuf:"varint,19,opt,name=oomSingle"`
 }

Happily accepting better naming suggestions :)

@haircommander
Copy link
Contributor

I am happy to put it on by default. As another thought (which I worry will unnecessarily increase complexity): are there other cgroup fields we want to optionally enable? Would an opaque key value field for memoryCgroupOptions be useful? I am not immediately seeing other options that the kubelet wouldn't make opinionated decisions about already (like memory.high), so maybe we don't need the complexity. I just am wary of adding too many fields to the pod api so I want to consider before adding one for just this field

@dims
Copy link
Member

dims commented Jun 12, 2023

/approve
/lgtm

Merging this to get more soaktime in the CI.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: da72515532e8b5504b621294ee68cfd297d90760

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, SergeyKanzhelev, tzneal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2023
@k8s-ci-robot k8s-ci-robot merged commit 86d7860 into kubernetes:master Jun 12, 2023
13 checks passed
SIG Node CI/Test Board automation moved this from Archive-it to Done Jun 12, 2023
SIG Node PR Triage automation moved this from Needs Reviewer to Done Jun 12, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 12, 2023
@tzneal tzneal deleted the memory-oom-group-support branch June 13, 2023 13:42
bertinatto added a commit to bertinatto/kubernetes that referenced this pull request Jul 18, 2023
@superbrothers
Copy link
Member

superbrothers commented Dec 3, 2023

This feature affected several of our workloads and caused a lot of OOM Killing in the cluster. One was an ML workload where the main process was not killed when a child process was killed, so another computation could continue. The second was for interactive workloads. For example, what used to work well with make -j20 -k || make -j10 is now OOMKilled.

We wanted to disable this feature because it was difficult to fix those workloads immediately, but we could not because it was not an option. We patched kubelet to disable this feature.

I would like an option to disable this feature. I understand that this feature works better with many workloads. (It would be better if it could be controlled on a per pod basis, but from the discussion so far it appears that this is not going to be provided.) Many clusters are currently running on cgroup v1. The problem will become more apparent when they switch to cgroup v2.

@tzneal
Copy link
Contributor Author

tzneal commented Dec 4, 2023

We discussed this at the SIG Node meeting before the change went in. I added an agenda item to discuss this feedback at the meeting tomorrow.

@poolnitsol
Copy link

poolnitsol commented Dec 6, 2023

We would like an option to disable this feature as well. We saw several containers restarting if one of the process got OOMKilled. It impacted us on a mass level and we had to revert to cgroup v1 in middle of k8s upgrades. We are trying to figure out what would it take for us to come up with a workaround if goes live without an option to disable

@haircommander
Copy link
Contributor

In the SIG-Node Meeting, we decided to pursue a node level kubelet config field that enables or disables this feature to allow an admin to opt-in. stay tuned in 1.30 for this

@gdhagger
Copy link

gdhagger commented Jan 8, 2024

@haircommander what's the best place to keep informed on progress on that feature? - I don't see anything in the sig-node google group or slack channel that seems directly related

@haircommander
Copy link
Contributor

it was mostly discussed in the meeting (of which the notes can be found here). as we enter 1.30 cycle there will be additional documentation

@tzneal
Copy link
Contributor Author

tzneal commented Jan 17, 2024

PR to add a flag to enable the old behavior: #122813

@mbrt
Copy link

mbrt commented Apr 16, 2024

Hey folks!

A bit late to the party, but we upgraded to Kubernetes 1.28 last week and, like others [1, 2, 3], had to face outages due to increased OOMs. At Celonis, among other things, we run a compute and memory intensive workload (10k cores, 80TiB memory), which loads data models (>100k) into memory to serve queries. Each data model is handled by a separate process, managed within a Pod by a coordinator process. Multiple Pods are managed by an orchestrator. The old behavior allowed serving tens of thousands of models with just a few nodes in a fairly dynamic way. If one process died because of OOM, the orchestrator would reschedule it. This had no repercussions on other models served in the same Pod, it’s simple, and memory utilization is very good.

I’m sharing these details because I think this is a valid use case. Splitting each process in its own Pod is infeasible, given the sheer amount of processes (up to 100k) and also the variability in memory usage of a single process. Splitting them into coarser grained Pods is also not a solution long term, as it would only reduce the blast radius, but not eliminate the problem of a single model taking down hundreds more. Resource utilization would also be lower and result in a rather expensive bill. Keeping ourselves with cgroups v1 seems also not feasible long term, as I’m seeing discussions to drop support, both in Kubernetes, in systemd and in AWS Bottlerocket (which we’re using). We were also already using cgroups v2 and downgrading to v1 seems to fix the problem in the wrong way.

I agree that the new behavior seems like a saner default. I’m however afraid that we’re thinking that the previous use case is no longer valid, so this is the deeper question we would like to address.

In summary, I have the following concerns:

  1. The proposed mitigations seem to go along the lines of "adding a flag" to give people more time to migrate. Those would also be coming 2 releases late (at best), while we’re leaving people reliant on the previous behavior exposed without a real mitigation.
  2. There seems to be no way forward for workloads managing their own memory in a granular way through sub-processes.
  3. Solutions to the previous point sound all like workarounds and not viable long term (I only see [Feature Proposal]: Ability to Configure Whether cgroupv2's group OOMKill is Used at the Pod Level #124253 (comment) being proposed, but it comes as part of a much bigger feature, which makes it hard to know when and if it will land).

Is there anything we could be doing to mitigate the impact of this change? For example, short term backporting the new kubelet flag to 1.28 or 1.29, and longer term expedite a subset of solution 3 targeted to OOMs only?

These are just ideas, and I’m happy to take other suggestions back to our teams as well. I hope SIG-Node can take this into consideration (cc @haircommander, @tzneal).

Thanks!

@xhejtman
Copy link

Hello,

would it be possible to mark pod status as OOMKilled when any oom kill occurs in the pod. It seems, that oomkilled status is set only iff oomkill is received by PID 1 in the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

[SIG-Node] Cgroups v2 memory.oom.group support (kill all pids in a pod on oom)