use the cgroup aware OOM killer if available #117793

tzneal · 2023-05-05T01:30:40Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Sets the oom.memory.group bit for container cgroups on kernels with cgroups v2.

Which issue(s) this PR fixes:

Fixes #117070

Special notes for your reviewer:

Does this PR introduce a user-facing change?

If using cgroups v2, then the cgroup aware OOM killer will be enabled for container cgroups via  `memory.oom.group` .  This causes processes within the cgroup to be treated as a unit and killed simultaneously in the event of an OOM kill on any process in the cgroup.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/torvalds/linux/blob/3c4aa44343777844e425c28f1427127f3e55826f/Documentation/admin-guide/cgroup-v2.rst?plain=1#L1280

tzneal · 2023-05-05T01:30:57Z

/sig node

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

bobbypage · 2023-05-05T19:52:08Z

Thanks for making the change. I agree this would be nice to have, my only concern is if there are some workloads that want to maintain the current behavior and avoid having a cgroup OOM. I agree this this a better default for the majority of workloads, but does it make sense to have some way for a workload to opt out of this behavior? Some folks mention that some workloads (e.g. postgressql, ngnix) do handle this case correctly, xref: #50632 (comment)

/cc @mrunalp @haircommander @giuseppe

for thoughts

k8s-ci-robot · 2023-05-05T19:52:13Z

@bobbypage: GitHub didn't allow me to request PR reviews from the following users: for, thoughts.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Thanks for making the change. I agree this would be nice to have, my only concern is if there are some workloads that want to maintain the current behavior and avoid having a cgroup OOM. I agree this this a better default for the majority of workloads, but does it make sense to have some way for a workload to opt out of this behavior? Some folks mention that some workloads (e.g. postgressql, ngnix) do handle this case correctly, xref: #50632 (comment)

/cc @mrunalp @haircommander @guineveresaenger for thoughts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2023-05-05T19:55:25Z

Yeah at this point, we have an established behavior of not doing this (even if it's the right thing). Unfortunately, if we want it tunable, we may need to introduce a knob to the pod api

tzneal · 2023-05-05T20:07:40Z

Yeah at this point, we have an established behavior of not doing this (even if it's the right thing). Unfortunately, if we want it tunable, we may need to introduce a knob to the pod api

What's the consensus on enabling the behavior by default if its tunable? E.g. something like:

diff --git a/staging/src/k8s.io/api/core/v1/types.go b/staging/src/k8s.io/api/core/v1/types.go
index 567592efd7c..a18ff2b3400 100644
--- a/staging/src/k8s.io/api/core/v1/types.go
+++ b/staging/src/k8s.io/api/core/v1/types.go
@@ -2534,6 +2534,11 @@ type Container struct {
        // Default is false.
        // +optional
        TTY bool `json:"tty,omitempty" protobuf:"varint,18,opt,name=tty"`
+       // Whether an OOM kill on a process in the container should only kill that process, or all
+       // processes in the container if possible.
+       // Default is false.
+       // +optional
+       oomSingleProcess bool `json:"oomSingle,omitempty" protobuf:"varint,19,opt,name=oomSingle"`
 }

Happily accepting better naming suggestions :)

haircommander · 2023-05-05T20:36:26Z

I am happy to put it on by default. As another thought (which I worry will unnecessarily increase complexity): are there other cgroup fields we want to optionally enable? Would an opaque key value field for memoryCgroupOptions be useful? I am not immediately seeing other options that the kubelet wouldn't make opinionated decisions about already (like memory.high), so maybe we don't need the complexity. I just am wary of adding too many fields to the pod api so I want to consider before adding one for just this field

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

tzneal · 2023-12-04T13:36:32Z

We discussed this at the SIG Node meeting before the change went in. I added an agenda item to discuss this feedback at the meeting tomorrow.

poolnitsol · 2023-12-06T20:04:08Z

We would like an option to disable this feature as well. We saw several containers restarting if one of the process got OOMKilled. It impacted us on a mass level and we had to revert to cgroup v1 in middle of k8s upgrades. We are trying to figure out what would it take for us to come up with a workaround if goes live without an option to disable

haircommander · 2023-12-07T16:42:51Z

In the SIG-Node Meeting, we decided to pursue a node level kubelet config field that enables or disables this feature to allow an admin to opt-in. stay tuned in 1.30 for this

gdhagger · 2024-01-08T17:44:03Z

@haircommander what's the best place to keep informed on progress on that feature? - I don't see anything in the sig-node google group or slack channel that seems directly related

haircommander · 2024-01-08T17:48:12Z

it was mostly discussed in the meeting (of which the notes can be found here). as we enter 1.30 cycle there will be additional documentation

tzneal · 2024-01-17T21:10:01Z

PR to add a flag to enable the old behavior: #122813

mbrt · 2024-04-16T12:48:14Z

Hey folks!

A bit late to the party, but we upgraded to Kubernetes 1.28 last week and, like others [1, 2, 3], had to face outages due to increased OOMs. At Celonis, among other things, we run a compute and memory intensive workload (10k cores, 80TiB memory), which loads data models (>100k) into memory to serve queries. Each data model is handled by a separate process, managed within a Pod by a coordinator process. Multiple Pods are managed by an orchestrator. The old behavior allowed serving tens of thousands of models with just a few nodes in a fairly dynamic way. If one process died because of OOM, the orchestrator would reschedule it. This had no repercussions on other models served in the same Pod, it’s simple, and memory utilization is very good.

I’m sharing these details because I think this is a valid use case. Splitting each process in its own Pod is infeasible, given the sheer amount of processes (up to 100k) and also the variability in memory usage of a single process. Splitting them into coarser grained Pods is also not a solution long term, as it would only reduce the blast radius, but not eliminate the problem of a single model taking down hundreds more. Resource utilization would also be lower and result in a rather expensive bill. Keeping ourselves with cgroups v1 seems also not feasible long term, as I’m seeing discussions to drop support, both in Kubernetes, in systemd and in AWS Bottlerocket (which we’re using). We were also already using cgroups v2 and downgrading to v1 seems to fix the problem in the wrong way.

I agree that the new behavior seems like a saner default. I’m however afraid that we’re thinking that the previous use case is no longer valid, so this is the deeper question we would like to address.

In summary, I have the following concerns:

The proposed mitigations seem to go along the lines of "adding a flag" to give people more time to migrate. Those would also be coming 2 releases late (at best), while we’re leaving people reliant on the previous behavior exposed without a real mitigation.
There seems to be no way forward for workloads managing their own memory in a granular way through sub-processes.
Solutions to the previous point sound all like workarounds and not viable long term (I only see [Feature Proposal]: Ability to Configure Whether cgroupv2's group OOMKill is Used at the Pod Level #124253 (comment) being proposed, but it comes as part of a much bigger feature, which makes it hard to know when and if it will land).

Is there anything we could be doing to mitigate the impact of this change? For example, short term backporting the new kubelet flag to 1.28 or 1.29, and longer term expedite a subset of solution 3 targeted to OOMs only?

These are just ideas, and I’m happy to take other suggestions back to our teams as well. I hope SIG-Node can take this into consideration (cc @haircommander, @tzneal).

Thanks!

xhejtman · 2024-04-22T14:02:26Z

Hello,

would it be possible to mark pod status as OOMKilled when any oom kill occurs in the pod. It seems, that oomkilled status is set only iff oomkill is received by PID 1 in the container.

zaheerabbas-prodigal · 2024-06-26T10:25:23Z

Hello,

We migrated to k8s 1.28 this week on AWS EKS. I reached this issue after debugging for over a day. This is an issue that is affecting us in Prod. Do you have any update on when it will be backported to support the option to opt out of this default setting?

Would appreciate any solutions others have implemented as a temp fix for this.

Thanks

dims · 2024-06-27T02:22:12Z

This is an issue that is affecting us in Prod

@zaheerabbas-prodigal please open an issue with AWS/EKS support.

zaheerabbas-prodigal · 2024-06-27T10:26:07Z

@dims - Thanks for the response. Unfortunately, we do NOT have a support subscription for raising this as a priority issue with AWS/EKS support.

We have mitigated this issue by increasing the memory limits of our service and backporting to cgroup1 using bottle rocket configs. That seemed to be the only solution we could do to mitigate this as a temporary fix. But this is a really bad state to be in.

I wanted to reach out here and check if the k8s team is looking to prioritize and support this feature on an opt-in basis and NOT make it the default behavior. Or if we could employ other techniques to mitigate this behavior.

cc: @anshul-prodigal @prashant-prodigal

dims · 2024-06-27T10:56:16Z

@zaheerabbas-prodigal please see the tail end of the discussion in #122813

( someone's gotta step up to do the work ... )

mbrt · 2024-07-01T07:44:41Z

@zaheerabbas-prodigal we have implemented a scrappy workaround on our side. A daemon set just goes through all containers' cgroups and periodically disables the group oom killer. Note that this is done in a loop, not because the kubelet will revert it, but because we need to do it after every container is created:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: disable-memory-oom-group
spec:
  selector:
    matchLabels:
      app: disable-memory-oom-group
  template:
    metadata:
      labels:
        app: disable-memory-oom-group
    spec:
      containers:
      - name: disable-memory-oom-group
        image: busybox
        securityContext:
          privileged: true
        command:
        - /bin/sh
        - -c
        - |-
          while true; do
            for f in $(find /sys/fs/cgroup/kubepods.slice/ -name memory.oom.group); do
              echo "0" > "$f"
            done
            sleep 60
          done

This doesn't require cgroups v1 nor kubernetes versions downgrades.

And before people get horrified, I should point out that we literally had no other option.

dwyart · 2024-07-24T09:47:56Z

We have successfully tested the proposed "ugly" workaround, but we chose to convert it to a cronjob running directly on the host and defined in user data:

dnf update -y && dnf install -y cronie && systemctl start crond
echo '* * * * * for f in $(find /sys/fs/cgroup/kubepods.slice/ -name memory.oom.group); do echo "0" > "$f"; done' | crontab

we think this makes it less ugly ;)

BenTheElder · 2024-07-30T21:45:05Z

Just noting that this actually broke Kubernetes's CI's own assumptions about controlling PID1 and kill behavior.

(we inject a tiny entrypoint wrapper in CI, and we don't expect it to be killed when the sidecar is still running, versus the test process underneath being killed)

discussion in kubernetes-sigs/prow#210

Sungurik · 2024-08-20T15:35:32Z

Almost like @mbrt and @dwyart we created a Kyverno policy to workaround the issue more precisely. Add label the pods for which you need to disable group OOM, and Kyverno injects a privileged sidecar. Something like:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disable-group-oom
  annotations:
    policies.kyverno.io/title: Disable group OOM for single pod
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/minversion: 1.6.0
    policies.kyverno.io/description: >-
      Injects priveleged container into pod
      and set 0 to memory.oom.group on node filesystem
spec:
  rules:
  - name: disable-group-oom
    match:
      any:
        - resources:
            kinds:
              - Pod
            selector:
              matchLabels:
                sexygroup/kyverno-policies: "disable-group-oom"
    mutate:
      patchStrategicMerge:
        spec:
          containers:
            - name: disable-group-oom
              image: alpine:3.20.2
              imagePullPolicy: IfNotPresent
              securityContext:
                privileged: true
              resources:
                requests:
                  cpu: 5m
                  memory: 30Mi
                limits:
                  memory: 30Mi
              command: ["sh", "-c"]
              args:
              - |
                sleep 15
                POD_UID=$(cat /proc/self/mountinfo | grep -m 1 '/pods/' | awk -F '/' '{print $4}' | tr '-' '_')
                while true; do
                  for f in $(find /sys/fs/cgroup/kubepods.slice/ -name memory.oom.group | grep $POD_UID); do
                    echo 0 > $f
                    echo $f
                    cat $f
                  done
                  sleep 300
                done

Of course that's still a durty hack, but at least you can control it partly not disabling completely for whole cluster.

buckleyGI · 2025-01-07T13:43:11Z

Seems a toggle is added in V1.32

Added singleProcessOOMKill flag to the kubelet configuration. Setting that to true enable single process OOM killing in cgroups v2. In this mode, if a single process is OOM killed within a container, the remaining processes will not be OOM killed. (#126096, @utam0k) [SIG API Machinery, Node, Testing and Windows]

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.32.md

pratikmallya · 2025-05-11T16:10:41Z

hey folks, I also ran into this issue. One solution to handling this gracefully is to use setrlimit syscall (e.g. available via the ulimit utility) to set the hard limit for the child process that might be running into OOMKills. This has worked for me (it avoids the OOMKiller altogether), and have written a short blurb describing the approach here.

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 5, 2023

k8s-ci-robot requested review from bobbypage and krmayankk May 5, 2023 01:31

tzneal commented May 5, 2023

View reviewed changes

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go Show resolved Hide resolved

tzneal force-pushed the memory-oom-group-support branch 5 times, most recently from 4e8beb7 to bfb463b Compare May 5, 2023 13:54

k8s-ci-robot requested review from mrunalp, haircommander and guineveresaenger May 5, 2023 19:52

SergeyKanzhelev reviewed May 5, 2023

View reviewed changes

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go Outdated Show resolved Hide resolved

superbrothers mentioned this pull request Apr 3, 2024

kubelet: new kubelet config option for disabling group oom kill #122813

Closed

toVersus mentioned this pull request Apr 6, 2024

Move cgroup v1 support into maintenance mode kubernetes/enhancements#4572

Merged

mkarrmann mentioned this pull request Apr 10, 2024

[Feature Proposal]: Ability to Configure Whether cgroupv2's group OOMKill is Used at the Pod Level #124253

Closed

jaskaransarkaria mentioned this pull request Jun 5, 2024

Planning upgrade to EKS 1.28 ministryofjustice/cloud-platform#5570

Closed

SergeyKanzhelev mentioned this pull request Jun 18, 2024

KEP-2837: Introducing Pod level resource specifications with criteria for alpha phase kubernetes/enhancements#4678

Merged

utam0k mentioned this pull request Jul 15, 2024

kubelet: new kubelet config option for disabling group oom kill #126096

Merged

mattcary mentioned this pull request Jul 16, 2024

gce-pd-driver container OOM killed after upgrade to GKE 1.28 kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#1782

Open

pacoxu mentioned this pull request Jul 26, 2024

Blog post about cgroup v1 support moving to maintenance mode / KEP-4569 kubernetes/website#47069

Merged

petr-muller mentioned this pull request Jul 27, 2024

Crier failing to report OOMed pods kubernetes-sigs/prow#210

Open

tsipo mentioned this pull request Aug 20, 2024

container_oom_events_total always returns 0 google/cadvisor#3015

Open

mzylowski mentioned this pull request Dec 4, 2024

[BUG] azuredisk container in csi-azuredisk-node is OOM killed during running fsck Azure/AKS#4682

Closed

cartermckinnon mentioned this pull request Jun 4, 2025

Node memory usage on cgroupv2 reported higher than cgroupv1 #118916

Open

use the cgroup aware OOM killer if available #117793

use the cgroup aware OOM killer if available #117793

Uh oh!

Conversation

tzneal commented May 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

tzneal commented May 5, 2023

Uh oh!

Uh oh!

bobbypage commented May 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented May 5, 2023

Uh oh!

haircommander commented May 5, 2023

Uh oh!

tzneal commented May 5, 2023

Uh oh!

haircommander commented May 5, 2023

Uh oh!

Uh oh!

tzneal commented Dec 4, 2023

Uh oh!

poolnitsol commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haircommander commented Dec 7, 2023

Uh oh!

gdhagger commented Jan 8, 2024

Uh oh!

haircommander commented Jan 8, 2024

Uh oh!

tzneal commented Jan 17, 2024

Uh oh!

mbrt commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xhejtman commented Apr 22, 2024

Uh oh!

zaheerabbas-prodigal commented Jun 26, 2024

Uh oh!

dims commented Jun 27, 2024

Uh oh!

zaheerabbas-prodigal commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dims commented Jun 27, 2024

Uh oh!

mbrt commented Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwyart commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Jul 30, 2024

Uh oh!

Sungurik commented Aug 20, 2024

Uh oh!

buckleyGI commented Jan 7, 2025

Uh oh!

pratikmallya commented May 11, 2025

Uh oh!

Uh oh!

tzneal commented May 5, 2023 •

edited

Loading

bobbypage commented May 5, 2023 •

edited

Loading

poolnitsol commented Dec 6, 2023 •

edited

Loading

mbrt commented Apr 16, 2024 •

edited

Loading

zaheerabbas-prodigal commented Jun 27, 2024 •

edited

Loading

mbrt commented Jul 1, 2024 •

edited

Loading

dwyart commented Jul 24, 2024 •

edited

Loading