Description
What would you like to be added?
Add a new field to the Pod API which disables group OOM kills (see [https://github.com//pull/117793] for the current behavior).
Specifically, I propose adding the optional field DisableCgroupGroupKill *bool
to Pod
:
kubernetes/pkg/apis/core/types.go
Line 3995 in 9791f0d
The default value will be equivalent to false, which indicates to respect the default behavior (currently, enable group kills when cgroup v2 is available). When set to true, it will disable memory.oom.group
for the given container if cgroups v2 is available, and will be a no-op on Windows or if cgroups v2 is not available.
However, I'm fairly indifferent to the specifics here, and I fully expect someone to be able to propose something more idiomatic :)
Why is this needed?
Context
cgroup v2 added the memory.oom.group option, which, when enabled, tells the OOMKiller to kill the entire cgroup (container) instead of targetting individual processes. Generally, this is desirable behavior, so a change was made to the kubelet to opt-in to this whenever cgroup v2 is available.
However, users then reported that this change caused issues for them (e.g. here). Prior to this change, the OOMKiller would only ever target a single process instead of an entire container, and there are various ways that workloads can either implicitly or explicitly rely upon this behavior. So, there's consensus that something needs to be done to support users who rely upon this.
SIG-Node's original proposed solution was to add a config option to the kubelet to allow admins to disable this behavior, although that PR has sat idle, I believe largely because consensus wasn't reached that this was the right way to go.
Why At the Container-Level
I think that it is more appropriate to put this configuration at the container-level instead of at the kubelet-level. The only counter-argument I've heard is that making changes to the kubelet is far easier than making changes to the Pod API (this is of course a valid and important point, although my view is that this is important enough to deserve proper support).
The situations where disabling this is beneficial are specific to individual containers/applications, and have almost no bearing on the rest of the node or cluster. Therefore, only someone deeply familiar with the specific application will be in a position to determine whether to set this flag. Any solution to this problem which places the configuration at the kubelet-level is ugly and overly burdensome on developers who often won't even know who to contact in order to make a kubelet configuration change.
Why is it Important
As noted above, most will likely agree that, in principle, the container level is the more appropriate place for this configuration. However, the main pushback might be that this issue doesn't warrant introducing such an API change.
The most obvious reason why this is important is that this is how Kubernetes (at least on Linux) has worked up until very recently, meaning that most likely many workloads are relying on this behavior in some way or another. Obviously avoiding disruption to previously working workloads is important, and, for the above reasons, I believe that a container-level configuration is the only serious way to do so.
Besides backwards compatibility concerns, I also think there are scenarios where it is correct and beneficial to want the OOMKiller to target individual offending processes instead of the entire container. This is usually the case for any multi-processing application where:
- Almost always, the offending process will be an isolated child process
- The application is in a position to efficiently recover from an child process OOM, and OOMs perhaps even have application-level implications
That is, most well-designed applications using a main process-child process(es) architecture!
The specific scenario where I've personally relied upon this, and which convinced me that this is important enough to deserve first-class support is:
I have an application which reads from a work queue to handle RPC calls. These RPC calls can arbitrary queries (it's a legacy system, please don't judge me :)). So the queries can be arbitrarily large, and there's no good way to detect this ahead of time. Obviously, OOM kills are a serious concern here. To make matters worse, we need to ensure each query is handled at least once; therefore, if a worker which is handling a task ever goes down, another worker needs to pick up the task because that worker could have gone down or disconnected for any number of reasons (this is a common pattern/concern in such worker-consumer architectures). So, if a query is bad and causes OOMs, then the best we can do is process it a few times. Only after we've detected that we've handled a certain number of times can decide the query is bad and return an error. Of course, bringing down multiple Pods just to handle such queries is awful!
The solution is simple: have a main process which handles housekeeping (e.g. heartbeating, reading from queue, etc.), and a worker process which handles the actual query. >>99% of the time, it will be the worker process that receives the OOMKill. The main process can handle this by immediately marking the query as "bad", returning an error, and spinning up a new process to continue handling requests. This simple change cut down Pod crashes by orders of magnitude! Of course, this relies upon the OOMKiller only targeting single processes, and this will stop working if my Cloud team ever introduces instances which have cgroup v2.
In my mind, is not an abuse of containers or Kubernetes, and is a use-case deserving of first-class support in order to ensure that it will continue to work into the future.