-
Notifications
You must be signed in to change notification settings - Fork 40.9k
kubelet: new kubelet config option for disabling group oom kill #122813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tzneal The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @haircommander |
d59ff64
to
a443686
Compare
Also, I have no attachment to the name |
a443686
to
9b58625
Compare
/remove-sig api-machinery |
9b58625
to
ee5e6e8
Compare
@tzneal As someone whose made 0 contributions to K8s, I just wanted to chime in with my opinion here (won't be offended if it's dismissed out-of-hand :)). I think that this configuration should be at the container-level, not at the Kubelet. Whether or not a single process or the entire container should be killed is entirely dependent on the specific application, and likely doesn't have much of an impact outside of the specific container in question. So, I think people writing the config for a container are going to be in a far better position to determine what the behavior for their container will be than admins setting the kubelet config. Of course, nothing here precludes adding the container-level configuration in addition to this configuration (or using |
If this has agreement on direction from kubelet component owners (since there's no linked design, it's not clear to me that it does), get the kubelet packages reviewed / approved, then relabel for api-review for the mechanical flag / config plumbing. /remove-label api-review |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is there a specific use-case that's needed to be supported? If so, what is it? I kind of miss the point of supporting a broken cgroup v1 behavior for backward-compatibility's sake in a point of time where cgroup v1 is deprecated. Another thing to mention is that the killed process won't get restarted automatically. Something would need to either restart the process inside the container or restart the container itself when needed. |
I think there's lots of situations where not automatically killing the entire container is desirable behavior. The most common example I've seen others point to is postgres, which is designed to let the main postmaster process do clean up in case that a child process crashes. A hard SIGKILL to everything is undesirable from a stability perspective. A pattern I've used is having a main process on a container which handles light housekeeping logic (e.g. reading from a queue, heartbeating) and child process which does the heavy lifting of handling a request. I've even been forced to use this pattern in Python applications due to GIL-locking issues. In this case, it's generally going to be the child process which is the proximate cause for the OOM, and most likely what the kernel is going to target. In this pattern, the main process can detect that the request itself is "bad" because it resulted in a killed process, and can handle that appropriately (e.g. return an error, discard the message, etc.). If the entire container were OOMKilled, then there's no opportunity to do any housekeeping and the rest of the system has no insight into what went wrong (maybe pod disconnected due to a network issue?), meaning I need to retry the job. As you note, the child process will need to be restarted, but most Process Pool libaries handle this out of the box, so isn't much of an issue. With all that being said, I stand by my above comment that I don't believe this configuration belongs in the kubelet, since the situations where this is desirable behavior is very application/container-specific. I'd rather have the configuration live at that level, and I'm worried having the configuration at the kubelet level as well would muddy the water. |
We have a bunch of processes that currently have a python wrapper program that's managing a task through a bunch of phases that call out to other binaries. If any of those binaries fails due to a segfault - or more frequently, and OOM event - the python code is responsible for seeing that it OOMed and updating our records of how that job failed. There are definitely cons to this setup, and more K8s native ways to reimplement it, but it's a big lift to refactor our processes to handle this in the near-medium term. I'm also certain that we're not the only folks relying on this mode of operation. |
Thank you @mkarrmann and @gdhagger for the detailed explanation!
If so, I'd say that it's not a broken behavior, but rather a feature that's needed for exceptional use-cases and that needs to be used with care.
Since this behavior is expected to be valuable only for exceptional use-cases, I tend to agree that we shouldn't define it on the kubelet (=node) level, unless there's a use-case for turning it on for the whole node that I can't think of. |
I agree that the container level is best, but then I would suggest reverting to #117793. I just upgraded our cluster and our workloads were fatally affected and there is no way around it. #117793 (comment) |
@superbrothers I think the community consensus is that cgroup-aware OOMs are better default behavior. The K8s docs have always been written in a way that implies that this is the case, so arguably it was a bug that this wasn't the default behavior. It also takes forethought to take advantage of non-cgroup-aware OOMKiller, which most devs won't do. The fact that #117793 broke workloads is unfortuanate, but I also think the damage is done. Now that that's been released, there's likely workloads that now implicitely depend on that behavior, so reverting that behavior would break those. I'm glad to see we agree that container-level config would be best :) I'm brand new to K8s development, but I have a goal of shipping container-level configuration for this flag. Hopefully that can be completed and out the door relatively soon, so that all use cases can be satisfied. |
@mkarrmann Hi, Matt 👋
I think this is the community consensus. It would take a lot of time and effort to get this configuration into the container level. I fear that many users will not be able to migrate cgroup v1/v2 kubernetes/enhancements#4572 in the meantime. |
@utam0k Hello! I personally am unsure if introducing and documenting a feature which is intended to be deprecated soon is a great idea (I don't know whether or not there's precedent for that in K8s). I also am quite sure that the kubelet solution will be overly burdensome for the majority developers who wish to configure this (I bet in most orgs, the vast majority of devs don't even know who to reach out to to discuss making a kubelet change in their nodes). Upon reflection, I think I actually would prefer temporarily reverting #117793, as opposed to introducing this config to compensate. But that's all just my personal opinion and I don't have much conviction on this point—I'm definitely not going to object to anything trying to unblock users. I also grant your point that shipping the container-level config simply may take too long. |
I agree that this should not be a kubelet/node level setting. I also feel that one problem with the kubelet flags is that the users don't have good visibility and in worst case nodes might be configured differently. A possible solution to provide per-container control would be QoS-resources (KEP). The OOM kill behavior would be a Kubernetes-managed QoS-resources with two possible settings (classes): group kill or process kill. It would also have the advantage to set per-cluster/per-node/per-namespace defaults. Plus the user could always see what berhavior their applications get. |
I think we need to decide whether this is a behavior/setting we want to support long-term, or simply a temporary crutch to help with the transition from cgroup v1 to v2. If it's the latter, we should not be adding anything to the k8s core API. |
I've opened an issue describing my proposal: #124253 I tried to thoroughly articulate why it's important that first-class support for this behavior is provided. |
@tzneal @haircommander @mrunalp @dchen1107 Footnotes |
For the record, I largely retract my original comment about not wanting this merged. It will take a while for a proper solution to be merged, and I agree this is important that it deserves a sooner fix (even if this solution will be inconvenient for many). One suggestion @tzneal : Make the config of type I still think it's important to eventually have an option for configuring this per Pod in the K8s API, in which case it'd be best to avoid confusion or ambiguity regarding how that will interact with this feature. |
@mkarrmann Thanks for your understanding and effort 🙏 I totally agree with you.
|
Closing this for now as I don't have time at the moment. |
@tzneal Thanks for your effort 🙏 I understand your situation. Can I take over this PR? |
Sure thing, please do! |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: