-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764
Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764
Conversation
Hi @iholder101. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
39ddf56
to
177a6e2
Compare
177a6e2
to
7588996
Compare
/hold The following bugfix commit is temporary: 7c015fa. It mimics the bugfix that was merged to cadvisor here: google/cadvisor#3293. Since there is no release containing the fix yet the commit is added so we can test and review the PR in the meantime. Please don't hesitate to start reviewing the PR, I won't remove the hold until this commit is removed. It will be removed once a cadvisor release takes place and we can consume it in Kubernetes. |
/ok-to-test |
/cc @pacoxu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 352425308b6d2102f5bfe2b6868394c8b0ebeea1
|
@dims ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for hack
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: iholder101, mrunalp, SergeyKanzhelev, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
// memorySwapLimit = total permitted memory+swap; if equal to memory limit, => 0 swap above memory limit | ||
// Some swapping is still possible. | ||
// Note that if memory limit is 0, memory swap limit is ignored. | ||
lcr.MemorySwapLimitInBytes = lcr.MemoryLimitInBytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this line failed some tests without swap enabled, like some Ubuntu.
#119467.
What type of PR is this?
/kind feature
/sig node
What this PR does / why we need it:
Adds full support for node swap running on cgroup v2, for both Limited and Unlimited swap behaviors as per KEP 2400 [1].
In addition, as the KEP dictates, support for cgroup v1 is being removed.
In addition, as per the KEP, when LimitedSwap is enabled the swap limit would be automatically calculated for
Burstable QoS pods. For Best-Effort / Guaranteed QoS pods, swap would be disabled.
The formula for the swap limit for Burstable QoS pods is:
(<memory-request>/<node-memory-capacity>)*<node-swap-capacity>
.In addition, containers with memory limits that are equal memory requests would not be able to access swap as well.
This is a nice way to opt-out of swap usage when LimitedSwap is enabled.
[1] https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md
Which issue(s) this PR fixes:
Fixes #119427
Fixes #119428
Fixes #119430
Special notes for your reviewer:
This PR introduces a commit to fix a cadvisor bug. This bug was fixed upstream, although cadvisor did not seem to have a new release since then. For testing purposes I keep the commit as being part of the PR, but it should eventually be removed in favor of bumping cadvisor's version.UPDATE: new cadvisor version with the fix is now consumed with this PR: #119225.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Manual testing and sample output
In order to ease the review process and give an clearer idea about how this feature functions,
I'll present some sample outputs and show swap usage in action.
In the following test cases, I'll be using the following test pod:
The container image is based on the latest fedora, but also installs stress-ng. This would be useful to test swap behavior.
In addition, I'm using
k
instead of./cluster/kubectl.sh
for simplicity.Test case #1 - sanity check: without swap enabled
Create the pod:
> k create -f pod.yaml pod/test-pod created
Check the swap memory limit:
See that swap is disabled (equals to 0).
Before this PR, it used to be
max
, which means there is no limitation for swap usage. This is fixed in this PR.Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:
Meanwhile, let's stress the pod to force swapping out pages:
The current swap usage is as follows:
As can be seen, swap is not being used as expected.
Case #2 - With unlimited swap
First, before deploying the cluster, the following environment variable has to be defined:
Create the pod:
> k create -f pod.yaml pod/test-pod created
Check the swap memory limit:
This is now set to
max
as expected.Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:
Meanwhile, let's stress the pod to force swapping out pages:
When we look back at the previous command continuously printing the swap usage we see:
As can be seen, the swap usage is getting larger and larger as time goes by.
Case #3 - With limited swap and a Burstable QoS Pod
First, before deploying the cluster, the following environment variable has to be defined:
Note that
LIMITED_SWAP
is a new environment variable introduced in this PR.Create the pod:
> k create -f pod.yaml pod/test-pod created
As per KEP2400, the swap limit is being automatically calculated for Burstable QoS pods. As written in the KEP, the fomula is:
(<container-request>/<node-memory-capacity>)*<swap-size>)
. On my system, I have approximately 400Gi of total memory and a swap size of approximately 40Gi. Therefore, the swap limit needs to be approximately(0.5Gi/400Gi)*40Gi == 0.5Gi
.Let's check the memory limit that was set automatically:
Check the swap memory limit:
Since
54,435,840 bytes == (54,435,840/1024^3)Gi ~= 0.5Gi
, we can tell that the limit was set as expected.Now, as before, let's stress the pod and see the actual swap usage:
As we can see, the pod cannot exceed its swap limit, as expected.
Case #4 - With limited swap and a Burstable QoS Pod with memory limits==requests
In this case the KEP dictates that no swap memory should be allocated to the container.
This way it's easy to opt-out of swap.
Note: the same behavior would apply to Best-Effort / Guaranteed QoS Pods.
Note: for this example I've changed the pod example above so that memory limits are equal to requests
As before, we'll depoloy the cluster with NodeSwap feature gate and LimitedSwap and create the pod:
> k create -f pod.yaml pod/test-pod created
Check the swap memory limit:
See that swap is disabled (equals to 0).
If we would repeat the same steps as before, meaning stressing the container:
As can be seen, swap is not being used as expected.