[KEP-2400] Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764

iholder101 · 2023-06-20T10:10:35Z

What type of PR is this?

/kind feature
/sig node

What this PR does / why we need it:

Adds full support for node swap running on cgroup v2, for both Limited and Unlimited swap behaviors as per KEP 2400 [1].

In addition, as the KEP dictates, support for cgroup v1 is being removed.

In addition, as per the KEP, when LimitedSwap is enabled the swap limit would be automatically calculated for
Burstable QoS pods. For Best-Effort / Guaranteed QoS pods, swap would be disabled.

The formula for the swap limit for Burstable QoS pods is:
(<memory-request>/<node-memory-capacity>)*<node-swap-capacity>.

In addition, containers with memory limits that are equal memory requests would not be able to access swap as well.
This is a nice way to opt-out of swap usage when LimitedSwap is enabled.

[1] https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md

Which issue(s) this PR fixes:

Fixes #119427
Fixes #119428
Fixes #119430

Special notes for your reviewer:

This PR introduces a commit to fix a cadvisor bug. This bug was fixed upstream, although cadvisor did not seem to have a new release since then. For testing purposes I keep the commit as being part of the PR, but it should eventually be removed in favor of bumping cadvisor's version.

UPDATE: new cadvisor version with the fix is now consumed with this PR: #119225.

Does this PR introduce a user-facing change?

Add full cgroup v2 swap support for both Limited and Unlimited swap.

When LimitedSwap is enabled the swap limit would be automatically calculated for
Burstable QoS pods. For Best-Effort / Guaranteed QoS pods, swap would be disabled.

Containers with memory requests equal to their memory limits also won't have
swap access, and it is a way to opt-out of swap for a single container.

The formula for the swap limit for Burstable QoS pods is:
`(<memory-request>/<node-memory-capacity>)*<node-swap-capacity>`.

Support for cgroup v1 is removed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/

Manual testing and sample output

In order to ease the review process and give an clearer idea about how this feature functions,
I'll present some sample outputs and show swap usage in action.

In the following test cases, I'll be using the following test pod:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: c1
    resources:
      requests:
        memory: "512M"
    image: quay.io/mabekitzur/fedora-with-stress-ng:12-3-2022
    command: ["/bin/bash"]
    args: ["-c", "sleep 9999"]

The container image is based on the latest fedora, but also installs stress-ng. This would be useful to test swap behavior.
In addition, I'm using k instead of ./cluster/kubectl.sh for simplicity.

Test case #1 - sanity check: without swap enabled

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

See that swap is disabled (equals to 0).
Before this PR, it used to be max, which means there is no limitation for swap usage. This is fixed in this PR.

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

The current swap usage is as follows:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

As can be seen, swap is not being used as expected.

Case #2 - With unlimited swap

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
max

This is now set to max as expected.

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

When we look back at the previous command continuously printing the swap usage we see:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done
0
# ... many zeros
0
...
2653233152
6713556992
9481170944
12596342784
16045711360
19471413248
22949982208
26457309184
30129651712
33872568320
37397729280
41005793280

As can be seen, the swap usage is getting larger and larger as time goes by.

Case #3 - With limited swap and a Burstable QoS Pod

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"
> export LIMITED_SWAP=true

Note that LIMITED_SWAP is a new environment variable introduced in this PR.

Create the pod:

> k create -f pod.yaml
pod/test-pod created

As per KEP2400, the swap limit is being automatically calculated for Burstable QoS pods. As written in the KEP, the fomula is: (<container-request>/<node-memory-capacity>)*<swap-size>). On my system, I have approximately 400Gi of total memory and a swap size of approximately 40Gi. Therefore, the swap limit needs to be approximately (0.5Gi/400Gi)*40Gi == 0.5Gi.

Let's check the memory limit that was set automatically:
Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
54435840

Since 54,435,840 bytes == (54,435,840/1024^3)Gi ~= 0.5Gi, we can tell that the limit was set as expected.

Now, as before, let's stress the pod and see the actual swap usage:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done
0
# ... many zeros
0
...
54435840
54435840
54435840
54435840
...

As we can see, the pod cannot exceed its swap limit, as expected.

Case #4 - With limited swap and a Burstable QoS Pod with memory limits==requests

In this case the KEP dictates that no swap memory should be allocated to the container.
This way it's easy to opt-out of swap.

Note: the same behavior would apply to Best-Effort / Guaranteed QoS Pods.
Note: for this example I've changed the pod example above so that memory limits are equal to requests

As before, we'll depoloy the cluster with NodeSwap feature gate and LimitedSwap and create the pod:

> export FEATURE_GATES="NodeSwap=true"
> export LIMITED_SWAP=true

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

See that swap is disabled (equals to 0).

If we would repeat the same steps as before, meaning stressing the container:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

As can be seen, swap is not being used as expected.

k8s-ci-robot · 2023-06-20T10:10:44Z

Hi @iholder101. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

iholder101 · 2023-06-22T14:20:45Z

/hold

The following bugfix commit is temporary: 7c015fa. It mimics the bugfix that was merged to cadvisor here: google/cadvisor#3293.

Since there is no release containing the fix yet the commit is added so we can test and review the PR in the meantime.

Please don't hesitate to start reviewing the PR, I won't remove the hold until this commit is removed. It will be removed once a cadvisor release takes place and we can consume it in Kubernetes.

harche · 2023-06-22T15:30:21Z

/ok-to-test

iholder101 · 2023-06-26T15:02:28Z

/cc @pacoxu

swatisehgal · 2023-06-27T11:48:09Z

/triage accepted
/priority important-soon
As this is work is captured as a prerequisite for Beta graduation of Node swap support. This feature is already tracked in the SIG Node planning doc and is targetted for 1.28 release.

SergeyKanzhelev

/lgtm
/approve

k8s-ci-robot · 2023-07-17T23:23:17Z

LGTM label has been added.

Git tree hash: 352425308b6d2102f5bfe2b6868394c8b0ebeea1

mrunalp · 2023-07-17T23:26:10Z

@dims ptal

thockin

/approve

for hack

k8s-ci-robot · 2023-07-18T00:21:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iholder101, mrunalp, SergeyKanzhelev, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~hack/OWNERS~~ [thockin]
~~pkg/features/OWNERS~~ [SergeyKanzhelev,mrunalp,thockin]
~~pkg/kubelet/OWNERS~~ [mrunalp,thockin]
~~test/e2e_node/OWNERS~~ [SergeyKanzhelev,mrunalp,thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-triage-robot · 2023-07-18T02:38:33Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

pacoxu · 2023-07-21T03:42:32Z

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

+		// memorySwapLimit = total permitted memory+swap; if equal to memory limit, => 0 swap above memory limit
+		// Some swapping is still possible.
+		// Note that if memory limit is 0, memory swap limit is ignored.
+		lcr.MemorySwapLimitInBytes = lcr.MemoryLimitInBytes


I suspect this line failed some tests without swap enabled, like some Ubuntu.
#119467.

k8s-ci-robot requested review from endocrimes and logicalhan June 20, 2023 10:11

k8s-ci-robot added the area/kubelet label Jun 20, 2023

iholder101 force-pushed the Swap/burstableQoS-impl branch 2 times, most recently from 177a6e2 to 7588996 Compare June 22, 2023 14:16

iholder101 marked this pull request as ready for review June 22, 2023 14:16

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2023

k8s-ci-robot requested a review from andrewsykim June 22, 2023 14:17

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2023

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 22, 2023

k8s-ci-robot requested a review from pacoxu June 26, 2023 15:02

k8s-ci-robot requested review from harche, pacoxu and thockin July 17, 2023 21:24

SergeyKanzhelev approved these changes Jul 17, 2023

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Jul 17, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2023

mrunalp approved these changes Jul 17, 2023

View reviewed changes

thockin reviewed Jul 18, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023

k8s-ci-robot merged commit da2fdf8 into kubernetes:master Jul 18, 2023

k8s-ci-robot added this to the v1.28 milestone Jul 18, 2023

bertinatto mentioned this pull request Jul 18, 2023

Set unified field cgroups v2 #119233

Closed

iholder101 mentioned this pull request Jul 19, 2023

[KEP-2400] Node memory swap support (replacement for inactive issue #2400) kubernetes/enhancements#4128

Closed

35 tasks

pacoxu mentioned this pull request Jul 21, 2023

failed to create containerd task: memory.memsw.limit_in_bytes: no such file or directory: unknown containerd/containerd#8855

Closed

pacoxu reviewed Jul 21, 2023

View reviewed changes

tzneal mentioned this pull request Jul 21, 2023

kuberuntime: Fix tests on machines with IsCgroup2UnifiedMode true #119145

Closed

mqasimsarfraz mentioned this pull request Jul 21, 2023

cri: memory.memsw.limit_in_bytes: no such file or directory containerd/containerd#8857

Merged

iholder101 mentioned this pull request Jul 21, 2023

REQUEST: New membership for iholder101 kubernetes/org#4342

Closed

31 tasks

This was referenced Jul 30, 2023

[crio_swap1g.ign]: Use cgroup v2 instead of v1 kubernetes/test-infra#30242

Merged

pull-kubernetes-node-swap-ubuntu-serial to run with cgroup v2 kubernetes/test-infra#30297

Merged

elezar mentioned this pull request Sep 20, 2023

only configure swap if swap is enabled #120784

Merged

oliver-goetz mentioned this pull request Sep 27, 2023

☂️-Issue for "Support for Kubernetes v1.28” gardener/gardener#8291

Closed

16 tasks

marquiz mentioned this pull request Sep 28, 2023

Skip unit tests that assume the system does not use cgroup v2. #119329

Closed

harche mentioned this pull request Jan 12, 2024

Fail Kubelet at startup if swap is configured with cgroup v1 #122241

Closed

jaskaransarkaria mentioned this pull request Jun 5, 2024

Planning upgrade to EKS 1.28 ministryofjustice/cloud-platform#5570

Closed

iholder101 changed the title ~~Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods~~ [KEP-2400] Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KEP-2400] Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764

[KEP-2400] Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764

Uh oh!

iholder101 commented Jun 20, 2023 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 20, 2023

Uh oh!

iholder101 commented Jun 22, 2023

Uh oh!

harche commented Jun 22, 2023

Uh oh!

iholder101 commented Jun 26, 2023

Uh oh!

swatisehgal commented Jun 27, 2023

Uh oh!

SergeyKanzhelev left a comment

Uh oh!

k8s-ci-robot commented Jul 17, 2023

Uh oh!

mrunalp commented Jul 17, 2023

Uh oh!

thockin left a comment

Uh oh!

k8s-ci-robot commented Jul 18, 2023

Uh oh!

k8s-triage-robot commented Jul 18, 2023

Uh oh!

pacoxu Jul 21, 2023

Uh oh!

Uh oh!

[KEP-2400] Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764

[KEP-2400] Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764

Uh oh!

Conversation

iholder101 commented Jun 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Manual testing and sample output

Test case #1 - sanity check: without swap enabled

Case #2 - With unlimited swap

Case #3 - With limited swap and a Burstable QoS Pod

Case #4 - With limited swap and a Burstable QoS Pod with memory limits==requests

Uh oh!

k8s-ci-robot commented Jun 20, 2023

Uh oh!

iholder101 commented Jun 22, 2023

Uh oh!

harche commented Jun 22, 2023

Uh oh!

iholder101 commented Jun 26, 2023

Uh oh!

swatisehgal commented Jun 27, 2023

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jul 17, 2023

Uh oh!

mrunalp commented Jul 17, 2023

Uh oh!

thockin left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jul 18, 2023

Uh oh!

k8s-triage-robot commented Jul 18, 2023

Uh oh!

pacoxu Jul 21, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iholder101 commented Jun 20, 2023 •

edited

Loading