[POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 #117392

iholder101 · 2023-04-16T12:19:12Z

What type of PR is this?

/kind feature

Note: One of this PR's commits fixes a bug in cAdvisor regarding cloning MachineInfo. It is fixed in this PR for the purposes of testing the POC, but a fix is also sent to upstream cAdvisor: google/cadvisor#3293.

Overview and motivation

This PR serves as a POC for supporting LimitedSwap, configured automatically on a QoS basis, which works on both cgroups v1 and v2.

This is important for two reasons:

According to the docs [1]:

cgroups v2: Kubernetes workloads cannot use swap memory.

       Not only that currently LimitedSwap is not supported at all of cgroups v2, but
       also in the current implementation [2] it's effectively being turned off also for v1.
       This is nicely explain as an in-code comment [3].

This is one of the main requirements in order to graduate swap into Beta [4].

In this PR (unlike my previous POC[5]), avoids introducing new APIs. Instead, the swap limitation is calculated automatically based on the Pod's QoS, the swap size, and the container and pod's memory request. See section below for full details.

[1] https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/
[2] https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.3/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L96
[3] https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.3/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#LL99
[4] https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md?plain=1#L542
[5] #116637

How swap memory allocation is calculated

As mentioned above, this PR does not introduce any new APIs. Instead, it is being automatically and transparently calculated.

Note: swap access is granted only for pods of Burstrable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictible, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.

Let C be the container that's being configured.
Let's define the following:

totalPodMemory: The sum of all the pod's containers memory requests.
CMemory: C's memory request.
requestedMemoryProportion: CMemory divided by totalPodMemory.
swapMemoryProportion: The node's swap capacity divided by the node's memory capacity.
swapLimit: C's swap limitation.

Then, swapLimit is calculated as follows:
swapLimit = requestedMemoryProportion * swapMemoryProportion * CMemory

That is, a Burstable QoS pod gets swap allocation proportional to its memory request and the swap/memory capacity on the node.
Every container in that pod gets swap allocation proportionate to its memory request.

Test Environment

Kubernetes: latest master branch
CRIO: v1.26.2, built and compiled from source code
* The following is used to bring up the local cluster: CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' ./hack/local-up-cluster.sh

I'm using the following pod for testing:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: c1
    resources:
      requests:
        memory: "512M"
    image: quay.io/mabekitzur/fedora-with-stress-ng:12-3-2022
    command: ["/bin/bash"]
    args: ["-c", "sleep 9999"]
  - name: c2
    resources:
      requests:
        memory: "256M"
    image: quay.io/mabekitzur/fedora-with-stress-ng:12-3-2022
    command: ["/bin/bash"]
    args: ["-c", "sleep 9999"]

The container image is based on the latest fedora, but also installs stress-ng. This would be useful to test swap behavior. The container does nothing but sleep for a huge amount of time. I'm going to refer to this pod as pod.yaml in the sections below.

This Pod will be slightly changed w.r.t. the relevant test case. I would show only the relevant changes, and not the whole pod for simplicity.

In addition, I'm using k instead of ./cluster/kubectl.sh.

I've also added a commit with debug logs in order to ease understanding the calculations behind the scenes.

Tests

All of the following tests were done on an environment running cgroups v2.

Case #1 - sanity check: without swap enabled

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

See that swap is disabled (equals to 0).
Before this PR, it used to be max, which means there is no limitation for swap usage. This is fixed in this PR.

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

C1:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

C1:

> k exec -it test-pod -c c2 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

As can be seen, swap is not being used.

Case #2 - With unlimited swap

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
max

This is now set to max as expected.
In addition, kubelet logs show:

[generateLinuxContainerResources()]: configuring container: c1
configuring unlimited swap

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

When we look back at the previous command continuously printing the swap usage we see:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done
0
# ... many zeros
0
...
5034381312
9559465984
13672230912
18211680256
22897905664
27367641088
28998823936

As can be seen, the swap usage is getting larger and larger as time goes by.

Case #3 - With limited swap and a Burstable QoS Pod

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"
> export LIMITED_SWAP=true

Note that LIMITED_SWAP is a new environment variable introduced for the purposes of this POC in this commit.

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Kubelet logs show:

[generateLinuxContainerResources()]: configuring container: c1
configuring limited swap
[limited swap]: totalPodMemory=768000000, containerMemoryRequest=512000000, requestedMemoryProportion=0.67
[limited swap]: node memory capacity=403955572736, swap capacity=42949664768, swap proportion=0.11
[limited swap]: swap limitation: 36291496

[generateLinuxContainerResources()]: configuring container: c2
configuring limited swap
[limited swap]: totalPodMemory=768000000, containerMemoryRequest=256000000, requestedMemoryProportion=0.33
[limited swap]: node memory capacity=403955572736, swap capacity=42949664768, swap proportion=0.11 
[limited swap]: swap limitation: 9072874

Check the swap memory limit:

> k exec -it test-pod -c c1 -- "cat" "/sys/fs/cgroup/memory.swap.max"
36290560
> k exec -it test-pod -c c2 -- "cat" "/sys/fs/cgroup/memory.swap.max"
9072640

The swap limitation is configured correctly.

Now, let's repeat the previous test's procedure. This is the output:

Running stress (done for both c1 and c2 containers):

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# stress-ng --vm-bytes 525158751436b --vm-keep -m 1

C1:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...
15691776
25255936
36171776
36171776
36171776

C1:

> k exec -it test-pod -c c2 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...
9072640
9072640
9072640

As can be seen, the swap reaches near the limit but doesn't exceed it as expected.

Case #4 - With limited swap and a Burstable / Guaranteed QoS Pod

As explained above, in these cases we expect that the containers wouldn't have swap access whatsoever.

Best-Effort QoS pod manifest (partial):

kind: Pod
spec:
  containers:
  - name: c1
    resources: {}
  - name: c2
    resources: {}

Guaranteed QoS pod manifest (partial):

kind: Pod
spec:
  containers:
  - name: c1
    resources:
      requests:
        cpu: 1.5
        memory: "512M"
      limits:
        cpu: 1.5
        memory: "512M"
  - name: c2
    resources:
      requests:
        cpu: 1.5
        memory: "256M"
      limits:
        cpu: 1.5
        memory: "256M"

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Kubelet logs show:

[generateLinuxContainerResources()]: configuring container: c1
configuring no swap

[generateLinuxContainerResources()]: configuring container: c2
configuring no swap

Check the swap memory limit:

> k exec -it test-pod -c c1 -- "cat" "/sys/fs/cgroup/memory.swap.max"
0
> k exec -it test-pod -c c2 -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

C1:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

C1:

> k exec -it test-pod -c c2 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

k8s-ci-robot · 2023-04-16T12:19:15Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-04-16T12:19:21Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-04-16T12:19:22Z

Hi @iholder101. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-04-16T12:19:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iholder101
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

iholder101 · 2023-04-16T12:39:18Z

/cc @harche @pacoxu @ffromani

ffromani · 2023-04-16T12:43:09Z

vendor/github.com/google/cadvisor/info/v1/machine.go

@@ -263,6 +263,7 @@ func (m *MachineInfo) Clone() *MachineInfo {
 		NumSockets:       m.NumSockets,
 		CpuFrequency:     m.CpuFrequency,
 		MemoryCapacity:   m.MemoryCapacity,
+		SwapCapacity: 	  m.SwapCapacity,


we can start maybe sending PR(s) to cadvisor to make sure it has all the bits and pieces to support swap?

You commented too fast :D

Here's the fix: google/cadvisor#3293 on upstream cAdvisor.
I'll also add it to the commit description.

Obviously, if we'd choose to go with this implementation, we would have delete this commit and wait for the fix to be merged.

ffromani · 2023-04-16T12:44:29Z

pkg/kubelet/kuberuntime/kuberuntime_container_linux.go

@@ -103,7 +103,7 @@ func (m *kubeGenericRuntimeManager) generateLinuxContainerResources(pod *v1.Pod,
 		// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
 		// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
 		switch m.memorySwapBehavior {
-		case kubelettypes.UnlimitedSwap:
+		case "", kubelettypes.UnlimitedSwap:


this looks like a change in behavior and, if so, needs discussion and docs.
In any case, please elaborate in the commit message why this change is beneficial

You're right, no need for this behavior change.
Now the "default" case is configuring "NoSwap", as before.

As a bonus, it now covers both v1 and v2 cases.

Signed-off-by: Itamar Holder <[email protected]>

After this commit, when LimitedSwap is enabled, containers would get swap acess limited with respect to the Pod's QoS, memory request proportion, and swap capacity proportion. The swap limitation is calculated in the following way: 1. Only pods of Burstable QoS have swap access. Other pods are limited to 0 swap usage. 2. Let "totalPodMemory" be the sum of all of the Pod's containers memory requests. 3. Let "requestedMemoryProportion" be the division of a container's memory request by "totalPodMemory". 4. Let "swapMemoryProportion" be the division of the swap capacity by the node's total memory capacity. 5. Let "swapLimit" be the multiplication of the three values above, that is: requestedMemoryProportion * swapMemoryProportion * containerMemRequest swapLimit then serves as the swap limitation for that container. Signed-off-by: Itamar Holder <[email protected]>

Signed-off-by: Itamar Holder <[email protected]>

PR for the fix at upstream cAdvisor: google/cadvisor#3293 Signed-off-by: Itamar Holder <[email protected]>

k8s-ci-robot · 2023-07-13T18:35:53Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dims · 2023-10-24T21:26:58Z

This work-in-progress PR needs a rebase. please rebase if this PR is still needed for the upcoming release.

k8s-triage-robot · 2024-02-03T18:13:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-02-03T18:13:34Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 16, 2023

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 16, 2023

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2023

k8s-ci-robot requested review from derekwaynecarr and liggitt April 16, 2023 12:20

iholder101 mentioned this pull request Apr 16, 2023

[POC]: Support container-level configured Limited Swap support for cgroups v1 and v2 #116637

Closed

iholder101 changed the title ~~POC]: Support automatically configured Limited Swap support for cgroups v1 and v2~~ [POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 Apr 16, 2023

k8s-ci-robot requested review from ffromani, harche and pacoxu April 16, 2023 12:39

ffromani reviewed Apr 16, 2023

View reviewed changes

iholder101 added 5 commits April 16, 2023 16:04

Add LIMITED_SWAP env var to enable limited swap

bfbdcfc

Signed-off-by: Itamar Holder <[email protected]>

Block swap when NodeSwap feature gate is disabled

604d334

Signed-off-by: Itamar Holder <[email protected]>

Add debug logs

4880d78

Signed-off-by: Itamar Holder <[email protected]>

Bugfix: cAdvisor machine info Clone() - clone SwapCapacity

46b0dfb

PR for the fix at upstream cAdvisor: google/cadvisor#3293 Signed-off-by: Itamar Holder <[email protected]>

iholder101 force-pushed the Swap/burstableQoS branch from 38857e7 to 46b0dfb Compare April 16, 2023 13:04

This was referenced Apr 19, 2023

KEP-2400: Graduate swap to Beta 1 kubernetes/enhancements#3957

Merged

Suppressible VMs: Support VMs that are memory-backed by a file kubevirt/kubevirt#9636

Closed

enp0s3 mentioned this pull request Jun 13, 2023

nominate iholder101 as core approver kubevirt/kubevirt#9900

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2023

dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2024

k8s-ci-robot closed this Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 #117392

[POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 #117392

Uh oh!

iholder101 commented Apr 16, 2023 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

iholder101 commented Apr 16, 2023

Uh oh!

ffromani Apr 16, 2023

Uh oh!

iholder101 Apr 16, 2023

Uh oh!

ffromani Apr 16, 2023

Uh oh!

iholder101 Apr 16, 2023

Uh oh!

k8s-ci-robot commented Jul 13, 2023

Uh oh!

dims commented Oct 24, 2023

Uh oh!

k8s-triage-robot commented Feb 3, 2024

Uh oh!

k8s-ci-robot commented Feb 3, 2024

Uh oh!

Uh oh!

[POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 #117392

[POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 #117392

Uh oh!

Conversation

iholder101 commented Apr 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Overview and motivation

How swap memory allocation is calculated

Test Environment

Tests

Case #1 - sanity check: without swap enabled

Case #2 - With unlimited swap

Case #3 - With limited swap and a Burstable QoS Pod

Case #4 - With limited swap and a Burstable / Guaranteed QoS Pod

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

k8s-ci-robot commented Apr 16, 2023

Uh oh!

iholder101 commented Apr 16, 2023

Uh oh!

ffromani Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

iholder101 Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

ffromani Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

iholder101 Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jul 13, 2023

Uh oh!

dims commented Oct 24, 2023

Uh oh!

k8s-triage-robot commented Feb 3, 2024

Uh oh!

k8s-ci-robot commented Feb 3, 2024

Uh oh!

Uh oh!

iholder101 commented Apr 16, 2023 •

edited

Loading