Skip to content

[POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 #117392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

iholder101
Copy link
Contributor

@iholder101 iholder101 commented Apr 16, 2023

What type of PR is this?

/kind feature

Note: One of this PR's commits fixes a bug in cAdvisor regarding cloning MachineInfo. It is fixed in this PR for the purposes of testing the POC, but a fix is also sent to upstream cAdvisor: google/cadvisor#3293.

Overview and motivation

This PR serves as a POC for supporting LimitedSwap, configured automatically on a QoS basis, which works on both cgroups v1 and v2.

This is important for two reasons:

  1. According to the docs [1]:
cgroups v2: Kubernetes workloads cannot use swap memory.

       Not only that currently LimitedSwap is not supported at all of cgroups v2, but
       also in the current implementation [2] it's effectively being turned off also for v1.
       This is nicely explain as an in-code comment [3].

  1. This is one of the main requirements in order to graduate swap into Beta [4].

In this PR (unlike my previous POC[5]), avoids introducing new APIs. Instead, the swap limitation is calculated automatically based on the Pod's QoS, the swap size, and the container and pod's memory request. See section below for full details.

[1] https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/
[2] https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.3/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L96
[3] https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.3/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#LL99
[4] https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md?plain=1#L542
[5] #116637

How swap memory allocation is calculated

As mentioned above, this PR does not introduce any new APIs. Instead, it is being automatically and transparently calculated.

Note: swap access is granted only for pods of Burstrable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictible, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.

Let C be the container that's being configured.
Let's define the following:

totalPodMemory: The sum of all the pod's containers memory requests.
CMemory: C's memory request.
requestedMemoryProportion: CMemory divided by totalPodMemory.
swapMemoryProportion: The node's swap capacity divided by the node's memory capacity.
swapLimit: C's swap limitation.

Then, swapLimit is calculated as follows:
swapLimit = requestedMemoryProportion * swapMemoryProportion * CMemory

That is, a Burstable QoS pod gets swap allocation proportional to its memory request and the swap/memory capacity on the node.
Every container in that pod gets swap allocation proportionate to its memory request.

Test Environment

  • Kubernetes: latest master branch
  • CRIO: v1.26.2, built and compiled from source code
      * The following is used to bring up the local cluster: CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' ./hack/local-up-cluster.sh

I'm using the following pod for testing:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: c1
    resources:
      requests:
        memory: "512M"
    image: quay.io/mabekitzur/fedora-with-stress-ng:12-3-2022
    command: ["/bin/bash"]
    args: ["-c", "sleep 9999"]
  - name: c2
    resources:
      requests:
        memory: "256M"
    image: quay.io/mabekitzur/fedora-with-stress-ng:12-3-2022
    command: ["/bin/bash"]
    args: ["-c", "sleep 9999"]

The container image is based on the latest fedora, but also installs stress-ng. This would be useful to test swap behavior. The container does nothing but sleep for a huge amount of time. I'm going to refer to this pod as pod.yaml in the sections below.

This Pod will be slightly changed w.r.t. the relevant test case. I would show only the relevant changes, and not the whole pod for simplicity.

In addition, I'm using k instead of ./cluster/kubectl.sh.

I've also added a commit with debug logs in order to ease understanding the calculations behind the scenes.

Tests

All of the following tests were done on an environment running cgroups v2.

Case #1 - sanity check: without swap enabled

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

See that swap is disabled (equals to 0).
Before this PR, it used to be max, which means there is no limitation for swap usage. This is fixed in this PR.

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

C1:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

C1:

> k exec -it test-pod -c c2 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

As can be seen, swap is not being used.

Case #2 - With unlimited swap

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
max

This is now set to max as expected.
In addition, kubelet logs show:

[generateLinuxContainerResources()]: configuring container: c1
configuring unlimited swap

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

When we look back at the previous command continuously printing the swap usage we see:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done
0
# ... many zeros
0
...
5034381312
9559465984
13672230912
18211680256
22897905664
27367641088
28998823936

As can be seen, the swap usage is getting larger and larger as time goes by.

Case #3 - With limited swap and a Burstable QoS Pod

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"
> export LIMITED_SWAP=true

Note that LIMITED_SWAP is a new environment variable introduced for the purposes of this POC in this commit.

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Kubelet logs show:

[generateLinuxContainerResources()]: configuring container: c1
configuring limited swap
[limited swap]: totalPodMemory=768000000, containerMemoryRequest=512000000, requestedMemoryProportion=0.67
[limited swap]: node memory capacity=403955572736, swap capacity=42949664768, swap proportion=0.11
[limited swap]: swap limitation: 36291496

[generateLinuxContainerResources()]: configuring container: c2
configuring limited swap
[limited swap]: totalPodMemory=768000000, containerMemoryRequest=256000000, requestedMemoryProportion=0.33
[limited swap]: node memory capacity=403955572736, swap capacity=42949664768, swap proportion=0.11 
[limited swap]: swap limitation: 9072874

Check the swap memory limit:

> k exec -it test-pod -c c1 -- "cat" "/sys/fs/cgroup/memory.swap.max"
36290560
> k exec -it test-pod -c c2 -- "cat" "/sys/fs/cgroup/memory.swap.max"
9072640

The swap limitation is configured correctly.

Now, let's repeat the previous test's procedure. This is the output:

Running stress (done for both c1 and c2 containers):

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# stress-ng --vm-bytes 525158751436b --vm-keep -m 1

C1:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...
15691776
25255936
36171776
36171776
36171776

C1:

> k exec -it test-pod -c c2 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...
9072640
9072640
9072640

As can be seen, the swap reaches near the limit but doesn't exceed it as expected.

Case #4 - With limited swap and a Burstable / Guaranteed QoS Pod

As explained above, in these cases we expect that the containers wouldn't have swap access whatsoever.

Best-Effort QoS pod manifest (partial):

kind: Pod
spec:
  containers:
  - name: c1
    resources: {}
  - name: c2
    resources: {}

Guaranteed QoS pod manifest (partial):

kind: Pod
spec:
  containers:
  - name: c1
    resources:
      requests:
        cpu: 1.5
        memory: "512M"
      limits:
        cpu: 1.5
        memory: "512M"
  - name: c2
    resources:
      requests:
        cpu: 1.5
        memory: "256M"
      limits:
        cpu: 1.5
        memory: "256M"

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Kubelet logs show:

[generateLinuxContainerResources()]: configuring container: c1
configuring no swap

[generateLinuxContainerResources()]: configuring container: c2
configuring no swap

Check the swap memory limit:

> k exec -it test-pod -c c1 -- "cat" "/sys/fs/cgroup/memory.swap.max"
0
> k exec -it test-pod -c c2 -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

C1:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

C1:

> k exec -it test-pod -c c2 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 16, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @iholder101. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 16, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iholder101
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2023
@iholder101 iholder101 changed the title POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 [POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 Apr 16, 2023
@iholder101
Copy link
Contributor Author

/cc @harche @pacoxu @ffromani

@@ -263,6 +263,7 @@ func (m *MachineInfo) Clone() *MachineInfo {
NumSockets: m.NumSockets,
CpuFrequency: m.CpuFrequency,
MemoryCapacity: m.MemoryCapacity,
SwapCapacity: m.SwapCapacity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can start maybe sending PR(s) to cadvisor to make sure it has all the bits and pieces to support swap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You commented too fast :D

Here's the fix: google/cadvisor#3293 on upstream cAdvisor.
I'll also add it to the commit description.

Obviously, if we'd choose to go with this implementation, we would have delete this commit and wait for the fix to be merged.

@@ -103,7 +103,7 @@ func (m *kubeGenericRuntimeManager) generateLinuxContainerResources(pod *v1.Pod,
// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec:
// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
switch m.memorySwapBehavior {
case kubelettypes.UnlimitedSwap:
case "", kubelettypes.UnlimitedSwap:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a change in behavior and, if so, needs discussion and docs.
In any case, please elaborate in the commit message why this change is beneficial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, no need for this behavior change.
Now the "default" case is configuring "NoSwap", as before.

As a bonus, it now covers both v1 and v2 cases.

After this commit, when LimitedSwap is enabled,
containers would get swap acess limited with respect
to the Pod's QoS, memory request proportion, and
swap capacity proportion.

The swap limitation is calculated in the following way:
1. Only pods of Burstable QoS have swap access. Other pods
are limited to 0 swap usage.
2. Let "totalPodMemory" be the sum of all of the Pod's
containers memory requests.
3. Let "requestedMemoryProportion" be the division of a
container's memory request by "totalPodMemory".
4. Let "swapMemoryProportion" be the division of the swap
capacity by the node's total memory capacity.
5. Let "swapLimit" be the multiplication of the three
values above, that is:
requestedMemoryProportion * swapMemoryProportion * containerMemRequest

swapLimit then serves as the swap limitation for that
container.

Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
PR for the fix at upstream cAdvisor:
google/cadvisor#3293

Signed-off-by: Itamar Holder <[email protected]>
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member

dims commented Oct 24, 2023

This work-in-progress PR needs a rebase. please rebase if this PR is still needed for the upcoming release.

@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

5 participants