Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764

Merged
merged 6 commits into from
Jul 18, 2023

Conversation

iholder101
Copy link
Contributor

@iholder101 iholder101 commented Jun 20, 2023

What type of PR is this?

/kind feature
/sig node

What this PR does / why we need it:

Adds full support for node swap running on cgroup v2, for both Limited and Unlimited swap behaviors as per KEP 2400 [1].

In addition, as the KEP dictates, support for cgroup v1 is being removed.

In addition, as per the KEP, when LimitedSwap is enabled the swap limit would be automatically calculated for
Burstable QoS pods. For Best-Effort / Guaranteed QoS pods, swap would be disabled.

The formula for the swap limit for Burstable QoS pods is:
(<memory-request>/<node-memory-capacity>)*<node-swap-capacity>.

In addition, containers with memory limits that are equal memory requests would not be able to access swap as well.
This is a nice way to opt-out of swap usage when LimitedSwap is enabled.

[1] https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md

Which issue(s) this PR fixes:

Fixes #119427
Fixes #119428
Fixes #119430

Special notes for your reviewer:

This PR introduces a commit to fix a cadvisor bug. This bug was fixed upstream, although cadvisor did not seem to have a new release since then. For testing purposes I keep the commit as being part of the PR, but it should eventually be removed in favor of bumping cadvisor's version.

UPDATE: new cadvisor version with the fix is now consumed with this PR: #119225.

Does this PR introduce a user-facing change?

Add full cgroup v2 swap support for both Limited and Unlimited swap.

When LimitedSwap is enabled the swap limit would be automatically calculated for
Burstable QoS pods. For Best-Effort / Guaranteed QoS pods, swap would be disabled.

Containers with memory requests equal to their memory limits also won't have
swap access, and it is a way to opt-out of swap for a single container.

The formula for the swap limit for Burstable QoS pods is:
`(<memory-request>/<node-memory-capacity>)*<node-swap-capacity>`.

Support for cgroup v1 is removed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/

Manual testing and sample output

In order to ease the review process and give an clearer idea about how this feature functions,
I'll present some sample outputs and show swap usage in action.

In the following test cases, I'll be using the following test pod:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: c1
    resources:
      requests:
        memory: "512M"
    image: quay.io/mabekitzur/fedora-with-stress-ng:12-3-2022
    command: ["/bin/bash"]
    args: ["-c", "sleep 9999"]

The container image is based on the latest fedora, but also installs stress-ng. This would be useful to test swap behavior.
In addition, I'm using k instead of ./cluster/kubectl.sh for simplicity.

Test case #1 - sanity check: without swap enabled

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

See that swap is disabled (equals to 0).
Before this PR, it used to be max, which means there is no limitation for swap usage. This is fixed in this PR.

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

The current swap usage is as follows:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

As can be seen, swap is not being used as expected.

Case #2 - With unlimited swap

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"

Create the pod:

> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
max

This is now set to max as expected.

Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done

Meanwhile, let's stress the pod to force swapping out pages:

> k exec -it test-pod -- bash
[root@test-pod /]> stress-ng --vm-bytes 525158751436b --vm-keep -m 1

When we look back at the previous command continuously printing the swap usage we see:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done
0
# ... many zeros
0
...
2653233152
6713556992
9481170944
12596342784
16045711360
19471413248
22949982208
26457309184
30129651712
33872568320
37397729280
41005793280

As can be seen, the swap usage is getting larger and larger as time goes by.

Case #3 - With limited swap and a Burstable QoS Pod

First, before deploying the cluster, the following environment variable has to be defined:

> export FEATURE_GATES="NodeSwap=true"
> export LIMITED_SWAP=true

Note that LIMITED_SWAP is a new environment variable introduced in this PR.

Create the pod:

> k create -f pod.yaml
pod/test-pod created

As per KEP2400, the swap limit is being automatically calculated for Burstable QoS pods. As written in the KEP, the fomula is: (<container-request>/<node-memory-capacity>)*<swap-size>). On my system, I have approximately 400Gi of total memory and a swap size of approximately 40Gi. Therefore, the swap limit needs to be approximately (0.5Gi/400Gi)*40Gi == 0.5Gi.

Let's check the memory limit that was set automatically:
Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
54435840

Since 54,435,840 bytes == (54,435,840/1024^3)Gi ~= 0.5Gi, we can tell that the limit was set as expected.

Now, as before, let's stress the pod and see the actual swap usage:

> k exec -it test-pod -- bash
[root@test-pod /]> while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 7; done
0
# ... many zeros
0
...
54435840
54435840
54435840
54435840
...

As we can see, the pod cannot exceed its swap limit, as expected.

Case #4 - With limited swap and a Burstable QoS Pod with memory limits==requests

In this case the KEP dictates that no swap memory should be allocated to the container.
This way it's easy to opt-out of swap.

Note: the same behavior would apply to Best-Effort / Guaranteed QoS Pods.
Note: for this example I've changed the pod example above so that memory limits are equal to requests

As before, we'll depoloy the cluster with NodeSwap feature gate and LimitedSwap and create the pod:

> export FEATURE_GATES="NodeSwap=true"
> export LIMITED_SWAP=true
> k create -f pod.yaml
pod/test-pod created

Check the swap memory limit:

> k exec -it test-pod -- "cat" "/sys/fs/cgroup/memory.swap.max"
0

See that swap is disabled (equals to 0).

If we would repeat the same steps as before, meaning stressing the container:

> k exec -it test-pod -c c1 -- bash
[root@test-pod /]# while true; do cat /sys/fs/cgroup/memory.swap.current; sleep 2; done
0
0
...

As can be seen, swap is not being used as expected.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 20, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @iholder101. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bart0sh bart0sh added this to Triage in SIG Node PR Triage Jun 20, 2023
@iholder101 iholder101 marked this pull request as ready for review June 22, 2023 14:16
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2023
@iholder101
Copy link
Contributor Author

/hold

The following bugfix commit is temporary: 7c015fa. It mimics the bugfix that was merged to cadvisor here: google/cadvisor#3293.

Since there is no release containing the fix yet the commit is added so we can test and review the PR in the meantime.

Please don't hesitate to start reviewing the PR, I won't remove the hold until this commit is removed. It will be removed once a cadvisor release takes place and we can consume it in Kubernetes.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2023
@harche
Copy link
Contributor

harche commented Jun 22, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 22, 2023
@iholder101
Copy link
Contributor Author

/cc @pacoxu

@k8s-ci-robot k8s-ci-robot requested a review from pacoxu June 26, 2023 15:02
@swatisehgal
Copy link
Contributor

/triage accepted
/priority important-soon
As this is work is captured as a prerequisite for Beta graduation of Node swap support. This feature is already tracked in the SIG Node planning doc and is targetted for 1.28 release.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2023
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 352425308b6d2102f5bfe2b6868394c8b0ebeea1

@mrunalp
Copy link
Contributor

mrunalp commented Jul 17, 2023

@dims ptal

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

for hack

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iholder101, mrunalp, SergeyKanzhelev, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit da2fdf8 into kubernetes:master Jul 18, 2023
14 of 15 checks passed
SIG Node CI/Test Board automation moved this from Triage to Done Jul 18, 2023
SIG Node PR Triage automation moved this from Triage to Done Jul 18, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jul 18, 2023
// memorySwapLimit = total permitted memory+swap; if equal to memory limit, => 0 swap above memory limit
// Some swapping is still possible.
// Note that if memory limit is 0, memory swap limit is ignored.
lcr.MemorySwapLimitInBytes = lcr.MemoryLimitInBytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this line failed some tests without swap enabled, like some Ubuntu.
#119467.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project