-
Notifications
You must be signed in to change notification settings - Fork 40.9k
[POC]: Support automatically configured Limited Swap support for cgroups v1 and v2 #117392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @iholder101. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: iholder101 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -263,6 +263,7 @@ func (m *MachineInfo) Clone() *MachineInfo { | |||
NumSockets: m.NumSockets, | |||
CpuFrequency: m.CpuFrequency, | |||
MemoryCapacity: m.MemoryCapacity, | |||
SwapCapacity: m.SwapCapacity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can start maybe sending PR(s) to cadvisor to make sure it has all the bits and pieces to support swap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You commented too fast :D
Here's the fix: google/cadvisor#3293 on upstream cAdvisor.
I'll also add it to the commit description.
Obviously, if we'd choose to go with this implementation, we would have delete this commit and wait for the fix to be merged.
@@ -103,7 +103,7 @@ func (m *kubeGenericRuntimeManager) generateLinuxContainerResources(pod *v1.Pod, | |||
// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec: | |||
// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory | |||
switch m.memorySwapBehavior { | |||
case kubelettypes.UnlimitedSwap: | |||
case "", kubelettypes.UnlimitedSwap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a change in behavior and, if so, needs discussion and docs.
In any case, please elaborate in the commit message why this change is beneficial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, no need for this behavior change.
Now the "default" case is configuring "NoSwap", as before.
As a bonus, it now covers both v1 and v2 cases.
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
After this commit, when LimitedSwap is enabled, containers would get swap acess limited with respect to the Pod's QoS, memory request proportion, and swap capacity proportion. The swap limitation is calculated in the following way: 1. Only pods of Burstable QoS have swap access. Other pods are limited to 0 swap usage. 2. Let "totalPodMemory" be the sum of all of the Pod's containers memory requests. 3. Let "requestedMemoryProportion" be the division of a container's memory request by "totalPodMemory". 4. Let "swapMemoryProportion" be the division of the swap capacity by the node's total memory capacity. 5. Let "swapLimit" be the multiplication of the three values above, that is: requestedMemoryProportion * swapMemoryProportion * containerMemRequest swapLimit then serves as the swap limitation for that container. Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
PR for the fix at upstream cAdvisor: google/cadvisor#3293 Signed-off-by: Itamar Holder <[email protected]>
38857e7
to
46b0dfb
Compare
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This work-in-progress PR needs a rebase. please rebase if this PR is still needed for the upcoming release. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind feature
Note: One of this PR's commits fixes a bug in cAdvisor regarding cloning MachineInfo. It is fixed in this PR for the purposes of testing the POC, but a fix is also sent to upstream cAdvisor: google/cadvisor#3293.
Overview and motivation
This PR serves as a POC for supporting LimitedSwap, configured automatically on a QoS basis, which works on both cgroups v1 and v2.
This is important for two reasons:
Not only that currently LimitedSwap is not supported at all of cgroups v2, but
also in the current implementation [2] it's effectively being turned off also for v1.
This is nicely explain as an in-code comment [3].
In this PR (unlike my previous POC[5]), avoids introducing new APIs. Instead, the swap limitation is calculated automatically based on the Pod's QoS, the swap size, and the container and pod's memory request. See section below for full details.
[1] https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/
[2] https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.3/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L96
[3] https://github.com/kubernetes/kubernetes/blob/v1.27.0-alpha.3/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#LL99
[4] https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md?plain=1#L542
[5] #116637
How swap memory allocation is calculated
As mentioned above, this PR does not introduce any new APIs. Instead, it is being automatically and transparently calculated.
Note: swap access is granted only for pods of Burstrable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictible, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.
Let C be the container that's being configured.
Let's define the following:
totalPodMemory
: The sum of all the pod's containers memory requests.CMemory
: C's memory request.requestedMemoryProportion
:CMemory
divided bytotalPodMemory
.swapMemoryProportion
: The node's swap capacity divided by the node's memory capacity.swapLimit
: C's swap limitation.Then,
swapLimit
is calculated as follows:swapLimit = requestedMemoryProportion * swapMemoryProportion * CMemory
That is, a Burstable QoS pod gets swap allocation proportional to its memory request and the swap/memory capacity on the node.
Every container in that pod gets swap allocation proportionate to its memory request.
Test Environment
* The following is used to bring up the local cluster:
CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' ./hack/local-up-cluster.sh
I'm using the following pod for testing:
The container image is based on the latest fedora, but also installs stress-ng. This would be useful to test swap behavior. The container does nothing but sleep for a huge amount of time. I'm going to refer to this pod as
pod.yaml
in the sections below.This Pod will be slightly changed w.r.t. the relevant test case. I would show only the relevant changes, and not the whole pod for simplicity.
In addition, I'm using
k
instead of./cluster/kubectl.sh
.I've also added a commit with debug logs in order to ease understanding the calculations behind the scenes.
Tests
All of the following tests were done on an environment running cgroups v2.
Case #1 - sanity check: without swap enabled
Create the pod:
> k create -f pod.yaml pod/test-pod created
Check the swap memory limit:
See that swap is disabled (equals to 0).
Before this PR, it used to be
max
, which means there is no limitation for swap usage. This is fixed in this PR.Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:
Meanwhile, let's stress the pod to force swapping out pages:
C1:
C1:
As can be seen, swap is not being used.
Case #2 - With unlimited swap
First, before deploying the cluster, the following environment variable has to be defined:
Create the pod:
> k create -f pod.yaml pod/test-pod created
Check the swap memory limit:
This is now set to
max
as expected.In addition, kubelet logs show:
Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:
Meanwhile, let's stress the pod to force swapping out pages:
When we look back at the previous command continuously printing the swap usage we see:
As can be seen, the swap usage is getting larger and larger as time goes by.
Case #3 - With limited swap and a Burstable QoS Pod
First, before deploying the cluster, the following environment variable has to be defined:
Note that
LIMITED_SWAP
is a new environment variable introduced for the purposes of this POC in this commit.Create the pod:
> k create -f pod.yaml pod/test-pod created
Kubelet logs show:
Check the swap memory limit:
The swap limitation is configured correctly.
Now, let's repeat the previous test's procedure. This is the output:
Running stress (done for both c1 and c2 containers):
C1:
C1:
As can be seen, the swap reaches near the limit but doesn't exceed it as expected.
Case #4 - With limited swap and a Burstable / Guaranteed QoS Pod
As explained above, in these cases we expect that the containers wouldn't have swap access whatsoever.
Best-Effort QoS pod manifest (partial):
Guaranteed QoS pod manifest (partial):
Create the pod:
> k create -f pod.yaml pod/test-pod created
Kubelet logs show:
Check the swap memory limit:
Now, let's try to stress the pod in order to show swap is being used. In order to achieve that, we can run the following two in parallel.
First, continuously print the swap usage on that Pod:
Meanwhile, let's stress the pod to force swapping out pages:
C1:
C1: