-
Notifications
You must be signed in to change notification settings - Fork 40.9k
[KEP-2400]: Swap-aware memory eviction (with no additional APIs) #129578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[KEP-2400]: Swap-aware memory eviction (with no additional APIs) #129578
Conversation
Skipping CI for Draft Pull Request. |
6f9cd4d
to
12438f5
Compare
b07084f
to
de3ab3b
Compare
/test pull-kubernetes-node-swap-conformance-fedora-serial |
Turn swap on and off and sleep for a few seconds. This helps stabalize the tests which will start with a fresh unused swap space. Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Accessible swap is the amount of swap that is accessible by Kubernetes pods, according to the LimitedSwap swap behavior. This PR treats accessible swap as an additional node memory capacity in the context of the eviction manager making memory signal observesations. Signed-off-by: Itamar Holder <[email protected]>
This PR treats used swap memory as an additional memory usage in the context of the eviction manager ranking pods for eviction. In addition, a pod's accessible swap is considered as an additional memory request. Signed-off-by: Itamar Holder <[email protected]>
a55c7d3
to
eedc64c
Compare
/test pull-kubernetes-node-swap-conformance-fedora-serial |
@kannon92 the PR is now merged, this PR is rebased on top of it, and seems that the swap-conformance lanes are both affected and passing. Is anything missing from your side for an LGTM? |
I think the code and concept look good to me. But I will leave lgtm/approvals for the people that were concerned about this feature. ie google GKE folks. |
Thanks @dchen1107 @yujuhong PTAL |
/test pull-kubernetes-unit |
/retest |
btw, https://testgrid.k8s.io/sig-node-kubelet#kubelet-swap-conformance-fedora-serial are flaking. THese jobs do not contain this code but still worth investigating why evictions on a swap enabled node behave differently. These jobs are not flaking at all in the main lane (https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv2-node-e2e-eviction) so swap seems to impact this. |
@iholder101: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
p1AccessibleSwap, p1Err := swapLimitCalculator.CalcPodSwapLimit(*p1) | ||
p2AccessibleSwap, p2Err := swapLimitCalculator.CalcPodSwapLimit(*p2) | ||
|
||
p1MemoryRequest.Add(*resource.NewQuantity(p1AccessibleSwap, resource.BinarySI)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of equally treating two unequal resources seems wrong for addressing memory pressure to me.
- Swap is a less expensive resource, making it easy to imagine larger swap spaces being provisioned than memory. Disk storage/swap availability cannot be compared with memory usage. For example, a user could set up 40GB of swap space in a 4GB physical memory setup. Even when 50% swap is implicitly allocated to a resource, this condition will likely never be satisfied because plenty of swap capacity will still be available.
- Swap is intended as an overflow mechanism, not primary memory; eviction could trigger before swap utilization even begins, especially in large swap configurations.
- Swap is significantly slower than memory, and treating both equally for handling memory pressure doesn't sound right. Kubelet's intent for eviction is to handle memory pressure, and if it were to evict a pod that would free up more swap space than memory, it wouldn't help in reducing memory pressure.
The other two criteria (pod priority and relative usage) are reasonable, but they are still affected by this fundamental equivalence problem.
I think existing eviction behavior: The formula "memory-usage + swap usage > memory requests + swap requests" creates a false equivalence. If swap were to be considered for eviction, better approach would be some form of weighted formula (as suggested by @pacoxu). Or if we want to simplify without the weights, consider: |
thinking how does this eviction strategy align with when api is introduced - will the api be at pod or container level? We are rationing swap today at the container level, if API will be at the pod-level, the meaning of 'swap accessibilitiy' for the resources would also change. Also, KEP suggests considering swap as a separate resource constraint in the future. If swap becomes a first-class resource in Kubernetes with its own requests and limits, it would make sense to consider swap pressure independent from memory pressure. |
As called out in earlier discussions, Kubernetes should consider adding I/O pressure as a factor in eviction decisions, which is more relevant when considering swap usage impact on node-stability. |
This actually makes total sense to me, and was one of the reasons I was concerned from running eviction tests on the swap-conformance lane. The failures are over the node not getting into memory pressure, which makes total sense. I think we need to revert these changes. I've opened a PR here: kubernetes/test-infra#34558. As explained in the PR:
|
APIs for swap is a complicated discussion for many reasons, like who controls swap limits (would be dangerous to give the pod owner direct control), how to set concrete swap limits, etc. In fact, the KEP explicitly says:
I find it very problematic that after sig-node had agreed on this with a large consensus it's now being opened again as if nothing was agreed. It makes sense to me that the bar for swap is very high, and that the burden of proof is on me. However, I think it's essential that once we decide on the scope of the feature we would respect these agreements. In any case, I'm okay with re-discussing the API as long as we look at it from the perspective of the most minimal API that's needed as a first step for the current KEP to GA, while deferring everything that's not strictly necessary to follow-up KEPs which I'll gladly work on.
I don't see swap ever becoming a first-class resource, for many reasons. This was already discussed and ruled out in the past. |
I agree. This sounds like a great topic for a dedicated KEP, as kubelet currently doesn't have any protection mechanism against IO. In addition, we can partially address this concern with documentation, as provisioning swap on a dedicated disk would greatly help in reducing IO pressure on the system. I'd go another step forward and say that the eviction manager's code is very old and we may want to frame this follow-up KEP around more general eviction improvements. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What type of PR is this?
/kind feature
/sig node
Problem and background
Prior to this PR, kubelet's eviction manager completely overlooked swap memory, leading to several issues:
In addition, this PR serves as an alternative to #128137. In that PR, swap is presented as a standalone signal alongside memory. However, I believe this approach is problematic, as it attempts to completely separate memory and swap memory. In reality, the two are inherently connected and should be addressed as a single issue. As a small example for how the two are inherently connected, swap is never used until the memory is full.
What this PR does / why we need it:
This PR addresses the above issues by making the eviction manager swap-aware. The proposed logic is fully backward compatible and requires no additional configuration, making it completely transparent to the user.
The main idea
Let
accessible swap
be the amount of swap that is accessible by pods according to the LimitedSwap swap behavior [1]. Note that the amount of accessible swap changes in time according to the pods running on the node.Triggering evictions: The eviction manager considers
accessible swap
as additional memory capacity.Eviction ordering: The eviction order will be defined as follows (for more info, see discussion here):
Which issue(s) this PR fixes:
Fixes #120800
Special notes for your reviewer:
Please don't mind the last commit. It cherry picks the fix from [KEP-2400] [Bugfix]: Ensure container-level swap metrics are collected #129486 and will be removed after this PR is merged.the PR is now merged and rebase on top of this one.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: