-
Notifications
You must be signed in to change notification settings - Fork 40.9k
[KEP-2400] Add full cgroup v2 swap support with automatically calculated swap limit for LimitedSwap and Burstable QoS Pods #118764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f37aec6
e4da568
a30410d
4b6314f
619be9c
4321d8c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,9 @@ limitations under the License. | |
package kuberuntime | ||
|
||
import ( | ||
"fmt" | ||
cadvisorv1 "github.com/google/cadvisor/info/v1" | ||
kubeapiqos "k8s.io/kubernetes/pkg/apis/core/v1/helper/qos" | ||
"math" | ||
"os" | ||
"strconv" | ||
|
@@ -46,7 +49,7 @@ func (m *kubeGenericRuntimeManager) applyPlatformSpecificContainerConfig(config | |
enforceMemoryQoS := false | ||
// Set memory.min and memory.high if MemoryQoS enabled with cgroups v2 | ||
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) && | ||
libcontainercgroups.IsCgroup2UnifiedMode() { | ||
isCgroup2UnifiedMode() { | ||
enforceMemoryQoS = true | ||
} | ||
cl, err := m.generateLinuxContainerConfig(container, pod, uid, username, nsTarget, enforceMemoryQoS) | ||
|
@@ -99,21 +102,17 @@ func (m *kubeGenericRuntimeManager) generateLinuxContainerResources(pod *v1.Pod, | |
|
||
lcr.HugepageLimits = GetHugepageLimitsFromResources(container.Resources) | ||
|
||
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { | ||
if swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo); utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we need to initialize this object before checking the feature gate? Ideally there should be no change when feature gate is not set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned in another comment, cgroup configures unlimited swap by default. If the feature gate is turned off, there's a need to actively limit the swap usage to 0. This is also consistent with the current logic [1]. |
||
// NOTE(ehashman): Behaviour is defined in the opencontainers runtime spec: | ||
// https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory | ||
switch m.memorySwapBehavior { | ||
case kubelettypes.UnlimitedSwap: | ||
// -1 = unlimited swap | ||
lcr.MemorySwapLimitInBytes = -1 | ||
case kubelettypes.LimitedSwap: | ||
fallthrough | ||
swapConfigurationHelper.ConfigureLimitedSwap(lcr, pod, container) | ||
default: | ||
// memorySwapLimit = total permitted memory+swap; if equal to memory limit, => 0 swap above memory limit | ||
// Some swapping is still possible. | ||
// Note that if memory limit is 0, memory swap limit is ignored. | ||
lcr.MemorySwapLimitInBytes = lcr.MemoryLimitInBytes | ||
swapConfigurationHelper.ConfigureUnlimitedSwap(lcr) | ||
} | ||
} else { | ||
swapConfigurationHelper.ConfigureNoSwap(lcr) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is what we have today in alpha? If feature gate is disabled, do we configure no swap? I thought we do not make any configuration in this case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right, today we don't configure anything if the feature gate is disabled. However, the question is if we want to ensure that swap is not accessible if the feature gate is off, or do we want to let the CRI decide what to do in this case. I know that in cgroups, the default is using WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am trying to understand whether this will break any customer who don't use the feature gate at all. Let's say somebody was running with If they never enabled nor disabled the feature gate, they would have the default container runtime's behavior today. Once upgraded to 1.28, they will have the feature gate enabled by default (default of feature gate is If they explicitly enabled the feature gate, they will get the same behavior with If they explicitly disabled the feature gate in So the actual concern is not this code branch. More concerning the new default with no feature gate set. Maybe we need to rethink the default behavior from being unlimited to being whatever Container Runtime default is? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we cannot find a quick answer here, perhaps what we can do:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, let's keep it off by default and revisit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SergeyKanzhelev Thanks for elaborating.
Yup, I agree. |
||
} | ||
|
||
// Set memory.min and memory.high to enforce MemoryQoS | ||
|
@@ -122,7 +121,7 @@ func (m *kubeGenericRuntimeManager) generateLinuxContainerResources(pod *v1.Pod, | |
memoryRequest := container.Resources.Requests.Memory().Value() | ||
memoryLimit := container.Resources.Limits.Memory().Value() | ||
if memoryRequest != 0 { | ||
unified[cm.MemoryMin] = strconv.FormatInt(memoryRequest, 10) | ||
unified[cm.Cgroup2MemoryMin] = strconv.FormatInt(memoryRequest, 10) | ||
} | ||
|
||
// Guaranteed pods by their QoS definition requires that memory request equals memory limit and cpu request must equal cpu limit. | ||
|
@@ -148,7 +147,7 @@ func (m *kubeGenericRuntimeManager) generateLinuxContainerResources(pod *v1.Pod, | |
} | ||
} | ||
if memoryHigh != 0 && memoryHigh > memoryRequest { | ||
unified[cm.MemoryHigh] = strconv.FormatInt(memoryHigh, 10) | ||
unified[cm.Cgroup2MemoryHigh] = strconv.FormatInt(memoryHigh, 10) | ||
} | ||
} | ||
if len(unified) > 0 { | ||
|
@@ -171,7 +170,7 @@ func (m *kubeGenericRuntimeManager) generateContainerResources(pod *v1.Pod, cont | |
enforceMemoryQoS := false | ||
// Set memory.min and memory.high if MemoryQoS enabled with cgroups v2 | ||
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) && | ||
libcontainercgroups.IsCgroup2UnifiedMode() { | ||
isCgroup2UnifiedMode() { | ||
enforceMemoryQoS = true | ||
} | ||
return &runtimeapi.ContainerResources{ | ||
|
@@ -216,7 +215,7 @@ func (m *kubeGenericRuntimeManager) calculateLinuxResources(cpuRequest, cpuLimit | |
} | ||
|
||
// runc requires cgroupv2 for unified mode | ||
if libcontainercgroups.IsCgroup2UnifiedMode() { | ||
if isCgroup2UnifiedMode() { | ||
resources.Unified = map[string]string{ | ||
// Ask the kernel to kill all processes in the container cgroup in case of OOM. | ||
// See memory.oom.group in https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html for | ||
|
@@ -298,3 +297,94 @@ func toKubeContainerResources(statusResources *runtimeapi.ContainerResources) *k | |
} | ||
return cStatusResources | ||
} | ||
|
||
// Note: this function variable is being added here so it would be possible to mock | ||
// the cgroup version for unit tests by assigning a new mocked function into it. Without it, | ||
// the cgroup version would solely depend on the environment running the test. | ||
var isCgroup2UnifiedMode = func() bool { | ||
return libcontainercgroups.IsCgroup2UnifiedMode() | ||
} | ||
|
||
type swapConfigurationHelper struct { | ||
machineInfo cadvisorv1.MachineInfo | ||
} | ||
|
||
func newSwapConfigurationHelper(machineInfo cadvisorv1.MachineInfo) *swapConfigurationHelper { | ||
return &swapConfigurationHelper{machineInfo: machineInfo} | ||
} | ||
|
||
func (m swapConfigurationHelper) ConfigureLimitedSwap(lcr *runtimeapi.LinuxContainerResources, pod *v1.Pod, container *v1.Container) { | ||
podQos := kubeapiqos.GetPodQOS(pod) | ||
containerDoesNotRequestMemory := container.Resources.Requests.Memory().IsZero() && container.Resources.Limits.Memory().IsZero() | ||
memoryRequestEqualsToLimit := container.Resources.Requests.Memory().Cmp(*container.Resources.Limits.Memory()) == 0 | ||
|
||
if podQos != v1.PodQOSBurstable || containerDoesNotRequestMemory || !isCgroup2UnifiedMode() || memoryRequestEqualsToLimit { | ||
m.ConfigureNoSwap(lcr) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's also add the logic on kubelet startup to fail to start if swap is configured with the cgroupv1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea! Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so will you send a separate PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, here's the PR: #119327. It's still WIP. While the logic itself is ready I also need to adjust the tests. I will finalize it shortly. |
||
return | ||
} | ||
|
||
containerMemoryRequest := container.Resources.Requests.Memory() | ||
swapLimit, err := calcSwapForBurstablePods(containerMemoryRequest.Value(), int64(m.machineInfo.MemoryCapacity), int64(m.machineInfo.SwapCapacity)) | ||
|
||
if err != nil { | ||
klog.ErrorS(err, "cannot calculate swap allocation amount; disallowing swap") | ||
m.ConfigureNoSwap(lcr) | ||
return | ||
} | ||
|
||
m.configureSwap(lcr, swapLimit) | ||
} | ||
|
||
func (m swapConfigurationHelper) ConfigureNoSwap(lcr *runtimeapi.LinuxContainerResources) { | ||
if !isCgroup2UnifiedMode() { | ||
// memorySwapLimit = total permitted memory+swap; if equal to memory limit, => 0 swap above memory limit | ||
// Some swapping is still possible. | ||
// Note that if memory limit is 0, memory swap limit is ignored. | ||
lcr.MemorySwapLimitInBytes = lcr.MemoryLimitInBytes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect this line failed some tests without swap enabled, like some Ubuntu. |
||
return | ||
} | ||
|
||
m.configureSwap(lcr, 0) | ||
} | ||
|
||
func (m swapConfigurationHelper) ConfigureUnlimitedSwap(lcr *runtimeapi.LinuxContainerResources) { | ||
if !isCgroup2UnifiedMode() { | ||
m.ConfigureNoSwap(lcr) | ||
return | ||
} | ||
|
||
if lcr.Unified == nil { | ||
lcr.Unified = map[string]string{} | ||
} | ||
|
||
lcr.Unified[cm.Cgroup2MaxSwapFilename] = "max" | ||
} | ||
|
||
func (m swapConfigurationHelper) configureSwap(lcr *runtimeapi.LinuxContainerResources, swapMemory int64) { | ||
if !isCgroup2UnifiedMode() { | ||
klog.ErrorS(fmt.Errorf("swap configuration is not supported with cgroup v1"), "swap configuration under cgroup v1 is unexpected") | ||
return | ||
} | ||
|
||
if lcr.Unified == nil { | ||
lcr.Unified = map[string]string{} | ||
} | ||
|
||
lcr.Unified[cm.Cgroup2MaxSwapFilename] = fmt.Sprintf("%d", swapMemory) | ||
} | ||
|
||
// The swap limit is calculated as (<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>. | ||
// For more info, please look at the following KEP: https://kep.k8s.io/2400 | ||
func calcSwapForBurstablePods(containerMemoryRequest, nodeTotalMemory, totalPodsSwapAvailable int64) (int64, error) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add some comments here to make it more clear? The logic is defined in KEP https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#steps-to-calculate-swap-limit There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
if nodeTotalMemory <= 0 { | ||
return 0, fmt.Errorf("total node memory is 0") | ||
} | ||
if containerMemoryRequest > nodeTotalMemory { | ||
return 0, fmt.Errorf("container request %d is larger than total node memory %d", containerMemoryRequest, nodeTotalMemory) | ||
} | ||
|
||
containerMemoryProportion := float64(containerMemoryRequest) / float64(nodeTotalMemory) | ||
swapAllocation := containerMemoryProportion * float64(totalPodsSwapAvailable) | ||
|
||
return int64(swapAllocation), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can enable it by default. There is an additional configuration that needs to be tweaked anyway, so no need to ask to change feature gate as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for other reviewers: we found that enabling it by default breaks some upgrade scenarios that we will deal with in beta2.