Skip to content

Best-effort topology mgr policy doesn't give best-effort CPU NUMA alignment #106270

Closed as not planned
@JanScheurich

Description

@JanScheurich

What happened?

Use case:

A DPDK application uses VLAN trunking on SR-IOV NICs and requires dedicated SR-IOV NICs. For cost-reasons there is only one SRIOV-NIC per server but, to exploit the CPU resources optimally, the applications need to run one single-NUMA pod per CPU socket. For these pods, CPUs and huge-pages must be allocated from the same NUMA node, while the SR-IOV device may be allocated from the NIC on the remote NUMA node, if necessary.

Problem description:

K8s bare metal node with CPU topology
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
The single SR-IOV NIC is on NUMA 0.

Kubelet is configured with
• CPU manager policy "static"
• Topology manager policy "best-effort"
• reserved_cpus: 0,1,40,41

The application creates two Guaranteed QoS DPDK pods requesting 32 CPUs each. The Remaining 6 CPUs per NUMA node are meant to be used by best-effort and burstable QoS pods.

The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.

Unfortunately this is not what happens: The CPU manager assigns CPUs 2-32,42-72 on NUMA 0 to the first pod and remaining CPUs 34-38,74-78 on NUMA 0 plus CPUs 3-25,43-65 on NUMA 1 to the second pod, thus breaking the DPDK application, which requires single NUMA CPUs.

What did you expect to happen?

The expected behavior with best-effort policy is that the CPU manager provides both Guaranteed QoS pods with CPUs from a single NUMA node each, even if the device manager cannot provide each pod with a local SR-IOV VF.

How can we reproduce it (as minimally and precisely as possible)?

See above. Create two guaranteed QoS pods with integer CPU requests and an SR-IOV device request from an SR-IOV network device pool on one NUMA node only such that the pods won't fit on the same NUMA node but one pod doesn't fully occupy the SR-IOV NUMA node either.

Anything else we need to know?

Analysis:

The problem is that for the second pod (to be landed on NUMA 1) the CPU manager offers the topology hints [10 (preferred), 11 (not preferred)]. The affinity bit strings enumerate the NUMA node right to left. The device manager's hint is [01 (preferred)]. The topology manager unconditionally merges these into a best hint 01 (not preferred). It does so by iterating over the cross-product of all provider hints, doing a bitwise AND of the affinity masks. For non-zero results the preferred status is set to true if and only if all combined provider hints were preferred. In our case the only non-zero affinity mask is 11 & 01 = 01, and it is not preferred.

With topology manager policies "single-numa-node" or "restricted" the topology manager would immediately reject pod admission. With "best effort" policy it admits the pod and returns the computed "best hint" 01 (not preferred) to the CPU and device manager for their resource allocations. Hence the CPU manager starts allocating CPUs from NUMA 0 and (since there are not enough) fills up the rest from NUMA 1. Note that the "best hint" 01 is not even among the options supplied by the CPU manager in the first place.

Proposal:

If there is no preferred best hint, the topology manager with "best-effort" policy should return one preferred hint to each provider from the original list that it had received. For the device manager that would be 01, for the CPU manager it would be 10. That way, each resource owner could do its best to guarantee NUMA locality among its resources.
We will provide a corresponding PR to open the discussion on how to improve the best-effort behavior of the topology manager.

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"841a4f4f3d4528aa284171074e00503faea18496", GitTreeState:"clean", BuildDate:"2021-08-31T06:52:44Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

none

OS version

# On Linux:
$ cat /etc/os-release
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"

$ uname -a
Linux control-plane-n108-mast-n027 5.3.18-24.75.3.22886.0.PTF.1187468-default #1 SMP Thu Sep 9 23:24:48 UTC 2021 (37ce29d) x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions