-
Notifications
You must be signed in to change notification settings - Fork 40.9k
DRA: Introduce a helper function producing standardized 'resource.kubernetes.io/pcieRoot' device attributes for DRA drivers #132296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…utes for DRA drivers The shared function can provide a way for DRA drivers to produce standardized device attributes in a consistent way. Signed-off-by: Shingo Omura <[email protected]>
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: everpeace The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @aojea |
/cc @michaelasp |
@michaelasp: GitHub didn't allow me to request PR reviews from the following users: michaelasp. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thanks for adding this!
3cbc0aa
to
db0bf33
Compare
Signed-off-by: Shingo Omura <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this is linux specific so the parts that are should be in a _linux.go
specific file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 73d65f8
// The value is a string value in the format `pci<domain>:<bus>`, | ||
// referring to a PCIe (Peripheral Component Interconnect Express) Root Complex. | ||
// This attribute can be used to identify devices that share the same PCIe Root Complex. | ||
StandardDeviceAttributePCIeRoot = StandardDeviceAttributePrefix + "pcieRoot" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StandardDeviceAttributePCIeRoot = StandardDeviceAttributePrefix + "pcieRoot" | |
StandardDeviceAttributePCIeRoot resourceapi.QualifiedName = StandardDeviceAttributePrefix + "pcieRoot" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 8ed8217
) | ||
|
||
var ( | ||
bdfRegexp = regexp.MustCompile(`^([0-9a-f]{4}):([0-9a-f]{2}):([0-9a-f]{2})\.([0-9a-f]{1})$`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give that this is only use din one function it seems unnecessary to declare as a global like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 88b34ea
staging/src/k8s.io/dynamic-resource-allocation/deviceattribute/pci.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/dynamic-resource-allocation/deviceattribute/pci.go
Outdated
Show resolved
Hide resolved
…helper functions" This reverts commit 2869b6f.
Signed-off-by: Shingo Omura <[email protected]>
…PCIeRootAttributeByPCIBusID' with it Signed-off-by: Shingo Omura <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets conslidate this into the attribute.go file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in caa64af
// GetPCIeRootAttributeByPCIBusID retrieves the PCIe Root Complex for a given PCI Bus ID. | ||
// in BDF (Bus-Device-Function) format, e.g., "0123:45:1e.7". | ||
// | ||
// It returns a DeviceAttribute with the PCIe Root Complex information("pci<domain>:<bus>") | ||
// as a string value or an error if the PCI Bus ID is invalid or the root complex cannot be determined. | ||
// | ||
// ref: https://wiki.xenproject.org/wiki/Bus:Device.Function_(BDF)_Notation | ||
func GetPCIeRootAttributeByPCIBusID(pciBusID string) (DeviceAttribute, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I said this elsewhere, but the reason to be explicit about ByPCIBusID
is that I think we will want a GetPCIeRootAttributeByCpuID()
at some point soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also imagined GetPCIeRootAttributeByInterfaceName()
, too (but, my original idea was to implement these variations via option patterns).
We can resolve PCIe root for interfaces by readlink /sys/class/net/<ifname>/device
.
# PCIe root for "enp48s0f0np0" is "pci0000:20"
$ readlink -f /sys/class/net/enp48s0f0np0/device
/sys/devices/pci0000:20/0000:20:03.1/0000:26:00.0/0000:27:08.0/0000:2e:00.0/0000:2f:10.0/0000:30:00.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revisit this if / when we get another variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way,
I think we will want a
GetPCIeRootAttributeByCpuID()
at some point soon.
It might be GetCpuSocketNumberByCpuID()
or GetNumaNodeByCpuID()
?? I think CPU is not within the PCIe hierarchy. Thus, aligning NIC and CPUs in hardware topology, we would need another standard device attribute, e.g. cpuSocketNumber
or numaNode
, in the near future (as stated in kubernetes/enhancements#5316 (comment)).
EXAMPLE: I attached an example of lstopo
output image. In this example, there are two packages, and each package has one NUMA Node, each numa node has 64 cores and 4 PCI Host Bridges (PCIe Roots).
$ lstopo -v --of console
Machine (P#0 total=1584474936KB DMIProductName="REDACTED" DMIProductVersion=xxxxxx DMIBoardVendor=REDACTED DMIBoardName=REDACTED DMIBoardVersion=REDACTED DMIBoardAssetTag="REDACTED" DMIChassisVendor=REDACTED DMIChassisType=1 DMIChassisVersion=REDACTED DMIChassisAsse
Package L#0 (P#0 total=791818556KB CPUVendor=REDACTED CPUFamilyNumber=REDACTED CPUModelNumber=REDACTED CPUModel="REDACTED" CPUStepping=1)
NUMANode L#0 (P#0 local=791818556KB total=791818556KB)
...<CPUs here>...
HostBridge L#0 (buses=0000:[00-0e])
...<PCIe devices/bridges here>...
HostBridge L#12 (buses=0000:[20-34])
...<PCIe devices/bridges here>...
HostBridge L#24 (buses=0000:[40-46])
...<PCIe devices/bridges here>...
HostBridge L#29 (buses=0000:[60-6b])
...<PCIe devices/bridges here>...
Package L#1 (P#1 total=792656380KB CPUVendor=REDACTED CPUFamilyNumber=REDACTED CPUModelNumber=REDACTED CPUModel="REDACTED" CPUStepping=1)
NUMANode L#1 (P#1 local=792656380KB total=792656380KB)
...<CPUs here>...
HostBridge L#39 (buses=0000:[80-87])
...<PCIe devices/bridges here>...
HostBridge L#44 (buses=0000:[a0-a6])
...<PCIe devices/bridges here>...
HostBridge L#49 (buses=0000:[c0-c6])
...<PCIe devices/bridges here>...
HostBridge L#53 (buses=0000:[e0-e7])
...<PCIe devices/bridges here>...
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to map a given CPU to its PCIe root complex independent of NUMA. We've done alignment by NUMA in the past, but going forward direct alignment by PCIe root complex is the preferred approach.
One way to see this association on my 4-GPU system is as follows:
$ cat /sys/bus/pci/devices/0019:01:00.0/local_cpulist
72-143
$ cat /sys/bus/pci/devices/0018:01:00.0/local_cpulist
72-143
$ cat /sys/bus/pci/devices/0009:01:00.0/local_cpulist
0-71
$ cat /sys/bus/pci/devices/0008:01:00.0/local_cpulist
0-71
The first two GPUs are "local" to CPUs 72-143 and the second two GPUs are "local" to CPUs 0-71. Where "local" implies that they are on the same PCIe root complex. You can imagine how one could use this to implement a robust GetPCIeRootAttributeByCpuID()
function.
There may be other more straightforward ways to glean this information, but this is gist of it.
/cc @kad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first two GPUs are "local" to CPUs 72-143, and the second two GPUs are "local" to CPUs 0-71. Where "local" implies that they are on the same PCIe root complex. You can imagine how one could use this to implement a robust GetPCIeRootAttributeByCpuID() function.
In my 8-GPU system,
$ nvidia-smi --query-gpu=gpu_bus_id --format=csv,noheader
00000000:04:00.0
00000000:23:00.0
00000000:43:00.0
00000000:64:00.0
00000000:84:00.0
00000000:A3:00.0
00000000:C3:00.0
00000000:E4:00.0
For example, For 0000:04:00.0
(1st) and 0000:23:00.0
(2nd):
- both are "local" to
0-63
, - BUT, they have different PCIe roots
# For 0000:04:00.0:
# it is "local" to CPU 0-63
$ cat /sys/bus/pci/devices/0000:04:00.0/local_cpulist
0-63
# its PCIe root is pci0000:00
$ readlink /sys/bus/pci/devices/0000:04:00.0
../../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:01.0/0000:04:00.0
# For 0000:23:00.0:
# it is "local" to CPU 0-63, too
$ cat /sys/bus/pci/devices/0000:23:00.0/local_cpulist
0-63
# but, its PCIe Root is pci0000:20
$ readlink /sys/bus/pci/devices/0000:23:00.0
../../../devices/pci0000:20/0000:20:01.1/0000:21:00.0/0000:22:00.0/0000:23:00.0
Thus, CPU 0-63
are "local" to at least two PCIe roots(pci0000:00
and pci0000:20
). So, I am still not sure how we can map a given CPU to its PCIe root complex🤔. Are there any other ways to determine/define it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not necessarily a 1:1 mapping for CPU <--> PCIe root complex, unfortunately (which can obviously complicate things when we want to do exact string matching for alignment). Seems when it comes to CPUs we may need a list rather than a single value -- which unfortunately won't work with simple matchAttributes
but will when we have the more expressive version of this: kubernetes/enhancements#5254
Note, on my 8-GPU system I don't have this "issue":
$ nvidia-smi --query-gpu=gpu_bus_id --format=csv,noheader
00000000:07:00.0
00000000:0F:00.0
00000000:47:00.0
00000000:4E:00.0
00000000:87:00.0
00000000:90:00.0
00000000:B7:00.0
00000000:BD:00.0
$ cat /sys/bus/pci/devices/0000\:07\:00.0/local_cpulist
48-63,176-191
$ cat /sys/bus/pci/devices/0000\:0f\:00.0/local_cpulist
48-63,176-191
$ readlink /sys/bus/pci/devices/0000\:07\:00.0
../../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:00.0/0000:05:00.0/0000:06:00.0/0000:07:00.0
$ readlink /sys/bus/pci/devices/0000\:0f\:00.0
../../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:04.0/0000:0a:00.0/0000:0b:10.0/0000:0d:00.0/0000:0e:00.0/0000:0f:00.0
$ cat /sys/bus/pci/devices/0000\:47\:00.0/local_cpulist
16-31,144-159
$ cat /sys/bus/pci/devices/0000\:4e\:00.0/local_cpulist
16-31,144-159
$ readlink /sys/bus/pci/devices/0000\:47\:00.0
../../../devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.0/0000:44:00.0/0000:45:00.0/0000:46:00.0/0000:47:00.0
$ readlink /sys/bus/pci/devices/0000\:4e\:00.0
../../../devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:04.0/0000:49:00.0/0000:4a:10.0/0000:4c:00.0/0000:4d:00.0/0000:4e:00.0
$ cat /sys/bus/pci/devices/0000\:87\:00.0/local_cpulist
112-127,240-255
$ cat /sys/bus/pci/devices/0000\:90\:00.0/local_cpulist
112-127,240-255
$ readlink /sys/bus/pci/devices/0000\:87\:00.0
../../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/0000:82:00.0/0000:83:00.0/0000:84:00.0/0000:85:00.0/0000:86:00.0/0000:87:00.0
$ readlink /sys/bus/pci/devices/0000\:90\:00.0
../../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/0000:82:04.0/0000:8b:00.0/0000:8c:10.0/0000:8e:00.0/0000:8f:00.0/0000:90:00.0
$ cat /sys/bus/pci/devices/0000\:b7\:00.0/local_cpulist
80-95,208-223
$ cat /sys/bus/pci/devices/0000\:bd\:00.0/local_cpulist
80-95,208-223
$ readlink /sys/bus/pci/devices/0000\:b7\:00.0
../../../devices/pci0000:b0/0000:b0:01.1/0000:b1:00.0/0000:b2:00.0/0000:b3:00.0/0000:b4:00.0/0000:b5:00.0/0000:b6:00.0/0000:b7:00.0
$ readlink /sys/bus/pci/devices/0000\:bd\:00.0
../../../devices/pci0000:b0/0000:b0:01.1/0000:b1:00.0/0000:b2:04.0/0000:b8:00.0/0000:b9:10.0/0000:bb:00.0/0000:bc:00.0/0000:bd:00.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Seems when it comes to CPUs we may need a list rather than a single value -- which unfortunately won't work with simple matchAttributes but will when we have the more expressive version of this: kubernetes/enhancements#5254
Yes, I agree. In that case, could we plan to add resource.kubernetes.io/pcieRoots
(plural) rather than resource.kubernetes.io/pcieRoot
in the near future? And, could we add an array type as device attribute's value type?
Note, on my 8-GPU system I don't have this "issue":
lucky you 😄
Signed-off-by: Shingo Omura <[email protected]>
@pohly can you point us where this can be exercised in the e2e driver? |
The test/e2e/dra/test-driver? That driver has no hardware dependency, so the code in this PR cannot be exercised through it. In general, writing any kind of integration or E2E test for this in the Kubernetes CI is problematic because it needs some real, deterministic sysfs. We don't have that. |
Thanks @pohly that makes sense. Do you have any comments on the name of the package / API for this? |
/test pull-kubernetes-apidiff |
/test pull-kubernetes-apidiff Might have died because of OOM. |
expectedErrMsg string | ||
}{ | ||
"valid": { | ||
smockSysfsSetup: func(t *testing.T, mockSysfs sysfsPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the "s" in "smock" stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's just typo... Thanks for the catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 94973ba
} | ||
} | ||
|
||
func TouchFile(t *testing.T, path string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func TouchFile(t *testing.T, path string) { | |
func touchFile(t *testing.T, path string) { |
It's easier to see that this is not a public API when it's not exported (doesn't make a different in _test.go, but still...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 2bc45ca
} | ||
} | ||
|
||
func CreateSymlink(t *testing.T, target, link string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func CreateSymlink(t *testing.T, target, link string) { | |
func createSymlink(t *testing.T, target, link string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 2bc45ca
// and can be replaced with a mock in tests. | ||
type sysfsPath string | ||
|
||
func (s sysfsPath) Devices(path string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (s sysfsPath) Devices(path string) string { | |
func (s sysfsPath) devices(path string) string { |
Same below.
In this case, exporting the method makes it callable outside of the package (assuming that some exported method returns a sysfsPath) to that caller and thus part of the package API. We want a lean API to make future changes safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2bc45ca.
Signed-off-by: Shingo Omura <[email protected]>
…ymlink) Signed-off-by: Shingo Omura <[email protected]>
Signed-off-by: Shingo Omura <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly Thank you for your review. I addressed your comments. PTAL.
expectedErrMsg string | ||
}{ | ||
"valid": { | ||
smockSysfsSetup: func(t *testing.T, mockSysfs sysfsPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's just typo... Thanks for the catch!
expectedErrMsg string | ||
}{ | ||
"valid": { | ||
smockSysfsSetup: func(t *testing.T, mockSysfs sysfsPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 94973ba
} | ||
} | ||
|
||
func TouchFile(t *testing.T, path string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 2bc45ca
} | ||
} | ||
|
||
func CreateSymlink(t *testing.T, target, link string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 2bc45ca
// and can be replaced with a mock in tests. | ||
type sysfsPath string | ||
|
||
func (s sysfsPath) Devices(path string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2bc45ca.
// Commented out for future use or if needed later | ||
// func (s sysfsPath) block(path string) string { | ||
// return filepath.Join(string(s), "block", path) | ||
// } | ||
|
||
// Commented out for future use or if needed later | ||
// func (s sysfsPath) class(path string) string { | ||
// return filepath.Join(string(s), "class", path) | ||
// } | ||
|
||
// Commented out for future use or if needed later | ||
// func (s sysfsPath) dev(path string) string { | ||
// return filepath.Join(string(s), "dev", path) | ||
// } | ||
|
||
// Commented out for future use or if needed later | ||
// func (s sysfsPath) firmware(path string) string { | ||
// return filepath.Join(string(s), "firmware", path) | ||
// } | ||
// func (s sysfsPath) kernel(path string) string { | ||
// return filepath.Join(string(s), "kernel", path) | ||
// } | ||
|
||
// Commented out for future use or if needed later | ||
// func (s sysfsPath) module(path string) string { | ||
// return filepath.Join(string(s), "module", path) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These unused methods are just commented out. We can remove these. WDYT??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove them if they are unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in d28791f
Signed-off-by: Shingo Omura <[email protected]>
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces a helper function
GetPCIeRootAttributeByPCIBusID(pciBusID)
for DRA drivers, allowing them to produce standardized device attributes in a consistent way.Which issue(s) this PR is related to:
ref: NVIDIA/k8s-dra-driver-gpu#401 (comment)
KEP: kubernetes/enhancements#4381
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: