Description
I am working with a kubeadm cluster on a baremetal server with 3 Nvidia A40 gpu.
Problem: The gpu.nvidia.com driver is not able to allocate gpu.
My deployment:
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
name: random
spec:
spec:
resourceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
name: dra-sample
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["sleep", "10000"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
source:
resourceClaimTemplateName: random
Here are Outputs for debugging:
$ kubectl describe pod dra-sample
Name: dra-sample
Namespace: default
Priority: 0
Service Account: default
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
ctr:
Image: ubuntu:22.04
Port: <none>
Host Port: <none>
Command:
sleep
10000
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x52xb (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-x52xb:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 25s default-scheduler 0/1 nodes are available: 1 waiting for resource driver to allocate resource.
Warning FailedScheduling 23s default-scheduler 0/1 nodes are available: 1 waiting for resource driver to provide information.
$ kubectl logs -n gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s
I0103 07:33:19.351405 1 controller.go:295] "resource controller: Starting" driver="gpu.resource.nvidia.com"
I0103 07:33:19.351515 1 reflector.go:287] Starting reflector *v1alpha2.ResourceClaim (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351519 1 reflector.go:287] Starting reflector *v1alpha2.PodSchedulingContext (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351538 1 reflector.go:323] Listing and watching *v1alpha2.ResourceClaim from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351544 1 reflector.go:323] Listing and watching *v1alpha2.PodSchedulingContext from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351555 1 reflector.go:287] Starting reflector *v1alpha2.ResourceClass (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351572 1 reflector.go:323] Listing and watching *v1alpha2.ResourceClass from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.357423 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.357445 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.357459 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.358836 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?allowWatchBookmarks=true&resourceVersion=1463537&timeout=7m2s&timeoutSeconds=422&watch=true 200 OK in 0 milliseconds
I0103 07:33:19.358847 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?allowWatchBookmarks=true&resourceVersion=1463541&timeout=5m15s&timeoutSeconds=315&watch=true 200 OK in 1 milliseconds
I0103 07:33:19.359089 1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?allowWatchBookmarks=true&resourceVersion=1463537&timeout=9m53s&timeoutSeconds=593&watch=true 200 OK in 1 milliseconds
I0103 07:33:19.452100 1 shared_informer.go:344] caches populated
I0103 07:33:27.927510 1 controller.go:241] "resource controller: new object" type="ResourceClaim" content="{\"metadata\":{\"name\":\"dra-sample-gpu-vgbkc\",\"generateName\":\"dra-sample-gpu-\",\"namespace\":\"default\",\"uid\":\"cf3f2558-4f63-402a-a7c3-333b7d928885\",\"resourceVersion\":\"1463627\",\"creationTimestamp\":\"2024-01-03T07:33:27Z\",\"annotations\":{\"resource.kubernetes.io/pod-claim-name\":\"gpu\"},\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"dra-sample\",\"uid\":\"803e19c7-41d6-4331-94a2-e86b3c32b177\",\"controller\":true,\"blockOwnerDeletion\":true}],\"managedFields\":[{\"manager\":\"kube-controller-manager\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2024-01-03T07:33:27Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:resource.kubernetes.io/pod-claim-name\":{}},\"f:generateName\":{},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"803e19c7-41d6-4331-94a2-e86b3c32b177\\\"}\":{}}},\"f:spec\":{\"f:allocationMode\":{},\"f:resourceClassName\":{}}}}]},\"spec\":{\"resourceClassName\":\"gpu.nvidia.com\",\"allocationMode\":\"WaitForFirstConsumer\"},\"status\":{}}"
I0103 07:33:27.927541 1 controller.go:260] "resource controller: Adding new work item" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927583 1 controller.go:332] "resource controller: processing" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927603 1 controller.go:476] "resource controller: ResourceClaim waiting for first consumer" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927613 1 controller.go:336] "resource controller: completed" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.935496 1 controller.go:241] "resource controller: new object" type="PodSchedulingContext" content="{\"metadata\":{\"name\":\"dra-sample\",\"namespace\":\"default\",\"uid\":\"e8b1fa44-574f-4a92-bdd9-26e33f816689\",\"resourceVersion\":\"1463629\",\"creationTimestamp\":\"2024-01-03T07:33:27Z\",\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"dra-sample\",\"uid\":\"803e19c7-41d6-4331-94a2-e86b3c32b177\",\"controller\":true,\"blockOwnerDeletion\":true}],\"managedFields\":[{\"manager\":\"kube-scheduler\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2024-01-03T07:33:27Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"803e19c7-41d6-4331-94a2-e86b3c32b177\\\"}\":{}}},\"f:spec\":{\"f:potentialNodes\":{},\"f:selectedNode\":{}}}}]},\"spec\":{\"selectedNode\":\"nm-shakti-worker6\",\"potentialNodes\":[\"nm-shakti-worker6\"]},\"status\":{}}"
I0103 07:33:27.935520 1 controller.go:260] "resource controller: Adding new work item" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.935547 1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.937738 1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 2 milliseconds
I0103 07:33:27.939722 1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.939739 1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.939747 1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.940731 1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.944575 1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 3 milliseconds
I0103 07:33:57.945164 1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.945186 1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.945198 1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.946163 1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.950857 1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 4 milliseconds
I0103 07:34:27.951558 1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.951578 1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.951589 1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"
$ kubectl logs -n gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm
Defaulted container "plugin" out of: plugin, init (init)
I0103 07:33:20.280685 1 device_state.go:142] using devRoot=/driver-root
I0103 07:33:20.291222 1 nonblockinggrpcserver.go:107] "dra: GRPC server started"
I0103 07:33:20.291293 1 nonblockinggrpcserver.go:107] "registrar: GRPC server started"
We can see the node is not able to pick up gpu resources:
$ kubectl describe node nm-shakti-worker6
Name: nm-shakti-worker6
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.CETSS=true
feature.node.kubernetes.io/cpu-cpuid.CLZERO=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.CPBOOST=true
feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FP256=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.IBRS=true
feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED=true
feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP=true
feature.node.kubernetes.io/cpu-cpuid.IBS=true
feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true
feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true
feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true
feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true
feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true
feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true
feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true
feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true
feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX=true
feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE=true
feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST=true
feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD=true
feature.node.kubernetes.io/cpu-cpuid.INVLPGB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.LBRVIRT=true
feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW=true
feature.node.kubernetes.io/cpu-cpuid.MCOMMIT=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.MOVU=true
feature.node.kubernetes.io/cpu-cpuid.MSRIRC=true
feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH=true
feature.node.kubernetes.io/cpu-cpuid.NRIPS=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.PPIN=true
feature.node.kubernetes.io/cpu-cpuid.PSFD=true
feature.node.kubernetes.io/cpu-cpuid.RDPRU=true
feature.node.kubernetes.io/cpu-cpuid.SEV=true
feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT=true
feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE=true
feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP=true
feature.node.kubernetes.io/cpu-cpuid.SEV_ES=true
feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED=true
feature.node.kubernetes.io/cpu-cpuid.SEV_SNP=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.SME=true
feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON=true
feature.node.kubernetes.io/cpu-cpuid.SUCCOR=true
feature.node.kubernetes.io/cpu-cpuid.SVM=true
feature.node.kubernetes.io/cpu-cpuid.SVMDA=true
feature.node.kubernetes.io/cpu-cpuid.SVMFBASID=true
feature.node.kubernetes.io/cpu-cpuid.SVML=true
feature.node.kubernetes.io/cpu-cpuid.SVMNP=true
feature.node.kubernetes.io/cpu-cpuid.SVMPF=true
feature.node.kubernetes.io/cpu-cpuid.SVMPFT=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED=true
feature.node.kubernetes.io/cpu-cpuid.TOPEXT=true
feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR=true
feature.node.kubernetes.io/cpu-cpuid.VAES=true
feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN=true
feature.node.kubernetes.io/cpu-cpuid.VMPL=true
feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT=true
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
feature.node.kubernetes.io/cpu-cpuid.VTE=true
feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-model.family=25
feature.node.kubernetes.io/cpu-model.id=1
feature.node.kubernetes.io/cpu-model.vendor_id=AMD
feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
feature.node.kubernetes.io/cpu-rdt.RDTMON=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-config.PREEMPT=true
feature.node.kubernetes.io/kernel-version.full=5.15.0-91-lowlatency
feature.node.kubernetes.io/kernel-version.major=5
feature.node.kubernetes.io/kernel-version.minor=15
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/memory-numa=true
feature.node.kubernetes.io/network-sriov.capable=true
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-10de.sriov.capable=true
feature.node.kubernetes.io/pci-1a03.present=true
feature.node.kubernetes.io/pci-8086.present=true
feature.node.kubernetes.io/pci-8086.sriov.capable=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
feature.node.kubernetes.io/usb-ef_0b1f_03ee.present=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=nm-shakti-worker6
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
nvidia.com/cuda.driver.major=525
nvidia.com/cuda.driver.minor=147
nvidia.com/cuda.driver.rev=05
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=0
nvidia.com/dra.controller=true
nvidia.com/dra.kubelet-plugin=true
nvidia.com/gfd.timestamp=1704222592
nvidia.com/gpu-driver-upgrade-state=upgrade-done
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=6
nvidia.com/gpu.count=3
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=AS--4124GS-TNR
nvidia.com/gpu.memory=49140
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-A40
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single
Annotations: csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"nm-shakti-worker6"}
flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"42:b2:da:55:59:b8"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.82.14.19
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CETSS,cpu-cpuid.CLZERO,cpu-cpuid.CMPXCHG8,cpu-cpuid.CPBOOST,cpu-cpuid...
nfd.node.kubernetes.io/master.version: v0.14.2
nfd.node.kubernetes.io/worker.version: v0.14.2
node.alpha.kubernetes.io/ttl: 0
nvidia.com/gpu-driver-upgrade-enabled: true
projectcalico.org/IPv4Address: 10.82.14.19/24
projectcalico.org/IPv4VXLANTunnelAddr: 10.244.170.0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 26 Dec 2023 10:42:19 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: nm-shakti-worker6
AcquireTime: <unset>
RenewTime: Wed, 03 Jan 2024 07:36:56 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 02 Jan 2024 19:08:58 +0000 Tue, 02 Jan 2024 19:08:58 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Wed, 03 Jan 2024 07:34:36 +0000 Tue, 26 Dec 2023 10:42:17 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 03 Jan 2024 07:34:36 +0000 Tue, 26 Dec 2023 10:42:17 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 03 Jan 2024 07:34:36 +0000 Tue, 26 Dec 2023 10:42:17 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 03 Jan 2024 07:34:36 +0000 Tue, 02 Jan 2024 17:18:27 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.82.14.19
Hostname: nm-shakti-worker6
Capacity:
cpu: 256
ephemeral-storage: 459778128Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 528139536Ki
nvidia.com/gpu: 0
pods: 110
Allocatable:
cpu: 256
ephemeral-storage: 423731522064
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 528037136Ki
nvidia.com/gpu: 0
pods: 110
System Info:
Machine ID: 604c16e1b6dc4bf182c91ec14fcce1bd
System UUID: 7f12d000-0f90-11ed-8000-3cecefeab242
Boot ID: 29e635b2-b1cc-491b-af4a-3930fc2bd8d8
Kernel Version: 5.15.0-91-lowlatency
OS Image: Ubuntu 22.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.14
Kubelet Version: v1.29.0
Kube-Proxy Version: v1.29.0
PodCIDR: 10.244.0.0/24
PodCIDRs: 10.244.0.0/24
Non-terminated Pods: (24 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-apiserver calico-apiserver-6598988b78-8bl8p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d19h
calico-apiserver calico-apiserver-6598988b78-zq5lh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d19h
calico-system calico-kube-controllers-779fc55954-hbvlw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d19h
calico-system calico-node-hzpnc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d19h
calico-system calico-typha-6fd5cc6495-wr86s 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d19h
calico-system csi-node-driver-nqdwf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d19h
gpu-operator gpu-feature-discovery-cdp92 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d15h
gpu-operator gpu-operator-1703607135-node-feature-discovery-gc-796f559d2vwht 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d15h
gpu-operator gpu-operator-1703607135-node-feature-discovery-master-59bf42qtd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d15h
gpu-operator gpu-operator-1703607135-node-feature-discovery-worker-qck97 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d15h
gpu-operator gpu-operator-99d8f4cd7-vn45x 200m (0%) 500m (0%) 100Mi (0%) 350Mi (0%) 7d15h
gpu-operator nvidia-container-toolkit-daemonset-jqr8v 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d15h
gpu-operator nvidia-dcgm-exporter-khgjc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d15h
gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m46s
gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m45s
gpu-operator nvidia-operator-validator-dtxs2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9h
kube-system coredns-76f75df574-lqdtc 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 7d15h
kube-system coredns-76f75df574-xzmqw 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 7d15h
kube-system etcd-nm-shakti-worker6 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 7d20h
kube-system kube-apiserver-nm-shakti-worker6 250m (0%) 0 (0%) 0 (0%) 0 (0%) 14h
kube-system kube-controller-manager-nm-shakti-worker6 200m (0%) 0 (0%) 0 (0%) 0 (0%) 14h
kube-system kube-proxy-t7gxs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d20h
kube-system kube-scheduler-nm-shakti-worker6 100m (0%) 0 (0%) 0 (0%) 0 (0%) 14h
tigera-operator tigera-operator-55585899bf-lcljn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d19h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1050m (0%) 500m (0%)
memory 340Mi (0%) 690Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events: <none>
$ kubectl describe -n gpu-operator nodeallocationstates.nas.gpu.resource.nvidia.com nm-shakti-worker6
Name: nm-shakti-worker6
Namespace: gpu-operator
Labels: <none>
Annotations: <none>
API Version: nas.gpu.resource.nvidia.com/v1alpha1
Kind: NodeAllocationState
Metadata:
Creation Timestamp: 2024-01-02T21:04:04Z
Generation: 15
Owner References:
API Version: v1
Kind: Node
Name: nm-shakti-worker6
UID: 7195b117-ec11-443e-ba2e-c84bb407708e
Resource Version: 1463603
UID: 39a10030-b2bd-4bd4-8a8b-4c5bb7c757c7
Spec:
Allocatable Devices:
Gpu:
Architecture: Ampere
Brand: Nvidia
Cuda Compute Capability: 8.6
Index: 0
Memory Bytes: 51527024640
Mig Enabled: false
Product Name: NVIDIA A40
Uuid: GPU-e060a342-afa6-2f46-7342-52ab49773d47
Gpu:
Architecture: Ampere
Brand: Nvidia
Cuda Compute Capability: 8.6
Index: 1
Memory Bytes: 51527024640
Mig Enabled: false
Product Name: NVIDIA A40
Uuid: GPU-e72a049f-0e52-1f3a-4e93-fac0c3ecfe50
Gpu:
Architecture: Ampere
Brand: Nvidia
Cuda Compute Capability: 8.6
Index: 2
Memory Bytes: 51527024640
Mig Enabled: false
Product Name: NVIDIA A40
Uuid: GPU-3e9c1e0e-50e4-5121-70dc-438753eeaa1c
Status: Ready
Events: <none>
How I installed k8s-dra-driver?
To fulfill prerequisites, I simply used gpu-operator as I already had it working in cluster. Just disabled devicePlugin.
helm upgrade --install gpu-operator-1703607135 --namespace gpu-operator nvidia/gpu-operator --set devicePlugin.enabled=false
For installing NVIDIA/k8s-dra-driver, I followed script(demo/clusters/kind/scripts/build-driver-image.sh) to build image and then ran:
helm upgrade --install nvidia-dra-device-plugin --namespace gpu-operator k8s-dra-driver/
Hence, it gave me:
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-cdp92 1/1 Running 1 (12h ago) 7d15h
gpu-operator-1703607135-node-feature-discovery-gc-796f559d2vwht 1/1 Running 2 (12h ago) 7d15h
gpu-operator-1703607135-node-feature-discovery-master-59bf42qtd 1/1 Running 2 (12h ago) 7d15h
gpu-operator-1703607135-node-feature-discovery-worker-qck97 1/1 Running 3 (12h ago) 7d15h
gpu-operator-99d8f4cd7-vn45x 1/1 Running 4 (12h ago) 7d15h
nvidia-container-toolkit-daemonset-jqr8v 1/1 Running 2 (12h ago) 7d15h
nvidia-cuda-validator-n2ltp 0/1 Completed 0 9h
nvidia-dcgm-exporter-khgjc 1/1 Running 1 (12h ago) 7d15h
nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s 1/1 Running 0 7m33s
nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm 1/1 Running 0 7m32s
nvidia-operator-validator-dtxs2 1/1 Running 0 9h
The step-up works fine with default driver from gpu-operator. But it fails to allocate and find gpu with NVIDIA/k8s-dra-driver. My installation is probably wrong. What did I miss?
Thanks!