Skip to content

DRA driver is not able to allocate gpu #46

Closed
@parthyadav3105

Description

@parthyadav3105

I am working with a kubeadm cluster on a baremetal server with 3 Nvidia A40 gpu.

Problem: The gpu.nvidia.com driver is not able to allocate gpu.

My deployment:

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  name: random
spec:
  spec:
    resourceClassName: gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
  name: dra-sample
spec:
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["sleep", "10000"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    source:
      resourceClaimTemplateName: random

Here are Outputs for debugging:

$ kubectl describe pod dra-sample

Name:             dra-sample
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  ctr:
    Image:      ubuntu:22.04
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      10000
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x52xb (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-x52xb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  25s   default-scheduler  0/1 nodes are available: 1 waiting for resource driver to allocate resource.
  Warning  FailedScheduling  23s   default-scheduler  0/1 nodes are available: 1 waiting for resource driver to provide information.

$ kubectl logs -n gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s

I0103 07:33:19.351405       1 controller.go:295] "resource controller: Starting" driver="gpu.resource.nvidia.com"
I0103 07:33:19.351515       1 reflector.go:287] Starting reflector *v1alpha2.ResourceClaim (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351519       1 reflector.go:287] Starting reflector *v1alpha2.PodSchedulingContext (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351538       1 reflector.go:323] Listing and watching *v1alpha2.ResourceClaim from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351544       1 reflector.go:323] Listing and watching *v1alpha2.PodSchedulingContext from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351555       1 reflector.go:287] Starting reflector *v1alpha2.ResourceClass (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351572       1 reflector.go:323] Listing and watching *v1alpha2.ResourceClass from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.357423       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.357445       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.357459       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.358836       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?allowWatchBookmarks=true&resourceVersion=1463537&timeout=7m2s&timeoutSeconds=422&watch=true 200 OK in 0 milliseconds
I0103 07:33:19.358847       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?allowWatchBookmarks=true&resourceVersion=1463541&timeout=5m15s&timeoutSeconds=315&watch=true 200 OK in 1 milliseconds
I0103 07:33:19.359089       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?allowWatchBookmarks=true&resourceVersion=1463537&timeout=9m53s&timeoutSeconds=593&watch=true 200 OK in 1 milliseconds
I0103 07:33:19.452100       1 shared_informer.go:344] caches populated
I0103 07:33:27.927510       1 controller.go:241] "resource controller: new object" type="ResourceClaim" content="{\"metadata\":{\"name\":\"dra-sample-gpu-vgbkc\",\"generateName\":\"dra-sample-gpu-\",\"namespace\":\"default\",\"uid\":\"cf3f2558-4f63-402a-a7c3-333b7d928885\",\"resourceVersion\":\"1463627\",\"creationTimestamp\":\"2024-01-03T07:33:27Z\",\"annotations\":{\"resource.kubernetes.io/pod-claim-name\":\"gpu\"},\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"dra-sample\",\"uid\":\"803e19c7-41d6-4331-94a2-e86b3c32b177\",\"controller\":true,\"blockOwnerDeletion\":true}],\"managedFields\":[{\"manager\":\"kube-controller-manager\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2024-01-03T07:33:27Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:resource.kubernetes.io/pod-claim-name\":{}},\"f:generateName\":{},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"803e19c7-41d6-4331-94a2-e86b3c32b177\\\"}\":{}}},\"f:spec\":{\"f:allocationMode\":{},\"f:resourceClassName\":{}}}}]},\"spec\":{\"resourceClassName\":\"gpu.nvidia.com\",\"allocationMode\":\"WaitForFirstConsumer\"},\"status\":{}}"
I0103 07:33:27.927541       1 controller.go:260] "resource controller: Adding new work item" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927583       1 controller.go:332] "resource controller: processing" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927603       1 controller.go:476] "resource controller: ResourceClaim waiting for first consumer" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927613       1 controller.go:336] "resource controller: completed" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.935496       1 controller.go:241] "resource controller: new object" type="PodSchedulingContext" content="{\"metadata\":{\"name\":\"dra-sample\",\"namespace\":\"default\",\"uid\":\"e8b1fa44-574f-4a92-bdd9-26e33f816689\",\"resourceVersion\":\"1463629\",\"creationTimestamp\":\"2024-01-03T07:33:27Z\",\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"dra-sample\",\"uid\":\"803e19c7-41d6-4331-94a2-e86b3c32b177\",\"controller\":true,\"blockOwnerDeletion\":true}],\"managedFields\":[{\"manager\":\"kube-scheduler\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2024-01-03T07:33:27Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"803e19c7-41d6-4331-94a2-e86b3c32b177\\\"}\":{}}},\"f:spec\":{\"f:potentialNodes\":{},\"f:selectedNode\":{}}}}]},\"spec\":{\"selectedNode\":\"nm-shakti-worker6\",\"potentialNodes\":[\"nm-shakti-worker6\"]},\"status\":{}}"
I0103 07:33:27.935520       1 controller.go:260] "resource controller: Adding new work item" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.935547       1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.937738       1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 2 milliseconds
I0103 07:33:27.939722       1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.939739       1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.939747       1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.940731       1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.944575       1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 3 milliseconds
I0103 07:33:57.945164       1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.945186       1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.945198       1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.946163       1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.950857       1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 4 milliseconds
I0103 07:34:27.951558       1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.951578       1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.951589       1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"

$ kubectl logs -n gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm

Defaulted container "plugin" out of: plugin, init (init)
I0103 07:33:20.280685       1 device_state.go:142] using devRoot=/driver-root
I0103 07:33:20.291222       1 nonblockinggrpcserver.go:107] "dra: GRPC server started"
I0103 07:33:20.291293       1 nonblockinggrpcserver.go:107] "registrar: GRPC server started"

We can see the node is not able to pick up gpu resources:
$ kubectl describe node nm-shakti-worker6

Name:               nm-shakti-worker6
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.CETSS=true
                    feature.node.kubernetes.io/cpu-cpuid.CLZERO=true
                    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
                    feature.node.kubernetes.io/cpu-cpuid.CPBOOST=true
                    feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.FP256=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSR=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST=true
                    feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD=true
                    feature.node.kubernetes.io/cpu-cpuid.INVLPGB=true
                    feature.node.kubernetes.io/cpu-cpuid.LAHF=true
                    feature.node.kubernetes.io/cpu-cpuid.LBRVIRT=true
                    feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW=true
                    feature.node.kubernetes.io/cpu-cpuid.MCOMMIT=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVU=true
                    feature.node.kubernetes.io/cpu-cpuid.MSRIRC=true
                    feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH=true
                    feature.node.kubernetes.io/cpu-cpuid.NRIPS=true
                    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.PPIN=true
                    feature.node.kubernetes.io/cpu-cpuid.PSFD=true
                    feature.node.kubernetes.io/cpu-cpuid.RDPRU=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_ES=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_SNP=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SME=true
                    feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT=true
                    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON=true
                    feature.node.kubernetes.io/cpu-cpuid.SUCCOR=true
                    feature.node.kubernetes.io/cpu-cpuid.SVM=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMDA=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMFBASID=true
                    feature.node.kubernetes.io/cpu-cpuid.SVML=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMNP=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMPF=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMPFT=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
                    feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED=true
                    feature.node.kubernetes.io/cpu-cpuid.TOPEXT=true
                    feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR=true
                    feature.node.kubernetes.io/cpu-cpuid.VAES=true
                    feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN=true
                    feature.node.kubernetes.io/cpu-cpuid.VMPL=true
                    feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT=true
                    feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
                    feature.node.kubernetes.io/cpu-cpuid.VTE=true
                    feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
                    feature.node.kubernetes.io/cpu-cpuid.X87=true
                    feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-model.family=25
                    feature.node.kubernetes.io/cpu-model.id=1
                    feature.node.kubernetes.io/cpu-model.vendor_id=AMD
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-config.PREEMPT=true
                    feature.node.kubernetes.io/kernel-version.full=5.15.0-91-lowlatency
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=15
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/memory-numa=true
                    feature.node.kubernetes.io/network-sriov.capable=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.sriov.capable=true
                    feature.node.kubernetes.io/pci-1a03.present=true
                    feature.node.kubernetes.io/pci-8086.present=true
                    feature.node.kubernetes.io/pci-8086.sriov.capable=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    feature.node.kubernetes.io/usb-ef_0b1f_03ee.present=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=nm-shakti-worker6
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    nvidia.com/cuda.driver.major=525
                    nvidia.com/cuda.driver.minor=147
                    nvidia.com/cuda.driver.rev=05
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=0
                    nvidia.com/dra.controller=true
                    nvidia.com/dra.kubelet-plugin=true
                    nvidia.com/gfd.timestamp=1704222592
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=6
                    nvidia.com/gpu.count=3
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=AS--4124GS-TNR
                    nvidia.com/gpu.memory=49140
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-A40
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=single
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"nm-shakti-worker6"}
                    flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"42:b2:da:55:59:b8"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.82.14.19
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CETSS,cpu-cpuid.CLZERO,cpu-cpuid.CMPXCHG8,cpu-cpuid.CPBOOST,cpu-cpuid...
                    nfd.node.kubernetes.io/master.version: v0.14.2
                    nfd.node.kubernetes.io/worker.version: v0.14.2
                    node.alpha.kubernetes.io/ttl: 0
                    nvidia.com/gpu-driver-upgrade-enabled: true
                    projectcalico.org/IPv4Address: 10.82.14.19/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.244.170.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 26 Dec 2023 10:42:19 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  nm-shakti-worker6
  AcquireTime:     <unset>
  RenewTime:       Wed, 03 Jan 2024 07:36:56 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 02 Jan 2024 19:08:58 +0000   Tue, 02 Jan 2024 19:08:58 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 03 Jan 2024 07:34:36 +0000   Tue, 26 Dec 2023 10:42:17 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 03 Jan 2024 07:34:36 +0000   Tue, 26 Dec 2023 10:42:17 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 03 Jan 2024 07:34:36 +0000   Tue, 26 Dec 2023 10:42:17 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 03 Jan 2024 07:34:36 +0000   Tue, 02 Jan 2024 17:18:27 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.82.14.19
  Hostname:    nm-shakti-worker6
Capacity:
  cpu:                256
  ephemeral-storage:  459778128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528139536Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                256
  ephemeral-storage:  423731522064
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528037136Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 604c16e1b6dc4bf182c91ec14fcce1bd
  System UUID:                7f12d000-0f90-11ed-8000-3cecefeab242
  Boot ID:                    29e635b2-b1cc-491b-af4a-3930fc2bd8d8
  Kernel Version:             5.15.0-91-lowlatency
  OS Image:                   Ubuntu 22.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.14
  Kubelet Version:            v1.29.0
  Kube-Proxy Version:         v1.29.0
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
Non-terminated Pods:          (24 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  calico-apiserver            calico-apiserver-6598988b78-8bl8p                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-apiserver            calico-apiserver-6598988b78-zq5lh                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               calico-kube-controllers-779fc55954-hbvlw                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               calico-node-hzpnc                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               calico-typha-6fd5cc6495-wr86s                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               csi-node-driver-nqdwf                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  gpu-operator                gpu-feature-discovery-cdp92                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-1703607135-node-feature-discovery-gc-796f559d2vwht    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-1703607135-node-feature-discovery-master-59bf42qtd    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-1703607135-node-feature-discovery-worker-qck97        0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-99d8f4cd7-vn45x                                       200m (0%)     500m (0%)   100Mi (0%)       350Mi (0%)     7d15h
  gpu-operator                nvidia-container-toolkit-daemonset-jqr8v                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                nvidia-dcgm-exporter-khgjc                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m46s
  gpu-operator                nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm       0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m45s
  gpu-operator                nvidia-operator-validator-dtxs2                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         9h
  kube-system                 coredns-76f75df574-lqdtc                                           100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     7d15h
  kube-system                 coredns-76f75df574-xzmqw                                           100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     7d15h
  kube-system                 etcd-nm-shakti-worker6                                             100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         7d20h
  kube-system                 kube-apiserver-nm-shakti-worker6                                   250m (0%)     0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 kube-controller-manager-nm-shakti-worker6                          200m (0%)     0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 kube-proxy-t7gxs                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d20h
  kube-system                 kube-scheduler-nm-shakti-worker6                                   100m (0%)     0 (0%)      0 (0%)           0 (0%)         14h
  tigera-operator             tigera-operator-55585899bf-lcljn                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                1050m (0%)  500m (0%)
  memory             340Mi (0%)  690Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0
Events:              <none>

$ kubectl describe -n gpu-operator nodeallocationstates.nas.gpu.resource.nvidia.com nm-shakti-worker6

Name:         nm-shakti-worker6
Namespace:    gpu-operator
Labels:       <none>
Annotations:  <none>
API Version:  nas.gpu.resource.nvidia.com/v1alpha1
Kind:         NodeAllocationState
Metadata:
  Creation Timestamp:  2024-01-02T21:04:04Z
  Generation:          15
  Owner References:
    API Version:     v1
    Kind:            Node
    Name:            nm-shakti-worker6
    UID:             7195b117-ec11-443e-ba2e-c84bb407708e
  Resource Version:  1463603
  UID:               39a10030-b2bd-4bd4-8a8b-4c5bb7c757c7
Spec:
  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.6
      Index:                    0
      Memory Bytes:             51527024640
      Mig Enabled:              false
      Product Name:             NVIDIA A40
      Uuid:                     GPU-e060a342-afa6-2f46-7342-52ab49773d47
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.6
      Index:                    1
      Memory Bytes:             51527024640
      Mig Enabled:              false
      Product Name:             NVIDIA A40
      Uuid:                     GPU-e72a049f-0e52-1f3a-4e93-fac0c3ecfe50
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.6
      Index:                    2
      Memory Bytes:             51527024640
      Mig Enabled:              false
      Product Name:             NVIDIA A40
      Uuid:                     GPU-3e9c1e0e-50e4-5121-70dc-438753eeaa1c
Status:                         Ready
Events:                         <none>

How I installed k8s-dra-driver?

To fulfill prerequisites, I simply used gpu-operator as I already had it working in cluster. Just disabled devicePlugin.

helm upgrade --install gpu-operator-1703607135 --namespace gpu-operator nvidia/gpu-operator --set devicePlugin.enabled=false

For installing NVIDIA/k8s-dra-driver, I followed script(demo/clusters/kind/scripts/build-driver-image.sh) to build image and then ran:

helm upgrade --install nvidia-dra-device-plugin --namespace gpu-operator k8s-dra-driver/

Hence, it gave me:
$ kubectl get pods -n gpu-operator

NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-cdp92                                       1/1     Running     1 (12h ago)   7d15h
gpu-operator-1703607135-node-feature-discovery-gc-796f559d2vwht   1/1     Running     2 (12h ago)   7d15h
gpu-operator-1703607135-node-feature-discovery-master-59bf42qtd   1/1     Running     2 (12h ago)   7d15h
gpu-operator-1703607135-node-feature-discovery-worker-qck97       1/1     Running     3 (12h ago)   7d15h
gpu-operator-99d8f4cd7-vn45x                                      1/1     Running     4 (12h ago)   7d15h
nvidia-container-toolkit-daemonset-jqr8v                          1/1     Running     2 (12h ago)   7d15h
nvidia-cuda-validator-n2ltp                                       0/1     Completed   0             9h
nvidia-dcgm-exporter-khgjc                                        1/1     Running     1 (12h ago)   7d15h
nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s   1/1     Running     0             7m33s
nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm      1/1     Running     0             7m32s
nvidia-operator-validator-dtxs2                                   1/1     Running     0             9h

The step-up works fine with default driver from gpu-operator. But it fails to allocate and find gpu with NVIDIA/k8s-dra-driver. My installation is probably wrong. What did I miss?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions