Maximize GPU network bandwidth with GPUDirect and multi-networking


This page shows you how to maximize network bandwidth and throughput for high-performance GPU workloads in Google Kubernetes Engine (GKE) clusters in Standard mode. This page is intended for machine learning (ML) engineers and platform administrators who facilitate ML workloads. You should already be familiar with networking technologies such as network interface cards (NICs) and TCP, and with accelerator technologies like the NVIDIA Collective Communications Library (NCCL).

Artificial intelligence (AI), ML, and high performance computing (HPC) applications require powerful acceleration to optimize performance by reducing job completion times. For example, ML models that focus on conversational AI and image generation require high scalability and compute power.

About Google Cloud GPU supercomputers

Google Cloud has accelerator-optimized supercomputers that are built for scalable, massive models. These machines have the following benefits:

  • Eight NVIDIA H100 GPUs per machine.
  • Up to 200 Gbps bandwidth on the primary NIC.
  • Secondary NICs (up to eight on A3 Mega machine types and up to four on A3 standard machine types), each supporting up to 200 Gbps bandwidth for GPU data transfer.

For a full list of benefits, see A3 machine series in the Compute Engine documentation.

Your GKE workload must use all available GPUs and all available secondary NICs on a single node and use a significant portion of the available bandwidth. The solution described in this document is ideal for workloads that require high performance, high throughput, and low latency.

Required features and capabilities for maximized bandwidth

To maximize your network bandwidth in GPU supercomputer nodes, use all of the following features:

  • GPUDirect networking stack: The A3 machine series supports two networking stacks for custom, remote direct memory access (RDMA):
    • On A3 standard machine types, utilize GPUDirect-TCPX to reduce the overhead required to transfer packet payloads to and from GPUs, which significantly improves throughput at scale compared to GPUs that don't use GPUDirect.
    • On A3 Mega machine types, utilize GPUDirect-TCPXO which further improves GPU to VM communication.
  • gVNIC: Enable GPUDirect capabilities such as packet header splitting, flow steering, and buffer management. gVNIC is required to use GPUDirect-TCPX or GPUDirect-TCPXO. For details about gVNIC, see Increase network traffic speed for GPU nodes.
  • Multi-networking: Add secondary NICs to the accelerator-optimized machine. Each NIC is associated with a separate subnet in its own VPC to avoid conflicts. For details about multi-network support, see Setup multi-network support for Pods.
  • Placement policies: Use a resource placement policy to place all GPU nodes for a specific workload on physically close servers to minimize latency. For details, see Define compact placement for GKE nodes.

Procedure outline

To use all of these capabilities together, you'll do the following:

  1. Create Virtual Private Cloud (VPC)s and subnets
  2. Create the GKE environment:
    1. Create a cluster with multi-networking enabled
    2. Create a node pool with the following characteristics:
      1. gVNIC enabled
      2. Multi-networking subnets specified for each secondary NIC
      3. A3 machine series with H100 GPUs backing the nodes
      4. Latest NVIDIA drivers installed
  3. Install the GPUDirect binary and the NCCL plugin
  4. Deploy a test workload to verify GPUDirect setup

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
  • Ensure that you have enough quota for H100 GPUs. To request more quota, see GPU quotas.

Requirements

The following requirements apply to both GPUDirect-TCPX and GPUDirect-TCPXO unless otherwise indicated.

  • GPUDirect-TCPX is supported on GKE version 1.27 or later and requires:
    • A3 standard machine type (for example, a3-highgpu-8g).
    • For GKE version 1.27, use GKE patch version 1.27.7-gke.1121000 or later.
    • For GKE version 1.28, use GKE patch version 1.28.8-gke.1095000 or later.
    • For GKE version 1.29, use GKE patch version 1.29.3-gke.1093000 or later.
  • GPUDirect-TCPXO is supported on GKE version 1.28 or later and requires:
    • A3 Mega machine type (for example, a3-megagpu-8g).
    • For GKE version 1.28, use GKE patch version 1.28.9-gke.1250000 or later.
    • For GKE version 1.29, use GKE patch version 1.29.4-gke.1542000 or later.
  • Your GPU nodes must use NVIDIA driver version 535 or later.
  • You must use GKE Dataplane V2.
  • The GKE node must use a Container-Optimized OS (COS) node image. Ubuntu and Windows node images are not supported.

Limitations

The following limitations apply:

Create VPCs and subnets

Create separate VPC networks in your project for each virtual NIC that you'll add to your nodes. Each VPC network must have a subnet and a firewall rule that allows internal network traffic.

  1. Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule. Choose the GPUDirect-TCPX tab for A3 standard machine types, or choose the GPUDirect-TCPXO tab for A3 Mega machine types, then complete the following instructions:

    GPUDirect-TCPX

    To maximize your bandwidth, we recommend that you create four new networks.

    for N in $(seq 1 4); do
    gcloud compute networks create PROJECT_ID-net-$N \
        --subnet-mode=custom \
        --mtu=8244
    
    gcloud compute networks subnets create PROJECT_ID-sub-$N \
        --network=PROJECT_ID-net-$N \
        --region=REGION \
        --range=SUBNET_RANGE
    
    gcloud compute firewall-rules create PROJECT_ID-internal-$N \
      --network=PROJECT_ID-net-$N \
      --action=ALLOW \
      --rules=tcp:0-65535,udp:0-65535,icmp \
      --source-ranges=SOURCE_RANGE
    done
    

    Replace the following:

    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Compute Engine region for each subnet.
    • SUBNET_RANGE: the IP address range of each subnet in CIDR notation. This example command iterates for four subnets, so you should use a variable to change the IP address for each subnet. For example, specify 192.168.$N.0/24 so that the first subnet uses 192.168.1.0/24, the second subnet uses 192.168.2.0/24, etc.
    • SOURCE_RANGE: The source IP address range for the firewall rule to allow ingress traffic, in CIDR notation. For example, 192.168.0.0/16.

    GPUDirect-TCPXO

    To maximize your bandwidth, we recommend that you create eight new networks.

    for N in $(seq 1 8); do
    gcloud compute networks create PROJECT_ID-net-$N \
        --subnet-mode=custom \
        --mtu=8244
    
    gcloud compute networks subnets create PROJECT_ID-sub-$N \
        --network=PROJECT_ID-net-$N \
        --region=REGION \
        --range=SUBNET_RANGE
    
    gcloud compute firewall-rules create PROJECT_ID-internal-$N \
      --network=PROJECT_ID-net-$N \
      --action=ALLOW \
      --rules=tcp:0-65535,udp:0-65535,icmp \
      --source-ranges=SOURCE_RANGE
    done
    

    Replace the following:

    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Compute Engine region for each subnet.
    • SUBNET_RANGE: the IP address range of each subnet in CIDR notation. This example command iterates for eight subnets, so you should use a variable to change the IP address for each subnet. For example, specify 192.168.$N.0/24 so that the first subnet uses 192.168.1.0/24, the second subnet uses 192.168.2.0/24, etc.
    • SOURCE_RANGE: The source IP address range for the firewall rule to allow ingress traffic, in CIDR notation. For example, 192.168.0.0/16.
  2. Verify that the networks were created:

    gcloud compute networks list
    

Create the GKE environment

Create a new GKE cluster that uses multi-networking (Preview) and create a GPU node pool that uses A3 machines with H100 GPUs and additional NICs. You can't update an existing cluster to use multi-networking.

GPUDirect-TCPX

  1. Create a Standard cluster:

    gcloud container clusters create CLUSTER_NAME \
        --location=LOCATION \
        --cluster-version=VERSION \
        --enable-dataplane-v2 --enable-ip-alias \
        --enable-multi-networking \
        --no-enable-autoupgrade \
    

    Replace the following:

    • CLUSTER_NAME: the name of your new cluster.
    • LOCATION: the Compute Engine region. for the cluster
    • VERSION: the GKE version for the cluster. Use a supported version as described in the Requirements section.
  2. Create Network and GKENetworkParamSet resources in the cluster that correspond to the VPC networks and subnetworks that you created:

    kubectl apply -f - <<EOF
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc1
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc1
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc2
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc2
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc3
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc3
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc4
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc4
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc1
    spec:
      vpc: PROJECT_ID-net-1
      vpcSubnet: PROJECT_ID-sub-1
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc2
    spec:
      vpc: PROJECT_ID-net-2
      vpcSubnet: PROJECT_ID-sub-2
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc3
    spec:
      vpc: PROJECT_ID-net-3
      vpcSubnet: PROJECT_ID-sub-3
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc4
    spec:
      vpc: PROJECT_ID-net-4
      vpcSubnet: PROJECT_ID-sub-4
      deviceMode: NetDevice
    EOF
    

    These resources tell GKE to configure the NICs for GPU traffic in passthrough mode. GKE doesn't apply built-in networking programming using eBPF to this traffic.

  3. Create a node pool for the H100 GPUs:

    gcloud container node-pools create NODE_POOL_NAME \
        --cluster=CLUSTER_NAME \
        --location=LOCATION \
        --machine-type=a3-highgpu-8g \
        --accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=LATEST \
        --additional-node-network=network=PROJECT_ID-net-1,subnetwork=PROJECT_ID-sub-1 \
        --additional-node-network=network=PROJECT_ID-net-2,subnetwork=PROJECT_ID-sub-2 \
        --additional-node-network=network=PROJECT_ID-net-3,subnetwork=PROJECT_ID-sub-3 \
        --additional-node-network=network=PROJECT_ID-net-4,subnetwork=PROJECT_ID-sub-4 \
        --enable-gvnic \
        --no-enable-autoupgrade
    

    Replace NODE_POOL_NAME with the name of the node pool.

    If this command fails, you might not have enough H100 GPU quota in your project. Ensure that you have quota and retry the command.

  4. Get a list of nodes in the cluster:

    kubectl get nodes
    
  5. Verify that each GPU node has eight GPUs:

    kubectl describe node NODE_NAME
    

    The output is similar to the following:

    Capacity:
      ...
      nvidia.com/gpu:             8
    Allocatable:
      ...
      nvidia.com/gpu:             8
    

GPUDirect-TCPXO

  1. Choose an available GKE version that supports GPUDirect-TCPXO. To list the versions, run this command:

    gcloud container get-server-config \
      --format="yaml(validMasterVersions)" \
      --zone=ZONE \
      --project=PROJECT_ID
    

    Replace the following:

    • ZONE: the compute zone for the cluster control plane.
    • PROJECT_ID: your Google Cloud project ID.
  2. Create a cluster:

    gcloud --project ${PROJECT} beta container clusters create CLUSTER_NAME \
      --enable-dataplane-v2 --enable-ip-alias --zone=ZONE \
      --enable-multi-networking --cluster-version=VERSION
      --no-enable-autoupgrade
    

    Replace the following:

    • CLUSTER_NAME: the name of your new cluster.
    • VERSION: a GKE version that supports GPUDirect-TCPXO, as described in Requirements.
    • REGION: the Compute Engine region for the cluster.
    • ZONE: the compute zone for the cluster.
  3. Create Network and GKENetworkParamSet resources in the cluster that correspond to the VPC networks and subnetworks that you created:

    kubectl apply -f - <<EOF
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc1
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc1
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc2
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc2
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc3
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc3
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc4
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc4
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc5
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc5
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc6
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc6
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc7
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc7
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: Network
    metadata:
      name: vpc8
    spec:
      parametersRef:
        group: networking.gke.io
        kind: GKENetworkParamSet
        name: vpc8
      type: Device
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc1
    spec:
      vpc: PROJECT_ID-net-1
      vpcSubnet: PROJECT_ID-sub-1
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc2
    spec:
      vpc: PROJECT_ID-net-2
      vpcSubnet: PROJECT_ID-sub-2
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc3
    spec:
      vpc: PROJECT_ID-net-3
      vpcSubnet: PROJECT_ID-sub-3
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc4
    spec:
      vpc: PROJECT_ID-net-4
      vpcSubnet: PROJECT_ID-sub-4
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc5
    spec:
      vpc: PROJECT_ID-net-5
      vpcSubnet: PROJECT_ID-sub-5
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc6
    spec:
      vpc: PROJECT_ID-net-6
      vpcSubnet: PROJECT_ID-sub-6
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc7
    spec:
      vpc: PROJECT_ID-net-7
      vpcSubnet: PROJECT_ID-sub-7
      deviceMode: NetDevice
    ---
    apiVersion: networking.gke.io/v1
    kind: GKENetworkParamSet
    metadata:
      name: vpc8
    spec:
      vpc: PROJECT_ID-net-8
      vpcSubnet: PROJECT_ID-sub-8
      deviceMode: NetDevice
    EOF
    

    These resources tell GKE to configure the NICs for GPU traffic in passthrough mode. GKE doesn't apply built-in networking programming using eBPF to this traffic.

  4. Create a node pool for the H100 GPUs:

    gcloud beta container node-pools create NODE_POOL_NAME \
        --zone=ZONE \
        --cluster=CLUSTER_NAME \
        --project=PROJECT_ID \
        --accelerator=type=nvidia-h100-mega-80gb,count=8,gpu-driver-version=LATEST \
        --machine-type=a3-megagpu-8g \
        --num-nodes=2 \
        --additional-node-network network=PREFIX-net-1,subnetwork=PREFIX-sub-1 \
        --additional-node-network network=PREFIX-net-2,subnetwork=PREFIX-sub-2 \
        --additional-node-network network=PREFIX-net-3,subnetwork=PREFIX-sub-3 \
        --additional-node-network network=PREFIX-net-4,subnetwork=PREFIX-sub-4 \
        --additional-node-network network=PREFIX-net-5,subnetwork=PREFIX-sub-5 \
        --additional-node-network network=PREFIX-net-6,subnetwork=PREFIX-sub-6 \
        --additional-node-network network=PREFIX-net-7,subnetwork=PREFIX-sub-7 \
        --additional-node-network network=PREFIX-net-8,subnetwork=PREFIX-sub-8 \
        --enable-gvnic \
        --no-enable-autoupgrade \
        --scopes "https://www.googleapis.com/auth/cloud-platform" \
        [--placement-policy=POLICY_NAME \
        --reservation-affinity=specific \
        --reservation=RESERVATION_NAME \
        --host-maintenance-interval=PERIODIC]
    

    Replace NODE_POOL_NAME with your node pool name.

    In the example, the --scopes "https://www.googleapis.com/auth/cloud-platform" argument sets the node instance's scope to be cloud-platform for testing convenience. For production, you may want to limit the scope to configure finer-grained credentials.

    Use the --placement-policy, --reservation-affinity, and --reservation flags if you are using a reservation. Specify these flags to configure the policy name and reservation in the node pool.

    If this command fails, you might not have enough H100 GPU quota in your project. Ensure that you have sufficient quota and retry the command.

  5. Get a list of nodes in the cluster:

    kubectl get nodes
    
  6. Verify that each GPU node has eight GPUs:

    kubectl describe node NODE_NAME
    

    The output is similar to the following:

    Capacity:
      ...
      nvidia.com/gpu:             8
    Allocatable:
      ...
      nvidia.com/gpu:             8
    

Install the GPUDirect binary and configure NCCL

This section shows you how to install the GPUDirect binary based on your A3 machine type (GPUDirect-TCPX for A3 standard, GPUDirect-TCPXO for A3 Mega) and a specific NCCL library version using a DaemonSet.

GPUDirect-TCPX

  1. Review the DaemonSet manifest:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: nccl-tcpx-installer
      namespace: kube-system
      labels:
        k8s-app: nccl-tcpx-installer
    spec:
      selector:
        matchLabels:
          k8s-app: nccl-tcpx-installer
      updateStrategy:
        type: RollingUpdate
      template:
        metadata:
          labels:
            name: nccl-tcpx-installer
            k8s-app: nccl-tcpx-installer
        spec:
          priorityClassName: system-node-critical
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: cloud.google.com/gke-accelerator
                        operator: In
                        values:
                          - nvidia-h100-80gb
          tolerations:
            - operator: "Exists"
          hostNetwork: true
          hostPID: true
          volumes:
            - name: var-lib
              hostPath:
                path: /var/lib
            - name: tcpx
              hostPath:
                path: /var/lib/tcpx
            - name: library-dir-host
              hostPath:
                path: /home/kubernetes/bin
          initContainers:
            - image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
              name: nccl-tcpx-installer
              resources:
                requests:
                  cpu: 150m
              securityContext:
                privileged: true
              volumeMounts:
                - name: var-lib
                  mountPath: /var/lib
                - name: library-dir-host
                  mountPath: /usr/local
              command: ["/bin/sh", "-c"]
              args:
                - |
                  set -ex
                  /scripts/container_entry.sh install --install-nccl
                  mkdir -p /usr/local/nvidia/lib64
                  cp -r /var/lib/tcpx/lib64/. /usr/local/nvidia/lib64
                  echo "installation finishes"
          containers:
            - image: "gcr.io/google-containers/pause:2.0"
              name: pause
    

    This DaemonSet does the following:

    1. Installs the NCCL library and GPUDirect-TCPX binary on the node.
    2. Stores the library and the binary in the /home/kubernetes/bin/nvidia/lib64 directory on the VM. By default, GKE mounts this directory into the /usr/local/nvidia/lib64 path in GPU containers that need to use NCCL and GPUDirect-TCPX.
  2. Deploy the DaemonSet:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml
    

    The NCCL plugin takes approximately two minutes to start running.

  3. Verify the status of the DaemonSet Pods:

    kubectl get pods -n=kube-system -l=name=nccl-tcpx-installer
    

    The output is similar to the following:

    nccl-tcpx-installer-6c2pv                    1/1     Running   0          2m11s
    nccl-tcpx-installer-qgg82                    1/1     Running   0          2m11s
    

GPUDirect-TCPXO

  1. Review the DaemonSet manifest:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: nccl-tcpxo-installer
      namespace: kube-system
      labels:
        k8s-app: nccl-tcpxo-installer
    spec:
      selector:
        matchLabels:
          k8s-app: nccl-tcpxo-installer
      updateStrategy:
        type: RollingUpdate
      template:
        metadata:
          labels:
            name: nccl-tcpxo-installer
            k8s-app: nccl-tcpxo-installer
        spec:
          priorityClassName: system-node-critical
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: cloud.google.com/gke-accelerator
                        operator: In
                        values:
                          - nvidia-h100-mega-80gb
          tolerations:
            - operator: "Exists"
          hostNetwork: true
          hostPID: true
          volumes:
            - name: var-lib
              hostPath:
                path: /var/lib
            - name: tcpxo
              hostPath:
                path: /var/lib/tcpxo
            - name: library-dir-host
              hostPath:
                path: /home/kubernetes/bin
          initContainers:
            - image: "ubuntu"
              name: pre-installation
              securityContext:
                privileged: true
              command:
                - nsenter
                - -at
                - '1'
                - --
                - sh
                - -c
                - /sbin/iptables -I INPUT -p tcp -m tcp -j ACCEPT && modprobe import-helper
            - name: nccl-tcpxo-installer
              image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.2
              resources:
                requests:
                  cpu: 150m
              securityContext:
                privileged: true
              volumeMounts:
                - name: var-lib
                  mountPath: /var/lib
                - name: library-dir-host
                  mountPath: /usr/local
              command: ["/bin/sh", "-c"]
              args:
                - |
                  set -ex
                  chmod 755 /scripts/container_entry.sh
                  /scripts/container_entry.sh install --install-nccl
                  mkdir -p /usr/local/nvidia/lib64
                  cp -r /var/lib/tcpxo/lib64/. /usr/local/nvidia/lib64
                  echo "installation finishes"
          containers:
            - image: "gcr.io/google-containers/pause:2.0"
              name: pause
    

    This DaemonSet does the following:

    1. Installs the NCCL library and GPUDirect-TCPXO binary on the node.
    2. Stores the library and the binary in the /home/kubernetes/bin/nvidia/lib64 directory on the VM. By default, GKE mounts this directory into the /usr/local/nvidia/lib64 path in GPU containers that need to use NCCL and GPUDirect-TCPXO.
  2. Deploy the DaemonSet:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpxo-installer.yaml
    

    The NCCL plugin takes approximately two minutes to start running.

  3. Verify the status of the DaemonSet Pods:

    kubectl get pods -n=kube-system -l=name=nccl-tcpx-installer
    

    The output is similar to the following:

    # Output
    nccl-tcpxo-installer-6c2pv                    1/1     Running   0          2m11s
    nccl-tcpxo-installer-qgg82                    1/1     Running   0          2m11s
    

Deploy a test workload

In this section, you deploy a sample workload to verify that NCCL and GPUDirect-TCPX or GPUDirect-TCPXO work as expected.

GPUDirect-TCPX

This workload includes a sidecar container named the tcpx-daemon, which runs a service that lets the Pod use GPUDirect-TCPX. You must add this sidecar container to any Pods in your own environment that need to use GPUDirect-TCPX. For a snippet of the required fields to add to your manifests, see Add GPUDirect to your manifest.

  1. Review the nccl-config.yaml ConfigMap manifest in GitHub. This manifest deploys scripts that initialize an NCCL all-gather test and sets NCCL-specific configuration settings.
  2. Review the nccl-test.yaml Deployment manifest in GitHub. This manifest does the following:

    1. Deploys two Pods, each of which runs in a node that has H100 GPUs.
    2. Deploys a sidecar container named tcpx-daemon in each Pod to let those Pods use GPUDirect-TCPX.
  3. Deploy the ConfigMap and the test workload:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-config.yaml
    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-test.yaml
    
  4. Run the following commands to trigger an NCCL all-gather test for the nodes:

    kubectl exec \
      --stdin --tty --container=nccl-test nccl-test-host-1 \
      -- /configs/allgather.sh nccl-host-1 nccl-host-2
    

    The output is similar to the following:

    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1048576         16384     float    none      -1    696.8    1.50    1.41      0    729.0    1.44    1.35      0
        2097152         32768     float    none      -1    776.4    2.70    2.53      0    726.7    2.89    2.71      0
        4194304         65536     float    none      -1    774.3    5.42    5.08      0    805.1    5.21    4.88      0
        8388608        131072     float    none      -1    812.1   10.33    9.68      0    817.6   10.26    9.62      0
        16777216        262144     float    none      -1   1035.2   16.21   15.19      0   1067.8   15.71   14.73      0
        33554432        524288     float    none      -1   1183.3   28.36   26.59      0   1211.8   27.69   25.96      0
        67108864       1048576     float    none      -1   1593.4   42.12   39.49      0   1510.5   44.43   41.65      0
      134217728       2097152     float    none      -1   2127.8   63.08   59.13      0   2312.7   58.03   54.41      0
      268435456       4194304     float    none      -1   3603.0   74.50   69.85      0   3586.2   74.85   70.17      0
      536870912       8388608     float    none      -1   7101.7   75.60   70.87      0   7060.9   76.03   71.28      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 29.8293
    

GPUDirect-TCPXO

This workload includes a sidecar container named the tcpxo-daemon, which runs a service that lets the Pod use GPUDirect-TCPXO. You must add this sidecar container to any Pods in your own environment that need to use GPUDirect-TCPXO. For a snippet of the required fields to add to your manifests, see Add GPUDirect to your manifest.

  1. Review the nccl-test.yaml manifest in GitHub. This manifest does the following:

    1. Deploys two Pods, each of which runs in a node that has H100 GPUs.
    2. Deploys a sidecar container named tcpxo-daemon in each Pod to let those Pods use GPUDirect-TCPXO.
  2. Deploy the ConfigMap and the test workload:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-test.yaml
    
  3. Deploy two Pods with the test workload:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-test.yaml
    
  4. Run the following commands to trigger an NCCL all-gather test for the two nodes:

    kubectl exec --stdin --tty --container=nccl-test nccl-test-host-1 -- /scripts/allgather.sh nccl-host-1 nccl-host-2
    

    The output is similar to the following:

    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1048576         16384     float    none      -1   4654.5    0.23    0.21      0   3890.9    0.27    0.25      0
        2097152         32768     float    none      -1   4117.2    0.51    0.48      0   5153.5    0.41    0.38      0
        4194304         65536     float    none      -1   6417.4    0.65    0.61      0   7295.5    0.57    0.54      0
        8388608        131072     float    none      -1   7872.1    1.07    1.00      0   6451.4    1.30    1.22      0
        16777216        262144     float    none      -1   6990.7    2.40    2.25      0   5609.3    2.99    2.80      0
        33554432        524288     float    none      -1   8254.0    4.07    3.81      0   7415.1    4.53    4.24      0
        67108864       1048576     float    none      -1   5546.3   12.10   11.34      0   6484.0   10.35    9.70      0
      134217728       2097152     float    none      -1   6507.3   20.63   19.34      0   6015.4   22.31   20.92      0
      268435456       4194304     float    none      -1   6744.1   39.80   37.32      0   7023.1   38.22   35.83      0
      536870912       8388608     float    none      -1   8939.8   60.05   56.30      0    11706   45.86   43.00      0
      1073741824      16777216     float    none      -1   8241.7  130.28  122.14      0   8375.2  128.20  120.19      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 22.449
    

Use recommended NCCL configuration settings to improve performance

The following key-value pairs are the recommended NCCL configuration settings for GPUDirect-TCPX and GPUDirect-TCPXO. When deploying your workloads that use NCCL, set them as environment variables to optimize performance.

GPUDirect-TCPX

"NCCL_SOCKET_IFNAME=\"eth0\"",
"NCCL_ALGO=Ring",
"NCCL_PROTO=Simple",
"NCCL_CROSS_NIC=0",
"NCCL_NET_GDR_LEVEL=PIX",
"NCCL_P2P_PXN_LEVEL=0",
"NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4",
"NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0",
"NCCL_DYNAMIC_CHUNK_SIZE=524288",
"NCCL_P2P_NET_CHUNKSIZE=524288",
"NCCL_P2P_PCI_CHUNKSIZE=524288",
"NCCL_P2P_NVL_CHUNKSIZE=1048576",
"NCCL_BUFFSIZE=4194304",
"NCCL_NSOCKS_PERTHREAD=4",
"NCCL_SOCKET_NTHREADS=1",
"NCCL_GPUDIRECTTCPX_TX_BINDINGS=\"eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177\"",
"NCCL_GPUDIRECTTCPX_RX_BINDINGS=\"eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191\"",
"NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000"

GPUDirect-TCPXO

"NCCL_FASTRAK_CTRL_DEV=eth0",
"NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8",
"NCCL_SOCKET_IFNAME=eth0",
"NCCL_CROSS_NIC=0",
"NCCL_ALGO=Ring,Tree",
"NCCL_PROTO=Simple",
"NCCL_MIN_NCHANNELS=4",
"NCCL_TUNER_PLUGIN=libnccl-tuner.so",
"NCCL_TUNER_CONFIG_PATH=/usr/local/nvidia/lib64/a3plus_tuner_config.textproto",
"NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/usr/local/nvidia/lib64/a3plus_guest_config.textproto",
"NCCL_DYNAMIC_CHUNK_SIZE=524288",
"NCCL_P2P_NET_CHUNKSIZE=524288",
"NCCL_P2P_PCI_CHUNKSIZE=524288",
"NCCL_P2P_NVL_CHUNKSIZE=1048576",
"NCCL_FASTRAK_NUM_FLOWS=2",
"NCCL_FASTRAK_USE_SNAP=1",
"NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS=600000",
"NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL=0",
"NCCL_BUFFSIZE=8388608",
"CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7",
"NCCL_NET_GDR_LEVEL=PIX",
"NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING=0",
"NCCL_FASTRAK_USE_LLCM=1",
"NCCL_NVLS_ENABLE=0"

Add GPUDirect to your manifests

This section provides the required fields that you must add to your Kubernetes manifests for your Pods to use GPUDirect.

GPUDirect-TCPX

  1. Add the following fields to the Pod specification:

    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      volumes:
      - name: libraries
        hostPath:
          path: /home/kubernetes/bin/nvidia/lib64
      - name: tcpx-socket
        hostPath:
          path: /run/tcpx
    
  2. Add the following container to the manifest to run the tcpx-daemon service:

    - name: tcpx-daemon
      image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9
      command:
        - /tcpgpudmarxd/build/app/tcpgpudmarxd
        - --gpu_nic_preset
        - a3vm
        - --gpu_shmem_type
        - fd
        - --uds_path
        - /run/tcpx
        - --setup_param
        - \"--verbose 128 2 0 \"
      securityContext:
        privileged: true
      volumeMounts:
        - name: libraries
          mountPath: /usr/local/nvidia/lib64
        - name: tcpx-socket
          mountPath: /run/tcpx
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
    
  3. Add the following volume mounts to any containers that request GPUs:

    volumeMounts:
    - name: tcpx-socket
      mountPath: /tmp
    - name: libraries
      mountPath: /usr/local/nvidia/lib64
    
  4. Add the following environment variable to every GPU container:

    env:
    - name: LD_LIBRARY_PATH
      value: /usr/local/nvidia/lib64
    
  5. Add environment variables to configure NCCL options. For details, see the Use recommended NCCL configuration settings to improve performance section in this document.

A completed Pod specification looks like the following:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  labels:
    name: example-pod
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  containers:
  - name: tcpx-daemon
    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9
    command:
      - /tcpgpudmarxd/build/app/tcpgpudmarxd
      - --gpu_nic_preset
      - a3vm
      - --gpu_shmem_type
      - fd
      - --uds_path
      - /run/tcpx
      - --setup_param
      - \"--verbose 128 2 0 \"
    securityContext:
      privileged: true
    volumeMounts:
      - name: libraries
        mountPath: /usr/local/nvidia/lib64
      - name: tcpx-socket
        mountPath: /run/tcpx
    env:
      - name: LD_LIBRARY_PATH
        value: /usr/local/nvidia/lib64
    - name: nccl-test
      image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx:v3.1.2
      imagePullPolicy: Always
      command:
        - /bin/sh
        - -c
        - "while true; do echo hello; sleep 1; done"
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
      volumeMounts:
        - name: tcpx-socket
          mountPath: /run/tcpx
        - name: libraries
          mountPath: /usr/local/nvidia/lib64
      resources:
        limits:
          nvidia.com/gpu: 8
  volumes:
    - name: libraries
      hostPath:
        path: /home/kubernetes/bin/nvidia/lib64
    - name: tcpx-socket
      hostPath:
        path: /run/tcpx

GPUDirect-TCPXO

  1. Add the following fields to the Pod specification:

    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      volumes:
      - name: libraries
        hostPath:
          path: /home/kubernetes/bin/nvidia/lib64
    
  2. Add the following container to the manifest to run the tcpxo-daemon service:

    - name: tcpxo-daemon
      image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.3
      imagePullPolicy: Always
      command: ["/bin/sh", "-c"]
      args:
        - |
          set -ex
          chmod 755 /fts/entrypoint_rxdm_container.sh
          /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr
      securityContext:
        privileged: true
      volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
    
  3. Add the following environment variable to every GPU container:

    env:
    - name: LD_LIBRARY_PATH
      value: /usr/local/nvidia/lib64
    
  4. Add privileged:true to every GPU container:

    securityContext:
      privileged: true
    
  5. Add environment variables to configure NCCL options. For details, see Use recommended NCCL configuration settings to improve performance.

A completed Pod specification looks like the following:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  labels:
    name: example-pod
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  containers:
    - name: tcpxo-daemon
      image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.3
      imagePullPolicy: Always
      command: ["/bin/sh", "-c"]
      args:
        - |
          set -ex
          chmod 755 /fts/entrypoint_rxdm_container.sh
          /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr
      securityContext:
        privileged: true
      volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
    - name: example-container
      image: example-image
      imagePullPolicy: Always
    ...
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
      securityContext:
        privileged: true
      volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
      resources:
        limits:
          nvidia.com/gpu: 8
  volumes:
    - name: nvidia-install-dir-host
      hostPath:
        path: /home/kubernetes/bin/nvidia

What's next