Scale out RAPIDS on Google Cloud Dataproc

Published in

RAPIDS AI

5 min readJun 12, 2019

In April, we talked about how you can try out RAPIDS for free, on Google Colab, using single NVIDIA Telsa T4 GPUs. But what do you do if your dataset is much larger than GPU memory? You then need to scale out your work to multiple GPUs.

At a high level, there are two commonly used approaches for this type of scale out: Dask & Apache Spark.

Dask is a popular parallel computing library that can be used to scale out analytics workloads. It’s lightweight & pure Python.
Spark is a robust, scalable data analytics engine. It came out of the Hadoop ecosystem and supports multiple languages, with Scala & Java being the most commonly used.

We’re working with others to add RAPIDS support to Apache Spark, more information to come later this year.

That said, RAPIDS today supports scale out using Dask! In fact, you can quickly try out our early access Dask implementation on Google Cloud Dataproc, which is a service for launching clusters of VMs on Google Cloud.

Follow the instructions below to spin up a RAPIDS cluster on Dataproc.

Install Google Cloud SDK

Step #1: If you don’t have a GCP account, you can sign up for one here.

Step #2: Install Google Cloud SDK

Step #3: Initialize the Google Cloud SDK

RAPIDS in Dataproc currently requires you to use the Google Cloud SDK command line interface. Future versions of Dataproc will support configuring GPUs through its web UI.

Create a Dataproc Cluster

Now that you’re set up with Google Cloud SDK, go ahead and run the following command to create and initialize a cluster:

CLUSTER_NAME=YOUR_CLUSTER_NAMEZONE=YOUR_COMPUTE_ZONESTORAGE_BUCKET=YOUR_STORAGE_BUCKETNUM_GPUS_IN_MASTER=4NUM_GPUS_IN_WORKER=4NUM_WORKERS=2DATAPROC_BUCKET=dataproc-initialization-actionsgcloud beta dataproc clusters create $CLUSTER_NAME  \--zone $ZONE \--master-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS_IN_MASTER \--master-machine-type n1-standard-32 \--worker-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS_IN_WORKER \--worker-machine-type n1-standard-32 \--num-workers $NUM_WORKERS \--bucket $STORAGE_BUCKET \--metadata "JUPYTER_PORT=8888" \--initialization-actions gs://$DATAPROC_BUCKET/rapids/rapids.sh \--optional-components=ANACONDA,JUPYTER \--enable-component-gateway

The bolded parameters are the ones that you’ll need to configure based on your GCP account. If you don’t specify a storage bucket, Dataproc will create a new one for you with a unique ID.

For the remaining parameters, you could simply use the defaults in the command. Check out the full documentation of this API here.

This command creates a cluster with 1 master and 2 worker VMs. Each of these machines have 4 T4 GPUs. These 12 GPUs provide a combined memory of 192 GB.

If you need more memory, you can increase the:

number of GPUs per node using NUM_GPUS_IN_MASTER & NUM_GPUS_IN_WORKER
number of workers using NUM_WORKERS

I would recommend maxing out the number of GPUs in one VM first, before scaling out to more VMs. Currently you can run a maximum of 4 T4s for this type of VM. You can learn more about T4 GPUs on Google Cloud here.

The “initialization actions” in this command are run on all nodes and they do the following:

Install NVIDIA driver
Install RAPIDS conda packages
Start dask-scheduler and workers

This sets up Jupyter to use port 8888 and the Dask dashboard to use port 8787.

This entire process might take around 15 minutes to complete.

Congrats! You now have your cluster up & running. When you’re done with your work, don’t forget to shutdown the cluster using the instructions at the end of the blog.

Connect to Cluster Using Jupyter

Go to Clusters page on GCP to see the cluster that you’ve successfully launched.

Click into your cluster and select the Web Interfaces tab. There you will see options to launch Jupyter & JupyterLab.

Explore Example Notebooks

Check out this notebook to see how you can scale out RAPIDS in a multi-node, multi-gpu environment. It is based on the NYC taxi dataset, and it shows how you can use dask_cudf to distribute dataframe manipulations on GPUs, using Pandas-like apis. It then builds an XGBoost model using dask_xgboost, which is the distributed version of GPU accelerated XGBoost.

Upload the notebook to Dataproc and try it out for yourself!

Note, before running the notebook, ensure that you’re using the RAPIDS kernel. You can look for this on the top right corner of your notebook.

Similarly, if you are starting a new notebook, look for the RAPIDS kernel in the launcher:

Monitor Progress using Dask Dashboard

RAPIDS uses Dask for distributing & coordinating work across the different nodes in your cluster. Dask has a suite of powerful monitoring tools that can be accessed from a browser. Follow the instructions below to get access to this Dask dashboard:

Step #1: Get IP address of the cluster’s master node

Run the following command to obtain the IP address of your cluster’s master node. You’ll need to change the bolded parameters based on your GCP account and cluster:

gcloud compute --project YOUR_PROJECT_ID ssh --zone YOUR_COMPUTE_ZONE YOUR_CLUSTER_NAME-mcurl ifconfig.me

This command will also set up the SSH keys that are required for step #2.

If you just want to get the IP address, you can go to the VM instances page on the GCP console. Here you’ll see all the nodes in your cluster. The node ending with “-m” is your master node.

Step #2: Set up an SSH tunnel between your local machine and the master node

ssh -i ~/.ssh/google_compute_engine -L 8787:localhost:8787 YOUR_USERNAME@EXTERNAL_IP_OF_MASTER_NODE

You should now be able to access the Dask dashboard at http://localhost:8787.

Wrap Up

Once you’ve explored RAPIDS to your heart’s content, you can tear down the cluster using this command:

gcloud dataproc clusters delete YOUR_CLUSTER_NAME

We’d love to hear your feedback, and do consider contributing to RAPIDS!