This is an performance report to show the relation between number of GPUs and execution time.
Currently DeepVariant(v0.7.x) support single GPU, so we can't get any benefit on multiple GPU machines, like nVidia DGX-1. Therefore, DeepVariant-on-Spark leverage Apache Spark to launch multiple DeepVariant processes in parallel, so all of GPU resource can be fully utilized.
gcloud beta dataproc clusters create my-dos1 \
--subnet default --zone us-west1-b \
--master-machine-type n1-highmem-8 --master-boot-disk-size 256 \
--num-workers 4 --worker-machine-type n1-highmem-16 \
--worker-min-cpu-platform "Intel Skylake" \
--worker-boot-disk-size 384 \
--worker-accelerator type=nvidia-tesla-p100,count=1 \
--num-worker-local-ssds 1 --image-version 1.2.59-deb9 \
--initialization-actions gs://seqslab-deepvariant/scripts/initialization-on-dataproc.sh \
--initialization-action-timeout 20m
Using the same command to demonstrate the performance improvement by adding more GPUs.
Since GPU acceleration is only leveraged by call_variants
among the
whole pipeline, you can use the following command to execute this step
directly.
bash ./deepvariant-on-spark/scripts/call_variants.sh /output/examples /bed/19/contiguous_unmasked_regions_156_parts 19 GRCH /output/variants
More details is described as follows:
Usage:
./deepvariant-on-spark/scripts/call_variants.sh <Example folder> <BED folder> <Reference Version> <Contig Style> <Output Folder>
Parameters:
<Example folder>: the output folder of make_examples
<BED folder>: the bed file for Adaptive Data Parallelization (ADP)
<Reference Version>: [ 19 | 38 ]
<Contig Style>: [ HG | GRCH ]
<Output Folder>: the output folder on HDFS
Examples:
./deepvariant-on-spark/scripts/call_variants.sh output_HG002/examples /bed/19/contiguous_unmasked_regions_156_parts 19 GRCH output_HG002/variants
The following table shows the execution time of the step call_variants
in different numbers of GPUs.
Hardware Spec. | Pure CPU cluster | 1-GPU Cluster | 2-GPU Cluster | 4-GPU Cluster |
---|---|---|---|---|
Execution Time | 1h 20m 49s | 16m 46s | 9m 34s | 7m 19s |
- The number of GPU cards is adjustable by changing
N
of--worker-accelerator type=nvidia-tesla-p100,count=N
when cluster launch. - Machine Spec. : 16-vCores with 104 GB memory (n1-highmem-16) * 4
If the cluster with 4-GPUs node is launched, you can find the full
utilization of GPU resources by using nvidia-smi
as follows:
Fri Jan 11 22:35:59 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87 Driver Version: 390.87 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:05.0 Off | 0 |
| N/A 62C P0 171W / 250W | 16112MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:00:06.0 Off | 0 |
| N/A 65C P0 166W / 250W | 15624MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:00:07.0 Off | 0 |
| N/A 63C P0 163W / 250W | 15624MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:00:08.0 Off | 0 |
| N/A 60C P0 157W / 250W | 16070MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 19863 C /usr/bin/python 4227MiB |
| 0 19868 C /usr/bin/python 4227MiB |
| 0 19880 C /usr/bin/python 4225MiB |
| 0 22558 C /usr/bin/python 3421MiB |
| 1 19876 C /usr/bin/python 4225MiB |
| 1 20093 C /usr/bin/python 4225MiB |
| 1 20094 C /usr/bin/python 4225MiB |
| 1 21556 C /usr/bin/python 2937MiB |
| 2 19864 C /usr/bin/python 4225MiB |
| 2 19874 C /usr/bin/python 4225MiB |
| 2 20095 C /usr/bin/python 4225MiB |
| 2 21555 C /usr/bin/python 2937MiB |
| 3 19879 C /usr/bin/python 4227MiB |
| 3 21986 C /usr/bin/python 3513MiB |
| 3 22495 C /usr/bin/python 4225MiB |
| 3 22686 C /usr/bin/python 4093MiB |
+-----------------------------------------------------------------------------+