This initialization action installs libraries required to run Apache Datasketches on a Google Cloud Dataproc cluster.
This initialization action installs dataksketches libraries on Dataproc cluster at /usr/lib/datasketches
location, below jars will be deployed:
datasketches-memory-2.0.0.jar
datasketches-java-3.1.0.jar
datasketches-pig-1.1.0.jar
datasketches-hive-1.2.0.jar
spark-java-thetasketches-1.0-SNAPSHOT.jar [ Only if Spark version < 3.5.0 ]
-
Using the
gcloud
command to create a new cluster with this initialization action. The following command will create a new standard cluster named${CLUSTER_NAME}
.REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/datasketches/dataksketches.sh
Note: Starting Apache Spark version 3.5.0, Datasketches libraries are already integrated, follow this example
-
For Older 3.X Spark versions, follow Thetasketches example from Datasketches documentation.
Note:
spark-java-thetasketches
example jar will be available under/usr/lib/datasketches
as a part of this init action, runspark-submit
withspark-java-thetasketches-1.0-SNAPSHOT.jar
to try Thetasketches example.spark-submit --jars /usr/lib/datasketches/datasketches-java-3.1.0.jar,/usr/lib/datasketches/datasketches-memory-2.0.0.jar --class Aggregate target/spark-java-thetasketches-1.0-SNAPSHOT.jar
If you modify the java code, use below instructions to build the jar.
-
Generate artifacts with Maven:
mvn archetype:generate -DgroupId=org.apache.datasketches -DartifactId=spark-java-thetasketches -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
-
Replace pom.xml with https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/datasketches/pom.xml
-
Add modified code from https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html under $local_path/src/main/java/org/apache/datasketches directory, remove the sample App.java file
Example:
root@cluster-$hostname-m:$local_path/spark-java-thetasketches/src/main/java/org/apache/datasketches# ls -lrt total 20 -rw-r--r-- 1 root root 1920 Feb 21 17:03 ThetaSketchJavaSerializable.java -rw-r--r-- 1 root root 2459 Feb 21 17:03 Spark2DatasetMapPartitionsReduceJavaSerialization.java -rw-r--r-- 1 root root 3654 Feb 21 17:03 MapPartitionsToPairReduceByKey.java -rw-r--r-- 1 root root 3142 Feb 21 17:03 AggregateByKey2.java -rw-r--r-- 1 root root 2123 Feb 21 17:03 Aggregate.java
-
Compile the code and package a jar:
mvn package
-
Verify if jar is created under
target/
root@cluster-$hostname-m:$local_path/spark-java-thetasketches# ls -lrt target/ total 48 drwxr-xr-x 3 root root 4096 Feb 29 18:36 maven-status drwxr-xr-x 3 root root 4096 Feb 29 18:36 generated-sources drwxr-xr-x 2 root root 4096 Feb 29 18:36 classes drwxr-xr-x 3 root root 4096 Feb 29 18:36 generated-test-sources drwxr-xr-x 3 root root 4096 Feb 29 18:36 test-classes drwxr-xr-x 2 root root 4096 Feb 29 18:36 surefire-reports drwxr-xr-x 2 root root 4096 Feb 29 18:36 maven-archiver -rw-r--r-- 1 root root 17542 Feb 29 18:36 spark-java-thetasketches-1.0-SNAPSHOT.jar
-
Run
spark-submit
with newly generated jar from above step.root@cluster-$hostname-m:$local_path/spark-java-thetasketches# spark-submit --jars /usr/lib/datasketches/datasketches-java-3.1.0.jar,/usr/lib/datasketches/datasketches-memory-2.0.0.jar --class Aggregate target/spark-java-thetasketches-1.0-SNAPSHOT.jar
-
- cd to
/usr/lib/datasketches
and follow Datasketches Hive examples
- cd to
/usr/lib/datasketches
and follow Datasketches Pig examples