datasketches

Apache Datasketches

⚠️ NOTICE: This init action is supported only on Dataproc clusters 2.1 and above.

This initialization action installs libraries required to run Apache Datasketches on a Google Cloud Dataproc cluster.

Using this initialization action

⚠️ NOTICE: See best practices of using initialization actions in production.

This initialization action installs dataksketches libraries on Dataproc cluster at /usr/lib/datasketches location, below jars will be deployed:

datasketches-memory-2.0.0.jar
datasketches-java-3.1.0.jar
datasketches-pig-1.1.0.jar
datasketches-hive-1.2.0.jar
spark-java-thetasketches-1.0-SNAPSHOT.jar [ Only if Spark version < 3.5.0 ]

Using the gcloud command to create a new cluster with this initialization action. The following command will create a new standard cluster named ${CLUSTER_NAME}.

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/datasketches/dataksketches.sh

Apache Datasketches Examples:

Spark:

Note: Starting Apache Spark version 3.5.0, Datasketches libraries are already integrated, follow this example

For Older 3.X Spark versions, follow Thetasketches example from Datasketches documentation.

Note: spark-java-thetasketches example jar will be available under /usr/lib/datasketches as a part of this init action, run spark-submit with spark-java-thetasketches-1.0-SNAPSHOT.jar to try Thetasketches example.

spark-submit --jars /usr/lib/datasketches/datasketches-java-3.1.0.jar,/usr/lib/datasketches/datasketches-memory-2.0.0.jar --class Aggregate target/spark-java-thetasketches-1.0-SNAPSHOT.jar

If you modify the java code, use below instructions to build the jar.

Generate artifacts with Maven:

mvn archetype:generate -DgroupId=org.apache.datasketches -DartifactId=spark-java-thetasketches -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Replace pom.xml with https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/datasketches/pom.xml

Add modified code from https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html under $local_path/src/main/java/org/apache/datasketches directory, remove the sample App.java file

Example:

root@cluster-$hostname-m:$local_path/spark-java-thetasketches/src/main/java/org/apache/datasketches# ls -lrt
total 20
-rw-r--r-- 1 root root 1920 Feb 21 17:03 ThetaSketchJavaSerializable.java
-rw-r--r-- 1 root root 2459 Feb 21 17:03 Spark2DatasetMapPartitionsReduceJavaSerialization.java
-rw-r--r-- 1 root root 3654 Feb 21 17:03 MapPartitionsToPairReduceByKey.java
-rw-r--r-- 1 root root 3142 Feb 21 17:03 AggregateByKey2.java
-rw-r--r-- 1 root root 2123 Feb 21 17:03 Aggregate.java

Compile the code and package a jar:
```
mvn package
```

Verify if jar is created under target/

root@cluster-$hostname-m:$local_path/spark-java-thetasketches# ls -lrt target/
total 48
drwxr-xr-x 3 root root  4096 Feb 29 18:36 maven-status
drwxr-xr-x 3 root root  4096 Feb 29 18:36 generated-sources
drwxr-xr-x 2 root root  4096 Feb 29 18:36 classes
drwxr-xr-x 3 root root  4096 Feb 29 18:36 generated-test-sources
drwxr-xr-x 3 root root  4096 Feb 29 18:36 test-classes
drwxr-xr-x 2 root root  4096 Feb 29 18:36 surefire-reports
drwxr-xr-x 2 root root  4096 Feb 29 18:36 maven-archiver
-rw-r--r-- 1 root root 17542 Feb 29 18:36 spark-java-thetasketches-1.0-SNAPSHOT.jar

Run spark-submit with newly generated jar from above step.

root@cluster-$hostname-m:$local_path/spark-java-thetasketches# spark-submit --jars /usr/lib/datasketches/datasketches-java-3.1.0.jar,/usr/lib/datasketches/datasketches-memory-2.0.0.jar --class Aggregate target/spark-java-thetasketches-1.0-SNAPSHOT.jar

Hive:

cd to /usr/lib/datasketches and follow Datasketches Hive examples

Pig:

cd to /usr/lib/datasketches and follow Datasketches Pig examples

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
datasketches.sh		datasketches.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasketches

datasketches

README.md

README.md

datasketches.sh

datasketches.sh

pom.xml

pom.xml

README.md

Apache Datasketches

Using this initialization action

Apache Datasketches Examples:

Spark:

Hive:

Pig:

Files

datasketches

Directory actions

More options

Directory actions

More options

Latest commit

History

datasketches

Folders and files

parent directory

README.md

README.md

datasketches.sh

datasketches.sh

pom.xml

pom.xml

README.md

Apache Datasketches

Using this initialization action

Apache Datasketches Examples:

Spark:

Hive:

Pig: