This initialization action installs version
0.12.0 RC2
of Apache Gobblin on all nodes within
Google Cloud Dataproc cluster.
The distribution is hosted in Dataproc-team owned Google Cloud Storage bucket gobblin-dist
.
You can use this initialization action to create a new Dataproc cluster with Gobblin installed by:
-
Use the
gcloud
command to create a new cluster with this initialization action.REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gobblin/gobblin.sh
-
Submit jobs
gcloud dataproc jobs submit hadoop --cluster=<CLUSTER_NAME> \ --class org.apache.gobblin.runtime.mapreduce.CliMRJobLauncher \ --properties mapreduce.job.user.classpath.first=true \ -- \ -sysconfig /usr/local/lib/gobblin/conf/gobblin-mapreduce.properties \ -jobconfig gs://<PATH_TO_JOB_CONFIG>
Alternatively, you can submit jobs through Gobblin launcher scripts located in
/usr/local/lib/gobblin/bin
. By default, Gobblin is only configured for mapreduce mode. -
To learn about how to use Gobblin read the documentation for the Getting Started guide.
- For Gobblin to work with Dataproc Job API, any additional client libraries
(for example: Kafka, MySql) would have to be symlinked into
/usr/lib/hadoop/lib
directory on each node.