Skip to content

Latest commit

 

History

History

gobblin

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Apache Gobblin Initialization Action

This initialization action installs version 0.12.0 RC2 of Apache Gobblin on all nodes within Google Cloud Dataproc cluster.

The distribution is hosted in Dataproc-team owned Google Cloud Storage bucket gobblin-dist.

Using this initialization action

⚠️ NOTICE: See best practices of using initialization actions in production.

You can use this initialization action to create a new Dataproc cluster with Gobblin installed by:

  1. Use the gcloud command to create a new cluster with this initialization action.

    REGION=<region>
    CLUSTER_NAME=<cluster_name>
    gcloud dataproc clusters create ${CLUSTER_NAME} \
        --region ${REGION} \
        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gobblin/gobblin.sh
  2. Submit jobs

    gcloud dataproc jobs submit hadoop --cluster=<CLUSTER_NAME> \
        --class org.apache.gobblin.runtime.mapreduce.CliMRJobLauncher \
        --properties mapreduce.job.user.classpath.first=true \
        -- \
        -sysconfig /usr/local/lib/gobblin/conf/gobblin-mapreduce.properties \
        -jobconfig gs://<PATH_TO_JOB_CONFIG>

    Alternatively, you can submit jobs through Gobblin launcher scripts located in /usr/local/lib/gobblin/bin. By default, Gobblin is only configured for mapreduce mode.

  3. To learn about how to use Gobblin read the documentation for the Getting Started guide.

Important notes

  1. For Gobblin to work with Dataproc Job API, any additional client libraries (for example: Kafka, MySql) would have to be symlinked into /usr/lib/hadoop/lib directory on each node.