Skip to content

Latest commit

 

History

History

oozie

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Oozie

This initialization action installs the Oozie workflow scheduler on a Google Cloud Dataproc cluster. The Oozie server, client, and web interface are installed.

Alternatives to Oozie for workflow orchestration

Using this initialization action

⚠️ NOTICE: See best practices of using initialization actions in production.

You can use this initialization action to create a new Dataproc cluster with Oozie installed:

  1. Use the gcloud command to create a new cluster with this initialization action. The following command will create a new cluster named <CLUSTER-NAME>:

    REGION=<region>
    CLUSTER_NAME=<cluster_name>
    gcloud dataproc clusters create ${CLUSTER_NAME} \
        --region ${REGION} \
        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/oozie/oozie.sh

    Optional arguments which can be passed as --metadata values:

    1. http-proxy - HTTP proxy to use for outbound requests
    2. email-smtp-host - SMTP server to use for outbound email
    3. email-from-address - Address from which to send email
    4. oozie-db-name - MySQL database name - default: "oozie"
    5. oozie-db-username - MySQL user by which the database is accessed - default: "oozie"
    6. oozie-password-secret-name - Name of Secret Manager secret used to store oozie database user's password
    7. oozie-password-secret-version - Version of Secret Manager secret used to store oozie database user's password - default: 1
    8. mysql-root-username - Administrative MySQL user by which the database is managed - default: "root"
    9. mysql-root-password-secret-name - Name of Secret Manager secret used to store MySQL root password
    10. mysql-root-password-secret-version - Version of Secret Manager secret used to store MySQL root password - default: 1
    11. oozie-enable-ssl - Whether to enable SSL for oozie service - default: "false"
  2. Once the cluster has been created Oozie should be running on the master node.

You can find more information about using initialization actions with Dataproc in the Dataproc documentation.

Testing Oozie

You can test this Oozie installation by running the oozie-examples included with Oozie. The examples are in an archive at /usr/share/doc/oozie/oozie-examples.tar.gz. To run the MapReduce example, you can do the following from (one of) the cluster master node(s):

  1. Move the examples to your home directory:

    cp /usr/share/doc/oozie/oozie-examples.tar.gz ~
    
  2. Decompress the archive:

    tar -xzf oozie-examples.tar.gz
    
  3. Edit the MapReduce example (~/examples/apps/map-reduce/job.properties) with details for your cluster:

    On standard and single node clusters, use the master node hostname:

    nameNode=hdfs://<cluster-name-m>:8020
    jobTracker=<cluster-name-m>:8032
    

    On high availability clusters use the nameservice ids (by default, the cluster name):

    nameNode=hdfs://<cluster-name>:8020
    jobTracker=<cluster-name>:8032
    
  4. Move the Oozie examples to HDFS:

    hadoop fs -put ~/examples/ /user/${USER}/
    
  5. Run the example on the command line with:

    oozie job -oozie http://${HOSTNAME}:11000/oozie -config ~/examples/apps/map-reduce/job.properties -run
    

Oozie web interface

The Oozie web interface is available on port 11000 on the master node of the cluster. For example, the Oozie web intarface would be available at the following address for a cluster named my-dataproc-cluster:

http://my-dataproc-cluster-m:11000/oozie

To connect to the web interface you will need to create an SSH tunnel and use a SOCKS proxy. Instructions on how to do this are available in the cloud dataproc documentation.

Important notes

  • As Oozie is updated in BigTop the version of Oozie which is installed with this action will change.
  • HDFS is running on port 8020 and the (YARN) JobTracker is on port 8032 which may be useful information for some jobs.
  • The hive2 action is recommended over the hive action.
    • The hive2 action connects to the clusters Hive Server 2 and behaves like Dataproc Hive jobs.
    • The hive action uses Oozie's bundled version of Hive (1.2 in Oozie 4.3) and does not by default use the cluster's Hive metastore, which will cause tables metadata to be lost.