Skip to content

Latest commit

 

History

History

dataflow

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Getting started with Google Cloud Dataflow

Open in Cloud Shell

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. This guides you through all the steps needed to run an Apache Beam pipeline in the Google Cloud Dataflow runner.

Setting up your Google Cloud project

The following instructions help you prepare your Google Cloud project.

  1. Install the Cloud SDK.

    ℹ️ This is not required in Cloud Shell since it already has the Cloud SDK pre-installed.

  2. Create a new Google Cloud project via the New Project page, or via the gcloud command line tool.

    export PROJECT=your-google-cloud-project-id
    gcloud projects create $PROJECT
  3. Setup the Cloud SDK to your GCP project.

    gcloud init
  4. Enable billing.

  5. Enable the Dataflow API.

  6. Create a service account JSON key via the Create service account key page.

    export PROJECT=$(gcloud config get-value project)
    export SA_NAME=samples
    export IAM_ACCOUNT=$SA_NAME@$PROJECT.iam.gserviceaccount.com
    
    # Create the service account.
    gcloud iam service-accounts create $SA_NAME --display-name $SA_NAME
    
    # Set the role to Project Owner (*).
    gcloud projects add-iam-policy-binding $PROJECT \
      --member serviceAccount:$IAM_ACCOUNT \
      --role roles/owner
    
    # Create a JSON file with the service account credentials.
    export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
    gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS \
      --iam-account=$IAM_ACCOUNT

    ℹ️ The Role field authorizes your service account to access resources. You can view and change this field later by using the GCP Console IAM page. If you are developing a production app, specify more granular permissions than roles/owner.

    To learn more about roles in service accounts, see Granting roles to service accounts.

    To learn more about service accounts, see Creating and managing service accounts

  7. Set the GOOGLE_APPLICATION_CREDENTIALS to your service account key file.

    export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json

Setting up a Java development environment

The following instructions help you prepare your development environment.

  1. Download and install the Java Development Kit. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.

    $JAVA_HOME/bin/java --version
  2. Download and install Apache Maven by following the Maven installation guide for your specific operating system.

    mvn --version
  3. (Optional) Set up an IDE like IntelliJ, VS Code, Eclipse. NetBeans, etc.

(Optional) Create a new Apache Beam pipeline

The easiest way to create a new Apache Beam pipeline is through the starter Maven archetype.

export NAME=your-pipeline-name
export PACKAGE=org.apache.beam.samples
export JAVA_VERSION=11

# This creates a new directory with the pipeline's code within it.
mvn archetype:generate \
    -DarchetypeGroupId=org.apache.beam \
    -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-starter \
    -DtargetPlatform=$JAVA_VERSION \
    -DartifactId=$NAME \
    -DgroupId=$PACKAGE \
    -DinteractiveMode=false

# Navigate to the pipeline contents.
cd $NAME

Make sure you have the latest plugin and dependency versions, and update your pom.xml file accordingly.

# Check your plugin versions.
mvn versions:display-plugin-updates

# Check your dependency versions.
mvn versions:display-dependency-updates

Finally, add the runners or I/O transforms you need into your pom.xml file.