Datastream to MySQL or PostgreSQL (Stream) template

The Datastream to SQL template is a streaming pipeline that reads Datastream data and replicates it into any MySQL or PostgreSQL database. The template reads data from Cloud Storage using Pub/Sub notifications and replicates this data into SQL replica tables.

The template does not support data definition language (DDL) and expects that all tables already exist in the database. Replication uses Dataflow stateful transforms to filter stale data and ensure consistency in out of order data. For example, if a more recent version of a row has already passed through, a late arriving version of that row is ignored. The data manipulation language (DML) that executes is a best attempt to perfectly replicate source to target data. The DML statements executed follow the following rules:

  • If a primary key exists, insert and update operations use upsert syntax (ie. INSERT INTO table VALUES (...) ON CONFLICT (...) DO UPDATE).
  • If primary keys exist, deletes are replicated as a delete DML.
  • If no primary key exists, both insert and update operations are inserted into the table.
  • If no primary keys exist, deletes are ignored.

If you are using the Oracle to Postgres utilities, add ROWID in SQL as the primary key when none exists.

Pipeline requirements

  • A Datastream stream that is ready to or already replicating data.
  • Cloud Storage Pub/Sub notifications are enabled for the Datastream data.
  • A PostgreSQL database was seeded with the required schema.
  • Network access between Dataflow workers and PostgreSQL is set up.

Template parameters

Required parameters

  • inputFilePattern : The file location for the Datastream files in Cloud Storage to replicate. This file location is typically the root path for the stream.
  • databaseHost : The SQL host to connect on.
  • databaseUser : The SQL user with all required permissions to write to all tables in replication.
  • databasePassword : The password for the SQL user.

Optional parameters

  • gcsPubSubSubscription : The Pub/Sub subscription with Datastream file notifications. For example, projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_ID>.
  • inputFileFormat : The format of the output file produced by Datastream. For example, avro or json. Defaults to avro.
  • streamName : The name or template for the stream to poll for schema information. The default value is {_metadata_stream}.
  • rfcStartDateTime : The starting DateTime used to fetch from Cloud Storage (https://tools.ietf.org/html/rfc3339). Defaults to: 1970-01-01T00:00:00.00Z.
  • dataStreamRootUrl : Datastream API Root URL. Defaults to: https://datastream.googleapis.com/.
  • databaseType : The database type to write to (for example, Postgres). Defaults to: postgres.
  • databasePort : The SQL database port to connect to. The default value is 5432.
  • databaseName : The name of the SQL database to connect to. The default value is postgres.
  • schemaMap : A map of key/values used to dictate schema name changes (ie. old_name:new_name,CaseError:case_error). Defaults to empty.
  • customConnectionString : Optional connection string which will be used instead of the default database string.

Run the template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Datastream to SQL template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --enable-streaming-engine \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Datastream_to_SQL \
    --parameters \
inputFilePattern=GCS_FILE_PATH,\
gcsPubSubSubscription=GCS_SUBSCRIPTION_NAME,\
databaseHost=DATABASE_HOST,\
databaseUser=DATABASE_USER,\
databasePassword=DATABASE_PASSWORD
  

Replace the following:

  • PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • GCS_FILE_PATH: the Cloud Storage path to Datastream data. For example: gs://bucket/path/to/data/
  • GCS_SUBSCRIPTION_NAME: the Pub/Sub subscription to read changed files from. For example: projects/my-project-id/subscriptions/my-subscription-id.
  • DATABASE_HOST: your SQL host IP.
  • DATABASE_USER: your SQL user.
  • DATABASE_PASSWORD: your SQL password.

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {

          "inputFilePattern": "GCS_FILE_PATH",
          "gcsPubSubSubscription": "GCS_SUBSCRIPTION_NAME",
          "databaseHost": "DATABASE_HOST",
          "databaseUser": "DATABASE_USER",
          "databasePassword": "DATABASE_PASSWORD"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Datastream_to_SQL",
   }
}
  

Replace the following:

  • PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • LOCATION: the region where you want to deploy your Dataflow job—for example, us-central1
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • GCS_FILE_PATH: the Cloud Storage path to Datastream data. For example: gs://bucket/path/to/data/
  • GCS_SUBSCRIPTION_NAME: the Pub/Sub subscription to read changed files from. For example: projects/my-project-id/subscriptions/my-subscription-id.
  • DATABASE_HOST: your SQL host IP.
  • DATABASE_USER: your SQL user.
  • DATABASE_PASSWORD: your SQL password.

What's next