Use Slowly changing dimensions in Dataform

This document you how to use the open-source Slowly changing dimensions package in Dataform as an example of using an open-source package.

The Slowly changing dimensions package contains common data models for creating type 2 slowly changing dimensions tables from mutable data sources in Dataform.

Slowly changing dimensions tables are incremental tables that contain data that can change unpredictably, not on a regular schedule, such as customers or products. In a type 2 slowly changing dimensions table, new data is appended in a new row without overwriting existing table rows. Table history is preserved in multiple records for a given key in the slowly changing dimension key. Each record has a unique key.

The Slowly changing dimensions package creates the following relations in BigQuery for a given NAME:

  • NAME - a view with scd_valid_from and scd_valid_to fields
  • NAME_updates - an incremental table that stores the change history of the source table

Dataform updates slowly changing dimensions every time it executes a slowly changing dimensions incremental table. You might want to schedule slowly changing dimensions table to run every day or every hour, depending on the granularity of changes you want to capture.

To learn how to schedule Dataform executions, see Schedule executions with Cloud Composer or Schedule executions with Workflows and Cloud Scheduler.

Before you begin

  1. On the Slowly changing dimensions releases page, copy the .tar.gz URL of the newest release.
  2. Create a Dataform repository.
  3. Create and initialize a workspace in the repository.
  4. Install the Slowly changing dimensions package in your Dataform repository.

Required roles

To get the permissions that you need to configure a package, ask your administrator to grant you the Dataform Editor (roles/dataform.editor) IAM role on workspaces. For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

Create a slowly changing dimensions table with the Slowly changing dimensions package

To create a slowly changing dimensions table with the Slowly changing dimensions package package in Dataform, follow these steps:

  1. In the Files pane, next to definitions/, click the More menu.

  2. Click Create file.

  3. In the Create new file pane, do the following:

    1. In the Add a file path field, after definitions/, enter the name of the file followed by .js. For example, definitions/definitions.js.

    2. Click Create file.

  4. In the Files pane, select the newly created .js. file.

  5. Import the package to the file in the following format:

     const CONSTANT-NAME = require("dataform-scd");
    

    Replace CONSTANT-NAME with a name for the constant, for example, scd.

  6. Create slowly changing dimensions tables in the following format:

    scd("source_data_scd", {
      uniqueKey: "UNIQUE_ID",
      timestamp: "UPDATED_AT", // A field that stores a timestamp or date of when the row was last changed.
      source: {
        schema: "SOURCE_SCHEMA",     // The source table to build slowly changing dimensions from.
        name: "SOURCE_SCHEMA_NAME",
      },
      incrementalConfig: {        // Any configuration parameters to apply to the incremental table that will be created.
        bigquery: {
          partitionBy: "UPDATED_AT",
        },
      },
    });
    

    Replace the following:

    • UNIQUE_ID: a unique identifier for rows in the table
    • UPDATED_AT: a name for the field that stores a timestamp or date of when the row was last changed, for example, updated_at
    • SOURCE_SCHEMA: the schema of the source table, for example, dataform_scd_example
    • SOURCE_SCHEMA_NAME: the name of the source table, for example, source_data
  7. Optional: Click Format.

The following code sample shows a slowly changing dimensions table definition created with the Slowly changing dimensions package:

const scd = require("dataform-scd");

/**
 * Create an SCD table on top of the table defined in source_data.sqlx.
 */
const { updates, view } = scd("source_data_scd", {
  // A unique identifier for rows in the table.
  uniqueKey: "user_id",
  // A field that stores a timestamp or date of when the row was last changed.
  timestamp: "updated_at",
  // The source table to build slowly changing dimensions from.
  source: {
    schema: "dataform_scd_example",
    name: "source_data",
  },
  // Any tags that will be added to actions.
  tags: ["slowly-changing-dimensions"],
  // Documentation of table columns
  columns: {user_id: "User ID", some_field: "Data Field", updated_at: "Timestamp for updates"},
  // Configuration parameters to apply to the incremental table that will be created.
  incrementalConfig: {
    bigquery: {
      partitionBy: "updated_at",
    },
  },
});

// Additional customization of the created models can be done by using the returned actions objects.
updates.config({
  description: "Updates table for SCD",
});

What's next