Splitting repositories

This document discusses strategies to split a Dataform repository and manage cross-repository dependencies.

Repositories are the core units in Dataform. A repository stores all SQLX and JavaScript files that make up your SQL workflow, as well as Dataform configuration files and packages. You can store a SQL workflow in a single repository, or split a SQL workflow between multiple repositories.

Splitting a repository in Dataform comes with the following advantages:

  • Adhering to Dataform limits on compilation resources usage. Splitting a large SQL workflow into multiple smaller repositories lowers the risk of exceeding Dataform limits on compilation resources.
  • Fine-graining processes. You can set processes, such as continuous integration (CI) rules, individually for each split fragment of your SQL workflow and the team developing it.
  • Fine-graining permissions. You can set permissions individually for each split fragment of your SQL workflow and the team developing it to enhance the overall security of the SQL workflow.
  • Improving collaboration by minimizing the number of collaborators working on each split fragment of your SQL workflow.
  • Improving codebase readability. Splitting the files that make up a large SQL workflow into multiple repositories makes it easier to navigate each repository individually than to navigate the entire SQL workflow at once.
  • Speeding up workflow execution of each split fragment of your SQL workflow in comparison to execution of the entire SQL workflow.

Splitting a repository in Dataform comes with the following downsides:

  • Custom continuous integration/continuous development (CI/CD) configuration required for each Dataform repository and its corresponding Git repository.
  • Custom scheduling configuration required for each Dataform repository and its corresponding Git repository.
  • Difficulty in managing dependencies between objects of your workflow housed in multiple repositories.
  • Lack of comprehensive directed acyclic graph (DAG) visualization of the SQL workflow split between multiple repositories. In each repository, the generated DAG represents only a portion of the complete SQL workflow.

Strategies for splitting a repository

When you split a repository, you divide the files that make up a parent SQL workflow into smaller child SQL workflows housed in Dataform separate repositories.

You might choose to split a repository in one of the following ways:

  • 1 repository per a development team.
  • 1 repository per domain, for example, sales, marketing, or logistics.
  • 1 central repository and 1 repository per domain that uses the contents of the central repository as data sources.

To house the parent SQL workflow in on third-party Git hosting platform, you need to individually connect each of the separate repositories containing child workflows to a dedicated third-party Git repository.

Managing cross-repository dependencies

The most efficient way to split a repository is to divide the parent SQL workflow into self-contained child SQL workflows, creating independent repositories. An independent repository does not use the contents of a different repository as a data source. This approach does not require managing cross-repository dependencies.

When you cannot avoid cross-repository dependencies, you can manage them by splitting a repository into a succession of repositories in which a repository depends on its predecessor and is a data source for its successor. The succession of repositories and their dependencies must best reflect the structure of your parent SQL workflow.

You can create dependencies between repositories with Dataform data source declarations. You can declare a BigQuery table type from a different Dataform repository as a data source in the currently edited repository. After you declare a data source, you can reference it like any other Dataform SQL workflow object and use it to develop your SQL workflow.

When you schedule execution of a SQL workflow split between repositories with cross-repository dependencies, you must execute the repositories one by one in the order of cross-repository dependencies.

We recommend avoiding splitting a repository into a group of repositories with two-way dependencies. A two-way dependency between repositories occurs when a repository is a data source for a different repository and also uses that repository as a data source. Two-way dependencies between repositories complicate scheduling and execution of the parent SQL workflow, as well as development processes.

What's next