Google Cloud Data Catalog and Tableau integration

Leverage Data Catalog to discover & annotate Tableau assets

Ricardo Mendes
Google Cloud - Community

--

Background photo by Lauren Mancke on Unsplash

The Google Cloud Data Catalog Team has recently announced its product is now GA and ready to accept custom (aka user-defined) entries! This brand new feature opens up scope for integrations and now users can leverage Data Catalog’s well-known potential to manage metadata from almost any kind of data asset.

To demonstrate how it works, I’ll share design thoughts and sample code to connect Data Catalog to market-leader Business Intelligence/Data Visualization tools, covering Tableau metadata integration in this blog post. They come from the experience of participating in the development of fully operational sample connectors, publicly available on GitHub.

Disclaimer: Google and/or Google Cloud do not officially support any tool to connect Data Catalog to non-GCP systems at the time this article has been written down (May 2020). What you will find here is merely the result of my experience as a Data Catalog early adopter.

Technical requirements

Let’s get started with the minimum technical requirements to make the integration happen:

  1. from Tableau: allow read-access to metadata belonging to assets hosted in a given server/instance;
  2. from Data Catalog: allow users to ingest metadata representing the Tableau-managed assets;
  3. connector: a component that scrapes metadata from a Tableau instance, prepares the information to fit Data Catalog’s entity model and ingests the prepared metadata.
Architecture overview: integrating Google Cloud Data Catalog and Tableau
Image 1. Architecture overview: integrating Data Catalog and Tableau

Since 1 and 2 are already provided by Tableau and Google Cloud, our focus has been on the development of a software component addressing the 3rd one.

Adapting distinct concepts

One of the first steps in this kind of integration is to map business entities from the source to the target system. Three Tableau types were chosen to illustrate this: Workbook, Sheet, and Dashboard.

Google Cloud Data Catalog and Tableau integration: adapting entity models
Image 2. Data Catalog and Tableau integration: adapting entity models

We need to adjust such types to fit the Data Catalog model, which is more generic. All Tableau assets of these types will turn into entries in the catalog since Entry is a first-class citizen there. They will be distinguishable by the new userSpecifiedType attribute added to the Entry class, as we’ll see next.

But Tableau entities have many more attributes than catalog entries can support… In this case, we can use Tags to annotate them and avoid missing meaningful information. Tags are based on Templates, which means we will leverage three Data Catalog main types to achieve the goal. I assume you have at least a basic understanding of them, by the way. In case you don’t, please take a look at this blog post.

There’s a missing piece when it comes to entity associations and parent/child relationships. E.g., a workbook contains multiple sheets and dashboards; at the same time, it depends on database tables to retrieve information from. Data Catalog currently doesn’t provide native support for structuring such relationships in custom types. And, again, Tags may help us to map them.

Deciding between Tableau REST and Metadata APIs

Tableau provides distinct ways to access content on their servers — among them, the so-called REST and Metadata APIs. Given this, we need to make a decision on which one will be used to build the bridge with Data Catalog.

The REST API is more mature and allows users to manage and change Tableau Server resources programmatically, using HTTP. In general, it allows us to retrieve less information than the Metadata API.

The Metadata API is newer yet powerful in the sense it allows users to read more comprehensive metadata set from Tableau assets, including lineage information! Yes, it brings the relationships between the asset you’re evaluating and other items managed by either Tableau or external systems, such as database servers.

There’s also a significant technical difference between them: REST API endpoints (or resources) are accessible through standard HTTP requests, i.e. GET, POST, and so on. The Metadata API, on the other hand, is powered by GraphQL.

Like REST APIs, GraphQL is also served over HTTP. However, instead of sending requests to multiple endpoints to return the data of interest, you can send one query to one endpoint and have that filtered to return only what you ask for. A GraphQL server is associated with one endpoint. Introduction to Tableau Metadata API

Probably you guessed the Metadata API is more suitable for our needs. And you’re right! It provides 99% of what we need to make the integration happen, except authenticating users and retrieving a list of sites belonging to a given Tableau server, as far as we have investigated. So, in fact, we need both APIs…

Connecting to Tableau

Once the general scenario has been presented, it’s time to see some practical stuff. The code snippets for this article were written in Python.

Before anything else, we need to authenticate, and the REST API is the key to do it programmatically. A vanilla HTTP request is enough in this case:

The above method returns an access token that must be provided as the X-Tableau-Auth header in the subsequent requests.

Sidenote: although a Python Client is available, we decided not to depend on it since we’ll have very few interactions with that API.

A similar approach will be used to retrieve information from the Metadata API (there are no client libraries for this API, by the way). The snippet below shows how it works:

Please notice how GraphQL simplifies retrieving information from correlated objects (lines 5 to 14). The workbook is the main entity in this case, but the same query can be used to fetch its sheets. Input query has a specific format; response comes in standard JSON.

We’ll see more details of the Metadata API in the next section.

Scrape Tableau metadata

Understanding how Tableau entities are related to each other is helpful when writing GraphQL queries. I believe a picture is worth a thousand words, so let me “transcript” some information from the Metadata API Reference into a diagram:

A subset of classes belonging to the Tableau Metadata Model
Image 3. A subset of classes belonging to the Tableau Metadata Model

We can write a more complex query based on such a model:

What really matters in the above sample?

  • line 2, tableauSites: GraphQL allows fetching assets’ metadata in a relationship-based approach. Hence, we can write queries to retrieve information from the top-level assets — sites — to the lower-level ones such as sheets and database tables;
  • line 6, workbooks: sites may have workbooks, and in case it happens, the requested workbooks’ info will be retrieved by this query;
  • line 12, sheets: the same as line 6, but referring to the relationship between a workbook and its sheets;
  • line 19, upstreamTables: conceptually, this relationship is a bit different from the previous ones in the sense workbooks may depend on assets to fetch data from, but they don’t own such assets. The word “tables”, in this context, means assets that are usually owned/managed by external systems, such as database servers or even Microsoft Excel. Please also notice the upstream prefix — it means a lineage shortcut in the Tableau metadata model. Lineage shortcuts start with upstream or downstream and may refer to internal or external assets.

For example, you can use upstreamTablesConnection to query the tables used by a data source or use downstreamSheetsConnection to query sheets used by a workbook. Understand the Metadata Model

  • line 25, dashboards: unfortunately NOT (YET?)… At the time this article has been written down (Jan 2020), the Metadata API does not support fetching dashboards nested to their parent workbooks. Hence, an extra query is required to scrape their metadata.

If the above concepts are clear for you, you’re ready to scrape any kind of metadata by querying the Tableau Metadata API. Time to move forward!

Prepare the metadata for Data Catalog

At this point, we have the knowledge to gather all the information we need from Tableau. Creating Entries seems to be a good starting point to convert them into Data Catalog entities since everything else depends on them.

Methods starting with __ represent helper stuff, out of the scope of the snippet.

The above code is intended to be self-explanatory (even for those who don’t know Python…), but there are important points to pay attention:

  • line 5, entry = types.Entry(): only Google Cloud data resources were managed by Data Catalog until now. Being more specific: only automatically synchronized entries (referring to BigQuery and Pub/Sub assets) were managed at the time the product was first launched in beta; some months later, users were able to manually create entries referring to GCS Filesets. From now on, we can create any kind of Entry;
  • line 9, YOUR-ENTRY-GROUP-ID: every Entry must belong to an Entry Group, created before the entries through a very simple API call as we’ll see in the next section. For your reference, automatically managed groups for BigQuery and Pub/Sub entries have the @bigquery and @pubsub ids. User-defined entries will belong to user-defined groups — exactly the same approach used by GCS Filesets. Since the project and location ids are prepended to the entry group id, one can simply use tableau at this point;
  • line 12, entry.user_specified_system = 'tableau': a common value for all Tableau-related entries;
  • line 13, entry.user_specified_type = 'workbook': used to distinguish Tableau-related entry types. The following values can be used to fulfill this attribute in the scope of this discussion: workbook, sheet, and dashboard.

As I mentioned in the Adapting distinct concepts section, Entry’s attributes cover only a small set of Tableau assets’ metadata. But we can leverage Data Catalog Tags to avoid losing valuable information. First, let’s take a look at a Tag Template suggestion:

A Tag from this Template can be attached to each Workbook-related Entry. By doing this, we enrich the metadata available in Data Catalog as if we “added” two extra fields to such entries: Site name and Tables used by the workbook.

Sidenote: Data Catalog currently doesn’t support lists as Tag values, so the table names must be concatenated in order to fulfill the upstream_tables field.

Metadata ingestion

This is the last stage of the integration process, where the Entries and Tags are ingested into Data Catalog. As mentioned in the previous section, every Entry must belong to an Entry Group, so let’s start by creating it:

Then we can iterate through the entries created in the preparation stage and ask Data Catalog to persist them and their Tags as well:

Once the ingestion process is finished you can see the new metadata by using the Data Catalog UI. Type system=tableau and Search:

Google Cloud Data Catalog and Tableau integration: using the UI to search Tableau assets
Image 4. Using Data Catalog UI to search Tableau assets

The below picture shows an Entry and a Tag for a sample Sheet:

Tableau Sheet metadata through the Google Cloud Data Catalog UI
Image 5. Tableau Sheet metadata through the Data Catalog UI

And the SCRAPE-PREPARE-INGEST process is DONE!

The sample connector

All topics discussed in this article are covered in a sample connector, available on GitHub. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!

It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

The code snippets provided along with the text are based on the connector but simplified for readability reasons. You’ll notice actual code is instrumented with a bit more consistency checks, exception handling, and tackles not only create, but also update and delete operations for entries and tags. Required templates are created at the beginning of the ingestion phase.

Also, it’s split into two major components: datacatalog-connector-commons and datacatalog-connector-tableau. There are sample connectors for other non-GCP systems such as Looker, Hive, MySQL, Oracle, and others. The code they share, chiefly the ingestion phase-related classes, is managed in the datacatalog-connector-commons component. Ideally, by design, only this component should communicate with Data Catalog custom types API. On the other hand, all Tableau communication and specific metadata manipulation code reside on datacatalog-connector-tableau. The commons component knows nothing about external source systems as shown in the picture below:

Google Cloud Data Catalog and Tableau integration components
Image 6. Data Catalog and Tableau integration components

Tableau allows two sync strategies: Full and Incremental. Full sync happens when all assets belonging to a given instance are synced on a single execution. It’s recommended when ingesting metadata from that instance for the first time. Incremental (or partial) sync happens when only a subset of the assets are synchronized due to changes in their metadata. Tableau introduced Webhooks support in version 2019.4 and this technology is an incremental sync enabler.

Tests have shown that scraping 1,000 assets from Tableau, preparing them and ingesting into Data Catalog, takes about 25 minutes. An Entry and an enriching Tag are generated for each asset. Each Entry + Tag pair needs 4 API calls to be fully synced: 1 GET Entry, 1 CREATE or UPDATE Entry, 1 GET Tag, and finally 1 CREATE or UPDATE Tag. Data Catalog currently does not support batch API operations.

That’s all, folks!

References

--

--