Google Cloud Data Catalog and Looker integration

Leverage Data Catalog to discover & annotate Looker assets

Published in

Google Cloud - Community

9 min readApr 30, 2020

Background photo by Lauren Mancke on Unsplash

The Google Cloud Data Catalog Team has recently announced its product is now GA and ready to accept custom (aka user-defined) entries! This brand new feature opens up scope for integrations and now users can leverage Data Catalog’s well-known potential to manage metadata from almost any kind of data asset.

To demonstrate how it works, I’ll share design thoughts and sample code to connect Data Catalog to market-leader Business Intelligence/Data Visualization tools, covering Looker metadata integration in this blog post. They come from the experience of participating in the development of fully operational sample connectors, publicly available on GitHub.

Disclaimer: Google and/or Google Cloud do not officially support any tool to connect Data Catalog to non-GCP systems at the time this article has been written down (Apr 2020). What you will find here is merely the result of my experience as a Data Catalog early adopter.

Technical requirements

Let’s get started with the minimum technical requirements to make the integration happen:

from Looker: allow read-access to metadata belonging to assets hosted in a given server/instance;
from Data Catalog: allow users to ingest metadata representing the Looker-managed assets;
connector: a component that scrapes metadata from a Looker instance, prepares the information to fit Data Catalog’s entity model and ingests the prepared metadata.

Architecture overview: integrating Google Cloud Data Catalog and Looker — **Image 1**. Architecture overview: integrating Data Catalog and Looker

Since 1 and 2 are already provided by Looker and Google Cloud, our focus has been on the development of a software component addressing the 3rd one.

Adapting distinct concepts

One of the first steps in this kind of integration is to map business entities from the source to the target system. Five Looker types were chosen to illustrate this: Folder, Look, Dashboard, Dashboard Element (aka Tile), and Query.

Google Cloud Data Catalog and Looker integration: adapting entity models — **Image 2**. Data Catalog and Looker integration: adapting entity models

We need to adjust such types to fit the Data Catalog model, which is more generic. All Looker assets of these types will turn into entries in the catalog since Entry is a first-class citizen there. They will be distinguishable by the new userSpecifiedType attribute added to the Entry class, as we’ll see next.

But Looker entities have many more attributes than catalog entries can support… In this case, we can use Tags to annotate them and avoid missing meaningful information. Tags are based on Templates, which means we will leverage three Data Catalog main types to achieve the goal. I assume you have at least a basic understanding of them, by the way. In case you don’t, please take a look at this blog post.

There’s a missing piece when it comes to entity associations and parent/child relationships. E.g., a folder contains multiple dashboards and looks; a dashboard contains multiple tiles; a given query can provide information to multiple looks and tiles. Data Catalog currently doesn’t provide native support for structuring such relationships in custom types. And, again, Tags may help us to map them.

Looker API and Client SDK

Once the overall design decisions have been presented, it’s time to see some practical stuff. Looker has a rich REST API and the following resources are discussed in this article:

Please browse through these pages and you’ll have a clear idea of what I mean by a rich API. You’ll notice there is plenty of information on each asset.

It’s possible to access the API programmatically through pre-built client SDKs available in the most popular languages, including Python and Javascript. Both the sample connector and the code snippets presented here were written in Python.

The API client needs a configuration file, which content is described in the SDK docs. Once you have the proper file, instantiating a client is as simple as:

Sidenote: ApiAuth: API Authentication describes how to authenticate in the API. The document mentions client credentials required for the login, which must be obtained by creating an API3 key on a user account in the Looker Admin console. A shortcut for the Looker Admin console is https://<YOUR-LOOKER-INSTANCE>/admin/users/api3_key/<YOUR-USER-ID>.

Any API call can be wrapped by the client. As an example, let’s say you want to retrieve information for a given folder. The API is https://docs.looker.com/reference/api-and-integration/api-reference/v3.1/folder#get_folder. To perform the call programmatically we do:

Please notice the fields argument: it allows us to specify the fields we want in the response, instead of retrieving all information provided by the API.

Scrape Looker metadata

We’ve explored two strategies when scraping Looker metadata — let me refer to them as eager and lazy.

Eager scraping allows reading metadata from all assets of a given type by using a single API call. It’s useful for reading common assets’ (usually user-managed) metadata. After reading all metadata the connector iterates through them and uses attribute values to reconstruct relationships among the objects, e.g. append/nest an array of dashboards to their parent folder.

Lazy scraping, on the other hand, might be useful when processing information of special (including system-managed) assets, such as the so-called lookml folder and its children, which are responsible for handling LookML stuff. It needs a bunch of subsequent API calls to retrieve nested assets information.

The next image illustrates the differences. From what we’ve seen so far, mixing them is a good practice in the sense they fill gaps left by each other.

Google Cloud an Looker integration: eager vs. lazy metadata scraping — **Image 3**. Eager vs. Lazy Looker metadata scraping

Let me add a few words about coding decisions when implementing eager scraping. There are two main options to retrieve lists of objects: sdk.all_*() and sdk.search_*(). The all* methods return abbreviated objects. The search* methods are more flexible in the sense you can specify the fields you want in the response. In summary:

sdk.all_folders() response does not include all the fields you may actually need, so I recommend using sdk.search_folders() instead. Also, empty folders are not included in all_folders response;
sdk.all_dashboards() response does not include all the fields you may actually need, so I recommend using sdk.search_dashboards() instead. Please remind LookML dashboards are not included in search_dashboards response and need lazy handling;
sdk.all_looks() response does not include all the fields you may actually need, so I recommend using sdk.search_looks() instead.

In summary, we prioritize the eager approach over the lazy. Eager implies fewer API calls and more in-memory processing, hence it’s faster. Lazy was used only to address corner cases not eagerly solvable: LookML Dashboards, for instance, are not included in sdk.search_dashboards() — eager scraping — response, so they are lazily processed.

Prepare the metadata for Data Catalog

At this point, we have all the information we need from Looker. It’s time to convert them into Data Catalog entities. Creating Entries seems to be a good starting point since everything else depends on them.

Methods starting with __ represent helper stuff, out of the scope of the snippet.

The above code is intended to be self-explanatory (even for those who don’t know Python…), but there are important points to pay attention:

line 5, entry = types.Entry(): only Google Cloud data resources were managed by Data Catalog until now. Being more specific: only automatically synchronized entries (referring to BigQuery and Pub/Sub assets) were managed at the time the product was first launched in beta; some months later, users were able to manually create entries referring to GCS Filesets. From now on, we can create any kind of Entry;
line 9, YOUR-ENTRY-GROUP-ID: every Entry must belong to an Entry Group, created before the entries through a very simple API call as we’ll see in the next section. For your reference, automatically managed groups for BigQuery and Pub/Sub entries have the @bigquery and @pubsub ids. User-defined entries will belong to user-defined groups — exactly the same approach used by GCS Filesets. Since the project and location ids are prepended to the entry group id, one can simply use looker at this point;
line 12, entry.user_specified_system = 'looker': a common value for all Looker-related entries;
line 13, entry.user_specified_type = 'look': used to distinguish Looker-related entry types. The following values can be used to fulfill this attribute in the scope of this discussion: dashboard, dashboard_element, folder, look, and query.

As I mentioned in the Adapting distinct concepts section, Entry’s attributes cover only a small set of Looker assets’ metadata. But we can leverage Data Catalog Tags to avoid losing valuable information. First, let’s take a look at a Tag Template suggestion:

A Tag from this Template can be attached to each Look-related Entry. By doing this, we enrich the metadata available in Data Catalog as if we "added" four extra fields to such entries: Folder name, Query Id, Data Catalog Entry for the Query, and Excel File Url. Folder name, query id, and Excel file are typical technical metadata. Entry for the query can be fulfilled with an HTML link referring to the corresponding Query Entry in Data Catalog’s web console. This link can be generated during the Prepare phase by a quite simple algorithm.

The same approach applies when enriching metadata for other asset types. There’s an interesting example regarding queries: Looker allows us to get the SQL statement for any query through the Run Query method (passing sql as the result_format argument). So the SQL statement turns into an attribute in Query-related Tags.

Metadata ingestion

This is the last stage of the integration process, where the Entries and Tags are ingested into Data Catalog. As mentioned in the previous section, every Entry must belong to an Entry Group, so let’s start by creating it:

Then we can iterate through the entries and ask Data Catalog to persist them and their Tags as well:

Once the ingestion process is finished you can see the new metadata by using the Data Catalog UI. Type system=looker and Search:

Google Cloud Data Catalog and Looker integration: using the UI to search Looker assets — **Image 4**. Using Data Catalog UI to search Looker assets

The below picture (screenshot manipulated to improve readability) shows an Entry and a Tag for a sample Query:

Looker Query metadata through the Google Cloud Data Catalog UI — **Image 5**. Looker Query metadata through the Data Catalog UI

And the SCRAPE-PREPARE-INGEST process is DONE!

The sample connector

All topics discussed in this article are covered in a sample connector, available on GitHub. Feel free to get it and run according to the instructions. Contributions are welcome, by the way!

It’s licensed under the Apache License Version 2.0, distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

The code snippets provided along with the text are based on the connector but simplified for readability reasons. You’ll notice actual code is instrumented with a bit more consistency checks, exception handling, and tackles not only create, but also update and delete operations for entries and tags. Required templates are created at the beginning of the ingestion phase.

Also, it’s split into two major components: datacatalog-connector-commons and datacatalog-connector-looker. There are sample connectors for other non-GCP systems such as Tableau, Hive, MySQL, Oracle, and others. The code they share, chiefly the ingestion phase-related classes, is managed in the datacatalog-connector-commons component. Ideally, by design, only this component should communicate with Data Catalog custom types API. On the other hand, all Looker communication and specific metadata manipulation code reside on datacatalog-connector-looker. The commons component knows nothing about external source systems as shown in the picture below:

Google Cloud Data Catalog and Looker integration components — **Image 6**. Data Catalog and Looker integration components

Since Looker currently doesn’t have webhooks or similar technology to potentially support incremental synchronization, all assets belonging to a given server are synced on each execution.

Tests have shown that scraping 1,000 assets from Looker, preparing, and ingesting them into Data Catalog, takes about 25 minutes. An Entry and an enriching Tag are generated for each asset. Each Entry + Tag pair needs 4 API calls to be fully synced: 1 GET Entry, 1 CREATE or UPDATE Entry, 1 GET Tag, and finally 1 CREATE or UPDATE Tag. Data Catalog currently does not support batch API operations.

That’s all, folks!

References

Google Cloud Data Catalog hands-on guide, a mental model: https://medium.com/google-cloud/data-catalog-hands-on-guide-a-mental-model-dae7f6dd49e
Looker API 3.1 Reference: https://docs.looker.com/reference/api-and-integration/api-reference/v3.1
Looker client SDK Codegen: https://github.com/looker-open-source/sdk-codegen