Oracle data (ETL) to Google BigQuery using Google Cloud Dataflow and Dataprep

Paulo Amaral
Google Cloud - Community
4 min readApr 5, 2018

--

ETL made easy

Despite the fact that an ETL task is pretty challenging when it comes to loading Big Data, there’s still the scenario in which you can load gigabytes or terabytes of data from Oracle into BigQuery ( Analytics Data Warehouse Google Cloud) relatively easy and very effective.

The Data Preparation and Transformations process consume up nearly 80% of one’s time of Data Scientist work.

In my case, the diagram below is based on cloud architecture that we have been created for our customer using Machine Learning Libraries like pandas, numpy, and others for Data Science. Our customer’s database table was imported into BigQuery for further data preparation and transformations using Google Cloud Dataprep.

  • I would like to mention the excellent job/work of TRIFACTA, a company that developed Dataprep together with Google.

This tool is really useful and time-saving, let’s say that it made our Data Scientists smile.

Data pipeline

Steps :

1. Export Oracle database table to CSV.

You can use a third-party ETL tool as well to connect your database directly to the BigQuery as described in this link, eliminating some steps described below.

2. Upload file to Cloud Storage bucket.

GCP data storage bucket

3. Create Dataset and import csv data to Google Cloud Dataprep

Select Dataset and import data from cloud storage:

Dataprep — Select csv and create a dataset

Choose Cloud storage bucket and select file to import:

4. Add dataset into a flow to run a job on Google Cloud Dataflow

Add new recipe

Edit recipe

5. Clean Data and set job to run

I really like that could easily integrate with all the other tools on the Google Cloud Platform. You can save a lot of time on data cleaning and Data Scientists can focus on the business value.

I will not detail Dataprep so much because there are other great posts that you can use as a reference. ;)

6. Edit Job

You can schedule the job as well to run periodically. Dataprep offers this option.

7. Create a new Table on BigQuery

8. Save Settings and Run Job

9. View job running on Dataflow

Google Cloud Dataflow Job Detail

10. Check results

Dataprep is as important to data analysis as a pre-flight checklist is to a Datascience pilot.

11- View Bigquery Table

Example of select searching a string ( 38.5m of rows).

  • Query complete (16.0s elapsed, 4.78 GB processed)
#standardSQL
SELECT * FROM
`project.BCM.AI` t
WHERE
REGEXP_CONTAINS(LOWER(TO_JSON_STRING(t)), r’caixa marcha’)

Next Steps :

  • Explore Big query with GCP Datalab ( notebooks for Data Science ).
Datalab

References :

--

--