Designing Secure Data Pipelines with VPC Service Controls

Chie Hayashida
Google Cloud - Community
5 min readSep 3, 2020

--

There are a lot of Data Analytics use cases on GCP these days. The one of the most common concerns about building a Data Analytics Platform on GCP is security. Google Cloud has a lot of security products to protect customer’s data, and one of them is VPC Service Controls(VPC SC).

VPC SC provides an additional layer of security defense for Google Cloud services that is independent of Identity and Access Management (IAM). While IAM enables granular identity-based access control, VPC SC enables broader context-based perimeter security, including controlling data egress across the perimeter.

This blog post describes an example of how to build a Data Platform using Cloud Functions, Dataflow, Google Cloud Storage and Bigquery with VPC Service Controls.

Example Architecture

Let’s start with a customer who wants to create the data pipeline from GCS to BigQuery with Cloud DataFlow. And Cloud Dataflow is kicked by Cloud Functions once data is stored in GCS.

Example Architecture

Set up VPC Service Controls

In order for developers to be able to view and edit components on the VPC Service Controls from Google Cloud Console or terminal, VPC SC has to include access levels settings. The first step is to create an access policy and access levels to configure on VPC Service Controls.

After that, you will configure VPC SC for your projects using the access policy and access levels which you created.

To configure access policy, access levels and VPC SC, Resource Manager Organization Viewer role is required.

Creating Access Policy and Access Levels

Access levels can be set for IP addresses, devices, user and service accounts. In this project, we’ll create an access level which includes IP addresses, user and service accounts. Setting access levels for users and service accounts is available with only command line. Please see Creating an access policy.

Here is the way to create an access level with CLI.

  1. Create the following `CONDITIONS.yaml`
- ipSubnetworks
- 252.0.2.0/24
- 2001:db8::/32
members:
- serviceAccount:<serviceaccount>
- user:<user>

2. Create an access level with the following command

gcloud access-context-manager levels create <NAME> --title <TITLE> — basic-level-spec CONDITIONS.yaml --combine-function=OR --policy=<POLICY_NAME>

3. You can view the access level which you created on Google Cloud Console, by selecting the organization that contains the project, in Security -> Access Context Manager

Note that you cannot see the code preview of the script in Cloud Functions console view if you configure VPC SC with an access level that restricts access to users and service accounts only. To resolve this issue, add IP addresses or device access restrictions at access level. For more information on how to create an access level, see Creating a basic access level.

Creating Service Perimeter

Let’s create VPC SC with the following components.

  • BigQuery API
  • Cloud Functions API
  • Google Cloud Dataflow API
  • Google Cloud Storage API

The detailed document for creating VPC SC is here. You should use the access level which you created at `Ingress Policies: Access Levels`.

VPC SC perimeter configuration sample

Whereas BigQuery and GCS can be used above, it’s necessary to configure additional settings for Cloud Function and Dataflow. The following sections describe that.

Configuration for Cloud Functions with VPC SC

To use Cloud Functions with VPC Service Controls, you have to configure Organization policy and serverless VPC Access.

Set up Organization Policies

To use VPC SC with Cloud Functions, you have to set up the following organization policies.

Mandatory:

Optional:

To manage organization policies, you need the Organization Policy Administrator role.

Please see Using VPC Service Controls for more details.

Set up Serverless VPC Access

Create Connector according to Creating a connector.
Here is an example.

Connector configuration sample

Update Access level after deploying Cloud Functions

You have to add Cloud Build service account to the access level you have created according to this manual. At the first time you deploy Cloud Functions, you'll get an error like this.

You should add Cloud Build service account identified here to the `CONDITIONS.yaml` you created in the previous section and update the access level with the following command.

gcloud access-context-manager levels update <NAME> — title <TITLE> — basic-level-spec CONDITIONS.yaml — combine-function=OR — policy=<POLICY_NAME>

Custom Dataflow Templates with VPC SC

When running Dataflow Templates with VPC SC, the worker instances must be created on the subnetwork in which Private Google Access is enabled. We will describe the configuration for this and the command line arguments required for template staging and execution.

Set up Private Google Access

The subnetwork to be used by Dataflow worker instances must be configured with Private Google Access as described earlier. The Private Google Access configuration can be configured to allow the project owner, editor, or network administrator roles to be done by an IAM user who has.

Stage Custom Dataflow Templates for VPC SC environment

If you’d like to stage your template, you have to use Service Account which is included in the access level which you set at VPC SC. Please set the environmental variable as below.

export GOOGLE_APPLICATION_CREDENTIALS=<credential path>

To engure that the Dataflow worker nodes used during template staging use a subnetwork with Private Google Access configured, ` — subnetwork=<SUBNETWORK_NAME> — usePublicIps=false` option is necessary at the commandline arguments.

The entire command line will look like this.

mvn compile exec:java \
-Dexec.mainClass=com.example.myclass \
-Dexec.args=” — runner=DataflowRunner \
— project=YOUR_PROJECT_ID \
— stagingLocation=gs://YOUR_BUCKET_NAME/staging \
— templateLocation=gs://YOUR_BUCKET_NAME/templates/YOUR_TEMPLATE_NAME
— subnetwork=SUBNETWORK_NAME
— usePublicIps=false”

Execute Custom Dataflow templates with VPC SC

You need to specify subnetwork and ip_configuration as Dataflow API arguments at the script which is called by Cloud functions.

The script sample in Python

from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def run(event, context):
bucket = ‘my_bucket’
projectId = ‘my_project’
location = ‘asia-northeast1’
jobName = ‘my-job’
tableSpec = ‘my_project:test_dataset.test_table’
gcsPath = f’gs://{bucket}/templates/my-template’
textFilePath = f’gs://{bucket}/’ + event[‘name’]credentials = GoogleCredentials.get_application_default()
service = build(‘dataflow’, ‘v1b3’, credentials=credentials)
body = {
“jobName”: jobName,
“parameters”: {
“textFilePath”: textFilePath,
“tableSpec”: tableSpec
},
“environment”: {
“subnetwork”: “mysubnetwork”,
“ip_configuration”: “WORKER_IP_PRIVATE”
}
}
res = service.projects().locations().templates().launch(
projectId=projectId,
gcsPath=gcsPath,
location=location,
body=body,
).execute()

Conclusion

In this blog post, I introduced how to build a secure data pipeline in GCP using VPC SC.
Enjoy a good Data engineer life with GCP!

--

--