Getting Started with Kerberized Dataproc Clusters with Cross-Realm Trust

Published in

Google Cloud - Community

7 min readNov 12, 2020

In this post, we walk through the architecture and deployment of setting up multiple Dataproc clusters with Kerberos that interoperate using cross-realm trust. The terraform module and scripts for deployment in this post are available in github repo kerberized_data_lake.

Overview

The below architecture is one of several approaches in deploying Dataproc with Kerberos. In this architecture, we leverage Dataproc’s Security Configuration for creating an on-cluster KDC and managing the cluster’s service principals and keytabs as is required for Kerberizing a Hadoop cluster.

With this architecture, Dataproc clusters must establish trust to ensure authenticated users can run jobs on the analytics clusters as well as access the central Hive metastore. One way trust configured across the Kerberos realms permits authenticated users seamless access to Hadoop services required. The three key components with cross-realm trust include:

Corporate directory service (Active Directory, MIT KDC, FreeIPA)

Single Master Dataproc cluster with a KDC for managing accounts for users & teams

Analytics Dataproc clusters

Cluster(s) for users to execute Spark, MapReduce, Hive/Beeline, Presto, etc.
One-way trust with corporate domain to verify end users

Hive Metastore Dataproc cluster

Central metastore for databases and tables
One-way trust with corporate domain to verify end user access to metastore
One-way trust with analytics clusters domain to verify HiveServer2 service principal for HiveServer2 impersonation

Deployment

The kerberized_data_lake git repository provides deployment using terragrunt/terraform for the Kerberos architecture described above. The deployment requires multiple GCP products including Dataproc, Cloud KMS, Cloud Storage, VPC and can be deployed in a sandbox environment to review all components and services used.

As a prerequisite, the Dataproc clusters are configured with no external IPs and require the subnetwork to have Private Google Access enabled.

Set environment variables (modify as needed):

export PROJECT=$(gcloud info --format='value(config.project)')
export ZONE=$(gcloud info --format='value(config.properties.compute.zone)')
export REGION=${ZONE%-*}
export DATALAKE_ADMIN=$(gcloud config list --format='value(core.account)')
export DATAPROC_SUBNET=default

2. Create terraform variables file:

cat > kerb_datalake.tfvars << EOL
project="${PROJECT}"
region="${REGION}"
zone="${ZONE}"
data_lake_super_admin="${DATALAKE_ADMIN}"
dataproc_subnet="${DATAPROC_SUBNET}"
users=["bob", "alice"]
tenants=["core-data"]
EOL

Configs Note:

project, region, zone- refer to the project, region, and zone for the deployment
data_lake_super_admin- the IAM admin of resources being deployed (KMS Key, GCS Buckets, etc.)
datataproc_subnet- the subnet of the Dataproc Cluster’s

3. Review kerb_data_lake.tfvars and run the deployment:

terraform workspace new ${PROJECT}
terraform init
terraform apply -var-file kerb_datalake.tfvars

Setup Details

In this example, we use an MIT KDC for implementing the corporate domain where we manage user principals in a single KDC. To verify the deployment, we create three users: [email protected], [email protected], and an application service account [email protected] that does not require human interaction (ie. password prompt) for authentication when executing jobs.

There are three clusters with their own Kerberos realm since each cluster deploys its own KDC. FOO.COM represents the corporate realm while the others are specifically for the data lake. It is important to note that while we are creating Dataproc clusters for all services, such as the standalone central KDC, we only configure the cluster to support that specific service and not as a traditional cluster for data processing. Below are the descriptions of the three clusters and their respective Kerberos realms:

cluster :       kdc-cluster 
realm   :       FOO.COM
desc    :       corporate directory service / domain controller where user/service accounts are createdcluster :       metastore-cluster
realm   :       HIVE-METASTORE.FOO.COM
desc    :       master only nodes that host the hive metastore catalog and provided default hive connection for all analytics clusterscluster :       analytics-cluster
realm   :       ANALYTICS.FOO.COM
desc    :       multi-tenant cluster(s) for data processing, users kinit [email protected] and are authenticated to access resources and execute jobs

Secrets for KDCs

Dataproc Kerberos deployment require a secret for on-cluster KDC and additional secrets for trust:

1) the encrypted secret for the KDC root principal
2) the encrypted secret for establishing trust to a remote KDC

Terraform generates and encrypts random generated secrets locally and pushes them to a specific GCS secrets bucket accessible by the clusters. On creation of a cluster, the setup will use the GCS URI of the encrypted secret, stream in the contents, decrypt using the KMS key, and set up the necessary configurations without storing the secrets on the cluster.

Secrets for Kerberos Principals

In addition to the secrets to set up on-cluster KDCs, we create user principals for [email protected], [email protected], and service principals for application service account [email protected]. These are created in the kdc-cluster and the insecure passwords for the users are simply the first possible 4 chars of the username followed by 123 (ie. alice/alic123).

Since [email protected] principal is an application service, we generate a keytab for this account and make that available on the master node of analytics-cluster. The credential is secured in /etc/security/keytab/core-data-svc.keytab with ownership of core-data-svc:core-data-svc and permissions 400. With this keytab available, the application service can authenticate using the keytab to execute jobs.

As this is for demonstration purposes only, we do not use an edge node, but use the master node for launching jobs for simplicity.

MIT KDC — FOO.COM Realm

The first cluster deployed is the MIT KDC. It doesn’t require setting up a trust to other clusters, but when other clusters are being created, they will establish trust with it. Establishing a trust with this KDC will allow other clusters to trust the user principals that are requesting access to Hadoop on the data processing clusters. The dependencies stored in GCS for this cluster to be created are:

gs://{var.project}-dataproc-secrets/
  kdc-cluster_principal.encrypted      (kdc secret)gs://{var.project}-dataproc-scripts/init-actions/
  create-users.sh                     (setup test users)

Metastore Cluster — HIVE-METASTORE.FOO.COM Realm

This cluster deploys a centralized Hive Metastore for managing shared metadata across several clusters. It requires a one-way trust with FOO.COM to allow authenticated users access to metastore resources (ie. Spark SQL job running as [email protected] on analytics-cluster querying tables) and is set up in two steps:

1) Add trust on local metastore-cluster (performed automatically)
2) Add trust on remote kdc-cluster (initialization setup-kerberos-trust.sh)

Below are dependencies for the metastore cluster:

gs://{var.project}-dataproc-secrets/
  metastore-cluster_principal.encrypted  (kdc secret)
  trust_metastore-cluster_kdc-cluster_principal.encrypted  (trust w/ foo secret)gs://{var.project}-dataproc-scripts/init-actions/
  setup-kerberos-config.sh               (updates krb5.conf)
  setup-kerberos-trust.sh                (setup trust on remote kdc)
  setup-users-config.sh                  (setup test users)
  disable-history-server.sh              (disable unneeded services)gs://{var.project}-dataproc-scripts/shutdown-scripts/
  shutdown-cleanup-trust.sh              (remove remote trust)

Analytics Cluster — ANALYTICS.FOO.COM Realm

The last cluster is the end user cluster that is used for data processing for multiple tenants. This cluster requires trust established with FOO.COM as well as reverse trust with the metastore (metastore trusts analytics) to permit HiveServer2 principal access to the metastore. Setup steps are as follows:

1) Add trust on local analytics-cluster (performed automatically)
2) Add trust on remote kdc-cluster (initialization setup-kerberos-trust.sh)
3) Add reverse trust on local analytics-cluster (initialization setup-kerberos-trust.sh)
4) Add reverse trust on remote metastore-cluster (initialization setup-kerberos-trust.sh)

Below are dependencies for the analytics cluster:

gs://{var.project}-dataproc-secrets/
  analytics-cluster_principal.encrypted      (kdc secret)
  trust_analytics-cluster_kdc-cluster_principal.encrypted           (trust w/ foo secret)
  trust_metastore-cluster_analytics-cluster_principal.encrypted  (metastore trusts analytics secret)gs://{var.project}-dataproc-scripts/init-actions/
  setup-kerberos-config.sh               (updates krb5.conf)
  setup-kerberos-trust.sh                (setup trust on remote kdc)
  setup-users-config.sh                  (setup test users)
  disable-history-server.sh              (disable unneeded services)gs://{var.project}-dataproc-scripts/shutdown-scripts/
  shutdown-cleanup-trust.sh              (remove remote trust)

Verify Kerberos Deployment

SSH & Kinit

$ gcloud compute ssh alice@analytics-cluster-m --tunnel-through-iap
$ kinit [email protected]                # remember insecure pwd alic123
Password for [email protected]:
$ klist 
$ hadoop fs -ls /user/

MapReduce Test

Execute a job on the kerberized cluster and persist the output to the data-lake bucket.

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.maps=4 10000000 gs://jh-data-sandbox-us-data-lake/test.db/test

Beeline Test

Tests alice’s authentication to HiveServer2 and HiveServer2 requests to Hive Metastore.

$ beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/[email protected]"
jdbc:hive2://localhost:10000/default> create database test_db location ‘gs://jh-data-sandbox-us-data-lake/test.db’;
jdbc:hive2://localhost:10000/default> show databases;
jdbc:hive2://localhost:10000/default> use test_db;
jdbc:hive2://localhost:10000/default> create external table test (one string) location 'gs://jh-data-sandbox-us-data-lake/test.db/test/';
jdbc:hive2://localhost:10000/default> describe formatted test;
jdbc:hive2://localhost:10000/default> select count(1) from test;
jdbc:hive2://localhost:10000/default> set hive.metastore.uris;
jdbc:hive2://localhost:10000/default> !q

Spark and Metastore Test

$ spark-shell 
scala> spark.sql("select count(1) from test_db.test").show(false)
scala> spark.catalog.listDatabases.show(false)
scala> spark.sql("use test_db")
scala> spark.sql("describe formatted test").show(false)
scala> println(spark.sparkContext.hadoopConfiguration.get("hive.metastore.uris"))
scala> :q

Application Service Account Test [email protected]

Kinit using keytab, run spark job on analytics cluster that access metastore.

$ gcloud compute ssh core-data-svc@analytics-cluster-m --tunnel-through-iap
 $ kinit -kt /etc/security/keytab/core-data-svc.keytab [email protected]
 $ spark-shell <<< 'spark.sql("select count(1) from test_db.test").show(false)'

Summary

In this blog, we learned the steps for deploying a dataproc cluster with kerberos and setting up one-way trust between KDCs for interoperability. We further verified the deployment by running jobs that not only executed on the local Dataproc kerberized cluster, but validated authenticated test users against a remote Hive Metastore. Lastly, we walked through the verification for automated jobs by providing authentication with a keytab. These steps hopefully will provide a good premise for understanding Kerberos on Dataproc and how to configure multiple clusters for your data lake.

By Melissa Avila & Jordan Hambleton