Getting Started with Kerberized Dataproc Clusters with Cross-Realm Trust
In this post, we walk through the architecture and deployment of setting up multiple Dataproc clusters with Kerberos that interoperate using cross-realm trust. The terraform module and scripts for deployment in this post are available in github repo kerberized_data_lake.
Overview
The below architecture is one of several approaches in deploying Dataproc with Kerberos. In this architecture, we leverage Dataproc’s Security Configuration for creating an on-cluster KDC and managing the cluster’s service principals and keytabs as is required for Kerberizing a Hadoop cluster.
With this architecture, Dataproc clusters must establish trust to ensure authenticated users can run jobs on the analytics clusters as well as access the central Hive metastore. One way trust configured across the Kerberos realms permits authenticated users seamless access to Hadoop services required. The three key components with cross-realm trust include:
Corporate directory service (Active Directory, MIT KDC, FreeIPA)
- Single Master Dataproc cluster with a KDC for managing accounts for users & teams
Analytics Dataproc clusters
- Cluster(s) for users to execute Spark, MapReduce, Hive/Beeline, Presto, etc.
- One-way trust with corporate domain to verify end users
Hive Metastore Dataproc cluster
- Central metastore for databases and tables
- One-way trust with corporate domain to verify end user access to metastore
- One-way trust with analytics clusters domain to verify HiveServer2 service principal for HiveServer2 impersonation
Deployment
The kerberized_data_lake git repository provides deployment using terragrunt/terraform for the Kerberos architecture described above. The deployment requires multiple GCP products including Dataproc, Cloud KMS, Cloud Storage, VPC and can be deployed in a sandbox environment to review all components and services used.
As a prerequisite, the Dataproc clusters are configured with no external IPs and require the subnetwork to have Private Google Access enabled.
- Set environment variables (modify as needed):
export PROJECT=$(gcloud info --format='value(config.project)')
export ZONE=$(gcloud info --format='value(config.properties.compute.zone)')
export REGION=${ZONE%-*}
export DATALAKE_ADMIN=$(gcloud config list --format='value(core.account)')
export DATAPROC_SUBNET=default
2. Create terraform variables file:
cat > kerb_datalake.tfvars << EOL
project="${PROJECT}"
region="${REGION}"
zone="${ZONE}"
data_lake_super_admin="${DATALAKE_ADMIN}"
dataproc_subnet="${DATAPROC_SUBNET}"
users=["bob", "alice"]
tenants=["core-data"]
EOL
Configs Note:
project
,region
,zone
- refer to the project, region, and zone for the deploymentdata_lake_super_admin
- the IAM admin of resources being deployed (KMS Key, GCS Buckets, etc.)datataproc_subnet
- the subnet of the Dataproc Cluster’s
3. Review kerb_data_lake.tfvars and run the deployment:
terraform workspace new ${PROJECT}
terraform init
terraform apply -var-file kerb_datalake.tfvars
Setup Details
In this example, we use an MIT KDC for implementing the corporate domain where we manage user principals in a single KDC. To verify the deployment, we create three users: [email protected], [email protected], and an application service account [email protected] that does not require human interaction (ie. password prompt) for authentication when executing jobs.
There are three clusters with their own Kerberos realm since each cluster deploys its own KDC. FOO.COM represents the corporate realm while the others are specifically for the data lake. It is important to note that while we are creating Dataproc clusters for all services, such as the standalone central KDC, we only configure the cluster to support that specific service and not as a traditional cluster for data processing. Below are the descriptions of the three clusters and their respective Kerberos realms:
cluster : kdc-cluster
realm : FOO.COM
desc : corporate directory service / domain controller where user/service accounts are createdcluster : metastore-cluster
realm : HIVE-METASTORE.FOO.COM
desc : master only nodes that host the hive metastore catalog and provided default hive connection for all analytics clusterscluster : analytics-cluster
realm : ANALYTICS.FOO.COM
desc : multi-tenant cluster(s) for data processing, users kinit [email protected] and are authenticated to access resources and execute jobs
Secrets for KDCs
Dataproc Kerberos deployment require a secret for on-cluster KDC and additional secrets for trust:
1) the encrypted secret for the KDC root principal
2) the encrypted secret for establishing trust to a remote KDC
Terraform generates and encrypts random generated secrets locally and pushes them to a specific GCS secrets bucket accessible by the clusters. On creation of a cluster, the setup will use the GCS URI of the encrypted secret, stream in the contents, decrypt using the KMS key, and set up the necessary configurations without storing the secrets on the cluster.
Secrets for Kerberos Principals
In addition to the secrets to set up on-cluster KDCs, we create user principals for [email protected], [email protected], and service principals for application service account [email protected]. These are created in the kdc-cluster and the insecure passwords for the users are simply the first possible 4 chars of the username followed by 123 (ie. alice/alic123).
Since [email protected] principal is an application service, we generate a keytab for this account and make that available on the master node of analytics-cluster. The credential is secured in /etc/security/keytab/core-data-svc.keytab with ownership of core-data-svc:core-data-svc and permissions 400. With this keytab available, the application service can authenticate using the keytab to execute jobs.
As this is for demonstration purposes only, we do not use an edge node, but use the master node for launching jobs for simplicity.
MIT KDC — FOO.COM Realm
The first cluster deployed is the MIT KDC. It doesn’t require setting up a trust to other clusters, but when other clusters are being created, they will establish trust with it. Establishing a trust with this KDC will allow other clusters to trust the user principals that are requesting access to Hadoop on the data processing clusters. The dependencies stored in GCS for this cluster to be created are:
gs://{var.project}-dataproc-secrets/
kdc-cluster_principal.encrypted (kdc secret)gs://{var.project}-dataproc-scripts/init-actions/
create-users.sh (setup test users)
Metastore Cluster — HIVE-METASTORE.FOO.COM Realm
This cluster deploys a centralized Hive Metastore for managing shared metadata across several clusters. It requires a one-way trust with FOO.COM to allow authenticated users access to metastore resources (ie. Spark SQL job running as [email protected] on analytics-cluster querying tables) and is set up in two steps:
1) Add trust on local metastore-cluster (performed automatically)
2) Add trust on remote kdc-cluster (initialization setup-kerberos-trust.sh)
Below are dependencies for the metastore cluster:
gs://{var.project}-dataproc-secrets/
metastore-cluster_principal.encrypted (kdc secret)
trust_metastore-cluster_kdc-cluster_principal.encrypted (trust w/ foo secret)gs://{var.project}-dataproc-scripts/init-actions/
setup-kerberos-config.sh (updates krb5.conf)
setup-kerberos-trust.sh (setup trust on remote kdc)
setup-users-config.sh (setup test users)
disable-history-server.sh (disable unneeded services)gs://{var.project}-dataproc-scripts/shutdown-scripts/
shutdown-cleanup-trust.sh (remove remote trust)
Analytics Cluster — ANALYTICS.FOO.COM Realm
The last cluster is the end user cluster that is used for data processing for multiple tenants. This cluster requires trust established with FOO.COM as well as reverse trust with the metastore (metastore trusts analytics) to permit HiveServer2 principal access to the metastore. Setup steps are as follows:
1) Add trust on local analytics-cluster (performed automatically)
2) Add trust on remote kdc-cluster (initialization setup-kerberos-trust.sh)
3) Add reverse trust on local analytics-cluster (initialization setup-kerberos-trust.sh)
4) Add reverse trust on remote metastore-cluster (initialization setup-kerberos-trust.sh)
Below are dependencies for the analytics cluster:
gs://{var.project}-dataproc-secrets/
analytics-cluster_principal.encrypted (kdc secret)
trust_analytics-cluster_kdc-cluster_principal.encrypted (trust w/ foo secret)
trust_metastore-cluster_analytics-cluster_principal.encrypted (metastore trusts analytics secret)gs://{var.project}-dataproc-scripts/init-actions/
setup-kerberos-config.sh (updates krb5.conf)
setup-kerberos-trust.sh (setup trust on remote kdc)
setup-users-config.sh (setup test users)
disable-history-server.sh (disable unneeded services)gs://{var.project}-dataproc-scripts/shutdown-scripts/
shutdown-cleanup-trust.sh (remove remote trust)
Verify Kerberos Deployment
SSH & Kinit
Login, authenticate as alice, and run Hadoop commands on analytics-cluster.
$ gcloud compute ssh alice@analytics-cluster-m --tunnel-through-iap
$ kinit [email protected] # remember insecure pwd alic123
Password for [email protected]:
$ klist
$ hadoop fs -ls /user/
MapReduce Test
Execute a job on the kerberized cluster and persist the output to the data-lake bucket.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.maps=4 10000000 gs://jh-data-sandbox-us-data-lake/test.db/test
Beeline Test
Tests alice’s authentication to HiveServer2 and HiveServer2 requests to Hive Metastore.
$ beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/[email protected]"
jdbc:hive2://localhost:10000/default> create database test_db location ‘gs://jh-data-sandbox-us-data-lake/test.db’;
jdbc:hive2://localhost:10000/default> show databases;
jdbc:hive2://localhost:10000/default> use test_db;
jdbc:hive2://localhost:10000/default> create external table test (one string) location 'gs://jh-data-sandbox-us-data-lake/test.db/test/';
jdbc:hive2://localhost:10000/default> describe formatted test;
jdbc:hive2://localhost:10000/default> select count(1) from test;
jdbc:hive2://localhost:10000/default> set hive.metastore.uris;
jdbc:hive2://localhost:10000/default> !q
Spark and Metastore Test
$ spark-shell
scala> spark.sql("select count(1) from test_db.test").show(false)
scala> spark.catalog.listDatabases.show(false)
scala> spark.sql("use test_db")
scala> spark.sql("describe formatted test").show(false)
scala> println(spark.sparkContext.hadoopConfiguration.get("hive.metastore.uris"))
scala> :q
Application Service Account Test [email protected]
Kinit using keytab, run spark job on analytics cluster that access metastore.
$ gcloud compute ssh core-data-svc@analytics-cluster-m --tunnel-through-iap
$ kinit -kt /etc/security/keytab/core-data-svc.keytab [email protected]
$ spark-shell <<< 'spark.sql("select count(1) from test_db.test").show(false)'
Summary
In this blog, we learned the steps for deploying a dataproc cluster with kerberos and setting up one-way trust between KDCs for interoperability. We further verified the deployment by running jobs that not only executed on the local Dataproc kerberized cluster, but validated authenticated test users against a remote Hive Metastore. Lastly, we walked through the verification for automated jobs by providing authentication with a keytab. These steps hopefully will provide a good premise for understanding Kerberos on Dataproc and how to configure multiple clusters for your data lake.