bigtable-beam-import

HBase Sequence Files to Cloud Bigtable using Beam

This folder contains tools to support importing and exporting HBase data to Google Cloud Bigtable using Cloud Dataflow.

Setup

To use the tools in this folder, you can download them from the maven repository, or you can build them using Maven.

Download the jars

Download the import/export jars, which is an aggregation of all required jars.

Build the jars yourself

Go to the top level directory and build the repo then return to this sub directory.

cd ../../
mvn clean install -DskipTests=true
cd bigtable-dataflow-parent/bigtable-beam-import

Tools

Data export pipeline

You can export data into a snapshot or into sequence files. If you're migrating your data from HBase to Bigtable, using snapshots is the preferred method.

Exporting snapshots from HBase

Perform these steps from Unix shell on an HBase edge node.

Set the environment variables

TABLE_NAME=your-table-name
SNAPSHOT_NAME=your-snapshot-name 
SNAPSHOT_EXPORT_PATH=/hbase-migration-snap
BUCKET_NAME="gs://bucket-name"

NUM_MAPPERS=16

Take the snapshot

echo "snapshot '$TABLE_NAME', '$SNAPSHOT_NAME'" | hbase shell -n

Export the snapshot

Install hadoop connectors
Copy to a GCS bucket

 hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
     -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers $NUM_MAPPERS

Create hashes for the table to be used during the data validation step. Visit the HBase documentation for more information on each parameter.

hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \
$TABLE_NAME $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/hashtable

Exporting sequence files from HBase

On your HDFS set the environment variables.

TABLE_NAME="my-new-table"
EXPORTDIR=/usr/[USERNAME]/hbase-${TABLE_NAME}-export
hadoop fs -mkdir -p ${EXPORTDIR}
MAXVERSIONS=2147483647

On an edge node, that has HBase classpath configured, run the export commands.

cd $HBASE_HOME
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
    -Dmapred.output.compress=true \
    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -Dhbase.client.scanner.caching=100 \
    -Dmapred.map.tasks.speculative.execution=false \
    -Dmapred.reduce.tasks.speculative.execution=false \
    $TABLE_NAME $EXPORTDIR $MAXVERSIONS

Exporting snapshots from Bigtable

Exporting HBase snapshots from Bigtable is not supported.

Exporting sequence files from Bigtable

Set the environment variables.

PROJECT_ID=your-project-id
INSTANCE_ID=your-instance-id
CLUSTER_NUM_NODES=3
TABLE_NAME=your-table-name

BUCKET_NAME=gs://bucket-name

Run the export.

java -jar bigtable-beam-import-2.3.0.jar export \
     --runner=dataflow \
     --project=$PROJECT_ID \
     --bigtableInstanceId=$INSTANCE_ID \
     --bigtableTableId=$TABLE_NAME \
     --destinationPath=$BUCKET_NAME/hbase_export/ \
     --tempLocation=$BUCKET_NAME/hbase_temp/ \
     --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
     --region=$REGION

Importing to Bigtable

You can import data into Bigtable from a snapshot or sequence files. Before you begin your import you must create the tables and column families in Bigtable via the schema translation tool or using the Bigtable command line tool and running the following:

cbt createtable your-table-name
cbt createfamily your-table-name your-column-family

Once your import is completed follow the instructions for the validator below to ensure it was successful.

Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly.

Snapshots (preferred method)

Set the environment variables.

PROJECT_ID=your-project-id
INSTANCE_ID=your-instance-id
TABLE_NAME=your-table-name
REGION=us-central1

SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
SNAPSHOT_NAME=your-snapshot-name

Run the import.

java -jar bigtable-beam-import-2.3.0.jar importsnapshot \
    --runner=DataflowRunner \
    --project=$PROJECT_ID \
    --bigtableInstanceId=$INSTANCE_ID \
    --bigtableTableId=$TABLE_NAME \
    --hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \
    --snapshotName=$SNAPSHOT_NAME \
    --stagingLocation=$SNAPSHOT_GCS_PATH/staging \
    --gcpTempLocation=$SNAPSHOT_GCS_PATH/temp \
    --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
    --region=$REGION

Snappy compressed Snapshots

Set the environment variables.

PROJECT_ID=your-project-id
INSTANCE_ID=your-instance-id
TABLE_NAME=your-table-name
REGION=us-central1

SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
SNAPSHOT_NAME=your-snapshot-name

Run the import.

java -jar bigtable-beam-import-2.3.0.jar importsnapshot \
    --runner=DataflowRunner \
    --project=$PROJECT_ID \
    --bigtableInstanceId=$INSTANCE_ID \
    --bigtableTableId=$TABLE_NAME \
    --hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \
    --snapshotName=$SNAPSHOT_NAME \
    --stagingLocation=$SNAPSHOT_GCS_PATH/staging \
    --gcpTempLocation=$SNAPSHOT_GCS_PATH/temp \
    --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
    --region=$REGION \
    --enableSnappy=true

Sequence Files

Set the environment variables.

PROJECT_ID=your-project-id
INSTANCE_ID=your-instance-id
CLUSTER_NUM_NODES=3
CLUSTER_ZONE=us-central1-a
TABLE_NAME=your-table-name

BUCKET_NAME=gs://bucket-name

Run the import.

java -jar bigtable-beam-import-2.3.0.jar import \
    --runner=dataflow \
    --project=$PROJECT_ID \
    --bigtableInstanceId=$INSTANCE_ID \
    --bigtableTableId=$TABLE_NAME \
    --sourcePattern=$BUCKET_NAME/hbase-export/part-* \
    --tempLocation=$BUCKET_NAME/hbase_temp \
    --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)  \
    --zone=$CLUSTER_ZONE \
    --region=$REGION

Validating data

Once your snapshot or sequence file is imported, you should run the validator to check if there are any rows with mismatched data.

Set the environment variables.

PROJECT_ID=your-project-id
INSTANCE_ID=your-instance-id
TABLE_NAME=your-table-name
REGION=us-central1

SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"

Run the sync job. It will put the results into $SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP.

java -jar bigtable-beam-import-2.3.0.jar sync-table  \
    --runner=dataflow \
    --project=$PROJECT_ID \
    --bigtableInstanceId=$INSTANCE_ID \
    --bigtableTableId=$TABLE_NAME \
    --outputPrefix=$SNAPSHOT_GCS_PATH/sync-table/output-${date +"%s"} \
    --stagingLocation=$SNAPSHOT_GCS_PATH/sync-table/staging \
    --hashTableOutputDir=$SNAPSHOT_GCS_PATH/hashtable \
    --tempLocation=$SNAPSHOT_GCS_PATH/sync-table/dataflow-test/temp \
    --region=$REGION

Name		Name	Last commit message	Last commit date
parent directory ..
src		src
README.md		README.md
clirr-ignored-differences.xml		clirr-ignored-differences.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigtable-beam-import

bigtable-beam-import

src

src

README.md

README.md

clirr-ignored-differences.xml

clirr-ignored-differences.xml

pom.xml

pom.xml

README.md

HBase Sequence Files to Cloud Bigtable using Beam

Setup

Download the jars

Build the jars yourself

Tools

Data export pipeline

Exporting snapshots from HBase

Exporting sequence files from HBase

Exporting snapshots from Bigtable

Exporting sequence files from Bigtable

Importing to Bigtable

Snapshots (preferred method)

Snappy compressed Snapshots

Sequence Files

Validating data

Files

bigtable-beam-import

Directory actions

More options

Directory actions

More options

Latest commit

History

bigtable-beam-import

Folders and files

parent directory

HBase Sequence Files to Cloud Bigtable using Beam

Setup

Download the jars

Build the jars yourself

Tools

Data export pipeline

Exporting snapshots from HBase

Exporting sequence files from HBase

Exporting snapshots from Bigtable

Exporting sequence files from Bigtable

Importing to Bigtable

Snapshots (preferred method)

Snappy compressed Snapshots

Sequence Files

Validating data