Developing NLP solutions with T5X and Vertex AI

This repository compiles prescriptive guidance and code samples that show how to operationalize the Google Research T5X framework using Google Cloud Vertex AI. Using T5X with Vertex AI enables streamlined experimentation, development, and deployment of natural language processing (NLP) solutions at scale.

The guidance assumes that you're familiar with ML concepts such as large language models (LLMs), and that you're generally familiar with Google Cloud features like Cloud Storage, Cloud TPUs, and Google Vertex AI.

Introduction

T5X is a machine learning (ML) framework for developing high-performance sequence models, including large language models (LLMs). For more information about T5X, see the following resources:

T5X is built as a JAX-based library for training, evaluating, and inferring with sequence models. T5X's primary focus is on Transformer type language models. You can use T5X to pretrain language models and to fine-tune a pretrained language model. The T5X GitHub repo includes references to a large number of pretrained Transformer models, including the T5 and Switch Transformer families of models.

T5X is streamlined, modular, and composable. You can implement pretraining, fine-tuning, evaluating, and inferring by configuring reusable components that are provided by T5X rather than having to develop custom Python modules.

Vertex AI is Google Cloud's unified ML platform that's designed to help data scientists and ML engineers increase their velocity of experimentation, deploy faster, and manage models with confidence.

Repository structure

.
├── configs
├── docs
├── examples
├── notebooks
├── tasks 
├── Dockerfile
└── README.md

/notebooks: Example notebooks demonstrating T5X fine-tuning, evaluating, and inferring scenarios:
/configs: Configuration files for the scenarios demonstrated in notebooks.
/scripts: Vertex AI Training T5X job configuration templates for selected fine-tuning, evaluating or inferring scenarios
- Fine-tuning 20B UL2 on XSUM
/tasks: Python modules implementing custom SeqIO Tasks.
/docs - Technical guides compiling best practices for running T5X on Vertex AI:
- Running and monitoring T5X jobs with Vertex AI
- Implementing model and data parallelizm
The main folder also includes Dockerfiles for custom container images used by Vertex Training.

Environment setup

This section outlines the steps to configure the Google Cloud environment that is required in order to run the code samples in this repo.

You use a user-managed instance of Vertex AI Workbench as your development environment and the primary interface to Vertex AI services.
You run T5X training, evaluating, and inferring tasks as Vertex Training custom jobs using a custom training container image.
You use Vertex AI Experiments and Vertex AI Tensorboard for job monitoring and experiment tracking.
You use a regional Cloud Storage bucket to manage artifacts created by T5X jobs.

To set up the environment execute the following steps.

Select a Google Cloud project

In the Google Cloud Console, on the project selector page, select or create a Google Cloud project. You need to be a project owner in order to set up the environment.

Enable the required services

From Cloud Shell, run the following commands to enable the required Cloud APIs:

export PROJECT_ID=<YOUR_PROJECT_ID>
 
gcloud config set project $PROJECT_ID
 
gcloud services enable \
  cloudbuild.googleapis.com \
  compute.googleapis.com \
  cloudresourcemanager.googleapis.com \
  iam.googleapis.com \
  container.googleapis.com \
  cloudapis.googleapis.com \
  cloudtrace.googleapis.com \
  containerregistry.googleapis.com \
  iamcredentials.googleapis.com \
  monitoring.googleapis.com \
  logging.googleapis.com \
  notebooks.googleapis.com \
  aiplatform.googleapis.com \
  storage.googleapis.com

Note: When you work with Vertex AI user-managed notebooks, be sure that all the services that you're using are provisioned in the same project and the same compute region as the available Vertex AI TPU pods regions. For a list of regions where TPU pods are available, see Locations in the Vertex AI documentation.

Verify quota to run jobs using Vertex AI TPUs

Some notebooks demonstrate scenarios that require as many as 128 TPU cores.

If you need an increase in Vertex AI TPU quota values, follow these steps:

In the Cloud Console, navigate to the Quotas tab of the Vertex AI API page.
In the Enter property name or value box that's next to the Filter label, add a filter that has the following conditions:

Quota: Custom model training TPU V2 cores per region or Custom model training TPU V3 cores per region
Dimensions (e.g. location): Region: <YOUR_REGION>

Note: Vertex AI TPUs are not available in all regions. If the Limit value in the listing is 8, TPUs are available, and you can request more by increasing the Quota value. If the Limit value is 0, no TPUs are available, and the Quota value cannot be changed.

In the listing, select the quota that matches your filter criteria and then click Edit Quotas.
In the New limit box, enter the required value and then submit the quota change request.

Quota increases don’t directly impact your billing because you are still required to specify the number of TPU cores to submit your T5X tasks. Only the tasks submitted with a high number of TPU cores result in higher billing.

Configure Vertex AI Workbench

You can create a user-managed notebooks instance from the command line.

Note: Make sure that you're following these steps in the same project as before.

In Cloud Shell, enter the following command. For <YOUR_INSTANCE_NAME>, enter a name starting with a lower-case letter followed by lower-case letters, numbers or dash sign. For <YOUR_LOCATION>, add a zone (for example, us-central1-a or europe-west4-a).

PROJECT_ID=$(gcloud config list --format 'value(core.project)')
INSTANCE_NAME=<YOUR_INSTANCE_NAME>
LOCATION=<YOUR_LOCATION>
gcloud notebooks instances create $INSTANCE_NAME \
     --vm-image-project=deeplearning-platform-release \
     --vm-image-family=common-cpu-notebooks \
     --machine-type=n1-standard-4 \
     --location=$LOCATION

Vertex AI Workbench creates a user-managed notebooks instance based on the properties that you specified and then automatically starts the instance. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLab link next to the instance name in the Vertex AI Workbench Cloud Console page. To connect to your user-managed notebooks instance, click Open JupyterLab.

Clone the repo, install dependencies, and build the base container image

After the Vertex Workbench user-managed notebook Jupyter lab is launched, perform the following steps:

On the Launcher page, start a new terminal session by clicking the Terminal icon.
Clone the repository to your notebook instance:

git clone https://github.com/GoogleCloudPlatform/t5x-on-vertex-ai.git

Install code dependencies:

cd t5x-on-vertex-ai
pip install -U pip
pip install google-cloud-aiplatform[tensorboard] tfds-nightly t5[gcp]

Build the base T5X container image in Container Registry. For <YOUR_PROJECT_ID>, use the ID of the Google project that you are working with.

export PROJECT_ID=<YOUR_PROJECT_ID>
gcloud config set project $PROJECT_ID
 
IMAGE_NAME=t5x-base
IMAGE_URI=gcr.io/${PROJECT_ID}/${IMAGE_NAME}
gcloud builds submit --timeout "2h" --tag ${IMAGE_URI} . --machine-type=e2-highcpu-8

Create a staging Cloud Storage bucket

The notebooks in the repo require access to a Cloud Storage bucket that's used for staging and for managing ML artifacts created by the jobs submitted. The bucket must be in the same Google Cloud region as the region you will use to run Vertex AI custom jobs.

In the Jupyter lab terminal, create the bucket. For <YOUR_REGION>, specify the region. For <YOUR_BUCKET_NAME>, use a globally unique name.

REGION=<YOUR_REGION>
BUCKET_NAME=<YOUR_BUCKET_NAME>
gsutil mb -l $REGION -p $PROJECT_ID gs://$BUCKET_NAME

Create a Vertex AI Tensorboard instance

In the Jupyter lab Terminal, create the Vertex AI Tensorboard instance:

DISPLAY_NAME=<YOUR_INSTANCE_NAME>
gcloud ai tensorboards create --display-name $DISPLAY_NAME --project $PROJECT_ID --region=$REGION

Prepare the datasets

Before you walk through the example notebooks, make sure that you pre-build all the required TensorFlow Datasets (TFDS) datasets.

From the Jupyter lab Terminal:

BUCKET_NAME=<YOUR_BUCKET_NAME>
export TFDS_DATA_DIR=gs://${BUCKET_NAME}/datasets

squad

tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version squad

wmt_t2t_translate

tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version wmt_t2t_translate

cnn_dailymail

tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version cnn_dailymail

xsum

To build xsum you need to download and prepare the source data manually.

Follow the instructions to create the xsum-extracts-from-downloads folder with source data.
Create a tar archive from the xsum-extracts-from-downloads folder.

tar -czvf xsum-extracts-from-downloads.tar.gz xsum-extracts-from-downloads/

Copy the archive to the TFDS manual downloads folder.

gsutil cp -r xsum-extracts-from-downloads.tar.gz ${TFDS_DATA_DIR}/downloads/manual/

Build the dataset

tfds build --data_dir $TFDS_DATA_DIR --experimental_latest_version xsum

The environment is ready.

Getting started

Start by reading the Running and monitoring T5X jobs with Vertex AI guide and walking through the Getting Started notebook.

Getting help

If you have any questions or if you found any problems with this repository, please report through GitHub issues.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
docs		docs
images		images
notebooks		notebooks
scripts		scripts
tasks		tasks
.gitignore		.gitignore
CONTRIBUTING		CONTRIBUTING
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

License

GoogleCloudPlatform/t5x-on-vertex-ai

Folders and files

Latest commit

History

Repository files navigation