Get online predictions from a custom trained model

Vertex AI offers online predictions on Google Distributed Cloud (GDC) air-gapped through the Prediction API. A prediction is the output of a trained machine learning model. Specifically, online predictions are synchronous requests made to your own model endpoint.

Online predictions let you upload, deploy, serve, and make requests using your own prediction models on a set of supported containers. Use online predictions when you are making requests in response to application input or in situations that require timely inference. This page provides an overview of the workflow for getting online predictions from your custom trained models on Vertex AI.

Use the Prediction API by applying Kubernetes custom resources (CRs) to the dedicated Prediction user cluster that your Infrastructure Operator (IO) creates for you.

Before getting online predictions, you must first deploy the model resource to an endpoint. This action associates compute resources with the model so that it can serve online predictions with low latency. Then, you can get online predictions from a custom trained model by sending a request.

Available container images

The following table contains the list of supported containers for Vertex AI online predictions in GDC.

ML framework	Version	Supported accelerators	Supported images
TensorFlow	2.6	CPU only	tf2-cpu.2-6:latest
TensorFlow	2.6	GPU	tf2-gpu.2-6:latest