View logs and metrics for Vertex AI

You can view logs and metrics for Vertex AI and the pre-trained APIs installed with it. The installed pre-trained APIs are Optical Character Recognition (OCR), Speech-to-Text, and Translation.

You can monitor some of the Vertex AI metrics in the observability tools. You can also create queries to monitor specific Vertex AI metrics. For information about observability in Google Distributed Cloud (GDC) air-gapped, see Monitor metrics and logs.

Before you begin

To get the permissions you need to view logs and metrics for Vertex AI, ask your Organization IAM Admin to grant you the Organization Grafana Viewer (organization-grafana-viewer) cluster role in the platform-obs namespace.

The following topics help you monitor Vertex AI and troubleshoot issues using logs and metrics.

View Vertex AI logs and metrics

To view Vertex AI logs and metrics, you must enable the Vertex AI pre-trained APIs. For more information, see Get the statuses of the pre-trained APIs.

To view Vertex AI logs and metrics, do the following:

If you aren't signed in to the GDC console, sign in using the steps in Sign in.
In the navigation pane, expand Vertex AI, then click on Pre-trained APIs.
On the Pre-trained APIs page, click on Monitor services to open the monitoring dashboard.
In the monitoring dashboard, click Explore to access logs and metrics.

Use the monitoring dashboard to view Vertex AI logs and metrics

You can view Vertex AI metrics in the monitoring dashboard. For example, you can create a query to view how Vertex AI affects CPU usage.

To view metrics in the monitoring dashboard, do the following:

Open the monitoring dashboard from Vertex AI. For more information, see View Vertex AI logs and metrics.
In the monitoring dashboard, on the Explore tab, select one of the following data sources:
- For metrics, select the prometheus data source.
- For Vertex AI operational logs, select the Operational Logs data source.
- For Vertex AI audit logs, select the Audit Logs data source.
Click the plus symbol (+) to create a custom dashboard for your metric and log queries.
Create queries and run them in the custom dashboard. The custom dashboard preserves your queries so that you can access them later.

Sample Vertex AI platform queries for the monitoring dashboard

The following are sample queries to help you construct your own metric and log queries to monitor Vertex AI in your air-gapped environment.

Sample Vertex AI platform metric queries

For metric queries, the data source must be prometheus.

The following are sample queries that show the effect on an operator's container CPU usage:

L1 operator CPU usage percentage: rate(container_cpu_usage_seconds_total{namespace="ai-system",container="l1operator"}[30s]) * 100
L2 operator CPU usage percentage: rate(container_cpu_usage_seconds_total{namespace="ai-system",container="l2operator"}[30s]) * 100

The following are sample queries that show the effect on an operator's container memory usage:

Level 1 memory usage in MB: container_memory_usage_bytes{namespace="ai-system",container="l1operator"} * 1e-6
Level 2 memory usage in MB: container_memory_usage_bytes{namespace="ai-system",container="l2operator"} * 1e-6

Sample Vertex AI platform log queries

For operational logs, the data source must be Operational Logs. For audit logs, the data source must be Audit Logs.

Sample Vertex AI platform operational log queries

The following are sample queries to view Vertex AI operational logs:

L1 operator logs: {service_name="vai-l1operator"}
L2 operator logs: {service_name="vai-l2operator"}

Sample Vertex AI platform audit log queries

Sample queries to generate Vertex AI audit logs.

Vertex AI platform frontend: {namespace="istio-system",service_name="istio"} | json | resource_cluster_name="vai-web-plugin-frontend.ai-system"
Vertex AI platform backend: {namespace="istio-system",service_name="istio"} | json | resource_cluster_name="vai-web-plugin-backend.ai-system"

Sample Vertex AI service queries for the monitoring dashboard

The following are sample queries to help you construct your own metric and log queries to monitor the pre-trained APIs installed with Vertex AI on GDC. You can monitor metrics and logs for Optical Character Recognition (OCR), Speech-to-Text, and Translation.

Sample Vertex AI pre-trained API metric queries

For metric queries, the data source must be prometheus.

The following are sample queries that show the effect of a pre-trained API on CPU usage. There is one sample query for each pre-trained API.

OCR CPU usage: rate(container_cpu_usage_seconds_total{namespace="ai-ocr-system",container="CONTAINER_NAME"}[30s]) * 100 CONTAINER_NAME values: vision-extractor | vision-frontend | vision-vms-ocr
Speech-to-Text CPU usage: rate(container_cpu_usage_seconds_total{namespace="ai-speech-system",container="CONTAINER_NAME"}[30s]) * 100
Translation CPU usage: rate(container_cpu_usage_seconds_total{namespace="ai-translation-system",container="CONTAINER_NAME"}[30s]) * 100 CONTAINER_NAME values: translation-aligner | translation-frontend | translation-prediction

The following are sample queries that use the destination_service filter label to get the error rate over the last 60 minutes:

OCR error rate: rate(istio_requests_total{destination_service=~".*ai-ocr-system.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])
Speech-to-Text error rate: rate(istio_requests_total{destination_service=~".*ai-speech-system.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])
Translation error rate: rate(istio_requests_total{destination_service=~".*ai-translation-system.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])

Sample Vertex AI pre-trained API log queries

For Operational Logs, the data source must be Operational Logs. For audit logs, the data source must be Audit Logs.

Sample pre-trained API operational log queries

Operational log queries for pre-trained APIs are constructed similarly to Vertex AI operational logs. The primary difference is the namespace used as the main filter specifies the pre-trained API. The three namespaces are:

ai-translation-system
ai-speech-system
ai-ocr-system

You can create more granular results by adding additional labels, such as service_name or pod, to your query. The following are operational log query examples for the pre-trained APIs:

OCR: {namespace="ai-ocr-system"}
Speech-to-Text: {namespace="ai-speech-system"}
Translation: {namespace="ai-translation-system"}

Sample pre-trained API audit log queries

The following are sample queries to generate audit logs for the pre-trained APIs:

OCR: {namespace="istio-system",service_name="istio"} | json | resource_cluster_name="dep-vision-frontend-server.ai-ocr-system"
Speech-to-Text: {namespace="istio-system",service_name="istio"} | json | resource_cluster_name="dep-translation-frontend-server.ai-translation-system"
Translation: {namespace="istio-system",service_name="istio"} | json | resource_cluster_name="dep-speech.ai-speech-system"

Get the statuses of the pre-trained APIs

To view the statuses of the pre-trained APIs, see View service statuses and endpoints.