Introduction to Mosaic AI Agent Evaluation

Preview

This article describes Mosaic AI Agent Evaluation. Agent Evaluation enables developers to quickly and reliably evaluate the quality, latency, and cost of generative AI applications. The capabilities of Agent Evaluation are unified across the development, staging, and production phases of the LLMops life cycle, and all evaluation metrics and data are logged to MLflow Runs.

Generative AI applications are complex and involve many different components. Evaluating the performance of these applications is not as straightforward as evaluating the performance of traditional ML models. Both qualitative and quantitative metrics that are used to evaluate quality are inherently more complex. This article gives an overview of how to work with Agent Evaluation and includes links to articles with more detail.

Establish ground truth with an evaluation set

To measure the quality of an AI application, you need to define what a high-quality, accurate response looks like. To do that, you create an evaluation set, which is a set of representative questions and ground-truth answers, and optionally supporting documents that you expect the response to be based on.

For details about evaluation sets, including the schema, metric dependencies, and best practices, see Evaluation sets.

Assess performance with the right metrics

Evaluating an AI application requires several sets of metrics, including:

Retrieval metrics, which measure whether the retriever returned chunks that are relevant to the input request.
Response metrics, which measure whether the response is accurate, consistent with the retrieved context, and relevant to the input request.
Performance metrics, which measure the number of tokens across all LLM generation calls and the latency in seconds for the trace.

For details about metrics and LLM judges, see Use agent metrics and LLM judges to evaluate RAG performance.

Evaluation runs

For details about how to run an evaluation, see How to run an evaluation and view the results. Agent Evaluation supports two options for providing output from the chain:

You can run the GenAI application, typically a chain or agent as part of the evaluation run. The application generates results for each input in the evaluation set.
You can provide output from a previous run of the application.

For details and explanation of when to use each option, see How to provide input to an evaluation run.

Get human feedback about the quality of a GenAI application

The Databricks review app makes it easy to gather feedback about the quality of a GenAI application from human reviewers. For details, see Get feedback about the quality of a RAG agent.

Information about the models powering LLM judges

LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
Disabling Partner-powered AI assistive features will prevent the LLM judge from calling partner-powered models.
Data sent to the LLM judge is not used for any model training.
LLM judges are intended to help customers evaluate their RAG applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.