Llm Evaluation With Mlflow Example Notebook Mlflow

Leo Migdal

-Nov 18, 2025, 2:48 AM

llm evaluation with mlflow example notebook mlflow

In this notebook, we will demonstrate how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics such as relevance, and even custom LLM-judged... We need to set our OpenAI API key, since we will be using GPT-4 for our LLM-judged metrics. In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure... Create a test case of inputs that will be passed into the model and ground_truth which will be used to compare against the generated output from the model. Create a simple OpenAI model that asks gpt-4o to answer the question in two sentences. Call mlflow.evaluate() with the model and evaluation dataframe.

There was an error while loading. Please reload this page. MLflow is a powerful open-source platform for managing the machine learning lifecycle. While it’s traditionally used for tracking model experiments, logging parameters, and managing deployments, MLflow has recently introduced support for evaluating Large Language Models (LLMs). In this tutorial, we explore how to use MLflow to evaluate the performance of an LLM—in our case, Google’s Gemini model—on a set of fact-based prompts. We’ll generate responses to fact-based prompts using Gemini and assess their quality using a variety of metrics supported directly by MLflow.

For this tutorial, we’ll be using both the OpenAI and Gemini APIs. MLflow’s built-in generative AI evaluation metrics currently rely on OpenAI models (e.g., GPT-4) to act as judges for metrics like answer similarity or faithfulness, so an OpenAI API key is required. You can obtain: In this step, we define a small evaluation dataset containing factual prompts along with their correct ground truth answers. These prompts span topics such as science, health, web development, and programming. This structured format allows us to objectively compare the Gemini-generated responses against known correct answers using various evaluation metrics in MLflow.

This code block defines a helper function gemini_completion() that sends a prompt to the Gemini 1.5 Flash model using the Google Generative AI SDK and returns the generated response as plain text. We then apply this function to each prompt in our evaluation dataset to generate the model’s predictions, storing them in a new “predictions” column. These predictions will later be evaluated against the ground truth answers When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations. MLFLOW is a powerful open source platform to manage the life cycle of automatic learning.

Although it is traditionally used for monitoring model experiences, parameter journalization and deployment management, MLFlow recently introduced support to assess large language models (LLMS). In this tutorial, we explore how to use MLFLOW to assess the performance of an LLM – In our case, the Gemini model of Google – on a set of prompts based on facts. We will generate responses to prompts based on facts using Gemini and will assess their quality by using a variety of measures supported directly by MLFLOW. For this tutorial, we will use the OPENAI and Gemini APIs. The assessment metrics generating the integrated AI of MLFLOW are currently based on OPENAI models (for example, GPT-4) to act as judges for metrics such as the similarity of response or loyalty, therefore a... You can get:

In this stage, we define a small set of evaluation data containing factual prompts with their correct -ground truth responses. These invites cover subjects such as science, health, web development and programming. This structured format allows us to objectively compare the responses generated by the gemini-aux known correct responses by using various evaluation measures in MLFLOW. If you’re experimenting with Large Language Models (LLMs) like Google’s Gemini and want reliable, transparent evaluation—this guide is for you. Evaluating LLM outputs can be surprisingly tricky, especially as their capabilities expand and their use cases multiply. How do you know if an LLM is accurate, consistent, or even safe in its responses?

And how do you systematically track and compare results across experiments so you can confidently improve your models? That’s where MLflow steps in. Traditionally known for experiment tracking and model management, MLflow is rapidly evolving into a robust platform for LLM evaluation. The latest enhancements make it easier than ever to benchmark LLMs using standardized, automated metrics—no more cobbling together manual scripts or spreadsheets. In this hands-on tutorial, I’ll walk you through evaluating the Gemini model with MLflow, using a set of fact-based prompts and metrics that matter. By the end, you’ll know not just how to run an LLM evaluation workflow, but why each step matters—and how to use your findings to iterate smarter.

You might wonder, “Don’t LLMs just work out of the box?” While today’s models are impressively capable, they’re not infallible. They can hallucinate facts, misunderstand context, or simply give inconsistent answers. If you’re deploying LLMs in production—for search, chatbots, summarization, or anything mission-critical—evaluation isn’t optional. It’s essential. MLflow’s recent updates add out-of-the-box support for evaluating LLMs—leveraging the strengths of both OpenAI’s robust metrics and Gemini’s powerful generation capabilities. Cloudera AI’s experiment tracking features allow you to use MLflow APIs for LLMs evaluation.

MLflow provides an API mlflow.evaluate() to help evaluate your LLMs. LLMs can generate text in various fields, such as answering questions, translation, and text summarization. MLflow’s LLM evaluation functionality consists of three main components: For more information on using heuristic-based metrics and an example of how to use mlflow to evaluate an LLM using heuristic-based metrics, see Using Heuristic-based metrics. LLM-as-a-Judge metrics: LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs. It overcomes the limitations of heuristic-based metrics, which often miss nuances like context and semantic accuracy.

LLM-as-a-Judge metrics provide a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation. MLflow provides various built-in LLM-as-a-Judge metrics and supports creating custom metrics with your own prompt, grading criteria, and reference examples. See the LLM-as-a-Judge Metrics section for more details. The notebooks listed below contain step-by-step tutorials on how to use MLflow to evaluate LLMs. The first set of notebooks is centered around evaluating an LLM for question-answering with a prompt engineering approach. The second set is centered around evaluating a RAG system.

All the notebooks will demonstrate how to use MLflow's builtin metrics such as token_count and toxicity as well as LLM-judged intelligent metrics such as answer_relevance. Learn how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics as relevance, and even custom LLM-judged metrics such as professionalism. Learn how to evaluate various Open-Source LLMs available in Hugging Face, leveraging MLflow's built-in LLM metrics and experiment tracking to manage models and evaluation results. There was an error while loading. Please reload this page. MLflow is a powerful open-source platform for managing the machine learning lifecycle.

While it’s traditionally used for tracking model experiments, logging parameters, and managing deployments, MLflow has recently introduced support for evaluating Large Language Models (LLMs). In this tutorial, we explore how to use MLflow to evaluate the performance of an LLM—in our case, Google’s Gemini model—on a set of fact-based prompts. We’ll generate responses to fact-based prompts using Gemini and assess their quality using a variety of metrics supported directly by MLflow. For this tutorial, we’ll be using both the OpenAI and Gemini APIs. MLflow’s built-in generative AI evaluation metrics currently rely on OpenAI models (e.g., GPT-4) to act as judges for metrics like answer similarity or faithfulness, so an OpenAI API key is required. You can obtain:

os.environ[“OPENAI_API_KEY”] = getpass(‘Enter OpenAI API Key:’) os.environ[“GOOGLE_API_KEY”] = getpass(‘Enter Google API Key:’) In this step, we define a small evaluation dataset containing factual prompts along with their correct ground truth answers. These prompts span topics such as science, health, web development, and programming. This structured format allows us to objectively compare the Gemini-generated responses against known correct answers using various evaluation metrics in MLflow.

Llm Evaluation With Mlflow Example Notebook Mlflow

People Also Search

In This Notebook, We Will Demonstrate How To Evaluate Various

There Was An Error While Loading. Please Reload This Page.

For This Tutorial, We’ll Be Using Both The OpenAI And

This Code Block Defines A Helper Function Gemini_completion() That Sends

Although It Is Traditionally Used For Monitoring Model Experiences, Parameter