Testing and Evaluation

The VideoSDK Agent SDK provides a structured evaluation framework that allows you to run controlled tests on individual agent components: Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) and collect performance metrics such as latency, accuracy, and stability.

Evaluation Components

To test your agent, use the Evaluation class. This allows you to define different scenarios (called "turns") and run them to see how your agent performs.

Key components include:

Evaluation: Runs all your test scenarios.
EvalTurn: Represents a single conversational turn, one complete exchange where the user gives input and the agent processes it to provide a response.
EvalMetric: Measurements like STT_LATENCY, LLM_LATENCY, etc.
LLMAsJudge: Uses an LLM to "judge" the quality of your agent's response.

These are the criteria the Judge can use to evaluate the agent:

Metric	Description
REASONING	Explains why the agent responded in a certain way. Useful for debugging logic.
RELEVANCE	Checks if the response actually answers the user's question.
CLARITY	Checks if the response is easy to understand.
SCORE	Gives a numerical rating (0-10) for the quality of the response.

Implementation

The following steps explain how to set up a test for your agent.

1. Import Libraries

First, import the necessary modules from the SDK.

import logging
import aiohttp
from videosdk.agents import (
    Evaluation, EvalTurn, EvalMetric, LLMAsJudgeMetric, 
    LLMAsJudge, STTEvalConfig, LLMEvalConfig, TTSEvalConfig, 
    STTComponent, LLMComponent, TTSComponent, function_tool
)

# Set up logging to see the output
logging.basicConfig(level=logging.INFO)

2. Define Tools

If your agent uses tools (like checking the weather), you need to define them here so the evaluation can use them.

@function_tool
async def get_weather(
    latitude: str,
    longitude: str,
):
    """
    Called when the user asks about the weather. Returns the weather for the given location.
    
    Args:
        latitude: The latitude of the location
        longitude: The longitude of the location
    """
    print("### Getting weather for", latitude, longitude)
    url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"
    weather_data = {}
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                if response.status == 200:
                    data = await response.json()
                    print("Weather data", data)
                    weather_data = {
                        "temperature": data["current"]["temperature_2m"],
                        "temperature_unit": "Celsius",
                    }
                else:
                    print(f"Failed to get weather data, status code: {response.status}")
                    raise Exception(f"Failed to get weather data, status code: {response.status}")
    except Exception as e:
        print(f"Exception in get_weather tool: {e}")
        raise e

    return weather_data

3. Setup Evaluation

Create an Evaluation instance. You can specify which metrics you want to track.

eval = Evaluation(
    name="basic-agent-eval",
    include_context=False,
    metrics=[
        EvalMetric.STT_LATENCY,
        EvalMetric.LLM_LATENCY,
        EvalMetric.TTS_LATENCY,
        EvalMetric.END_TO_END_LATENCY
    ],
    output_dir="./reports"
)

Parameters:

Parameter	Type	Description
`name`	`str`	Name of the evaluation suite.
`include_context`	`bool`	Whether to include conversation context.
`metrics`	`list`	List of metrics to calculate (e.g., `EvalMetric.STT_LATENCY`).
`output_dir`	`str`	Directory to save the evaluation reports.

4. Add Test Scenarios (Turns)

Add "turns" to your evaluation. A turn simulates a single complete interaction loop (Input -> Processing -> Response) between the user and the agent. You can mix and match mock inputs (text) and real inputs (audio files).

Scenario 1: Complex Interaction

Here, we test the full pipeline:

STT: Transcribes an audio file (sample.wav).
LLM: Receives a mock text input (overriding the STT output for this test) and uses the get_weather tool.
TTS: Generates speech from a mock text string.
Judge: An LLM reviews the answer to see if it is relevant.

Note

Only .wav files are supported for STT evaluation. Please ensure your audio files are in this format.

eval.add_turn(
    EvalTurn(
        stt=STTComponent.deepgram(
            STTEvalConfig(file_path="./sample.wav") 
        ),
        llm=LLMComponent.google(
            LLMEvalConfig(
                model="gemini-2.5-flash-lite",
                use_stt_output=False, 
                mock_input="write one paragraph about Water and get weather of Delhi",
                tools=[get_weather]
            )
        ),
        tts=TTSComponent.google(
            TTSEvalConfig(
                model="en-US-Standard-A",
                use_llm_output=False,
                mock_input="Peter Piper picked a peck of pickled peppers"  
            )
        ),
        judge=LLMAsJudge.google(
            model="gemini-2.5-flash-lite",
            prompt="Can you evaluate the agent's response based on the following criteria: Is it relevant to the user input?",
            checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE]
        )
    )
)

Configuration Parameters:

STTEvalConfig
LLMEvalConfig
TTSEvalConfig
LLMAsJudge

Parameter	Type	Description
`file_path`	`str`	Path to the audio file. Note: Only `.wav` files are supported.

Parameter	Type	Description
`model`	`str`	The LLM model to use (e.g., `gemini-2.5-flash-lite`).
`use_stt_output`	`bool`	If `True`, uses the output from the STT stage as input.
`mock_input`	`str`	Text input to use if `use_stt_output` is `False`.
`tools`	`list`	List of tool functions available to the LLM.

Parameter	Type	Description
`model`	`str`	The TTS model to use.
`use_llm_output`	`bool`	If `True`, uses the output from the LLM stage as input.
`mock_input`	`str`	Text input to use if `use_llm_output` is `False`.

Parameter	Type	Description
`model`	`str`	The LLM model to use for judging.
`prompt`	`str`	The prompt/criteria for the judge.
`checks`	`list`	List of metrics to check (e.g., `LLMAsJudgeMetric.REASONING`, `LLMAsJudgeMetric.SCORE`).

Scenario 2: End-to-End Flow

This scenario uses the output from one step as the input for the next. The STT output is fed into the LLM, and the LLM output is fed into the TTS.

eval.add_turn(
    EvalTurn(
        stt=STTComponent.deepgram(
            STTEvalConfig(file_path="./Sports.wav") 
        ),
        llm=LLMComponent.google(
            LLMEvalConfig(
                model="gemini-2.5-flash-lite",
                use_stt_output=True, # Use the text from STT
            )
        ),
        tts=TTSComponent.google(
            TTSEvalConfig(
                model="en-US-Standard-A",
                use_llm_output=True # Use the text from LLM
            )
        ),
        judge=LLMAsJudge.google(
            model="gemini-2.5-flash-lite",
            prompt="Is the response relevant?",
            checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE]
        )
    )
)

Scenario 3: Individual Component Testing

You can also test components in isolation.

STT Only
LLM Only
TTS Only

eval.add_turn(
    EvalTurn(
        stt=STTComponent.deepgram(
            STTEvalConfig(file_path="./Sports.wav") 
        )
    )
)

eval.add_turn(
    EvalTurn(
        llm=LLMComponent.google(
            LLMEvalConfig(
                model="gemini-2.5-flash-lite",
                use_stt_output=False, 
                mock_input="write one paragraph about trees",
            )
        ),
    )
)

eval.add_turn(
    EvalTurn(
         tts=TTSComponent.google(
            TTSEvalConfig(
                model="en-US-Standard-A",
                use_llm_output=False,
                mock_input="A big black bug bit a big black bear, made the big black bear bleed blood."
            )
        )
    )
)

5. Run and Save Results

Finally, run the evaluation and save the report. The report will be saved to the output_dir.

results = eval.run()
results.save()

Examples - Try It Out Yourself

Evaluation Example

A complete example of setting up and running evaluations for your agent.

Got a Question? Ask us on discord

Evaluation Components​

Implementation​

1. Import Libraries​

2. Define Tools​

3. Setup Evaluation​

4. Add Test Scenarios (Turns)​

Scenario 1: Complex Interaction​

Scenario 2: End-to-End Flow​

Scenario 3: Individual Component Testing​

5. Run and Save Results​

Examples - Try It Out Yourself​