Testing and Evaluation
The VideoSDK Agent SDK provides a structured evaluation framework that allows you to run controlled tests on individual agent components: Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) and collect performance metrics such as latency, accuracy, and stability.
Evaluation Components
To test your agent, use the Evaluation class. This allows you to define different scenarios (called "turns") and run them to see how your agent performs.
Key components include:
Evaluation: Runs all your test scenarios.EvalTurn: Represents a single conversational turn, one complete exchange where the user gives input and the agent processes it to provide a response.EvalMetric: Measurements likeSTT_LATENCY,LLM_LATENCY, etc.LLMAsJudge: Uses an LLM to "judge" the quality of your agent's response.
These are the criteria the Judge can use to evaluate the agent:
| Metric | Description |
|---|---|
| REASONING | Explains why the agent responded in a certain way. Useful for debugging logic. |
| RELEVANCE | Checks if the response actually answers the user's question. |
| CLARITY | Checks if the response is easy to understand. |
| SCORE | Gives a numerical rating (0-10) for the quality of the response. |
Implementation
The following steps explain how to set up a test for your agent.
1. Import Libraries
First, import the necessary modules from the SDK.
import logging
import aiohttp
from videosdk.agents import (
Evaluation, EvalTurn, EvalMetric, LLMAsJudgeMetric,
LLMAsJudge, STTEvalConfig, LLMEvalConfig, TTSEvalConfig,
STTComponent, LLMComponent, TTSComponent, function_tool
)
# Set up logging to see the output
logging.basicConfig(level=logging.INFO)
2. Define Tools
If your agent uses tools (like checking the weather), you need to define them here so the evaluation can use them.
@function_tool
async def get_weather(
latitude: str,
longitude: str,
):
"""
Called when the user asks about the weather. Returns the weather for the given location.
Args:
latitude: The latitude of the location
longitude: The longitude of the location
"""
print("### Getting weather for", latitude, longitude)
url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}¤t=temperature_2m"
weather_data = {}
try:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
print("Weather data", data)
weather_data = {
"temperature": data["current"]["temperature_2m"],
"temperature_unit": "Celsius",
}
else:
print(f"Failed to get weather data, status code: {response.status}")
raise Exception(f"Failed to get weather data, status code: {response.status}")
except Exception as e:
print(f"Exception in get_weather tool: {e}")
raise e
return weather_data
3. Setup Evaluation
Create an Evaluation instance. You can specify which metrics you want to track.
eval = Evaluation(
name="basic-agent-eval",
include_context=False,
metrics=[
EvalMetric.STT_LATENCY,
EvalMetric.LLM_LATENCY,
EvalMetric.TTS_LATENCY,
EvalMetric.END_TO_END_LATENCY
],
output_dir="./reports"
)
Parameters:
| Parameter | Type | Description |
|---|---|---|
name | str | Name of the evaluation suite. |
include_context | bool | Whether to include conversation context. |
metrics | list | List of metrics to calculate (e.g., EvalMetric.STT_LATENCY). |
output_dir | str | Directory to save the evaluation reports. |
4. Add Test Scenarios (Turns)
Add "turns" to your evaluation. A turn simulates a single complete interaction loop (Input -> Processing -> Response) between the user and the agent. You can mix and match mock inputs (text) and real inputs (audio files).
Scenario 1: Complex Interaction
Here, we test the full pipeline:
- STT: Transcribes an audio file (
sample.wav). - LLM: Receives a mock text input (overriding the STT output for this test) and uses the
get_weathertool. - TTS: Generates speech from a mock text string.
- Judge: An LLM reviews the answer to see if it is relevant.
Only .wav files are supported for STT evaluation. Please ensure your audio files are in this format.
eval.add_turn(
EvalTurn(
stt=STTComponent.deepgram(
STTEvalConfig(file_path="./sample.wav")
),
llm=LLMComponent.google(
LLMEvalConfig(
model="gemini-2.5-flash-lite",
use_stt_output=False,
mock_input="write one paragraph about Water and get weather of Delhi",
tools=[get_weather]
)
),
tts=TTSComponent.google(
TTSEvalConfig(
model="en-US-Standard-A",
use_llm_output=False,
mock_input="Peter Piper picked a peck of pickled peppers"
)
),
judge=LLMAsJudge.google(
model="gemini-2.5-flash-lite",
prompt="Can you evaluate the agent's response based on the following criteria: Is it relevant to the user input?",
checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE]
)
)
)
Configuration Parameters:
- STTEvalConfig
- LLMEvalConfig
- TTSEvalConfig
- LLMAsJudge
| Parameter | Type | Description |
|---|---|---|
file_path | str | Path to the audio file. Note: Only .wav files are supported. |
| Parameter | Type | Description |
|---|---|---|
model | str | The LLM model to use (e.g., gemini-2.5-flash-lite). |
use_stt_output | bool | If True, uses the output from the STT stage as input. |
mock_input | str | Text input to use if use_stt_output is False. |
tools | list | List of tool functions available to the LLM. |
| Parameter | Type | Description |
|---|---|---|
model | str | The TTS model to use. |
use_llm_output | bool | If True, uses the output from the LLM stage as input. |
mock_input | str | Text input to use if use_llm_output is False. |
| Parameter | Type | Description |
|---|---|---|
model | str | The LLM model to use for judging. |
prompt | str | The prompt/criteria for the judge. |
checks | list | List of metrics to check (e.g., LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE). |
Scenario 2: End-to-End Flow
This scenario uses the output from one step as the input for the next. The STT output is fed into the LLM, and the LLM output is fed into the TTS.
eval.add_turn(
EvalTurn(
stt=STTComponent.deepgram(
STTEvalConfig(file_path="./Sports.wav")
),
llm=LLMComponent.google(
LLMEvalConfig(
model="gemini-2.5-flash-lite",
use_stt_output=True, # Use the text from STT
)
),
tts=TTSComponent.google(
TTSEvalConfig(
model="en-US-Standard-A",
use_llm_output=True # Use the text from LLM
)
),
judge=LLMAsJudge.google(
model="gemini-2.5-flash-lite",
prompt="Is the response relevant?",
checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE]
)
)
)
Scenario 3: Individual Component Testing
You can also test components in isolation.
- STT Only
- LLM Only
- TTS Only
eval.add_turn(
EvalTurn(
stt=STTComponent.deepgram(
STTEvalConfig(file_path="./Sports.wav")
)
)
)
eval.add_turn(
EvalTurn(
llm=LLMComponent.google(
LLMEvalConfig(
model="gemini-2.5-flash-lite",
use_stt_output=False,
mock_input="write one paragraph about trees",
)
),
)
)
eval.add_turn(
EvalTurn(
tts=TTSComponent.google(
TTSEvalConfig(
model="en-US-Standard-A",
use_llm_output=False,
mock_input="A big black bug bit a big black bear, made the big black bear bleed blood."
)
)
)
)
5. Run and Save Results
Finally, run the evaluation and save the report. The report will be saved to the output_dir.
results = eval.run()
results.save()
Examples - Try It Out Yourself
Got a Question? Ask us on discord

