Skip to main content

Testing and Evaluation

The VideoSDK Agent SDK provides a structured evaluation framework that allows you to run controlled tests on individual agent components: Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) and collect performance metrics such as latency, accuracy, and stability.

Evaluation Components

To test your agent, use the Evaluation class. This allows you to define different scenarios (called "turns") and run them to see how your agent performs.

Key components include:

  • Evaluation: Runs all your test scenarios.
  • EvalTurn: Represents a single conversational turn, one complete exchange where the user gives input and the agent processes it to provide a response.
  • EvalMetric: Measurements like STT_LATENCY, LLM_LATENCY, etc.
  • LLMAsJudge: Uses an LLM to "judge" the quality of your agent's response.

These are the criteria the Judge can use to evaluate the agent:

MetricDescription
REASONINGExplains why the agent responded in a certain way. Useful for debugging logic.
RELEVANCEChecks if the response actually answers the user's question.
CLARITYChecks if the response is easy to understand.
SCOREGives a numerical rating (0-10) for the quality of the response.

Implementation

The following steps explain how to set up a test for your agent.

1. Import Libraries

First, import the necessary modules from the SDK.

import logging
import aiohttp
from videosdk.agents import (
Evaluation, EvalTurn, EvalMetric, LLMAsJudgeMetric,
LLMAsJudge, STTEvalConfig, LLMEvalConfig, TTSEvalConfig,
STTComponent, LLMComponent, TTSComponent, function_tool
)

# Set up logging to see the output
logging.basicConfig(level=logging.INFO)

2. Define Tools

If your agent uses tools (like checking the weather), you need to define them here so the evaluation can use them.

@function_tool
async def get_weather(
latitude: str,
longitude: str,
):
"""
Called when the user asks about the weather. Returns the weather for the given location.

Args:
latitude: The latitude of the location
longitude: The longitude of the location
"""
print("### Getting weather for", latitude, longitude)
url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"
weather_data = {}
try:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
print("Weather data", data)
weather_data = {
"temperature": data["current"]["temperature_2m"],
"temperature_unit": "Celsius",
}
else:
print(f"Failed to get weather data, status code: {response.status}")
raise Exception(f"Failed to get weather data, status code: {response.status}")
except Exception as e:
print(f"Exception in get_weather tool: {e}")
raise e

return weather_data

3. Setup Evaluation

Create an Evaluation instance. You can specify which metrics you want to track.

eval = Evaluation(
name="basic-agent-eval",
include_context=False,
metrics=[
EvalMetric.STT_LATENCY,
EvalMetric.LLM_LATENCY,
EvalMetric.TTS_LATENCY,
EvalMetric.END_TO_END_LATENCY
],
output_dir="./reports"
)

Parameters:

ParameterTypeDescription
namestrName of the evaluation suite.
include_contextboolWhether to include conversation context.
metricslistList of metrics to calculate (e.g., EvalMetric.STT_LATENCY).
output_dirstrDirectory to save the evaluation reports.

4. Add Test Scenarios (Turns)

Add "turns" to your evaluation. A turn simulates a single complete interaction loop (Input -> Processing -> Response) between the user and the agent. You can mix and match mock inputs (text) and real inputs (audio files).

Scenario 1: Complex Interaction

Here, we test the full pipeline:

  1. STT: Transcribes an audio file (sample.wav).
  2. LLM: Receives a mock text input (overriding the STT output for this test) and uses the get_weather tool.
  3. TTS: Generates speech from a mock text string.
  4. Judge: An LLM reviews the answer to see if it is relevant.
Note

Only .wav files are supported for STT evaluation. Please ensure your audio files are in this format.

eval.add_turn(
EvalTurn(
stt=STTComponent.deepgram(
STTEvalConfig(file_path="./sample.wav")
),
llm=LLMComponent.google(
LLMEvalConfig(
model="gemini-2.5-flash-lite",
use_stt_output=False,
mock_input="write one paragraph about Water and get weather of Delhi",
tools=[get_weather]
)
),
tts=TTSComponent.google(
TTSEvalConfig(
model="en-US-Standard-A",
use_llm_output=False,
mock_input="Peter Piper picked a peck of pickled peppers"
)
),
judge=LLMAsJudge.google(
model="gemini-2.5-flash-lite",
prompt="Can you evaluate the agent's response based on the following criteria: Is it relevant to the user input?",
checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE]
)
)
)

Configuration Parameters:

ParameterTypeDescription
file_pathstrPath to the audio file. Note: Only .wav files are supported.

Scenario 2: End-to-End Flow

This scenario uses the output from one step as the input for the next. The STT output is fed into the LLM, and the LLM output is fed into the TTS.

eval.add_turn(
EvalTurn(
stt=STTComponent.deepgram(
STTEvalConfig(file_path="./Sports.wav")
),
llm=LLMComponent.google(
LLMEvalConfig(
model="gemini-2.5-flash-lite",
use_stt_output=True, # Use the text from STT
)
),
tts=TTSComponent.google(
TTSEvalConfig(
model="en-US-Standard-A",
use_llm_output=True # Use the text from LLM
)
),
judge=LLMAsJudge.google(
model="gemini-2.5-flash-lite",
prompt="Is the response relevant?",
checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE]
)
)
)

Scenario 3: Individual Component Testing

You can also test components in isolation.

eval.add_turn(
EvalTurn(
stt=STTComponent.deepgram(
STTEvalConfig(file_path="./Sports.wav")
)
)
)

5. Run and Save Results

Finally, run the evaluation and save the report. The report will be saved to the output_dir.

results = eval.run()
results.save()

Examples - Try It Out Yourself

Got a Question? Ask us on discord