AI Voice Agent Quick Start
This guide will help you integrate an AI-powered voice agent into your VideoSDK meetings.
Prerequisites​
Before you begin, ensure you have:
- A VideoSDK authentication token (generate from app.videosdk.live)
-
- A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
- Python 3.12 or higher
- API Key: An API key corresponding to your chosen model provider:
- OpenAI API key (for OpenAI models)
- Google Gemini API key (for Gemini's LiveAPI)
- AWS credentials (aws_access_key_id and aws_secret_access_key) for Amazon Nova Sonic
Installation​
Create and activate a virtual environment with Python 3.12 or higher:
- macOS/Linux
- Windows
python3.12 -m venv venv
source venv/bin/activate
python -m venv venv
venv\Scripts\activate
First, install the VideoSDK AI Agent package using pip:
pip install videosdk-agents
Now its time to install the plugin for your chosen AI model. Each plugin is tailored for seamless integration with the VideoSDK AI Agent SDK.
- OpenAI
- Gemini(LiveAPI)
- AWS-Nova-Sonic
pip install "videosdk-plugins-openai"
pip install "videosdk-plugins-google"
pip install "videosdk-plugins-aws"
Environment Setup​
It's recommended to use environment variables for secure storage of API keys, secret tokens, and authentication tokens. Create a .env
file in your project root:
- OpenAI
- Gemini(LiveAPI)
- AWS-Nova-Sonic
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token
OPENAI_API_KEY=your_openai_api_key
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token
GOOGLE_API_KEY=your_google_api_key
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=your_aws_region
Generating a VideoSDK Meeting ID​
Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:
Using cURL​
curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: YOUR_JWT_TOKEN_HERE" \
-H "Content-Type: application/json"
For more details on the Create Room API, refer to the VideoSDK documentation.
1. Creating a Custom Agent​
First, let's create a custom voice agent by inheriting from the base Agent
class:
from videosdk.agents import Agent, function_tool
# External Tool
# async def get_weather(self, latitude: str, longitude: str):
class VoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
tools=[get_weather] # You can register any external tool defined outside of this scope
)
async def on_enter(self) -> None:
"""Called when the agent first joins the meeting"""
await self.session.say("Hi there! How can I help you today?")
This code defines a basic voice agent with:
- Custom instructions that define the agent's personality and capabilities
- An entry message when joining a meeting
- State change handling to track the agent's current activity
2. Implementing Function Tools​
Function tools allow your agent to perform actions beyond conversation. There are two ways to define tools:
- External Tools: Defined as standalone functions outside the agent class and registered via the
tools
argument in the agent's constructor. - Internal Tools: Defined as methods inside the agent class and decorated with
@function_tool
.
Below is an example of both:
import aiohttp
# External Function Tools
@function_tool
def get_weather(latitude: str, longitude: str):
"""Called when the user asks about the weather. This function will return the weather for
the given location. When given a location, please estimate the latitude and longitude of the
location and do not ask the user for them.
Args:
latitude: The latitude of the location
longitude: The longitude of the location
"""
print(f"Getting weather for {latitude}, {longitude}")
url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}¤t=temperature_2m"
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
return {
"temperature": data["current"]["temperature_2m"],
"temperature_unit": "Celsius",
}
else:
raise Exception(
f"Failed to get weather data, status code: {response.status}"
)
class VoiceAgent(Agent):
# ... previous code ...
# Internal Function Tools
@function_tool
async def get_horoscope(self, sign: str) -> dict:
"""Get today's horoscope for a given zodiac sign.
Args:
sign: The zodiac sign (e.g., Aries, Taurus, Gemini, etc.)
"""
horoscopes = {
"Aries": "Today is your lucky day!",
"Taurus": "Focus on your goals today.",
"Gemini": "Communication will be important today.",
}
return {
"sign": sign,
"horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
}
- Use external tools for reusable, standalone functions (registered via
tools=[...]
). - Use internal tools for agent-specific logic as class methods.
- Both must be decorated with
@function_tool
for the agent to recognize and use them.
3. Setting Up the Pipeline​
The pipeline connects your agent to an AI model. In this example, we're using OpenAI's real-time model:
- OpenAI
- Gemini(LiveAPI)
- AWS-Nova-Sonic
from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig
from videosdk.agents import RealTimePipeline
from openai.types.beta.realtime.session import TurnDetection
async def start_session(context: dict):
# Initialize the AI model
model = OpenAIRealtime(
model="gpt-4o-realtime-preview",
# When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter
api_key="sk-proj-XXXXXXXXXXXXXXXXXXXX",
config=OpenAIRealtimeConfig(
voice="alloy", # alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse
modalities=["text", "audio"],
turn_detection=TurnDetection(
type="server_vad",
threshold=0.5,
prefix_padding_ms=300,
silence_duration_ms=200,
),
tool_choice="auto"
)
)
pipeline = RealTimePipeline(model=model)
# Continue to the next steps...
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.agents import RealTimePipeline
async def start_session(context: dict):
# Initialize the AI model
model = GeminiRealtime(
model="gemini-2.0-flash-live-001",
# When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
api_key="AKZSXXXXXXXXXXXXXXXXXXXX",
config=GeminiLiveConfig(
voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
response_modalities=["AUDIO"]
)
)
pipeline = RealTimePipeline(model=model)
# Continue to the next steps...
from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig
from videosdk.agents import RealTimePipeline
async def start_session(context: dict):
# Initialize the AI model
model = NovaSonicRealtime(
model="amazon.nova-sonic-v1:0",
# When AWS credentials and region are set in .env - DON'T pass credential parameters
region="us-east-1", # Currently, only "us-east-1" is supported for Amazon Nova Sonic.
aws_access_key_id="AWSXXXXXXXXXXXXXXXXXXXX",
aws_secret_access_key="AQSXXXXXXXXXXXXXXXXXXXX",
config=NovaSonicConfig(
voice="tiffany", # "tiffany","matthew", "amy"
temperature=0.7,
top_p=0.9,
max_tokens=1024
)
)
pipeline = RealTimePipeline(model=model)
# Continue to the next steps...
To initiate a conversation with Amazon Nova Sonic, the user must speak first. The model listens for user input to begin the interaction.
When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code.
4. Assembling and Starting the Agent Session​
Now, let's put everything together and start the agent session:
import asyncio
from videosdk.agents import AgentSession
async def start_session(context: dict):
# ... previous setup code ...
# Create the agent session
session = AgentSession(
agent=VoiceAgent(),
pipeline=pipeline,
context=context
)
try:
# Start the session
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()
if __name__ == "__main__":
def make_context():
# When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth
return {
"meetingId": "your_actual_meeting_id_here", # Replace with actual meeting ID
"name": "AI Voice Agent",
"videosdk_auth": "your_videosdk_auth_token_here" # Replace with actual token
}
asyncio.run(start_session(context=make_context()))
5. Connecting with VideoSDK Client Applications​
After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:
When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.
6. Running the Project​
Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your .env
file is properly configured and all dependencies are installed.
python main.py
Get started quickly with the Quick Start Example for the VideoSDK AI Agent SDK — everything you need to build your first AI agent fast.
Got a Question? Ask us on discord