AI Voice Agent Quick Start
This guide will help you integrate an AI-powered voice agent into your VideoSDK meetings.
Prerequisites​
Before you begin, ensure you have:
- A VideoSDK authentication token (generate from app.videosdk.live)
- A VideoSDK meeting ID (you can generate one using the Create Room API)
- An OpenAI API key (or Gemini API key if using Google's models)
Installation​
First, install the VideoSDK AI Agent package using pip:
pip install videosdk-agents
Environment Setup​
It's recommended to use environment variables for secure storage of API keys and tokens. Create a .env
file in your project root:
OPENAI_API_KEY=your_openai_api_key
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token
GOOGLE_API_KEY=your_google_api_key
GOOGLE_APPLICATION_CREDENTIALS=path_to_google_credentials_json
Generating a VideoSDK Meeting ID​
Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:
Using cURL​
curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: YOUR_JWT_TOKEN_HERE" \
-H "Content-Type: application/json"
For more details on the Create Room API, refer to the VideoSDK documentation.
1. Creating a Custom Agent​
First, let's create a custom voice agent by inheriting from the base Agent
class:
from videosdk.agents import Agent, AgentState, function_tool
class VoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
)
# Register any tools the agent can use
self.register_tools([self.get_weather, self.get_horoscope])
async def on_enter(self) -> None:
"""Called when the agent first joins the meeting"""
await self.session.say("Hi there! How can I help you today?")
This code defines a basic voice agent with:
- Custom instructions that define the agent's personality and capabilities
- An entry message when joining a meeting
- State change handling to track the agent's current activity
2. Implementing Function Tools​
Function tools allow your agent to perform actions beyond conversation. Let's add two example tools:
import aiohttp
class VoiceAgent(Agent):
# ... previous code ...
@function_tool
async def get_weather(self, latitude: str, longitude: str):
"""Called when the user asks about the weather. This function will return the weather for
the given location. When given a location, please estimate the latitude and longitude of the
location and do not ask the user for them.
Args:
latitude: The latitude of the location
longitude: The longitude of the location
"""
print(f"Getting weather for {latitude}, {longitude}")
url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}¤t=temperature_2m"
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
return {
"temperature": data["current"]["temperature_2m"],
"temperature_unit": "Celsius",
}
else:
raise Exception(
f"Failed to get weather data, status code: {response.status}"
)
@function_tool
async def get_horoscope(self, sign: str) -> dict:
"""Get today's horoscope for a given zodiac sign.
Args:
sign: The zodiac sign (e.g., Aries, Taurus, Gemini, etc.)
"""
horoscopes = {
"Aries": "Today is your lucky day!",
"Taurus": "Focus on your goals today.",
"Gemini": "Communication will be important today.",
}
return {
"sign": sign,
"horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
}
Each function tool:
- Is decorated with
@function_tool
- Includes detailed docstrings that help the AI understand when and how to use the tool
- Accepts parameters from the AI and returns structured data
3. Setting Up the Pipeline​
The pipeline connects your agent to an AI model. In this example, we're using OpenAI's real-time model:
from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig
from videosdk.agents import RealTimePipeline
from openai.types.beta.realtime.session import InputAudioTranscription, TurnDetection
async def start_session(jobctx):
# Initialize the AI model
model = OpenAIRealtime(
model="gpt-4o-realtime-preview",
config=OpenAIRealtimeConfig(
modalities=["text", "audio"],
input_audio_transcription=InputAudioTranscription(
model="whisper-1"
),
turn_detection=TurnDetection(
type="server_vad",
threshold=0.5,
prefix_padding_ms=300,
silence_duration_ms=200,
),
tool_choice="auto"
)
)
pipeline = RealTimePipeline(model=model)
# Continue to the next steps...
4. Assembling and Starting the Agent Session​
Now, let's put everything together and start the agent session:
import asyncio
from videosdk.agents import AgentSession, WorkerJob
async def start_session(jobctx):
# ... previous setup code ...
# Create the agent session
session = AgentSession(
agent=VoiceAgent(),
pipeline=pipeline,
context=jobctx
)
try:
# Start the session
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()
def entryPoint(jobctx):
asyncio.run(start_session(jobctx))
if __name__ == "__main__":
def make_context():
return {
"meetingId": "<Your-Meeting-ID>", # Use the generated meeting ID from earlier
"name": "AI Assistant", # Name displayed in the meeting
}
# Create and start the worker job
job = WorkerJob(job_func=entryPoint, jobctx=make_context)
job.start()
5. Connecting with VideoSDK Client Applications​
After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:
When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.
Got a Question? Ask us on discord