Skip to main content

AI Voice Agent Quick Start

This guide will help you integrate an AI-powered voice agent into your VideoSDK meetings.

Prerequisites​

Before you begin, ensure you have:

  • A VideoSDK authentication token (generate from app.videosdk.live)
  • A VideoSDK meeting ID (you can generate one using the Create Room API)
  • An OpenAI API key (or Gemini API key if using Google's models)

Installation​

First, install the VideoSDK AI Agent package using pip:

pip install videosdk-agents

Environment Setup​

It's recommended to use environment variables for secure storage of API keys and tokens. Create a .env file in your project root:

.env
OPENAI_API_KEY=your_openai_api_key
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token
GOOGLE_API_KEY=your_google_api_key
GOOGLE_APPLICATION_CREDENTIALS=path_to_google_credentials_json

Generating a VideoSDK Meeting ID​

Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:

Using cURL​

curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: YOUR_JWT_TOKEN_HERE" \
-H "Content-Type: application/json"

For more details on the Create Room API, refer to the VideoSDK documentation.

1. Creating a Custom Agent​

First, let's create a custom voice agent by inheriting from the base Agent class:

from videosdk.agents import Agent, AgentState, function_tool

class VoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
)
# Register any tools the agent can use
self.register_tools([self.get_weather, self.get_horoscope])

async def on_enter(self) -> None:
"""Called when the agent first joins the meeting"""
await self.session.say("Hi there! How can I help you today?")

This code defines a basic voice agent with:

  • Custom instructions that define the agent's personality and capabilities
  • An entry message when joining a meeting
  • State change handling to track the agent's current activity

2. Implementing Function Tools​

Function tools allow your agent to perform actions beyond conversation. Let's add two example tools:

import aiohttp

class VoiceAgent(Agent):
# ... previous code ...

@function_tool
async def get_weather(self, latitude: str, longitude: str):
"""Called when the user asks about the weather. This function will return the weather for
the given location. When given a location, please estimate the latitude and longitude of the
location and do not ask the user for them.

Args:
latitude: The latitude of the location
longitude: The longitude of the location
"""
print(f"Getting weather for {latitude}, {longitude}")
url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"

async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
return {
"temperature": data["current"]["temperature_2m"],
"temperature_unit": "Celsius",
}
else:
raise Exception(
f"Failed to get weather data, status code: {response.status}"
)

@function_tool
async def get_horoscope(self, sign: str) -> dict:
"""Get today's horoscope for a given zodiac sign.

Args:
sign: The zodiac sign (e.g., Aries, Taurus, Gemini, etc.)
"""
horoscopes = {
"Aries": "Today is your lucky day!",
"Taurus": "Focus on your goals today.",
"Gemini": "Communication will be important today.",
}
return {
"sign": sign,
"horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
}

Each function tool:

  • Is decorated with @function_tool
  • Includes detailed docstrings that help the AI understand when and how to use the tool
  • Accepts parameters from the AI and returns structured data

3. Setting Up the Pipeline​

The pipeline connects your agent to an AI model. In this example, we're using OpenAI's real-time model:

from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig
from videosdk.agents import RealTimePipeline
from openai.types.beta.realtime.session import InputAudioTranscription, TurnDetection

async def start_session(jobctx):
# Initialize the AI model
model = OpenAIRealtime(
model="gpt-4o-realtime-preview",
config=OpenAIRealtimeConfig(
modalities=["text", "audio"],
input_audio_transcription=InputAudioTranscription(
model="whisper-1"
),
turn_detection=TurnDetection(
type="server_vad",
threshold=0.5,
prefix_padding_ms=300,
silence_duration_ms=200,
),
tool_choice="auto"
)
)

pipeline = RealTimePipeline(model=model)

# Continue to the next steps...

4. Assembling and Starting the Agent Session​

Now, let's put everything together and start the agent session:

import asyncio
from videosdk.agents import AgentSession, WorkerJob

async def start_session(jobctx):
# ... previous setup code ...

# Create the agent session
session = AgentSession(
agent=VoiceAgent(),
pipeline=pipeline,
context=jobctx
)

try:
# Start the session
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()

def entryPoint(jobctx):
asyncio.run(start_session(jobctx))

if __name__ == "__main__":
def make_context():
return {
"meetingId": "<Your-Meeting-ID>", # Use the generated meeting ID from earlier
"name": "AI Assistant", # Name displayed in the meeting
}

# Create and start the worker job
job = WorkerJob(job_func=entryPoint, jobctx=make_context)
job.start()

5. Connecting with VideoSDK Client Applications​

After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:

When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.

Got a Question? Ask us on discord