AI Voice Agent Quick Start

This guide will help you integrate an AI-powered voice agent into your VideoSDK meetings.

Prerequisites

Before you begin, ensure you have:

A VideoSDK authentication token (generate from app.videosdk.live)
A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
Python 3.12 or higher
API Key: An API key corresponding to your chosen model provider:
- OpenAI API key (for OpenAI models)
- Google Gemini API key (for Gemini's LiveAPI)
- AWS credentials (aws_access_key_id and aws_secret_access_key) for Amazon Nova Sonic

Installation

Create and activate a virtual environment with Python 3.12 or higher:

macOS/Linux
Windows

python3.12 -m venv venv
source venv/bin/activate

python -m venv venv
venv\Scripts\activate

First, install the VideoSDK AI Agent package using pip:

pip install videosdk-agents

Now its time to install the plugin for your chosen AI model. Each plugin is tailored for seamless integration with the VideoSDK AI Agent SDK.

OpenAI
Gemini(LiveAPI)
AWS-Nova-Sonic

pip install "videosdk-plugins-openai"

pip install "videosdk-plugins-google"

pip install "videosdk-plugins-aws"

Environment Setup

It's recommended to use environment variables for secure storage of API keys, secret tokens, and authentication tokens. Create a .env file in your project root:

OpenAI
Gemini(LiveAPI)
AWS-Nova-Sonic

.env
VIDEOSDK_AUTH_TOKEN = your_videosdk_auth_token;
OPENAI_API_KEY = your_openai_api_key;

.env
VIDEOSDK_AUTH_TOKEN = your_videosdk_auth_token;
GOOGLE_API_KEY = your_google_api_key;

.env
VIDEOSDK_AUTH_TOKEN = your_videosdk_auth_token;
AWS_ACCESS_KEY_ID = your_aws_access_key;
AWS_SECRET_ACCESS_KEY = your_aws_secret_key;
AWS_DEFAULT_REGION = your_aws_region;

Generating a VideoSDK Meeting ID

Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:

Using cURL

curl -X POST https://api.videosdk.live/v2/rooms \
  -H "Authorization: YOUR_JWT_TOKEN_HERE" \
  -H "Content-Type: application/json"

For more details on the Create Room API, refer to the VideoSDK documentation.

Step 1: Creating a Custom Agent

Creating a Custom Agent

First, let's create a custom voice agent by inheriting from the base Agent class:

main.py
from videosdk.agents import Agent, function_tool

# External Tool
# async def get_weather(self, latitude: str, longitude: str):

class VoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
             tools=[get_weather] # You can register any external tool defined outside of this scope
        )

    async def on_enter(self) -> None:
        """Called when the agent first joins the meeting"""
        await self.session.say("Hi there! How can I help you today?")

    async def on_exit(self) -> None:
        """Called when the agent leave the meeting"""
        await self.session.say("Goodbye!")     

This code defines a basic voice agent with:

Custom instructions that define the agent's personality and capabilities
An entry message when joining a meeting
State change handling to track the agent's current activity

Step 2: Implementing Function Tools

Implementing Function Tools

Function tools allow your agent to perform actions beyond conversation. There are two ways to define tools:

External Tools: Defined as standalone functions outside the agent class and registered via the tools argument in the agent's constructor.
Internal Tools: Defined as methods inside the agent class and decorated with @function_tool.

Below is an example of both:

main.py
import aiohttp

# External Function Tools
@function_tool
async def get_weather(latitude: str, longitude: str):
    """Called when the user asks about the weather. This function will return the weather for
    the given location. When given a location, please estimate the latitude and longitude of the
    location and do not ask the user for them.

    Args:
        latitude: The latitude of the location
        longitude: The longitude of the location
    """
    print(f"Getting weather for {latitude}, {longitude}")
    url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"

    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            if response.status == 200:
                data = await response.json()
                return {
                    "temperature": data["current"]["temperature_2m"],
                    "temperature_unit": "Celsius",
                }
            else:
                raise Exception(
                    f"Failed to get weather data, status code: {response.status}"
                )

class VoiceAgent(Agent):
# ... previous code ...
# Internal Function Tools
    @function_tool
    async def get_horoscope(self, sign: str) -> dict:
        """Get today's horoscope for a given zodiac sign.

        Args:
            sign: The zodiac sign (e.g., Aries, Taurus, Gemini, etc.)
        """
        horoscopes = {
            "Aries": "Today is your lucky day!",
            "Taurus": "Focus on your goals today.",
            "Gemini": "Communication will be important today.",
        }
        return {
            "sign": sign,
            "horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
        }

Use external tools for reusable, standalone functions (registered via tools=[...]).
Use internal tools for agent-specific logic as class methods.
Both must be decorated with @function_tool for the agent to recognize and use them.

Step 3: Setting Up the Pipeline

Setting Up the Pipeline

The pipeline connects your agent to an AI model. In this example, we're using OpenAI's real-time model:

OpenAI
Gemini(LiveAPI)
AWS-Nova-Sonic

main.py
from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig
from videosdk.agents import RealTimePipeline, JobContext
from openai.types.beta.realtime.session import TurnDetection

async def start_session(context: JobContext):
    # Initialize the AI model
    model = OpenAIRealtime(
        model="gpt-4o-realtime-preview",
        # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter
        api_key="sk-proj-XXXXXXXXXXXXXXXXXXXX",
        config=OpenAIRealtimeConfig(
            voice="alloy", # alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse
            modalities=["text", "audio"],
            turn_detection=TurnDetection(
                type="server_vad",
                threshold=0.5,
                prefix_padding_ms=300,
                silence_duration_ms=200,
            ),
            tool_choice="auto"
        )
    )

    pipeline = RealTimePipeline(model=model)

    # Continue to the next steps...

main.py
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.agents import RealTimePipeline, JobContext

async def start_session(context: JobContext):
    # Initialize the AI model
    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        api_key="AKZSXXXXXXXXXXXXXXXXXXXX",
        config=GeminiLiveConfig(
            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    pipeline = RealTimePipeline(model=model)

    # Continue to the next steps...

note

Set "vision": True in the session context to stream the user’s camera frames to models (currently supported by Google Gemini Live API). Leave it out for audio-only agents.

main.py
from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig
from videosdk.agents import RealTimePipeline, JobContext

async def start_session(context: JobContext):
    # Initialize the AI model
    model = NovaSonicRealtime(
        model="amazon.nova-sonic-v1:0",
        # When AWS credentials and region are set in .env - DON'T pass credential parameters
        region="us-east-1",  # Currently, only "us-east-1" is supported for Amazon Nova Sonic.
        aws_access_key_id="AWSXXXXXXXXXXXXXXXXXXXX",
        aws_secret_access_key="AQSXXXXXXXXXXXXXXXXXXXX",
        config=NovaSonicConfig(
            voice="tiffany", #  "tiffany","matthew", "amy"
            temperature=0.7,
            top_p=0.9,
            max_tokens=1024
        )
    )

    pipeline = RealTimePipeline(model=model)

    # Continue to the next steps...

note

To initiate a conversation with Amazon Nova Sonic, the user must speak first. The model listens for user input to begin the interaction.

note

When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code.

Step 4: Assembling and Starting the Agent Session

Assembling and Starting the Agent Session

Now, let's put everything together and start the agent session:

main.py
import asyncio
from videosdk.agents import AgentSession, WorkerJob, RoomOptions, JobContext

async def start_session(context: JobContext):
    # ... previous setup code ...

    # Create the agent session
    session = AgentSession(
        agent=VoiceAgent(),
        pipeline=pipeline
    )

    try:
       await context.connect()
        # Start the session
        await session.start()
        # Keep the session running until manually terminated
        await asyncio.Event().wait()
    finally:
        # Clean up resources when done
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
        room_id="<room_id>", # Replace it with your actual roomId
        auth_token = "<VIDEOSDK_AUTH_TOKEN>", # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth
        name="Test Agent",
        playground=True,
        vision=True # Only available when using the Google Gemini Live API
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Step 5: Connecting with VideoSDK Client Applications

Connecting with VideoSDK Client Applications

After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:

When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.

Step 6: Running the Project

Running the Project

Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your .env file is properly configured and all dependencies are installed.

python main.py

tip

Get started quickly with the Quick Start Example for the VideoSDK AI Agent SDK — everything you need to build your first AI agent fast.

Got a Question? Ask us on discord

Prerequisites​

Installation​

Environment Setup​

Generating a VideoSDK Meeting ID​

Using cURL​

Step 1: Creating a Custom Agent​

Step 2: Implementing Function Tools​

Step 3: Setting Up the Pipeline​

Step 4: Assembling and Starting the Agent Session​

Step 5: Connecting with VideoSDK Client Applications​

Step 6: Running the Project​

Prerequisites

Installation

Environment Setup

Generating a VideoSDK Meeting ID

Using cURL

Step 1: Creating a Custom Agent

Step 2: Implementing Function Tools

Step 3: Setting Up the Pipeline

Step 4: Assembling and Starting the Agent Session

Step 5: Connecting with VideoSDK Client Applications

Step 6: Running the Project