Skip to main content

AI Voice Agent Quick Start

Get started with VideoSDK Agents in minutes. This guide covers both Realtime (speech-to-speech) and Cascaded (STT-LLM-TTS) pipeline implementations.

Try out these complete examples in CodeSandbox

API Keys - Create API keys for the services you'll use. For Realtime Pipeline: OpenAI ↗, Google AI Studio ↗ & VideoSDK Dashboard ↗. For Cascading Pipeline: Deepgram ↗, OpenAI ↗, ElevenLabs ↗ & VideoSDK Dashboard ↗. Follow the guide to generate videosdk token

Access keys safely using environment variables. Save changes with Cmd + S (Mac) or Ctrl + S (Windows). A Playground link will appear in the terminal → copy it and open in a new tab.

Prerequisites

Before you begin, ensure you have:

Understanding the Architecture

Before diving into implementation, let's understand the two main pipeline architectures available:

Realtime Pipeline provides direct speech-to-speech processing with minimal latency:

Realtime Pipeline Architecture

The realtime pipeline processes audio directly through a unified model that handles:

  • User Voice InputSpeech to Speech modelAgent Voice Output

This approach offers the fastest response times and is ideal for real-time conversations.

Installation

Create and activate a virtual environment with Python 3.12 or higher:

python3.12 -m venv venv
source venv/bin/activate
pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]"

Want to use a different provider? Check out our plugins for STT, LLM, and TTS.

Environment Setup

It's recommended to use environment variables for secure storage of API keys, secret tokens, and authentication tokens. Create a .env file in your project root:

.env
DEEPGRAM_API_KEY = "Your Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
ELEVENLABS_API_KEY = "Your ElevenLabs API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"

API Keys - Get API keys Deepgram ↗, OpenAI ↗, ElevenLabs ↗ & VideoSDK Dashboard ↗ follow to guide to generate videosdk token

Step 1: Creating a Custom Agent

First, let's create a custom voice agent by inheriting from the base Agent class:

main.py
import asyncio, os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from typing import AsyncIterator

# Pre-downloading the Turn Detector model
pre_download_model()

class MyVoiceAgent(Agent):
def __init__(self):
super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
async def on_enter(self): await self.session.say("Hello! How can I help?")
async def on_exit(self): await self.session.say("Goodbye!")

This code defines a basic voice agent with:

  • Custom instructions that define the agent's personality and capabilities
  • An entry message when joining a meeting
  • State change handling to track the agent's current activity

Step 2: Assembling and Starting the Agent Session

The pipeline connects your agent to an AI model.

main.py
async def start_session(context: JobContext):
# Create agent and conversation flow
agent = MyVoiceAgent()
conversation_flow = ConversationFlow(agent)

# Create pipeline
pipeline = CascadingPipeline(
stt=DeepgramSTT(model="nova-2", language="en"),
llm=OpenAILLM(model="gpt-4o"),
tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
vad=SileroVAD(threshold=0.35),
turn_detector=TurnDetector(threshold=0.8)
)

session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow
)

try:
await context.connect()
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()
await context.shutdown()

def make_context() -> JobContext:
room_options = RoomOptions(
# room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
name="VideoSDK Cascaded Agent",
playground=True
)

return JobContext(room_options=room_options)

if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()

Step 3: Running the Project

Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your .env file is properly configured and all dependencies are installed.

python main.py

Step 4: Connecting with VideoSDK Client Applications

When working with a Client SDK, make sure to create the room first using the Create Room API . Then, simply pass the generated room id in both your client SDK and the RoomOptions for your AI Agent so they connect to the same session.

tip

Get started quickly with the Quick Start Example for the VideoSDK AI Agent SDK — everything you need to build your first AI agent fast.

Got a Question? Ask us on discord