--- title: Introduction hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - Voice AI - Real-time Communication - AI Integration - VideoSDK Cloud - Conversational AI - Build AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Introduction slug: introduction --- import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, TelephonyIcon, WaveformIcon, DocsIcon, CloudIcon, PuzzlePieceSimpleIcon, MetricsIcon, BulbIcon, DiscordIcon, SupportIcon } from '@site/src/components/agent/cards'; # AI Voice Agents The VideoSDK AI Agent SDK is a powerful Python framework for developers to seamlessly integrate intelligent, real-time voice agents into any application. Bridge the gap between advanced AI models and human interaction, creating natural, engaging, and responsive conversational experiences. , showArrow: false }, { title: "AI Telephony Agent Quickstart", description: "Build an AI Telephony Agent in less than 10 minutes", link: "/ai_agents/ai-phone-agent-quick-start", icon: , showArrow: false }, { title: "Github Repository", description: "The videosdk agent code and examples", link: "https://github.com/videosdk-live/agents", icon: }, { title: "Agent Starter Apps", description: "Ready-to-run starter apps to get your AI agent up and running fast.", link: "/ai_agents/agent-runtime/connect-agent/web-integrations/agent-starter-react", icon: } ]} /> ## The Architecture The VideoSDK AI Agents framework connects four key components to enable seamless AI voice interactions: - Your **Infrastructure** hosts the agent management system - The **Agent Worker** creates and manages AI sessions - The **VideoSDK Room** handles real-time meeting operations - **User Devices** connect through web, mobile apps, or phone calls to interact with intelligent agents that can listen, understand, and respond naturally in real-time conversations. ![Introduction](https://assets.videosdk.live/images/agent-architecture.png) ## Use Cases Here are some real-world applications where VideoSDK AI Agents can be deployed to create intelligent, voice-enabled experiences across different industries and scenarios. You can use this, or refer this to create your customized agent. ## The Building Blocks Our SDK is built on four primary, modular components that work together to create powerful and customizable agents. Understand these concepts, and you're ready to build. , showArrow: false }, { title: "Deployment Options", description: "Deploy your agent on cloud or self-host it on your own infrastructure", link: "/ai_agents/deployments/introduction", icon: , showArrow: false }, { title: "Observability", description: "Monitor and debug with confidence using our built-in session analytics, latency tracking, and detailed traces.", link: "/ai_agents/tracing-observability/session-analytics", icon: , showArrow: false }, { title: "Plugin Ecosystem", description: "Integrate with dozens of providers like OpenAI, Google, Anthropic, and Elevenlabs for STT, LLM, and TTS.", link: "/ai_agents/plugins/realtime/openai", icon: , showArrow: false } ]} /> ## Need Help? If you have any queries, please feel free to reach out to us using one of the following methods: }, { title: "GitHub", description: "Ask your questions on GitHub.", link: "https://github.com/videosdk-live/agents/issues", icon: }, { title: "Support", description: "Talk to an expert, book demo or talk to sales.", link: "https://www.videosdk.live/contact", icon: } ]} columns={3} /> ## Frequently Asked Questions
What programming language and version are required? The AI Agent SDK is built in Python. You'll need Python 3.12 or higher to use the SDK.
Can my agent answer phone calls? Yes. By integrating with our SIP/telephony services, your AI agent can join a room initiated by a standard phone call. This allows you to build powerful IVR systems, automated appointment schedulers, AI-powered call centers, and more.
What AI models are supported? The SDK supports various AI models including: - **Real-time Models**: OpenAI, Google Gemini, AWS Nova Sonic - **LLM Providers**: OpenAI, Google Gemini, Anthropic Claude, Sarvam AI, Cerebras - **TTS Providers**: ElevenLabs, OpenAI, Google, AWS Polly, Cartesia, and many more - **STT Providers**: OpenAI Whisper, Deepgram, Google, AssemblyAI, and others
Can I use my own custom models? Absolutely! The SDK's modular architecture allows you to create custom plugins for any AI provider. Check our [plugin development guide](https://github.com/videosdk-live/agents/blob/main/BUILD_YOUR_OWN_PLUGIN.md) for detailed instructions.
How is pricing handled for the AI Agent SDK? VideoSDK offers a free tier with limited usage. The AI Agent SDK itself is open-source, but you'll need API keys for the AI services you choose to use (OpenAI, Google, etc.). Check the [pricing page](https://www.videosdk.live/pricing) for VideoSDK usage limits.
Can agents handle more than just voice? Absolutely! Agents support multimodal interactions including vision processing, data messages, and real-time video streams. They can also use function tools to interact with external systems and APIs.
Is the SDK production-ready? Yes, the AI Agent SDK is stable and production-ready. It is designed to be self-hosted on your own infrastructure for full control and scalability, from a single server to a Kubernetes cluster. It includes comprehensive error handling, metrics collection, and deployment flexibility.
--- --- title: A2A Implementation Guide hide_title: false hide_table_of_contents: false description: "Complete implementation guide for building Agent to Agent (A2A) systems with VideoSDK AI Agents. Learn to create customer service and specialist agents that collaborate seamlessly using real-world examples." pagination_label: "A2A Implementation" keywords: - A2A Implementation - Agent to Agent Example - Multi-Agent System - Multiple Agent - A2A Protocol - AI Agent - Google's A2A - Customer Service Agent - Loan Specialist Agent - VideoSDK Agents - AI Agent SDK - Python Implementation - Agent Collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Implementation slug: implementation --- # A2A Implementation Guide This guide shows you how to build a complete Agent to Agent (A2A) system using the concepts from the [A2A Overview](overview). We'll create a banking customer service system with a main customer service agent and a loan specialist. ## Implementation Overview We'll build a system with: - **Customer Service Agent**: Voice-enabled interface agent using **RealTimePipeline** for low-latency voice interactions - **Loan Specialist Agent**: Text-based domain expert using **CascadingPipeline** for efficient text processing - **Intelligent Routing**: Automatic detection and forwarding of loan queries - **Seamless Communication**: Users get expert responses without knowing about the routing ## Structure of the project ```js A2A ├── agents/ │ ├── customer_agent.py # CustomerServiceAgent definition │ ├── loan_agent.py # LoanAgent definition │ ├── session_manager.py # Handles session creation, pipeline setup, meeting join/leave └── main.py # Entry point: runs main() and starts agents ``` ## Sequence Diagram ![A2A Architecture](https://cdn.videosdk.live/website-resources/docs-resources/a2a_sequence_diagram.png) ## Step 1: Create the Customer Service Agent - **`Interface Agent`**: Creates `CustomerServiceAgent` as the main user-facing agent with voice capabilities and customer service instructions. - **`Function Tool`**: Implements`@function_tool forward_to_specialist()`that uses A2A discovery to find and route queries to domain specialists. - **`Response Relay`**: Includes `handle_specialist_response()` method that automatically receives and relays specialist responses back to users. ```python title="agents/customer_agent.py" from videosdk.agents import Agent, AgentCard, A2AMessage, function_tool import asyncio from typing import Dict, Any class CustomerServiceAgent(Agent): def __init__(self): super().__init__( agent_id="customer_service_1", instructions=( "You are a helpful bank customer service agent. " "For general banking queries (account balances, transactions, basic services), answer directly. " "For ANY loan-related queries, questions, or follow-ups, ALWAYS use the forward_to_specialist function " "with domain set to 'loan'. This includes initial loan questions AND all follow-up questions about loans. " "Do NOT attempt to answer loan questions yourself - always forward them to the specialist. " "After forwarding a loan query, stay engaged and automatically relay any response you receive from the specialist. " "When you receive responses from specialists, immediately relay them naturally to the customer." ) ) @function_tool async def forward_to_specialist(self, query: str, domain: str) -> Dict[str, Any]: """Forward queries to domain specialist agents using A2A discovery""" # Use A2A discovery to find specialists by domain specialists = self.a2a.registry.find_agents_by_domain(domain) id_of_target_agent = specialists[0] if specialists else None if not id_of_target_agent: return {"error": f"No specialist found for domain {domain}"} # Send A2A message to the specialist await self.a2a.send_message( to_agent=id_of_target_agent, message_type="specialist_query", content={"query": query} ) return { "status": "forwarded", "specialist": id_of_target_agent, "message": "Let me get that information for you from our loan specialist..." } async def handle_specialist_response(self, message: A2AMessage) -> None: """Handle responses from specialist agents and relay to user""" response = message.content.get("response") if response: # Brief pause for natural conversation flow await asyncio.sleep(0.5) # Try multiple methods to relay the response to the user prompt = f"The loan specialist has responded: {response}" methods_to_try = [ (self.session.pipeline.send_text_message, prompt),# While using Cascading as main agent, comment this (self.session.pipeline.model.send_message, response),# While using Cascading as main agent, comment this (self.session.say, response) ] for method, arg in methods_to_try: try: await method(arg) break except Exception as e: print(f"Error with {method.__name__}: {e}") async def on_enter(self): # Register this agent with the A2A system await self.register_a2a(AgentCard( id="customer_service_1", name="Customer Service Agent", domain="customer_service", capabilities=["query_handling", "specialist_coordination"], description="Handles customer queries and coordinates with specialists" )) await self.session.say("Hello! I am your customer service agent. How can I help you?") # Set up message listener for specialist responses self.a2a.on_message("specialist_response", self.handle_specialist_response) async def on_exit(self): print("Customer agent left the meeting") ``` ## Step 2: Create the Loan Specialist Agent - **`Specialist Agent Setup`**: Creates `LoanAgent` class with specialized loan expertise instructions and agent_id `"specialist_1"`. - **`Message Handlers`**: Implements` handle_specialist_query()` to process incoming queries and handle_model_response() to send responses back. - **`Registration`**: Registers with A2A system using domain "loan" so it can be `discovered` by other agents needing loan expertise. ```python title="agents/loan_agent.py" from videosdk.agents import Agent, AgentCard, A2AMessage class LoanAgent(Agent): def __init__(self): super().__init__( agent_id="specialist_1", instructions=( "You are a specialized loan expert at a bank. " "Provide detailed, helpful information about loans including interest rates, terms, and requirements. " "Give complete answers with specific details when possible. " "You can discuss personal loans, car loans, home loans, and business loans. " "Provide helpful guidance and next steps for loan applications. " "Be friendly and professional in your responses. " "Keep responses concise within 5-7 lines and easily understandable." ) ) async def handle_specialist_query(self, message: A2AMessage): """Process incoming queries from customer service agent""" query = message.content.get("query") if query: # Send the query to our AI model for processing await self.session.pipeline.send_text_message(query) async def handle_model_response(self, message: A2AMessage): """Send processed responses back to requesting agent""" response = message.content.get("response") requesting_agent = message.to_agent if response and requesting_agent: # Send the specialist response back to the customer service agent await self.a2a.send_message( to_agent=requesting_agent, message_type="specialist_response", content={"response": response} ) async def on_enter(self): await self.register_a2a(AgentCard( id="specialist_1", name="Loan Specialist Agent", domain="loan", capabilities=["loan_consultation", "loan_information", "interest_rates"], description="Handles loan queries" )) self.a2a.on_message("specialist_query", self.handle_specialist_query) self.a2a.on_message("model_response", self.handle_model_response) async def on_exit(self): print("LoanAgent Left") ``` ## Step 3: Configure Session Management - **`Pipeline Architecture`**: Uses **RealTimePipeline** for customer agent (audio-enabled Gemini for voice interaction) and **CascadingPipeline** for specialist agent (text-only OpenAI for efficient processing). - **`Session Factory`**: Provides `create_pipeline()` and `create_session()` functions to properly configure agent sessions based on their roles. - **`Modality Separation`**: Ensures customer agent can handle voice while specialist processes text in background. ```python title="session_manager.py" from videosdk.agents import AgentSession, CascadingPipeline, RealTimePipeline, ConversationFlow from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import os class MyConversationFlow(ConversationFlow): async def on_turn_start(self, transcript: str) -> None: pass async def on_turn_end(self) -> None: pass def create_pipeline(agent_type: str): if agent_type == "customer": # Customer agent: RealTimePipeline for voice interaction return RealTimePipeline( model=GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) ) else: # Specialist agent: CascadingPipeline for text processing return CascadingPipeline( llm=OpenAILLM(api_key=os.getenv("OPENAI_API_KEY")), ) def create_session(agent, pipeline) -> AgentSession: return AgentSession( agent=agent, pipeline=pipeline, conversation_flow=MyConversationFlow(agent=agent), ) ``` :::note While setting up pipelines, make sure: - The **customer agent** has **voice capabilities only** (via `RealTimePipeline`). - The **specialist agent (Loan Agent)** operates in **text-only mode** (via `CascadingPipeline`). ::: :::info **Pipeline Support**: The VideoSDK AI Agents framework supports both **RealTimePipeline** and **CascadingPipeline**, enabling flexible configurations for voice and text processing with **A2A**. You can run a full `RealTimePipeline` or `CascadingPipeline` for both modalities, or create a hybrid setup that combines the two. This allows you to tailor the use of STT, TTS, and LLM to suit your specific use case, whether for low-latency interactions, complex processing flows, or a mix of both. ::: ## Step 4: Deploy A2A System on VideoSDK Platform - **`Meeting Setup`**: Customer agent joins VideoSDK meeting for user interaction while specialist runs in background mode. Requires environment variables: `VIDEOSDK_AUTH_TOKEN`, `GOOGLE_API_KEY`, and `OPENAI_API_KEY`. - **`System Orchestration`**: Uses `JobContext` and `WorkerJob` to manage the meeting lifecycle and agent coordination. - **`Resource Management`**: Handles startup sequence, keeps system running, and provides clean shutdown with proper A2A unregistration ```python title="main.py" import asyncio from contextlib import suppress from agents.customer_agent import CustomerServiceAgent from agents.loan_agent import LoanAgent from session_manager import create_pipeline, create_session from videosdk.agents import JobContext, RoomOptions, WorkerJob async def main(ctx: JobContext): specialist_agent = LoanAgent() specialist_pipeline = create_pipeline("specialist") specialist_session = create_session(specialist_agent, specialist_pipeline) customer_agent = CustomerServiceAgent() customer_pipeline = create_pipeline("customer") customer_session = create_session(customer_agent, customer_pipeline) specialist_task = asyncio.create_task(specialist_session.start()) try: await ctx.connect() await customer_session.start() await asyncio.Event().wait() except (KeyboardInterrupt, asyncio.CancelledError): print("Shutting down...") finally: specialist_task.cancel() with suppress(asyncio.CancelledError): await specialist_task await specialist_session.close() await customer_session.close() await specialist_agent.unregister_a2a() await customer_agent.unregister_a2a() await ctx.shutdown() def customer_agent_context() -> JobContext: room_options = RoomOptions(room_id="", name="Customer Service Agent", playground=True) return JobContext( room_options=room_options ) if __name__ == "__main__": job = WorkerJob(entrypoint=main, jobctx=customer_agent_context) job.start() ``` :::note Ensure that the `JobContext` is created **only for the primary (main) agent**, i.e., the agent responsible for user-facing interaction (e.g., Customer Agent). The background agent (e.g., Loan Agent) should not have its own context or initiate a separate connection. ::: #### Running the Application Set the required environment variables: ```bash export VIDEOSDK_AUTH_TOKEN="your_videosdk_token" export GOOGLE_API_KEY="your_google_api_key" export OPENAI_API_KEY="your_openai_api_key" ``` Replace `` in the code with your actual meeting ID, then run: ```bash cd A2A python main.py ``` :::tip Quick Start Get the complete working example at [A2A Quick Start Repository](https://github.com/videosdk-live/agents-quickstart/tree/main/A2A) with all the code ready to run. ::: --- --- title: Agent to Agent (A2A) hide_title: false hide_table_of_contents: false description: "Understanding the core concepts of Agent to Agent (A2A) communication in VideoSDK AI Agents - AgentCard, A2AMessage, agent registration, and discovery mechanisms for building collaborative multi-agent systems." pagination_label: "A2A Overview" keywords: - A2A Overview - A2A Protocol - Agent To Agent - AI Agent - Google's A2A - AgentCard - A2AMessage - Agent Registration - Agent Discovery - Multi-Agent Communication - VideoSDK Agents - AI Agent SDK - Agent Collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Overview slug: overview --- # Agent to Agent (A2A) The Agent to Agent (A2A) protocol enables seamless collaboration between specialized AI agents, allowing them to communicate, share knowledge, and coordinate responses based on their unique capabilities and domain expertise. With VideoSDK's A2A implementation, you can create multi-agent systems where different agents work together to provide comprehensive solutions. ## How It Works ### Basic Flow 1. **Agent Registration**: Agents register themselves with an `AgentCard` that contains their capabilities and domain expertise 2. **Client Query**: Client sends a query to the main agent 3. **Agent Discovery**: Main agent discovers relevant specialist agents using agent cards 4. **Query Forwarding**: Main agent forwards specialized queries to appropriate agents 5. **Response Chain**: Specialist agents process queries and respond back to the main agent 6. **Client Response**: Main agent formats and delivers the final response to the client ![A2A Architecture](https://cdn.videosdk.live/website-resources/docs-resources/a2a_diagram.png) ### Example Scenario ``` Client → "Book a flight to New York and find a hotel" ↓ Travel Agent (Main) → Analyzes query ↓ Travel Agent → Discovers Flight Booking Agent & Hotel Booking Agent ↓ Travel Agent → Forwards flight query to Flight Booking Agent Travel Agent → Forwards hotel query to Hotel Booking Agent ↓ Specialist Agents → Process queries and respond back (text format) ↓ Travel Agent → Combines responses and sends to client (audio format) ``` # Core Components ## 1. AgentCard The `AgentCard` is how agents identify themselves and advertise their capabilities to other agents. #### Structure ```python AgentCard( id="agent_flight_001", name="Skymate", domain="flight", capabilities=[ "search_flights", "modify_bookings", "show_flight_status" ], description="Handles all flight-related tasks" ) ``` #### Parameters | Parameter | Type | Required | Description | | -------------- | ------ | -------- | ------------------------------------ | | `id` | string | Yes | Unique identifier for the agent | | `name` | string | Yes | Human-readable agent name | | `domain` | string | Yes | Primary expertise domain | | `capabilities` | list | Yes | List of specific capabilities | | `description` | string | Yes | Brief description of agent's purpose | | `metadata` | dict | No | Additional metadata for the agent | ## 2. A2AMessage `A2AMessage` is the standardized communication format between agents. #### Structure ```python message = A2AMessage( from_agent="travel_agent_1", to_agent="agent_flight_001", type="flight_status_query", content={"query": "What's the status of AI202?"}, metadata={"client_id": "xyz123", "urgency": "medium"} ) ``` #### Parameters | Parameter | Type | Required | Description | | ------------ | ------ | -------- | --------------------------- | | `from_agent` | string | Yes | ID of the sending agent | | `to_agent` | string | Yes | ID of the receiving agent | | `type` | string | Yes | Message type/event name | | `content` | dict | Yes | Message payload | | `metadata` | dict | No | Additional message metadata | ## 3. Agent Registry #### `register_a2a(agent_card)` Register an agent with the A2A system. ```python async def on_enter(self): await self.register_a2a(AgentCard( id="agent_flight_001", name="Skymate", domain="flight", capabilities=[ "search_flights", "modify_bookings", "show_flight_status" ], description="Handles all flight-related tasks" )) ``` **What Registration Does:** - Adds the agent to the global `AgentRegistry` singleton - Makes the agent discoverable by other agents - Stores both the `AgentCard` and agent instance - Enables message routing to this agent #### `unregister()` Unregister an agent from the A2A system. ```python await self.unregister_a2a() ``` ## 4. A2AProtocol Class The main class for managing agent-to-agent communication. ### Agent Discovery #### `find_agents_by_domain(domain: str)` Discover agents based on their domain expertise. ```python agents = self.a2a.registry.find_agents_by_domain("hotel") # Returns: ["agent_hotel_001"] ``` #### `find_agents_by_capability(cap: str)` Find agents with specific skills. ```python agents = await self.a2a.registry.find_agents_by_capability("modify_bookings") # Returns: ["agent_flight_001"] ``` --- ### Agent Communications #### `send_message(to_agent, message_type, content, metadata=None)` Send messages directly to other agents. ```python await self.a2a.send_message( to_agent="agent_hotel_001", message_type="hotel_booking_query", # Event name that the receiving agent listens for content={"query": "Find 3-star hotels in Delhi under $100"}, metadata={"client_id": "xyz123"} # Optional metadata ) ``` **Parameters:** - `to_agent` (string): Target agent ID - `message_type` (string): Event name the receiving agent listens for - `content` (dict): Message payload - `metadata` (dict, optional): Additional message metadata #### `on_message(message_type, handler)` Register message handlers for incoming messages. ```python # Register a handler for specialist queries self.a2a.on_message("hotel_booking_query", self.handle_specialist_query) async def handle_specialist_query(self, message): # Process the incoming message query = message.content.get("query") # ... process query ... # Return response return {"response": "Current mortgage rates are 6.5%"} ``` ## Next Steps Now that you're familiar with the core A2A concepts, it's time to move from theory to practice: 👉 **[Explore the Full A2A Implementation](implementation)** Dive into a complete, working example that demonstrates agent discovery, messaging, and collaboration in action. --- --- title: Build a Custom Voice AI Agent in Minutes hide_title: false hide_table_of_contents: false description: "Use VideoSDK's low-code builder to design, test, and deploy a personalized voice agent powered by your preferred LLM." keywords: - voice ai agent - low-code agent builder - conversational ai - videosdk agents - gemini - realtime pipeline - telephony - knowledge base - speech recognition - tts image: https://strapi.videosdk.live/uploads/Screenshot_2025_11_17_at_5_06_23_PM_33a509fd4e.png sidebar_label: Build Agent slug: build-agent --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import Step from '@site/src/components/Step' # Agent Runtime Guide AI voice agents are transforming how businesses interact with customers, providing natural, conversational experiences through voice interfaces. VideoSDK's **Agent Runtime** feature offers a powerful **no-code/low-code interface** that enables you to build sophisticated AI voice agents without extensive programming knowledge. ## Prerequisites Before you begin, ensure you have: - **VideoSDK Account:** Visit [VideoSDK Dashboard](https://app.videosdk.live) to sign up for a free account and access the AI Agent builder. ## Step-By-Step Guide
### Step 1: Create a New Agent
1. In the dashboard, navigate to **AI Agent > Agents** or visit [Agents Dashboard](https://app.videosdk.live/agents/agents). 2. You'll see the `AI Agent > Agents` section in the dashboard. 3. To create a voice agent, click on **Agents** in the sidebar. ![Select Agents in Dashboard](https://strapi.videosdk.live/uploads/1_Select_Agents_in_Dashboard_1b6a6f6d0c.png)
### Step 2: Click `Add New Agent`
This is where you'll start creating your voice agent. If no agent has been created yet, you'll see a **Add New Agent** button. If agents already exist, you'll see a list of all AI voice agents, and you can click the button in the top right corner to create a new agent. ![Click Create AI Voice Agent Button](https://strapi.videosdk.live/uploads/2_Click_Create_AI_Voice_Agent_Button_349f3799f2.png)
### Step 3: Configure Agent Details
This is where you can define your AI voice agent's persona and behavior: - **Agent Name:** Set a descriptive name for your agent (e.g., "AI Interviewer"). - **System Prompt:** Define the agent's role, personality, and behavior guidelines. - **Welcome Message:** Set the message that plays when the agent joins a conversation. - **Closing Message:** Set the message that plays when the agent leaves a conversation. ![Create Voice Agent Persona](https://strapi.videosdk.live/uploads/3_Create_Voice_Agent_Persona_6281a768ef.png)
### Step 4: Configure the Pipeline
The pipeline is the core engine of your voice agent, processing audio through speech recognition, AI reasoning, and text-to-speech. VideoSDK offers two pipeline options: **Realtime Pipeline** and **Cascading Pipeline**. The **Realtime Pipeline** provides direct speech-to-speech processing with minimal latency, ideal for natural, conversational interactions. Example: Adding **Gemini Realtime Model** 1. Add your Gemini API key in the pipeline configuration or at [Realtime Integrations](https://app.videosdk.live/agents/integrations/realtime). 2. To get your API key, visit [Gemini API Keys](https://aistudio.google.com/api-keys). ![Gemini Add Your API Key](https://strapi.videosdk.live/uploads/4_Gemini_Add_Your_API_Key_bcf81a0f82.png) **Available models:** - `gemini-2.5-flash-native-audio-preview-12-2025` - `gemini-2.0-flash` - `gemini-2.5-flash-native-audio-preview-12-2025` - `gemini-2.5-flash-native-audio` The **Cascading Pipeline** processes audio through distinct stages (STT → LLM → TTS), providing maximum control over each component. Configure your providers for [STT Integrations](https://app.videosdk.live/agents/integrations/stt), [LLM Integrations](https://app.videosdk.live/agents/integrations/llm) and [TTS Integrations](https://app.videosdk.live/agents/integrations/tts). ![STT Providers](https://strapi.videosdk.live/uploads/stt_e2522d9ea2.png) Example: Adding **Deepgram STT** - Get API Key at: [Deepgram Console](https://console.deepgram.com/) **Available models:** - `flux-general-en` - `nova-2` or `nova-2-general` (for non-English transcriptions) - `nova-3` or `nova-3-general` - `base`
### Step 5: Knowledge Base Integration
Upload a knowledge base to provide context and domain expertise to your voice agent. This dramatically improves answer accuracy and enables your agent to handle specialized queries. - Navigate to the **Knowledge Base** tab in your agent configuration. - Upload documents, FAQs, or product sheets that contain relevant information. - The agent will use this knowledge to provide more accurate and contextual responses. ![Add Knowledge Base in VideoSDK](https://strapi.videosdk.live/uploads/Add_Knowlodege_base_in_videosdk_363aaa82f3.png)
### Step 6: Configure Telephony Settings
Configure telephony settings to enable your agent to handle phone calls: - **Agent Type:** Set the type of agent (inbound, outbound, or both). - **Inbound Gateways:** Set up gateways to receive incoming calls. - **Outbound Gateways:** Set up gateways to make outbound calls. - **Routing Rules:** Create rules to map phone numbers to your agent. - **Calling Settings:** Configure call handling preferences and behavior. ![Telephony Configuration](https://strapi.videosdk.live/uploads/telephony_agents_dd2c2080ac.png) This configuration is essential for **call center automation**, **platform integration**, and smooth **agent orchestration**.
### Step 7: Test Your Voice Agent
You can interact with the agent directly from the dashboard before connecting it to production channels: 1. Visit [Agents Dashboard](https://app.videosdk.live/agents/agents). 2. Locate your agent in the list and click the **Test** button in the top-right corner. 3. Use the built-in simulator to speak with the agent in real time, view live transcripts, and fine-tune prompts based on the conversation. ![Test AI Voice Agent](https://strapi.videosdk.live/uploads/test_ai_voice_agent_30e0045af0.png)
### Step 8: Connect Voice Agent
Once your agent is configured, you can connect it to various platforms and devices: - **Web:** Integrate your agent into web applications. - **Mobile:** Connect to iOS and Android mobile apps. - **Telephony:** Deploy to phone systems for voice calls. - **IoT Devices:** Connect to Internet of Things devices. ![Connect AI Voice Agent](https://strapi.videosdk.live/uploads/8_connect_ai_voice_agent_17fe428419.png) ## Next Steps Congratulations! You've successfully created your AI voice agent. Here are the next steps: - **Test Your Agent:** Use the built-in test simulator to verify your agent's behavior and responses. - **Deploy to Production:** Connect your agent to production environments and real user interactions. - **Monitor Performance:** Track agent performance, user satisfaction, and conversation quality. - **Iterate and Improve:** Refine your agent's prompts, knowledge base, and configuration based on real-world usage. Keep refining your agent's configuration to build a powerful voice AI solution tailored to your specific business needs. ### Starter Apps import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; --- /// Enter your code here(ref Flutter) --- --- title: Flutter Agent Starter hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using Flutter frontend and a no-code agent from the dashboard. sidebar_label: Flutter pagination_label: Agent Runtime with Flutter keywords: - ai agent - no-code - voice interaction - real-time communication - flutter sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: agent-starter-flutter --- import Step from '@site/src/components/Step' import CreateAgent from '@site/mdx/_ai-agent-starter-sdk-guide.mdx' # Agent Starter App - Flutter VideoSDK enables you to seamlessly add a voice-enabled AI agent to your Flutter app — this guide walks you through connecting your Flutter frontend to an agent configured and deployed directly from the VideoSDK dashboard. ## Prerequisites - A deployed AI agent on VideoSDK Agent Cloud. If you haven't done this yet, create and deploy your agent using the [Low-Code Deployment UI](/ai_agents/agent-runtime/build-agent) on the VideoSDK Dashboard — no coding required. Once deployed, note down your **Agent ID**. - If your target platform is iOS, your development environment must meet the following requirements: - Flutter 3.8.0 or later - Dart 3.x or later - Valid Video SDK [Account](https://app.videosdk.live/) import APISecret from '@site/mdx/introduction/_api-key.mdx'; ## Run the Sample Project
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agent-starter-app-flutter.git cd agent-starter-flutter ```
### Step 2: Install the dependencies
Install all the dependencies to run the project. ```bash flutter pub get ```
### Step 3: Create Your Agent (Optional)
:::info If you've already configured and deployed your agent from the VideoSDK Dashboard, you can jump directly to [Step 4](#step-4-setup-environment-variables). :::
### Step 4: Setup Environment Variables
Copy the `.env.example` file to `.env`. ```bash cp .env.example .env ``` Update the `.env` file with your credentials. The `AGENT_ID` is the identifier for the Low-Code agent you deployed from the VideoSDK Dashboard. ```env title=".env" AUTH_TOKEN=your_videosdk_auth_token AGENT_ID=your_agent_id MEETING_ID=your_meeting_id VERSION_ID=your_version_id ``` > **Tip:** You can obtain your `AUTH_TOKEN` and `AGENT_ID` from the [VideoSDK Dashboard](https://app.videosdk.live/) under your Agent Cloud deployment. `MEETING_ID` is optional — if left blank, the app will create a new meeting automatically.
### Step 5: Run the Sample App
Bingo, it's time to push the launch button. **Android:** ```bash flutter run ``` **iOS:** ```bash cd ios && pod install && cd .. flutter run -d ios ``` Once running, the app will use the Dispatch API to send your deployed agent into the meeting room. You'll see the live transcription as you speak, and the agent will respond in real time. --- ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `AGENT_ID` and `VERSION_ID` in your `.env` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the debug console for any network errors. 4. **Flutter build issues:** - Ensure your Flutter version is compatible (3.8.0 or later for iOS targets). - Try cleaning the build: `flutter clean`. - Delete `pubspec.lock` and run `flutter pub get`. - For iOS: run `cd ios && pod install` before `flutter run`. --- --- title: iOS Agent Starter hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using iOS frontend and a no-code agent from the dashboard. sidebar_label: iOS pagination_label: Agent Runtime with iOS keywords: - ai agent - no-code - voice interaction - real-time communication - ios sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 3 slug: agent-starter-ios --- import Step from '@site/src/components/Step' import CreateAgent from '@site/mdx/_ai-agent-starter-sdk-guide.mdx' # Agent Starter App - iOS VideoSDK enables you to seamlessly add a voice-enabled AI agent to your iOS app — this guide walks you through connecting your iOS application to an agent configured and deployed directly from the VideoSDK dashboard. ## Prerequisites - A deployed AI agent on VideoSDK Agent Cloud. If you haven't done this yet, create and deploy your agent using the [Low-Code Deployment UI](/ai_agents/agent-runtime/build-agent) on the VideoSDK Dashboard — no coding required. Once deployed, note down your **Agent ID**. - For iOS, your development environment must meet the following requirements: - iOS 18 or later - Xcode 16.4 or later - Valid Video SDK [Account](https://app.videosdk.live/) import APISecret from '@site/mdx/introduction/_api-key.mdx'; ## Run the Sample Project
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agent-starter-app-ios.git cd agent-starter-ios ```
### Step 2: Open the project in XCode
Open the `agent-starter-ios.xcodeproj` file using Xcode.
### Step 3: Create Your Agent (Optional)
:::info If you've already configured and deployed your agent from the VideoSDK Dashboard, you can jump directly to [Step 4](#step-4-set-up-credentials). :::
### Step 4: Set up credentials
Before running the app, you need to configure your authentication details. Open `agent-starter-ios/Constants/MeetingConfig.swift` and supply the required values: ``` AUTH_TOKEN: AGENT_ID: MEETING_ID: VERSION_ID: ``` > **Tip:** You can obtain your `AUTH_TOKEN` and `AGENT_ID` from the [VideoSDK Dashboard](https://app.videosdk.live/) under your Agent Cloud deployment. `MEETING_ID` is optional — if left blank, the app will create a new meeting automatically. `VERSION_ID` is also optional, if left blank, the app will fetch the agent's version and choose the latest one and proceed with the meeting.
### Step 5: Build and Run
Bingo, Now Select your target physical device and click the Run button (or press Cmd + R) in Xcode! Once running, the app will use the Dispatch API to send your deployed agent into the meeting room. You'll see the live transcription as you speak, and the agent will respond in real time. --- ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `AGENT_ID` and `VERSION_ID` in your `agent-starter-ios/Constants/MeetingConfig.swift` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the debug console for any network errors. --- --- title: Agent Runtime with Flutter hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using Flutter frontend and a no-code agent from the dashboard. sidebar_label: With Flutter pagination_label: Agent Runtime with Flutter keywords: - ai agent - no-code - voice interaction - real-time communication - flutter sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-flutter --- import Step from '@site/src/components/Step' # Agent Runtime with Flutter VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your Flutter application within minutes. This guide shows you how to connect a Flutter frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Flutter installed on your device - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token. ::: ## Project Structure Your project structure should look like this: ```jsx title="Project Structure" root ├── android ├── ios ├── lib │ ├── api_call.dart │ ├── join_screen.dart │ ├── main.dart │ ├── meeting_controls.dart │ ├── meeting_screen.dart │ └── participant_tile.dart ├── macos ├── web └── windows ``` You will be working on the following files: - `join_screen.dart`: Responsible for the user interface to join a meeting. - `meeting_screen.dart`: Displays the meeting interface and handles meeting logic. - `api_call.dart`: Handles API calls for creating meetings and dispatching agents. ## 1. Flutter Frontend
### Step 1: Getting Started
Follow these steps to create the environment necessary to add AI agent functionality to your app. #### Create a New Flutter App Create a new Flutter app using the following command: ```bash $ flutter create videosdk_ai_agent_flutter_app ``` #### Install VideoSDK Install the VideoSDK using the following Flutter command. Make sure you are in your Flutter app directory before you run this command. ```bash $ flutter pub add videosdk $ flutter pub add http ```
### Step 2: Configure Project
#### For Android - Update the `/android/app/src/main/AndroidManifest.xml` for the permissions we will be using to implement the audio and video features. ```xml title="android/app/src/main/AndroidManifest.xml" ``` - If necessary, in the `build.gradle` you will need to increase `minSdkVersion` of `defaultConfig` up to `23` (currently default Flutter generator set it to `16`). #### For iOS - Add the following entries which allow your app to access the camera and microphone to your `/ios/Runner/Info.plist` file : ```xml title="/ios/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Uncomment the following line to define a global platform for your project in `/ios/Podfile` : ```ruby title="/ios/Podfile" platform :ios, '12.0' ``` #### For MacOS - Add the following entries to your `/macos/Runner/Info.plist` file which allow your app to access the camera and microphone. ```xml title="/macos/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Add the following entries to your `/macos/Runner/DebugProfile.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/DebugProfile.entitleaments" com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ``` - Add the following entries to your `/macos/Runner/Release.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/Release.entitlements" com.apple.security.network.server com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ```
### Step 3: Configure Environment and Credentials
Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `lib/api_call.dart` along with your agent credentials. ```dart title="lib/api_call.dart" import 'dart:convert'; import 'package:http/http.dart' as http; //Auth token we will use to generate a meeting and connect to it const token = 'YOUR_VIDEOSDK_AUTH_TOKEN'; const agentId = 'YOUR_AGENT_ID'; const versionId = 'YOUR_VERSION_ID'; // API call to create meeting Future createMeeting() async { final http.Response httpResponse = await http.post( Uri.parse('https://api.videosdk.live/v2/rooms'), headers: {'Authorization': token}, ); //Destructuring the roomId from the response return json.decode(httpResponse.body)['roomId']; } // API call to connect agent Future connectAgent(String meetingId) async { final http.Response httpResponse = await http.post( Uri.parse('https://api.videosdk.live/v2/agent/general/dispatch'), headers: { 'Authorization': token, 'Content-Type': 'application/json', }, body: json.encode({ 'agentId': agentId, 'meetingId': meetingId, 'versionId': versionId, }), ); if (httpResponse.statusCode != 200) { throw Exception('Failed to connect agent'); } } ```
### Step 4: Design the User Interface (UI)
Update the UI files to add the "Connect Agent" button and connect the logic. ```dart title="lib/join_screen.dart" import 'package:flutter/material.dart'; import 'api_call.dart'; import 'meeting_screen.dart'; class JoinScreen extends StatelessWidget { final _meetingIdController = TextEditingController(); JoinScreen({super.key}); void onJoinButtonPressed(BuildContext context) { // check meeting id is not null or invaild // if meeting id is vaild then navigate to MeetingScreen with meetingId,token Navigator.of(context).push( MaterialPageRoute( builder: (context) => MeetingScreen(meetingId: "YOUR_MEETING_ID", token: token), ), ); } @override Widget build(BuildContext context) { return Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(12.0), child: Center( child: ElevatedButton( onPressed: () => onJoinButtonPressed(context), child: const Text('Join Meeting'), ), ), ), ); } } ``` ```dart title="lib/meeting_screen.dart" import 'package:flutter/material.dart'; import 'package:videosdk/videosdk.dart'; import 'participant_tile.dart'; import 'meeting_controls.dart'; import 'api_call.dart'; class MeetingScreen extends StatefulWidget { final String meetingId; final String token; const MeetingScreen({ super.key, required this.meetingId, required this.token, }); @override State createState() => _MeetingScreenState(); } class _MeetingScreenState extends State { late Room _room; var micEnabled = true; var camEnabled = true; bool _isAgentConnected = false; Map participants = {}; @override void initState() { // create room _room = VideoSDK.createRoom( roomId: widget.meetingId, token: widget.token, displayName: "John Doe", micEnabled: micEnabled, camEnabled: false, defaultCameraIndex: 1, // Index of MediaDevices will be used to set default camera ); setMeetingEventListener(); // Join room _room.join(); super.initState(); } // listening to meeting events void setMeetingEventListener() { _room.on(Events.roomJoined, () { setState(() { participants.putIfAbsent( _room.localParticipant.id, () => _room.localParticipant, ); }); }); _room.on(Events.participantJoined, (Participant participant) { setState( () => participants.putIfAbsent(participant.id, () => participant), ); }); _room.on(Events.participantLeft, (String participantId) { if (participants.containsKey(participantId)) { setState(() => participants.remove(participantId)); } }); _room.on(Events.roomLeft, () { participants.clear(); Navigator.popUntil(context, ModalRoute.withName('/')); }); } void _connectAgent() async { try { await connectAgent(widget.meetingId); setState(() { _isAgentConnected = true; }); ScaffoldMessenger.of(context).showSnackBar( const SnackBar(content: Text('Agent connected successfully!')), ); } catch (e) { ScaffoldMessenger.of(context).showSnackBar( SnackBar(content: Text('Failed to connect agent: ${e.toString()}')), ); } } // onbackButton pressed leave the room Future _onWillPop() async { _room.leave(); return true; } @override Widget build(BuildContext context) { return WillPopScope( onWillPop: () => _onWillPop(), child: Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(8.0), child: Column( children: [ Text(widget.meetingId), //render all participant Expanded( child: Padding( padding: const EdgeInsets.all(8.0), child: GridView.builder( gridDelegate: const SliverGridDelegateWithFixedCrossAxisCount( crossAxisCount: 2, crossAxisSpacing: 10, mainAxisSpacing: 10, mainAxisExtent: 300, ), itemBuilder: (context, index) { return ParticipantTile( key: Key(participants.values.elementAt(index).id), participant: participants.values.elementAt(index), ); }, itemCount: participants.length, ), ), ), MeetingControls( onToggleMicButtonPressed: () { micEnabled ? _room.muteMic() : _room.unmuteMic(); micEnabled = !micEnabled; }, onLeaveButtonPressed: () => _room.leave(), onConnectAgentButtonPressed: _isAgentConnected ? null : _connectAgent, ), ], ), ), ), ); } } ``` ```dart title="lib/meeting_controls.dart" import 'package:flutter/material.dart'; class MeetingControls extends StatelessWidget { final void Function() onToggleMicButtonPressed; final void Function() onLeaveButtonPressed; final void Function()? onConnectAgentButtonPressed; const MeetingControls({ super.key, required this.onToggleMicButtonPressed, required this.onLeaveButtonPressed, required this.onConnectAgentButtonPressed, }); @override Widget build(BuildContext context) { return Row( mainAxisAlignment: MainAxisAlignment.spaceEvenly, children: [ ElevatedButton( onPressed: onLeaveButtonPressed, child: const Text('Leave'), ), ElevatedButton( onPressed: onToggleMicButtonPressed, child: const Text('Toggle Mic'), ), ElevatedButton( onPressed: onConnectAgentButtonPressed, child: const Text('Connect Agent'), ), ], ); } } ``` ## 2. Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard.
### Step 1: Create Your Agent
First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard.
### Step 2: Get Agent and Version ID
Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png)
### Step 3: Configure IDs in Frontend
Now, update your `lib/api_call.dart` file with these IDs. ```dart title="lib/api_call.dart" const token = 'your_videosdk_auth_token_here'; const agentId = 'paste_your_agent_id_here'; const versionId = 'paste_your_version_id_here'; ``` ## 3. Run the Application
### Step 1: Run the Frontend
Once you have completed all the steps mentioned above, start your Flutter application: ```bash flutter run ```
### Step 2: Connect and Interact
1. **Join the meeting from the Flutter app:** - Click the "Join Meeting" button. - Allow microphone permissions when prompted. 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see a confirmation that the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `roomId`, `agentId`, and `versionId` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `agentId` and `versionId` are correct. - Check the debug console for any network errors. 4. **Flutter build issues:** - Ensure your Flutter version is compatible. - Try cleaning the build: `flutter clean`. - Delete `pubspec.lock` and run `flutter pub get`. --- --- title: Agent Runtime with iOS hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using an iOS frontend and a no-code agent from the dashboard. sidebar_label: With iOS pagination_label: Agent Runtime with iOS keywords: - ai agent - no-code - voice interaction - real-time communication - ios sdk - swiftui image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-ios --- import Step from '@site/src/components/Step' # Agent Runtime with iOS VideoSDK empowers you to integrate an AI voice agent into your iOS app within minutes. This guide shows you how to connect an iOS (SwiftUI) frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites - macOS with Xcode 15.0+ - iOS 13.0+ deployment target - Valid VideoSDK [Account](https://app.videosdk.live/) - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. :::
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agents-quickstart.git cd mobile-quickstarts/ios/ ```
### Step 2: Environment Configuration
### Create a Meeting Room Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_VIDEOSDK_AUTH_TOKEN" \ -H "Content-Type: application/json" ``` Use the returned `roomId` in your configuration files. ### Configuration Files Update the following files with your credentials. The Agent and Version IDs will be retrieved in a later step. **MeetingViewController.swift** (line 14): ```swift var token = "YOUR_VIDEOSDK_AUTH_TOKEN" // Add Your token here var agentId = "YOUR_AGENT_ID" var versionId = "YOUR_VERSION_ID" ``` **JoinScreenView.swift** (line 13): ```swift let meetingId: String = "YOUR_MEETING_ID" ```
### Step 3: iOS Frontend Modifications
### Step 1: Add Connect Agent Button In `MeetingView.swift`, add a button to connect the agent. ```swift title="MeetingView.swift" // Add this button to your view hierarchy Button(action: { meetingVC.connectAgent() }) { Text("Connect Agent") } .disabled(meetingVC.isAgentConnected) ``` ### Step 2: Implement Connect Logic In `MeetingViewController.swift`, add the logic to call the dispatch API. ```swift title="MeetingViewController.swift" // Add state to track if the agent is connected @Published var isAgentConnected = false // ... func connectAgent() { guard let url = URL(string: "https://api.videosdk.live/v2/agent/general/dispatch") else { return } var request = URLRequest(url: url) request.httpMethod = "POST" request.setValue("application/json", forHTTPHeaderField: "Content-Type") request.setValue(token, forHTTPHeaderField: "Authorization") let body: [String: Any] = [ "agentId": agentId, "meetingId": room?.id ?? "", "versionId": versionId ] request.httpBody = try? JSONSerialization.data(withJSONObject: body) URLSession.shared.dataTask(with: request) { data, response, error in if let error = error { print("Connect error: \(error.localizedDescription)") return } if let httpResponse = response as? HTTPURLResponse, httpResponse.statusCode == 200 { DispatchQueue.main.async { self.isAgentConnected = true print("Agent connected successfully") } } else { print("Failed to connect agent") } }.resume() } ```
### Step 4: Creating the AI Agent from Dashboard (No-Code)
### Step 1: Create Your Agent First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard. ### Step 2: Get Agent and Version ID Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png) ### Step 3: Configure IDs in Frontend Now, update your `MeetingViewController.swift` file with these IDs. ```swift title="MeetingViewController.swift" var agentId = "paste_your_agent_id_here" var versionId = "paste_your_version_id_here" ```
### Step 5: Run the iOS Frontend
1. **Open Xcode:** ```bash open videosdk-agents-quickstart-ios.xcodeproj ``` 2. **Configure your development team:** - Select the project in Xcode - Go to "Signing & Capabilities" - Select your development team 3. **Build and run:** - Select your target device or simulator - Press `Cmd + R` to build and run
### Step 6: Connect and Interact
1. Join the meeting from the app and allow microphone permissions. 2. When you join, click the "Connect Agent" button to call the agent into the meeting. 3. Talk to the agent in real time. ## Troubleshooting ### Common Issues 1. **Build Errors:** - Ensure Xcode 15.0+ is installed - Check iOS deployment target (13.0+) - Verify VideoSDK package dependency 2. **Authentication Issues:** - Verify `VIDEOSDK_AUTH_TOKEN` in `MeetingViewController.swift` - Check token permissions include `allow_join` 3. **Meeting Connection Issues:** - Ensure `YOUR_MEETING_ID` is correct - Verify network connectivity - Check VideoSDK account status 4. **AI Agent Issues:** - Verify `agentId` and `versionId` are set correctly - Check for errors in the Xcode console when connecting the agent. --- --- title: Agent Runtime with React Native hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using a React Native frontend and a no-code agent from the dashboard. sidebar_label: With React Native pagination_label: Agent Runtime with React Native keywords: - ai agent - no-code - voice interaction - real-time communication - react native sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-react-native --- import Step from '@site/src/components/Step' import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Agent Runtime with React Native VideoSDK empowers you to integrate an AI voice agent into your React Native app (Android/iOS) within minutes. This guide shows you how to connect a React Native frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites - VideoSDK Developer Account (get token from the [dashboard](https://app.videosdk.live/api-keys)) - Node.js and a working React Native environment (Android Studio and/or Xcode) - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK token and an agent from the dashboard. Generate your VideoSDK token from the dashboard. ::: ## Project Structure First, create an empty project using `mkdir folder_name` on your preferable location for the React Native Frontend. Your final project structure should look like this: ```jsx title="Directory Structure" root ├── android/ ├── ios/ ├── App.js ├── constants.js └── index.js ``` You will work on: - `android/`: Contains the Android-specific project files. - `ios/`: Contains the iOS-specific project files. - `App.js`: The main React Native component, containing the UI and meeting logic. - `constants.js`: To store token, meetingId, and agent credentials for the frontend. - `index.js`: The entry point of the React Native application, where VideoSDK is registered. ## Building the React Native Frontend
### Step 1: Create App and Install SDKs
Create a React Native app and install the VideoSDK RN SDK: ```bash npx react-native init videosdkAiAgentRN cd videosdkAiAgentRN # Install VideoSDK npm install "@videosdk.live/react-native-sdk" ```
### Step 2: Configure the Project
#### Android Setup ```xml title="android/app/src/main/AndroidManifest.xml" ``` ```java title="android/app/build.gradle" dependencies { implementation project(':rnwebrtc') } ``` ```gradle title="android/settings.gradle" include ':rnwebrtc' project(':rnwebrtc').projectDir = new File(rootProject.projectDir, '../node_modules/@videosdk.live/react-native-webrtc/android') ``` ```java title="MainApplication.kt" import live.videosdk.rnwebrtc.WebRTCModulePackage class MainApplication : Application(), ReactApplication { override val reactNativeHost: ReactNativeHost = object : DefaultReactNativeHost(this) { override fun getPackages(): List { val packages = PackageList(this).packages.toMutableList() packages.add(WebRTCModulePackage()) return packages } // ... } } ``` ```java title="android/gradle.properties" /* This one fixes a weird WebRTC runtime problem on some devices. */ android.enableDexingArtifactTransform.desugaring=false ``` ```java title="android/app/proguard-rules.pro" -keep class org.webrtc.** { *; } ``` ```java title="android/build.gradle" buildscript { ext { minSdkVersion = 23 } } ``` #### iOS Setup To update CocoaPods, you can reinstall the gem using the following command: ```gem $ sudo gem install cocoapods ``` ```sh title="ios/Podfile" pod ‘react-native-webrtc’, :path => ‘../node_modules/@videosdk.live/react-native-webrtc’ ``` You need to change the platform field in the Podfile to 12.0 or above because react-native-webrtc doesn't support iOS versions earlier than 12.0. Update the line: platform : ios, ‘12.0’. After updating the version, you need to install the pods by running the following command: ```sh pod install ``` Add the following lines to your info.plist file located at (project folder/ios/projectname/info.plist): ```html title="ios/MyApp/Info.plist" NSCameraUsageDescription Camera permission description NSMicrophoneUsageDescription Microphone permission description ```
### Step 3: Register Service and Configure
Register VideoSDK services in your root `index.js` file for the initialization service. ```js title="index.js" import { AppRegistry } from "react-native"; import App from "./App"; import { name as appName } from "./app.json"; import { register } from "@videosdk.live/react-native-sdk"; register(); AppRegistry.registerComponent(appName, () => App); ``` Create a `constants.js` file to store your token, meeting ID, and agent credentials. ```js title="constants.js" export const token = "YOUR_VIDEOSDK_AUTH_TOKEN"; export const meetingId = "YOUR_MEETING_ID"; export const name = "User Name"; export const agentId = "YOUR_AGENT_ID"; export const versionId = "YOUR_VERSION_ID"; ```
### Step 4: Build UI and wire up MeetingProvider
```js title="App.js" import React, { useState } from 'react'; import { SafeAreaView, TouchableOpacity, Text, View, FlatList, Alert, } from 'react-native'; import { MeetingProvider, useMeeting, } from '@videosdk.live/react-native-sdk'; import { meetingId, token, name, agentId, versionId } from './constants'; const Button = ({ onPress, buttonText, backgroundColor }) => { return ( {buttonText} ); }; function ControlsContainer({ join, leave, toggleMic }) { const [connected, setConnected] = useState(false); const connectAgent = async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: token, }, body: JSON.stringify({ agentId: agentId, meetingId: meetingId, versionId: versionId }), }); if (response.ok) { Alert.alert("Agent connected successfully!"); setConnected(true); } else { Alert.alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); Alert.alert("An error occurred while connecting the agent."); } }; return (
```
### Step 3: Configure the Frontend
Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `config.js`. You will get the Agent and Version IDs in the next section. ```js title="config.js" TOKEN = "your_videosdk_auth_token_here"; ROOM_ID = "YOUR_MEETING_ID"; AGENT_ID = "YOUR_AGENT_ID"; VERSION_ID = "YOUR_VERSION_ID"; ```
### Step 4: Implement Meeting Logic
In `index.js`, retrieve DOM elements, declare variables, and add the core meeting functionalities, including the logic to connect the agent. ```js title="index.js" // getting Elements from Dom const leaveButton = document.getElementById("leaveBtn"); const toggleMicButton = document.getElementById("toggleMicBtn"); const createButton = document.getElementById("createMeetingBtn"); const connectAgentButton = document.getElementById("connectAgentBtn"); const audioContainer = document.getElementById("audioContainer"); const textDiv = document.getElementById("textDiv"); // declare Variables let meeting = null; let meetingId = ""; let isMicOn = false; // Join Agent Meeting Button Event Listener createButton.addEventListener("click", async () => { document.getElementById("join-screen").style.display = "none"; textDiv.textContent = "Please wait, we are joining the meeting"; meetingId = ROOM_ID; initializeMeeting(); }); // Initialize meeting function initializeMeeting() { window.VideoSDK.config(TOKEN); meeting = window.VideoSDK.initMeeting({ meetingId: meetingId, name: "C.V.Raman", micEnabled: true, webcamEnabled: false, }); meeting.join(); meeting.localParticipant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, meeting.localParticipant, true); } }); meeting.on("meeting-joined", () => { textDiv.textContent = null; document.getElementById("grid-screen").style.display = "block"; document.getElementById("meetingIdHeading").textContent = `Meeting Id: ${meetingId}`; }); meeting.on("meeting-left", () => { audioContainer.innerHTML = ""; }); meeting.on("participant-joined", (participant) => { let audioElement = createAudioElement(participant.id); participant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, participant, false); audioContainer.appendChild(audioElement); } }); }); meeting.on("participant-left", (participant) => { let aElement = document.getElementById(`a-${participant.id}`); if (aElement) aElement.remove(); }); } // Create audio elements for participants function createAudioElement(pId) { let audioElement = document.createElement("audio"); audioElement.setAttribute("autoPlay", "false"); audioElement.setAttribute("playsInline", "true"); audioElement.setAttribute("controls", "false"); audioElement.setAttribute("id", `a-${pId}`); audioElement.style.display = "none"; return audioElement; } // Set audio track function setAudioTrack(stream, participant, isLocal) { if (stream.kind === "audio") { if (isLocal) { isMicOn = true; } else { const audioElement = document.getElementById(`a-${participant.id}`); if (audioElement) { const mediaStream = new MediaStream(); mediaStream.addTrack(stream.track); audioElement.srcObject = mediaStream; audioElement.play().catch((err) => console.error("audioElem.play() failed", err)); } } } } // Implement controls leaveButton.addEventListener("click", async () => { meeting?.leave(); document.getElementById("grid-screen").style.display = "none"; document.getElementById("join-screen").style.display = "block"; }); toggleMicButton.addEventListener("click", async () => { if (isMicOn) meeting?.muteMic(); else meeting?.unmuteMic(); isMicOn = !isMicOn; }); connectAgentButton.addEventListener("click", async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: TOKEN, }, body: JSON.stringify({ agentId: AGENT_ID, meetingId: ROOM_ID, versionId: VERSION_ID }), }); if (response.ok) { alert("Agent connected successfully!"); connectAgentButton.style.display = "none"; } else { alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); alert("An error occurred while connecting the agent."); } }); ``` ## Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard.
### Step 1: Create Your Agent
First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard.
### Step 2: Get Agent and Version ID
Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png)
### Step 3: Configure IDs in Frontend
Now, update your `config.js` file with these IDs. ```js title="config.js" TOKEN = "your_videosdk_auth_token_here"; ROOM_ID = "YOUR_MEETING_ID"; AGENT_ID = "paste_your_agent_id_here"; VERSION_ID = "paste_your_version_id_here"; ``` ## Run the Application
### Step 1: Start the Frontend
Once you have completed all the steps, serve your frontend files: ```bash # Using Python's built-in server python3 -m http.server 8000 # Or using Node.js http-server npx http-server -p 8000 ``` Open `http://localhost:8000` in your web browser.
### Step 2: Connect and Interact
1. **Join the meeting from the frontend:** - Click the "Join Agent Meeting" button in your browser. - Allow microphone permissions when prompted. 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see an alert confirming the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Final Output You have completed the implementation of an AI agent with real-time voice interaction using VideoSDK and a no-code agent from the dashboard. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `ROOM_ID`, `AGENT_ID`, and `VERSION_ID` are correctly set in `config.js`. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check browser permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the browser's developer console for any network errors. --- --- title: Agent Runtime with React hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using React frontend and a no-code backend. sidebar_label: With React pagination_label: Agent Runtime with React keywords: - ai agent - no-code - voice interaction - real-time communication - react sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-react --- import Step from '@site/src/components/Step' # Agent Runtime with React VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your React application within minutes. This guide shows you how to connect a React frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Node.js installed on your device - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token. ::: ## Project Structure Your project structure should look like this. ```jsx title="Project Structure" root ├── node_modules ├── public ├── src │ ├── config.js │ ├── App.js │ └── index.js └── .env ``` You will be working on the following files: - `App.js`: Responsible for creating a basic UI for joining the meeting - `config.js`: Responsible for storing the token, room ID, and agent credentials - `index.js`: This is the entry point of your React application. ## Part 1: React Frontend
### Step 1: Getting Started with the Code!
#### Create new React App Create a new React App using the below command. ```bash $ npx create-react-app videosdk-ai-agent-react-app ``` #### Install VideoSDK Install the VideoSDK using the below-mentioned npm command. Make sure you are in your react app directory before you run this command. ```bash $ npm install "@videosdk.live/react-sdk" ```
### Step 2: Configure Environment and Credentials
Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `src/config.js`. You will get the Agent and Version IDs in the next section. ```js title="src/config.js" export const TOKEN = "YOUR_VIDEOSDK_AUTH_TOKEN"; export const ROOM_ID = "YOUR_MEETING_ID"; export const AGENT_ID = "YOUR_AGENT_ID"; export const VERSION_ID = "YOUR_VERSION_ID"; ```
### Step 3: Design the user interface (UI)
Create the main App component with audio-only interaction in `src/App.js`. This includes the "Connect Agent" button. ```js title="src/App.js" import React, { useEffect, useRef, useState } from "react"; import { MeetingProvider, MeetingConsumer, useMeeting, useParticipant } from "@videosdk.live/react-sdk"; import { TOKEN, ROOM_ID, AGENT_ID, VERSION_ID } from "./config"; function ParticipantAudio({ participantId }) { const { micStream, micOn, isLocal, displayName } = useParticipant(participantId); const audioRef = useRef(null); useEffect(() => { if (!audioRef.current) return; if (micOn && micStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(micStream.track); audioRef.current.srcObject = mediaStream; audioRef.current.play().catch(() => {}); } else { audioRef.current.srcObject = null; } }, [micStream, micOn]); return (

Participant: {displayName} | Mic: {micOn ? "ON" : "OFF"}

); } function Controls() { const { leave, toggleMic } = useMeeting(); const [connected, setConnected] = useState(false); const connectAgent = async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: TOKEN, }, body: JSON.stringify({ agentId: AGENT_ID, meetingId: ROOM_ID, versionId: VERSION_ID }), }); if (response.ok) { alert("Agent connected successfully!"); setConnected(true); } else { alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); alert("An error occurred while connecting the agent."); } }; return (
{!connected && }
); } function MeetingView({ meetingId, onMeetingLeave }) { const [joined, setJoined] = useState(null); const { join, participants } = useMeeting({ onMeetingJoined: () => setJoined("JOINED"), onMeetingLeft: onMeetingLeave, }); const joinMeeting = () => { setJoined("JOINING"); join(); }; return (

Meeting Id: {meetingId}

{joined === "JOINED" ? (
{[...participants.keys()].map((pid) => ( ))}
) : joined === "JOINING" ? (

Joining the meeting...

) : ( )}
); } export default function App() { const [meetingId] = useState(ROOM_ID); const onMeetingLeave = () => { // no-op; simple sample }; return ( {() => } ); } ```
## Part 2: Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard.
### Step 1: Create Your Agent
First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard.
### Step 2: Get Agent and Version ID
Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png)
### Step 3: Configure IDs in Frontend
Now, update your `src/config.js` file with these IDs. ```js title="src/config.js" export const TOKEN = "your_videosdk_auth_token_here"; export const ROOM_ID = "YOUR_MEETING_ID"; export const AGENT_ID = "paste_your_agent_id_here"; export const VERSION_ID = "paste_your_version_id_here"; ``` ## Part 3: Run the Application
### Step 1: Run the Frontend
Once you have completed all the steps mentioned above, start your React application: ```bash # Install dependencies npm install # Start the development server npm start ``` Open `http://localhost:3000` in your web browser.
### Step 2: Connect and Interact
1. **Join the meeting from the React app:** - Click the "Join" button in your browser - Allow microphone permissions when prompted 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see an alert confirming the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Final Output You have completed the implementation of an AI agent with real-time voice interaction using VideoSDK and a no-code agent from the dashboard in React. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `ROOM_ID`, `AGENT_ID`, and `VERSION_ID` are correctly set in `src/config.js`. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check browser permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the browser's developer console for any network errors. 4. **React build issues:** - Ensure Node.js version is compatible - Try clearing npm cache: `npm cache clean --force` - Delete `node_modules` and reinstall: `rm -rf node_modules && npm install` --- --- title: SIP hide_title: false hide_table_of_contents: false description: " A framework for creating AI-powered voice agents using VideoSDK and various SIP providers" pagination_label: "VideoSDK AI SIP Framework" keywords: - AI Agent SDK - VideoSDK Agents - SIP - Trunking - Python SDK - Voice AI - Real-time Communication - AI Integration - VideoSDK Cloud - Development sidebar_label: SIP slug: sip --- import Step from '@site/src/components/Step' import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # VideoSDK AI SIP Framework A production-ready framework for creating AI-powered voice agents using VideoSDK and various SIP providers (e.g., Twilio). This framework enables you to build and deploy sophisticated conversational AI agents that can handle both inbound and outbound phone calls with natural language processing. ## How It Works The framework simplifies a complex process into a manageable workflow. Here’s a high-level overview of the architecture: 1. **Phone Call**: A user calls a phone number you have acquired from a SIP provider (like Twilio, Plivo, etc.). 2. **SIP Provider**: The provider receives the call and sends a webhook notification to your application server. 3. **Your Application Server**: This is the application you build using this framework. * It receives the webhook. * It uses the `SIPManager` to create a secure VideoSDK room for the call. * It launches your custom AI Agent. * It responds to the SIP provider with instructions (e.g., TwiML) to forward the call's audio into the VideoSDK room. 4. **VideoSDK & AI Agent**: Your AI Agent joins the room, receives the live audio from the phone call, processes it using your chosen AI models (for speech-to-text, language understanding, and text-to-speech), and responds in real-time to create a seamless, interactive conversation. --- ## Prerequisites Before you get started, ensure you have the following: ### System Requirements - **Python**: 3.11 or higher - **Network**: Public internet access for webhook delivery ### Required Credentials - **VideoSDK Credentials**: Sign up at [app.videosdk.live](https://app.videosdk.live/) to get your token and SIP credentials. ![VideoSDK SIP Credentials](https://strapi.videosdk.live/uploads/sip_dashboard_screenshot_8025aba2ec.png) - **SIP Provider Account**: Obtain provider-specific credentials. - **AI Model Provider**: An account with Google, OpenAI, or another supported provider. --- ## Get Started ### 1. Installation Create and activate a virtual environment ```js python3 -m venv venv source venv/bin/activate ``` ```js python3 -m venv venv venv\Scripts\activate ``` Install the core framework ```bash pip install videosdk-plugins-sip ``` Install plugins for your chosen AI services (e.g., Google) ```bash pip install videosdk-plugins-google ``` ### 2. Environment Configuration Your agent requires credentials for both VideoSDK and your chosen SIP provider. You can provide these through environment variables (recommended) or directly in your code. Create a `.env` file in your project's root directory, edit the file with your credentials. #### **VideoSDK Credentials (Required)** These are essential for the framework to function. ```ini VIDEOSDK_AUTH_TOKEN=your_videosdk_jwt_token VIDEOSDK_SIP_USERNAME=your_videosdk_sip_username VIDEOSDK_SIP_PASSWORD=your_videosdk_sip_password ``` #### **AI Model Credentials (Required)** Add the API key for your chosen AI provider. ```ini GOOGLE_API_KEY=your_google_api_key_here ``` #### **SIP Provider Credentials** Fill in the details for the provider you will be using. The framework will automatically use the correct variables based on the `SIP_PROVIDER` you set. Get your credentials from the [Twilio console](https://console.twilio.com/dashboard). ```ini SIP_PROVIDER=twilio TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxx TWILIO_AUTH_TOKEN=your_auth_token TWILIO_PHONE_NUMBER=+1234567890 ``` Copy the example environment file and populate it with your credentials. ```bash cp env.example .env ``` Now, edit the `.env` file: ```ini # VideoSDK Configuration VIDEOSDK_AUTH_TOKEN=your_videosdk_jwt_token VIDEOSDK_SIP_USERNAME=your_videosdk_sip_username VIDEOSDK_SIP_PASSWORD=your_videosdk_sip_password # AI Model Configuration (Example for Google Gemini) GOOGLE_API_KEY=your_google_api_key # Provider Selection (currently, 'twilio' is supported) SIP_PROVIDER=twilio # Twilio Configuration TWILIO_ACCOUNT_SID=your_twilio_account_sid TWILIO_AUTH_TOKEN=your_twilio_auth_token TWILIO_PHONE_NUMBER=+1234567890 ``` ## AI Agent and SIP Setup Here’s how to structure your application.
### Step 1: Initialize the SIP Manager
The `create_sip_manager` function is the main entry point. It establishes the connection to your SIP provider by reading the environment variables you configured. ```python import os from dotenv import load_dotenv from videosdk.plugins.sip import create_sip_manager # Load variables from the .env file load_dotenv() # This function reads your .env variables and configures the correct provider sip_manager = create_sip_manager( provider=os.getenv("SIP_PROVIDER"), videosdk_token=os.getenv("VIDEOSDK_AUTH_TOKEN"), # The provider_config dictionary passes provider-specific environment variables. provider_config={ # Twilio "account_sid": os.getenv("TWILIO_ACCOUNT_SID"), "auth_token": os.getenv("TWILIO_AUTH_TOKEN"), "phone_number": os.getenv("TWILIO_PHONE_NUMBER"), } ) ```
### Step 2: Define Your Agent's Pipeline
The pipeline defines which AI models your agent uses. Here, we are using Google's Gemini for a [Real-time Pipeline](https://docs.videosdk.live/ai_agents/core-components/realtime-pipeline). You could also use a [Cascading Pipeline](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline). ```python from videosdk.agents import RealTimePipeline from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig def create_agent_pipeline(): """This creates the AI model pipeline for our agent.""" model = GeminiRealtime( api_key=os.getenv("GOOGLE_API_KEY"), model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", # Choose your desired voice response_modalities=["AUDIO"], # We want the agent to speak back ), ) return RealTimePipeline(model=model) ```
### Step 3: Define Your Agent's Personality and Tools
The `Agent` class defines the system prompt (instructions), personality, and custom [function tools](https://docs.videosdk.live/ai_agents/core-components/agent) and [MCP Servers](https://docs.videosdk.live/ai_agents/mcp-integration) that your agent can use. ```python import asyncio from videosdk.agents import Agent, function_tool, JobContext from typing import Optional class SIPAIAgent(Agent): """An AI agent for handling voice calls.""" def __init__(self, ctx: Optional[JobContext] = None): super().__init__( instructions="You are a friendly and helpful voice assistant. Keep your responses concise.", tools=[self.end_call], # You can also integrate other function tools and MCP Servers here. ) self.ctx = ctx self.greeting_message = "Hello! Thank you for calling. How can I assist you today?" async def on_enter(self) -> None: pass async def greet_user(self) -> None: """Greets the user with the message defined above.""" await self.session.say(self.greeting_message) async def on_exit(self) -> None: pass ``` ## Server Setup and Deployment Your application must be accessible from the public internet so that your SIP provider can send it webhooks. You have two main options for this. For testing on your local machine, `ngrok` is the perfect tool. It creates a secure, public URL that tunnels directly to your local server. The `lifespan` manager in our example code handles this for you automatically. When you start the server, it will generate a unique URL and automatically configure the `SIPManager` with it. **Code Snippet (FastAPI Lifespan Manager):** ```python import os import logging from contextlib import asynccontextmanager from fastapi import FastAPI from pyngrok import ngrok logger = logging.getLogger(__name__) @asynccontextmanager async def lifespan(app: FastAPI): """Lifespan manager for FastAPI app startup and shutdown.""" port = int(os.getenv("PORT", 8000)) try: ngrok.kill() ngrok_auth_token = os.getenv("NGROK_AUTHTOKEN") if ngrok_auth_token: ngrok.set_auth_token(ngrok_auth_token) tunnel = ngrok.connect(port, "http") # The Base URL is generated here sip_manager.set_base_url(tunnel.public_url) logger.info(f"NGROK TUNNEL CREATED: {tunnel.public_url}") except Exception as e: logger.error(f"Failed to start ngrok tunnel: {e}") yield try: ngrok.kill() logger.info("Ngrok tunnel closed") except Exception as e: logger.error(f"Error closing ngrok tunnel: {e}") app = FastAPI(title="SIP AI Agent", lifespan=lifespan) ``` For a live application, you will deploy your code to a cloud server (e.g., AWS EC2, Google Cloud Run, Heroku) that has a permanent public IP address or domain name. In this case, you should **not** use the `ngrok` `lifespan` manager. Instead, set the base URL directly in your code. **Code Snippet (Cloud Server Setup):** ```python from fastapi import FastAPI # Your FastAPI app for production app = FastAPI(title="SIP AI Agent") # IMPORTANT: Set your server's public URL before starting the app. # This should be the actual domain where your service is hosted. PUBLIC_URL = "https://api.your-public-url.com" sip_manager.set_base_url(PUBLIC_URL) ``` :::note You must configure your SIP provider's webhook to point to `https://your-public-or-ngrok-url.com/webhook/incoming`. ::: ## API Endpoint Guide Your application server, powered by the `sip` framework, exposes a set of endpoints for controlling and monitoring calls. --- ### `POST /webhook/incoming` This is the **most important endpoint for handling inbound calls**. When a user calls your SIP provider's phone number, the provider sends an HTTP request (a webhook) to this URL. * **Purpose**: To serve as the primary entry point for all incoming phone calls. * **Provider Configuration**: You **must** configure this full URL in your SIP provider's dashboard for your phone number. * **Core Process**: 1. Receives the webhook from the SIP provider. 2. Creates a new VideoSDK room for the call. 3. Launches your `SIPAIAgent` in a separate process, which then waits in the room. 4. Responds to the provider with instructions (XML-based TwiML/ExoML) detailing how to forward the call's audio stream to the newly created room's SIP address. --- ### `POST /call/make` This endpoint allows you to **programmatically initiate an outbound call** from your agent to a user's phone number. ```bash # Replace with the destination phone number curl -X POST "http://localhost:8000/call/make?to_number=+1234567890" ``` * **Purpose**: To start new conversations with users. Ideal for automated reminders, lead qualification, or proactive support. * **Query Parameters**: | Parameter | Type | Description | Required | | :--- | :--- | :--- | :--- | | `to_number` | `string` | The full phone number to call, in E.164 format (e.g., `+15551234567`). | Yes | * **Core Process (Outbound Call Flow)**: 1. Your request hits the endpoint. 2. The `SIPManager` creates a VideoSDK room and immediately launches your `SIPAIAgent`. The agent then waits in the room. 3. The manager sends an API request to your SIP provider (e.g., Twilio), instructing it to call the `to_number`. 4. Crucially, it provides the SIP provider with a unique webhook URL for this specific call: `https:///sip/answer/{room_id}`. 5. When the user answers their phone, the SIP provider sends a webhook to that unique answer URL to connect the user to the waiting agent. --- ### `POST /sip/answer/{room_id}` This is an **internal-facing endpoint** designed to complete the outbound call loop. You will not call this endpoint directly. * **Purpose**: To serve as the dynamic "answer URL" for outbound calls. * **Path Parameters**: | Parameter | Type | Description | | :--- | :--- | :--- | | `room_id` | `string` | The unique ID of the VideoSDK room where the agent is waiting. | * **Core Process**: 1. This endpoint is called by the SIP provider *only after* the user answers an outbound call initiated by `/call/make`. 2. It uses the `room_id` to find the correct SIP address for the room where the agent is waiting. 3. It returns a simple TwiML/XML response that tells the provider how to bridge the just-answered call with the agent. --- ### `GET /sessions` A simple utility endpoint for **monitoring the health and status** of your service. * **Purpose**: To see how many calls are currently active. * **Core Process**: 1. Receives a simple `GET` request. 2. Checks the `SIPManager`'s internal state. 3. Returns a count of active sessions and a list of their corresponding room IDs. --- :::tip If you experience high latency when connecting a call, it may be due to a mismatch between the geographical region of your VideoSDK meeting server (which defaults to the nearest server region to you) and your SIP provider's region. To reduce latency, upgrade to an enterprise plan and set `VIDEOSDK_REGION=sip_provider_region` in your `.env` file for a low-latency experience. ::: --- --- title: Playground hide_title: false hide_table_of_contents: false description: "Test and interact with your VideoSDK AI agents in real-time using Playground mode. Learn how to enable the interactive testing environment for rapid development and debugging of voice AI agents." pagination_label: "Playground" keywords: - AI Agent SDK - VideoSDK Agents - Playground - Testing - Python SDK - Voice AI - Real-time Communication - AI Integration - VideoSDK Cloud - Development - Debugging image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Playground slug: playground --- # Agents Playground The Agents Playground provides an interactive testing environment where you can directly communicate with your AI agents during development. This feature enables rapid prototyping, testing, and debugging of your voice AI implementations without needing a separate client application. ## Overview Playground mode creates a web-based interface that connects directly to your agent session, allowing you to: - Test agent in real-time - Demonstrate agent capabilities to stakeholders ## Enabling Playground Mode To activate playground mode, simply set `playground: True` in your RoomOptions for JobContext. ### Basic Implementation ```python from videosdk.agents import RoomOptions, JobContext, WorkerJob async def entrypoint(ctx: JobContext): # Your agent implementation here # This is where you create your pipeline, agent, and session pass def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Test Agent", playground=True # Enable playground mode ) return JobContext(room_options=room_options) if __name__ == "__main__": from videosdk.agents import WorkerJob job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Accessing the Playground Once your agent session starts, the playground URL will be displayed in your terminal: ``` Agent started in playground mode Interact with agent here at: https://playground.videosdk.live?token={auth_token}&meetingId={meeting_id} ``` ### URL Structure The playground URL follows this format: ``` https://playground.videosdk.live?token={auth_token}&meetingId={meeting_id} ``` Where: - `auth_token`: videosdk_auth that is provided in session context or in env file. - `meeting_id`: The meeting ID specified in session context. **Note**: Playground mode is designed for development and testing purposes. For production deployments, ensure playground mode is disabled to maintain security and performance. --- --- title: AI Agent with Flutter - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using Flutter frontend and Python agent. sidebar_label: Flutter pagination_label: AI Agent with Flutter - Quick Start keywords: - ai agent - voice interaction - real-time communication - flutter sdk - python agent image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-flutter --- import AiAgentQuickStartFlutter from '@site/mdx/\_ai-agent-quick-start-flutter.mdx'; --- --- title: AI Agent with iOS - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using iOS Swift frontend and Python backend. sidebar_label: iOS pagination_label: AI Agent with iOS - Quick Start keywords: - ai agent - voice interaction - real-time communication - ios sdk - python backend - swift image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-ios --- import AiAgentQuickStartiOS from '@site/mdx/\_ai-agent-quick-start-ios.mdx'; --- --- title: AI Agent with IoT - Quick Start hide_title: false hide_table_of_contents: false description: Integrate a real-time AI agent with an ESP32 device using VideoSDK, enabling voice-based interaction through Google Gemini Live API. pagination_label: AI Agent with IoT - Quick Start keywords: - iot - esp32 - ai agent - videosdk - real-time communication image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-iot sidebar_label: Physical AI (IoT) --- import AiAgentQuickStartIoT from '@site/mdx/\_ai-agent-quick-start-iot.mdx'; --- --- title: AI Agent with JavaScript - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using JavaScript frontend. sidebar_label: JavaScript pagination_label: AI Agent with JavaScript - Quick Start keywords: - ai agent - voice interaction - real-time communication - javascript sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-js --- import AiAgentQuickStartJS from '@site/mdx/\_ai-agent-quick-start-js.mdx'; --- --- title: AI Agent with React Native - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using a React Native frontend and Python backend. sidebar_label: React Native pagination_label: AI Agent with React Native - Quick Start keywords: - ai agent - voice interaction - real-time communication - react native sdk - python backend image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-react-native --- import AiAgentQuickStartReactNative from '@site/mdx/\_ai-agent-quick-start-react-native.mdx'; --- --- title: AI Agent with React - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using React frontend and Python backend. sidebar_label: React pagination_label: AI Agent with React - Quick Start keywords: - ai agent - voice interaction - real-time communication - react sdk - python backend image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-react --- import AiAgentQuickStartReact from '@site/mdx/\_ai-agent-quick-start-react.mdx'; --- --- title: AI Agent with Unity - Quick Start hide_title: false hide_table_of_contents: false description: Integrate a real-time AI agent with Unity using VideoSDK, enabling voice-based interaction through Google Gemini Live API. sidebar_label: Unity pagination_label: AI Agent with Unity - Quick Start keywords: - unity - ai agent - videosdk - real-time communication image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-unity --- import AiAgentQuickStartUnity from '@site/mdx/\_ai-agent-quick-start-unity.mdx'; --- --- title: AI Telephony Agent Quick Start hide_title: false hide_table_of_contents: false description: "A comprehensive guide to creating a fully functional AI telephony agent using VideoSDK Agent SDK. Learn how to run the agent locally, connect it to the global telephone network using SIP, and enable it to handle both inbound and outbound phone calls." pagination_label: "AI Telephony Agent Quick Start" keywords: - AI Telephony Agent - Quick Start - VideoSDK Agents - AI Agent SDK - Python - SIP - Telephony - Phone Calls - Inbound Calls - Outbound Calls - Gemini - Google API - Voice Integration image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: AI Telephony Agent slug: ai-phone-agent-quick-start --- import TelephonyQuickStart from '@site/mdx/\_ai-telephony-agent-quick-start.mdx'; --- --- title: AI Voice Agent Quick Start hide_title: false hide_table_of_contents: false description: "A step-by-step guide to quickly integrate an AI-powered voice agent into your VideoSDK meetings using the AI Agent SDK. Covers prerequisites, installation, custom agent creation, function tools, pipeline setup, and session management." pagination_label: "AI Voice Agent Quick Start" keywords: - AI Voice Agent - Quick Start - VideoSDK Agents - AI Agent SDK - Python - OpenAI - Gemini - Live API - Speech To Speech - Amazon Nova Sonic - AWS Nova Sonic - Function Tools - Realtime AI - Voice Integration - VideoSDK Meetings image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: AI Voice Agent slug: voice-agent-quick-start --- import Step from '@site/src/components/Step' import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import AllSDKCard from '@site/src/components/AllSDKCard' # AI Voice Agent Quick Start Get started with VideoSDK Agents in minutes. This guide covers both Realtime (speech-to-speech) and Cascaded (STT-LLM-TTS) pipeline implementations. ## Prerequisites Before you begin, ensure you have: - A VideoSDK authentication token (generate from [app.videosdk.live](https://app.videosdk.live)), follow to guide to [generate videosdk token](/ai_agents/authentication-and-token) - A VideoSDK meeting ID (you can generate one using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) or through the VideoSDK dashboard) - Python 3.12 or higher ## Understanding the Architecture Before diving into implementation, let's understand the two main pipeline architectures available: **Realtime Pipeline** provides direct speech-to-speech processing with minimal latency: ![Realtime Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_realtime_pipeline.png) The realtime pipeline processes audio directly through a unified model that handles: - **User Voice Input** → **Speech to Speech model** → **Agent Voice Output** This approach offers the fastest response times and is ideal for real-time conversations. **Cascading Pipeline** processes audio through distinct stages for maximum control: ![Cascading Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_casading_pipeline.png) The cascading pipeline processes audio through three sequential stages: - **User Voice Input** → **STT (Speech-to-Text)** → **LLM (Large Language Model)** → **TTS (Text-to-Speech)** → **Agent Voice Output** This approach provides better control over each processing stage and supports more complex AI reasoning. ## Installation Create and activate a virtual environment with Python 3.12 or higher: ```js python3.12 -m venv venv source venv/bin/activate ``` ```js python -m venv venv venv\Scripts\activate ``` ```bash pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" ``` > Want to use a different provider? Check out our plugins for [STT](https://docs.videosdk.live/ai_agents/plugins/stt/openai), [LLM](https://docs.videosdk.live/ai_agents/plugins/llm/openai), and [TTS](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs). ```bash pip install videosdk-agents # Choose your real-time provider: # For OpenAI pip install "videosdk-plugins-openai" # For Gemini (LiveAPI) pip install "videosdk-plugins-google" # For AWS Nova pip install "videosdk-plugins-aws" ``` ## Environment Setup It's recommended to use environment variables for secure storage of API keys, secret tokens, and authentication tokens. Create a `.env` file in your project root: ```shell title=".env" DEEPGRAM_API_KEY = "Your Deepgram API Key" OPENAI_API_KEY = "Your OpenAI API Key" ELEVENLABS_API_KEY = "Your ElevenLabs API Key" VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token" ``` > **API Keys** - Get API keys [Deepgram ↗](https://console.deepgram.com/), [OpenAI ↗](https://platform.openai.com/api-keys), [ElevenLabs ↗](https://elevenlabs.io/app/settings/api-keys) & [VideoSDK Dashboard ↗](https://app.videosdk.live/api-keys) follow to guide to [generate videosdk token ](/ai_agents/authentication-and-token) ```bash title=".env" VIDEOSDK_AUTH_TOKEN="VideoSDK Auth token" OPENAI_API_KEY="Your OpenAI API Key" // For Google Live API // GOOGLE_API_KEY="Google Live API Key" // For AWS Nova API // AWS_ACCESS_KEY_ID="AWS Key Id" // AWS_SECRET_ACCESS_KEY="AWS Secret Key" // AWS_DEFAULT_REGION="AWS Region" ``` > **API Keys** - Get API keys [OpenAI ↗](https://platform.openai.com/api-keys) or [Gemini ↗](https://aistudio.google.com/app/apikey) or [AWS Nova Sonic ↗](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html) & [VideoSDK Dashboard ↗](https://app.videosdk.live/api-keys)> follow to guide to [generate videosdk token ](/ai_agents/authentication-and-token)
### Step 1: Creating a Custom Agent
First, let's create a custom voice agent by inheriting from the base `Agent` class: ```python title="main.py" import asyncio, os from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from typing import AsyncIterator # Pre-downloading the Turn Detector model pre_download_model() class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help?") async def on_exit(self): await self.session.say("Goodbye!") ``` ```python title="main.py" import asyncio, os from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig from openai.types.beta.realtime.session import TurnDetection class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help?") async def on_exit(self): await self.session.say("Goodbye!") ``` This code defines a basic voice agent with: - Custom instructions that define the agent's personality and capabilities - An entry message when joining a meeting - State change handling to track the agent's current activity
### Step 2: Assembling and Starting the Agent Session
The pipeline connects your agent to an AI model. ```python title="main.py" async def start_session(context: JobContext): # Create agent and conversation flow agent = MyVoiceAgent() conversation_flow = ConversationFlow(agent) # Create pipeline pipeline = CascadingPipeline( stt=DeepgramSTT(model="nova-2", language="en"), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow ) try: await context.connect() await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create name="VideoSDK Cascaded Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ```python title="main.py" async def start_session(context: JobContext): # Initialize Model model = OpenAIRealtime( model="gpt-realtime-2025-08-28", config=OpenAIRealtimeConfig( voice="alloy", # Available voices:alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse modalities=["text", "audio"], turn_detection=TurnDetection( type="server_vad", threshold=0.5, prefix_padding_ms=300, silence_duration_ms=200, ) ) ) # Create pipeline pipeline = RealTimePipeline( model=model ) session = AgentSession( agent=MyVoiceAgent(), pipeline=pipeline ) try: await context.connect() await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create name="VideoSDK Realtime Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ```
### Step 3: Running the Project
Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your `.env` file is properly configured and all dependencies are installed. ```bash python main.py console ``` Want to see the magic instantly? Try console mode to interact with your agent directly through the terminal! No need to join a meeting room - just speak and listen through your local system. Perfect for quick testing and development. ![Console Mode](https://cdn.videosdk.live/website-resources/docs-resources/ai_agents_console_mode_image.png) Learn more about [Console Mode](/ai_agents/console_mode). ```bash python main.py ``` Once you run this command, a playground URL will appear in your terminal. You can use this URL to interact with your AI agent.
### Step 4: Connecting with VideoSDK Client Applications
When working with a Client SDK, make sure to create the room first using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) . Then, simply pass the generated `room id` in both your client SDK and the `RoomOptions` for your AI Agent so they connect to the same session. :::tip Get started quickly with the [Quick Start Example](https://github.com/videosdk-live/agents-quickstart/) for the VideoSDK AI Agent SDK — everything you need to build your first AI agent fast. ::: --- --- title: Authentication and Token | Video SDK hide_title: true hide_table_of_contents: false description: Video SDK and Audio SDK, developers need to implement a token server. This requires efforts on both the front-end and backend. sidebar_label: Authentication and Tokens pagination_label: Authentication and Tokens keywords: - audio calling - video calling - real-time communication - collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: authentication-and-token --- # Why we are using JWT based Token ? Token based authentication allows users to verify their identity by providing generated API key and secret. We use JWT token for the authentication purpose because Token-based authentication is **widely used** in modern web applications and APIs because it offers several benefits over traditional authentication. For example, it can **reduce the risk of the credentials being misused**, and it allows for **more fine-grained control** over access to resources. Additionally, tokens can be easily revoked or expired, making it easier to manage access rights. ## How to generate Token ? To manage secured communication, every participant that connects to the meeting needs an access token. You can easily generate this token by using your `apiKey` and `secret-key` which you can get from [VideoSDK Dashboard](https://app.videosdk.live/api-keys). ### 1. Generating token from Dashboard If you are looking to do **testing or for development purpose**, you can generate a temporary token from [VideoSDK Dashboard's API section](https://app.videosdk.live/api-keys). import ReactPlayer from "react-player";
:::tip The best practice for getting token includes generating it from your backend server which will help in **keeping your credentials safe**. ::: ### 2. Generating token in your backend - Your server will generate access token using your API key and secret. - While generating a token, you can provide **expiration time, permissions and roles** which are discussed later in this section. - Your client obtains token from your backend server. - For token validation, client will pass this token to VideoSDK server. - VideoSDK server will only allow entry in the meeting if the token is valid. ![img2.png](/img/authentication-and-token.png) import GenerateToken from "@site/src/theme/GenerateTokenContainer"; Follow our official example repositories to setup token API [videosdk-rtc-api-server-examples](https://github.com/videosdk-live/videosdk-rtc-api-server-examples) ### Payload while generating token For AI Agent authentication, the payload is simplified to include only the essential parameters: ```js { apikey: API_KEY, //MANDATORY permissions: [`allow_join`], //MANDATORY } ``` - **`apikey`(Mandatory)**: This must be the API Key generated from the VideoSDK Dashboard. You can get it from [here](https://app.videosdk.live/api-keys). - **`permissions`(Mandatory)**: For AI agents, typically use `allow_join` to enable the agent to join meetings directly. Available permissions for AI agents: - **`allow_join`**: The AI agent is **allowed to join** the meeting directly. - **`ask_join`**: The AI agent is required to **ask for permission to join** the meeting. Then, you have to sign this payload with your **`SECRET KEY`** and `jwt` options using the **`HS256 algorithm`**. ### Expiration time You can set any expiration time to the token. But in the **production environment**, it is recommended to generate a token with **short expiration time** because by any chance if someone gets hold of the token, it won't be valid for a longer period of time. ### What happens if token is expired? If your token is expired, the user won't be able to join the meeting and all the API calls will give error with message `Token is invalid or expired`. :::note Token is validated only once while joining the meeting, so if a person joins the meeting and the token gets expired after that, there won't be any issue in the current meeting. ::: ## How to check validity of token? 1. After generating the token, visit [jwt.io](https://jwt.io) and paste your token in the given area. 2. You will be able to see the payload you passed while generating the token and also be able to see the expiration time and token creation time. ![img1.png](/img/validate-token.png) --- --- title: Console Mode for AI Agents hide_title: false hide_table_of_contents: false description: "Learn how to use VideoSDK AI Agents in console mode for direct terminal-based voice interactions without joining a meeting room." pagination_label: "Console Mode" keywords: - AI Voice Agent - Console Mode - Terminal Interaction - VideoSDK Agents - CLI Mode - Voice Testing - Local Development - Quick Testing - Videosdk Console Mode image: img/videosdklive-thumbnail.jpg sidebar_position: 8 sidebar_label: Console Mode slug: console_mode --- # Console Mode for AI Agents Console mode allows you to interact with your AI agent directly through the terminal without joining a VideoSDK meeting room. This is particularly useful for: - Quick testing of agent functionality - Local development and debugging - Testing function tools and MCP integrations - Validating pipeline configurations ## How It Works When running your agent script in console mode: 1. The agent runs in a terminal-based environment 2. Your microphone input is captured directly through the terminal 3. Agent responses are played through your system audio 4.Run the full `Cascading Pipeline` and `RealTime Pipeline` locally without connecting to a meeting. This makes it easier to verify that audio flows, agent logic, and response generation are working correctly before deploying into a live session. 1. Function tools, MCP integrations, and other features remain fully functional ## Using Console Mode To use console mode, simply add the `console` argument when running your agent script: ```bash python main.py console ``` import ReactPlayer from 'react-player'
The console will display: - Agent speech output - User speech input - Various latency metrics (STT, TTS, LLM,EOU) - Pipeline processing information This flexibility allows you to use the same agent code for both development and production environments. --- --- title: Agent Session hide_title: false hide_table_of_contents: false description: "Discover how the `AgentSession` in VideoSDK's AI Agent SDK orchestrates various components into a unified workflow, managing the agent's interaction lifecycle and context for seamless real-time communication." pagination_label: "Agent Session" keywords: - AgentSession - AI Agent SDK - VideoSDK Agents - Component Orchestration - Session Management - Context Handling - Agent Workflow - Real-time AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 8 sidebar_label: Agent Session slug: agent-session --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; # Agent Session The `AgentSession` is the central orchestrator that integrates the `Agent`, `Pipeline`, and optional `ConversationFlow` into a cohesive workflow. It manages the complete lifecycle of an agent's interaction within a VideoSDK meeting, handling initialization, execution, and cleanup. ![Agent Session](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_agent_session.png) ## Core Features - **Component Orchestration:** Unifies agent, pipeline, and conversation flow components. - **Lifecycle Management:** Handles session start, execution, and cleanup ## State Management The `AgentSession` provides comprehensive state tracking for both users and agents, automatically emitting state change events for real-time monitoring. :::tip Version Requirement The state management features and enhanced methods (`reply()`, `interrupt()`) are available in versions above v0.0.35. ::: ### User States - **IDLE** - User is not actively speaking or listening - **SPEAKING** - User is currently speaking - **LISTENING** - User is actively listening to the agent ### Agent States - **STARTING** - Agent is initializing - **IDLE** - Agent is ready and waiting - **SPEAKING** - Agent is currently generating speech - **LISTENING** - Agent is processing user input - **THINKING** - Agent is processing and generating response - **CLOSING** - Agent is shutting down ### State Event Monitoring State changes are automatically emitted as events that you can listen to: ```python title="main.py" def on_user_state_changed(data): print("User state:", data) def on_agent_state_changed(data): print("Agent state:", data) session.on("user_state_changed", on_user_state_changed) session.on("agent_state_changed", on_agent_state_changed) ``` ## Constructor Parameters ```python AgentSession( agent: Agent, pipeline: Pipeline, conversation_flow: Optional[ConversationFlow] = None, wake_up: Optional[int] = None ) ``` Either RealTimePipeline or{" "} CascadingPipeline ), }, { title: "Conversation Flow", description: "Optional conversation state management", link: "/ai_agents/core-components/conversation-flow", } ]} columns={3} /> ### Wake-Up Call Wake-up call automatically triggers actions when users are inactive for a specified period of time, helping maintain engagement. ```python title="main.py" # Configure wake-up timer session = AgentSession( agent=MyAgent(), pipeline=pipeline, wake_up=10 # Trigger after 10 seconds of inactivity ) # Set callback function async def on_wake_up(): await session.say("Are you still there? How can I help?") session.on_wake_up = on_wake_up ``` :::note Important: If a `wake_up` time is provided, you must set a callback function before starting the session. If no `wake_up` time is specified, no timer or callback will be activated. ::: ## Basic Usage To get an agent running, you initialize an `AgentSession` with your custom `Agent` and a configured `Pipeline`. The session handles the underlying connection and data flow. ### Example Implementation: import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```python title="main.py" from videosdk.agents import AgentSession, Agent, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAIRealtime from videosdk.agents import RealTimePipeline class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") def setup_state_monitoring(self): def on_user_state_changed(data): print(f"User state changed to: {data['state']}") def on_agent_state_changed(data): print(f"Agent state changed to: {data['state']}") self.session.on("user_state_changed", on_user_state_changed) self.session.on("agent_state_changed", on_agent_state_changed) async def start_session(ctx: JobContext): model = OpenAIRealtime(model="gpt-4o-realtime-preview") pipeline = RealTimePipeline(model=model) session = AgentSession( agent=MyAgent(), pipeline=pipeline ) await ctx.connect() await session.start() # Session runs until manually stopped or meeting ends def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Assistant Bot" ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ```python title="main.py" from videosdk.agents import AgentSession, Agent, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAISTT, OpenAITTS, OpenAILLM from videosdk.agents import CascadingPipeline class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") def setup_state_monitoring(self): def on_user_state_changed(data): print(f"User state changed to: {data['state']}") def on_agent_state_changed(data): print(f"Agent state changed to: {data['state']}") self.session.on("user_state_changed", on_user_state_changed) self.session.on("agent_state_changed", on_agent_state_changed) async def start_session(ctx: JobContext): # Configure individual components stt = OpenAISTT(model="whisper-1") llm = OpenAILLM(model="gpt-4") tts = OpenAITTS(model="tts-1", voice="alloy") pipeline = CascadingPipeline( stt=stt, llm=llm, tts=tts ) session = AgentSession( agent=MyAgent(), pipeline=pipeline ) await ctx.connect() await session.start() # Session runs until manually stopped or meeting ends def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Assistant Bot" ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## Development and Testing Features The `AgentSession` supports several modes for development, testing, and user engagement: ### Playground Mode Playground mode provides a web-based interface for testing your agent without building a separate client application. #### Usage To activate playground mode, simply set `playground: True` in your RoomOptions for JobContext. ```python title="main.py" from videosdk.agents import RoomOptions, JobContext, WorkerJob async def entrypoint(ctx: JobContext): # Your agent implementation here # This is where you create your pipeline, agent, and session pass def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Test Agent", playground=True # Enable playground mode ) return JobContext(room_options=room_options) if __name__ == "__main__": from videosdk.agents import WorkerJob job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` When enabled, the playground URL is automatically displayed in your terminal for easy access. :::note Note: Playground mode is designed for development and testing purposes. For production deployments, ensure playground mode is disabled to maintain security and performance. ::: ### Console Mode Console mode allows you to test your agent directly in the terminal using your microphone and speakers, without joining a VideoSDK meeting. #### Usage To use console mode, simply add the console argument when running your agent script: ```bash python main.py console ``` import ReactPlayer from 'react-player'
The console will display: - Agent speech output - User speech input - Various latency metrics (STT, TTS, LLM,EOU) - Pipeline processing information This flexibility allows you to use the same agent code for both development and production environments. ## Session Lifecycle Management The `AgentSession` provides methods to control the agent's presence and behavior in the meeting. }, { title: "say(message: str)", description: "Sends a message from the agent to the meeting participants. Allows the agent to communicate with users in the meeting.", link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20say,self%2C%20message%3A%C2%A0str)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE", icon: }, { title: "close()", description: "Gracefully shuts down the session. Finalizes metrics collection, cancels wake-up timer, and calls agent's on_exit() hook.", link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=Methods-,async%20def%20close,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE", icon: }, { title: "leave()", description: "Leaves the meeting without full session cleanup. Provides a quick exit option while maintaining session state.", link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20leave,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE", icon: }, { title: "reply(instructions, wait_for_playback)", description: "Generate agent responses using instructions and current chat context. Includes playback control and prevents concurrent calls.", icon: , link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20reply,self%2C%20instructions%3A%C2%A0str%2C%20wait_for_playback%3A%C2%A0bool%C2%A0%3D%C2%A0True)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE" }, { title: "interrupt()", description: "Immediately interrupt the agent's current operation, stopping speech generation and LLM processing for emergency stops or user interruptions.", icon: , link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20interrupt,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE" } ]} /> ### Example of Managing the Lifecycle: ```python title="main.py" import asyncio from videosdk.agents import AgentSession, Agent, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAIRealtime from videosdk.agents import RealTimePipeline class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") # LIFECYCLE: Agent entry point - called when session starts async def on_enter(self): await self.session.say("Hello! How can I help you today?") # LIFECYCLE: Agent exit point - called when session ends async def on_exit(self): print("Agent is leaving the session") @function_tool async def provide_summary(self) -> str: """Provide a conversation summary using the new reply method""" await self.session.reply("Let me summarize our conversation so far...") return "Summary provided" @function_tool async def stop_speaking(self) -> str: """Emergency stop functionality""" await self.session.interrupt() return "Agent stopped successfully" async def run_agent_session(ctx: JobContext): # LIFECYCLE STAGE 1: Session Creation model = OpenAIRealtime(model="gpt-4o-realtime-preview") pipeline = RealTimePipeline(model=model) session = AgentSession(agent=MyAgent(), pipeline=pipeline) try: # LIFECYCLE STAGE 2: Connection Establishment await ctx.connect() # LIFECYCLE STAGE 3: Session Start await session.start() # LIFECYCLE STAGE 4: Session Running await asyncio.Event().wait() finally: # LIFECYCLE STAGE 5: Session Cleanup await session.close() # LIFECYCLE STAGE 6: Context Shutdown await ctx.shutdown() # LIFECYCLE STAGE 0: Context Creation def make_context() -> JobContext: room_options = RoomOptions(room_id="your-room-id", auth_token="your-token") return JobContext(room_options=room_options) if __name__ == "__main__": # LIFECYCLE ORCHESTRATION: Worker Job Management # Creates and starts the worker job that manages the entire lifecycle job = WorkerJob(entrypoint=run_agent_session, jobctx=make_context) job.start() ``` ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. }, { title: "Agent to agent Example", description: "Agent Session with Customer and Loan agent", link: "https://github.com/videosdk-live/agents/tree/main/examples/a2a", icon: } ]} /> --- --- title: Agent hide_title: false hide_table_of_contents: false description: "Learn about the `Agent` base class in the VideoSDK AI Agent SDK. Understand how to create custom agents, define system prompts, manage state, and register function tools." pagination_label: "Agent" keywords: - Agent Class - AI Agent SDK - VideoSDK Agents - Custom Agents - System Prompts - State Management - Function Tools - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Agent slug: agent --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; # Agent The `Agent` class is the base class for defining AI agent behavior and capabilities. It provides the foundation for creating intelligent conversational agents with support for function tools, MCP servers, and advanced lifecycle management. ![Agent](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_agent.png) ## Basic Usage ### Simple Agent This is how you can initialize a simple agent with the `Agent` class, where `instructions` defines how the agent should behave. ```python title="main.py" from videosdk.agents import Agent class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant." ) ``` ## Agent with Function Tools Function tools allow your agent to perform actions and interact with external services, extending its capabilities beyond simple conversation. You can register tools that are defined either outside or inside your agent class. ### External Tools External tools are defined as standalone functions and are passed into the agent's constructor via the tools list. This is useful for sharing common tools across multiple agents. ```python title="main.py" from videosdk.agents import Agent, function_tool # External tool defined outside the class @function_tool(description="Get weather information") def get_weather(location: str) -> str: """Get weather information for a specific location.""" # Weather logic here return f"Weather in {location}: Sunny, 72°F" class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant.", tools=[get_weather] # Register the external tool ) ``` ### Internal Tools Internal tools are defined as methods within your agent class and are decorated with `@function_tool`. This is useful for logic that is specific to the agent and needs access to its internal state (`self`). ```python title="main.py" from videosdk.agents import Agent, function_tool class FinanceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful financial assistant." ) self.portfolio = {"AAPL": 10, "GOOG": 5} @function_tool def get_portfolio_value(self) -> dict: """Get the current value of the user's stock portfolio.""" # In a real scenario, you'd fetch live stock prices # This is a simplified example return {"total_value": 5000, "holdings": self.portfolio} ``` ## Agent with MCP Server `MCPServerStdio` enables your agent to communicate with external processes via standard input/output streams. This is ideal for integrating complex, standalone Python scripts or other local executables as tools. ```python title="main.py" import sys from pathlib import Path from videosdk.agents import Agent, MCPServerStdio # Path to your external Python script that runs the MCP server mcp_server_path = Path(__file__).parent / "mcp_server_script.py" class MCPAgent(Agent): def __init__(self): super().__init__( instructions="You are an assistant that can leverage external tools via MCP.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(mcp_server_path)], session_timeout=30 ) ] ) ``` ## Agent Lifecycle and Methods The `Agent` class provides lifecycle hooks and methods to manage state and behavior at critical points in the agent's session. ### Lifecycle Hooks These methods are designed to be overridden in your custom agent class to implement specific behaviors. - `async def on_enter(self) -> None`: Called once when the agent successfully joins the meeting. This is the ideal place for introductions or initial actions, such as greeting participants. - `async def on_exit(self) -> None`: Called when the agent is about to exit the meeting. Use this for cleanup tasks or for saying goodbye. ```python title="main.py" from videosdk.agents import Agent class LifecycleAgent(Agent): async def on_enter(self): print("Agent has entered the meeting.") await self.session.say("Hello everyone! I'm here to help.") async def on_exit(self): print("Agent is exiting the meeting.") await self.session.say("It was a pleasure assisting you. Goodbye!") ``` ## Human in the Loop (HITL) Human in the Loop enables AI agents to escalate specific queries to human operators for review and approval. This implementation uses Discord as the human interface through an MCP server, allowing seamless handoffs between AI automation and human oversight. ### Use Cases - **Discount Requests**: AI escalates pricing queries to human sales agents - **Complex Support**: Technical issues requiring human expertise - **Policy Decisions**: Requests that need human approval or clarification - **Escalation Scenarios**: Situations where AI confidence is low ### Implementation The HITL pattern combines the Agent's MCP server capability with a Discord-based human interface: ```python title="main.py" from videosdk.agents import Agent, MCPServerStdio, CascadingPipeline, AgentSession, JobContext, RoomOptions, WorkerJob from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.anthropic import AnthropicLLM from videosdk.plugins.google import GoogleTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector import pathlib import sys import os from typing import Optional class CustomerAgent(Agent): def __init__(self, ctx: Optional[JobContext] = None): current_dir = pathlib.Path(__file__).parent discord_mcp_server_path = current_dir / "discord_mcp_server.py" super().__init__( instructions="You are a customer-facing agent for VideoSDK. You have access to various tools to assist with customer inquiries, provide support, and handle tasks. When a user asks for a discount percentage, always use the appropriate tool to retrieve and provide the accurate answer from your superior human agent.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(discord_mcp_server_path)], session_timeout=30 ), ] ) self.ctx = ctx async def on_enter(self) -> None: """Called when the agent first joins the meeting""" await self.session.say("Hi! I'm your VideoSDK customer support agent. How can I help you today?") async def on_exit(self) -> None: """Called when the agent exits the meeting""" await self.session.say("Thank you for contacting VideoSDK support. Have a great day!") # Pipeline configuration integrated into the main setup def create_pipeline() -> CascadingPipeline: """Create and configure the cascading pipeline with all components""" return CascadingPipeline( stt=DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")), llm=AnthropicLLM(api_key=os.getenv("ANTHROPIC_API_KEY")), tts=GoogleTTS(api_key=os.getenv("GOOGLE_API_KEY")), vad=SileroVAD(), turn_detector=TurnDetector(threshold=0.8) ) async def start_session(ctx: JobContext): """Main entry point that creates agent, pipeline, and starts the session""" # Create the pipeline pipeline = create_pipeline() # Create the agent with context agent = CustomerAgent(ctx=ctx) # Create the agent session session = AgentSession( agent=agent, pipeline=pipeline ) try: # Connect to the room await ctx.connect() # Start the agent session await session.start() # Keep running until interrupted import asyncio await asyncio.Event().wait() finally: # Clean up resources await session.close() await ctx.shutdown() def make_context() -> JobContext: """Create the job context with room configuration""" room_options = RoomOptions( room_id=os.getenv("VIDEOSDK_ROOM_ID", "your-room-id"), auth_token=os.getenv("VIDEOSDK_AUTH_TOKEN"), name="VideoSDK Customer Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ```python title="discord_mcp_server.py" import asyncio import os from mcp.server.fastmcp import FastMCP import discord from discord.ext import commands class DiscordHuman: def __init__(self, user_id: int, channel_id: int, bot_token: str): self.user_id = user_id self.channel_id = channel_id self.bot_token = bot_token self.bot = commands.Bot(command_prefix="!", intents=discord.Intents.all()) self.response_future = None self.setup_bot_events() def setup_bot_events(self): @self.bot.event async def on_ready(): print(f'{self.bot.user} has connected to Discord!') @self.bot.event async def on_message(message): if (message.author.id == self.user_id and message.channel.id in [thread.id for thread in self.bot.get_all_channels() if hasattr(thread, 'parent')] and self.response_future and not self.response_future.done()): self.response_future.set_result(message.content) async def start_bot(self): """Start the Discord bot""" await self.bot.start(self.bot_token) async def ask(self, question: str) -> str: if not self.bot.is_ready(): return "❌ Discord bot is not ready" try: channel = self.bot.get_channel(self.channel_id) if not channel: return "❌ Channel not found" thread = await channel.create_thread( name=question[:100], type=discord.ChannelType.public_thread ) await thread.send(f"<@{self.user_id}> {question}") self.response_future = asyncio.get_event_loop().create_future() try: response = await asyncio.wait_for(self.response_future, timeout=600) return response except asyncio.TimeoutError: return "⏱️ Timed out waiting for a human response" except Exception as e: return f"❌ Error: {str(e)}" # Initialize Discord human instance discord_human = DiscordHuman( user_id=int(os.getenv("DISCORD_USER_ID")), channel_id=int(os.getenv("DISCORD_CHANNEL_ID")), bot_token=os.getenv("DISCORD_TOKEN") ) # MCP Server Setup mcp = FastMCP("HumanInTheLoopServer") @mcp.tool(description="Ask a human agent via Discord for a specific user query such as discount percentage, etc.") async def ask_human(question: str) -> str: """Ask a human agent via Discord for assistance""" return await discord_human.ask(question) async def main(): """Main function to start both the Discord bot and MCP server""" # Start Discord bot in background bot_task = asyncio.create_task(discord_human.start_bot()) # Wait a moment for bot to initialize await asyncio.sleep(2) # Start MCP server await mcp.run() if __name__ == "__main__": asyncio.run(main()) ``` Set the following environment variables: ```bash title=".env" DISCORD_TOKEN=your_discord_bot_token DISCORD_USER_ID=human_operator_user_id DISCORD_CHANNEL_ID=channel_id_for_escalations DEEPGRAM_API_KEY=your_deepgram_key ANTHROPIC_API_KEY=your_anthropic_key GOOGLE_API_KEY=your_google_key VIDEOSDK_AUTH_TOKEN=your_videosdk_token VIDEOSDK_ROOM_ID=your_room_id ``` The Discord MCP server provides the `ask_human` tool that creates Discord threads for human operator responses. This leverages the same MCP integration pattern shown in the previous section. Complete implementation with full source code, setup instructions, and configuration examples available in the [VideoSDK Agents GitHub repository](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop). --> ## Examples - Try Out Yourself Checkout the examples of function tool usage and MCP server. }, { title: "MCP Server", description: "Implement agent with MCP server integration", link: "https://github.com/videosdk-live/agents/blob/main/examples/mcp_example.py", icon: }, { title: "Human in the Loop", description: "Escalate queries to human operators via Discord", link: "https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop", icon: } ]} columns={3} /> --- --- title: Avatar hide_title: false hide_table_of_contents: false description: "Learn how to add virtual avatars to your VideoSDK AI Agents. Understand avatar integration, configuration, and how to create lifelike visual representations for your agents." pagination_label: "Avatar" keywords: - Avatar - Virtual Avatar - Simli - Visual Representation - AI Agent Avatar - Video Avatar - Agent Appearance - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 10 sidebar_label: Avatar slug: avatar --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Avatar Avatars add a visual, human-like presence to your AI agents, creating more engaging and natural interactions. The VideoSDK Agents framework supports virtual avatars through the Simli integration, providing lifelike video representations that sync with your agent's speech. ![Avatar](https://cdn.videosdk.live/website-resources/docs-resources/voice_agent_avatar.png) ## Overview Avatar functionality enables your AI agents to: - **Visual Presence**: Display a human-like avatar that represents your agent - **Lip Sync**: Automatically synchronize avatar mouth movements with speech - **Real-time Rendering**: Generate avatar video in real-time during conversations - **Customizable Appearance**: Choose from different avatar faces and styles - **Seamless Integration**: Works with both CascadingPipeline and RealTimePipeline ## What Avatars Enable With avatar capabilities, your agents can: - Provide a more human and approachable interface - Increase user engagement through visual interaction - Create branded agent personalities with custom appearances - Enhance accessibility through visual communication cues - Build trust through consistent visual representation ## Simli Avatar Integration ### Basic Setup The Simli avatar integration provides high-quality virtual avatars with real-time lip synchronization. ```python title="main.py" from videosdk.plugins.simli import SimliAvatar, SimliConfig # Configure your avatar avatar_config = SimliConfig( apiKey="your-simli-api-key", faceId="0c2b8b04-5274-41f1-a21c-d5c98322efa9", # Default face syncAudio=True, handleSilence=True ) avatar = SimliAvatar(config=avatar_config) ``` ### Avatar Configuration Options Customize your avatar's behavior and appearance: ```python title="main.py" from videosdk.plugins.simli import SimliConfig config = SimliConfig( apiKey="your-simli-api-key", faceId="your-custom-face-id", # Choose avatar appearance syncAudio=True, # Enable lip sync handleSilence=True, # Manage silent periods maxSessionLength=1800, # 30 minutes max session maxIdleTime=300 # 5 minutes idle timeout ) ``` ## Pipeline Integration Add avatar to your cascading pipeline setup: ```python title="main.py" from videosdk.agents import CascadingPipeline, AgentSession from videosdk.plugins.simli import SimliAvatar, SimliConfig # Configure avatar avatar_config = SimliConfig(apiKey="your-simli-api-key") avatar = SimliAvatar(config=avatar_config) # Create pipeline with avatar pipeline = CascadingPipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, avatar=avatar # Add avatar to pipeline ) ``` Integrate avatar with real-time models: ```python title="main.py" from videosdk.agents import RealTimePipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig from videosdk.plugins.openai import OpenAIRealtime # Configure avatar avatar = SimliAvatar(config=SimliConfig(apiKey="your-api-key")) # Configure real-time model model = OpenAIRealtime(model="gpt-4o-realtime-preview") # Create pipeline with avatar pipeline = RealTimePipeline( model=model, avatar=avatar ) ``` :::info You can also specify the avatar in your room configuration: ```python title="main.py" from videosdk.agents import JobContext, RoomOptions def make_context(): avatar = SimliAvatar(config=SimliConfig(apiKey="your-api-key")) return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Avatar Agent", avatar=avatar # Specify avatar in room options ) ) ``` ::: ## Example - Try Out Yourself import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; } ]} /> --- --- title: Background Audio hide_title: false hide_table_of_contents: false description: "Learn about Background Audio in the VideoSDK AI Agent SDK. Enable ambient sounds, thinking audio, and background music to enhance conversational experiences." pagination_label: "Background Audio" keywords: - Background Audio - Thinking Audio - Ambient Sound - Background Music - override_thinking - RoomOptions - Audio Playback - VideoSDK Agents - Python SDK - AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 8 sidebar_label: Background Audio slug: background-audio --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Background Audio The Background Audio feature enables voice agents to play audio during conversations, enhancing user experience with ambient sounds and processing feedback. There are two ways to set the audio: 1. **Thinking Audio:** Plays automatically during agent processing (e.g., keyboard typing sounds) 2. **Background Audio:** Plays on-demand for ambient music or soundscapes ![Thinking Audio](https://assets.videosdk.live/images/thinking_audio.png) ![Background Audio](https://assets.videosdk.live/images/background-audio.png) ## Getting Started ### Enable Background Audio ```python from videosdk.agents import RoomOptions, JobContext room_options = RoomOptions( room_id="your-room-id", name="My Agent", # highlight-start background_audio=True # Enable background audio support #highlight-end ) context = JobContext(room_options=room_options) ``` ### Agent Methods **1. Set Thinking Audio** `set_thinking_audio()`: Configures audio that plays automatically while the agent processes responses. **Parameters:** - `file (str, optional)`: Path to custom WAV audio file. If not provided, uses built-in `agent_keyboard.wav` - `volume (float, optional)`: Volume of the audio. Default: `0.3` **Example:** ```python class MyAgent(Agent): def __init__(self): super().__init__(instructions="...") # highlight-start # Use default keyboard sound self.set_thinking_audio() # Or use custom audio # self.set_thinking_audio(file="path/to/custom.wav") # highlight-end ``` **2. Play Background Audio** `play_background_audio()`: Starts playing background audio during the conversation. **Parameters:** - `file (str, optional)`: Path to custom WAV audio file. If not provided, uses built-in `classical.wav` - `looping (bool, optional)`: Whether to loop the audio. Default: `False` - `override_thinking (bool, optional)`: Whether to stop thinking audio when background audio starts. Default: `True` - `volume (float, optional)`: Volume of the audio. Default: `1.0` **Example:** ```python @function_tool async def play_music(self): """Plays background music""" # highlight-start await self.play_background_audio( looping=True, override_thinking=False ) # highlight-end return "Music started" ``` **3. Stop Background Audio** `stop_background_audio()`: Stops currently playing background audio. **Example:** ```python @function_tool async def stop_music(self): """Stops background music""" # highlight-start await self.stop_background_audio() # highlight-end return "Music stopped" ``` ## Complete Example ```python title="main.py" from videosdk.agents import ( Agent, AgentSession, CascadingPipeline, WorkerJob, ConversationFlow, JobContext, RoomOptions, function_tool ) from videosdk.plugins.openai import OpenAILLM, OpenAITTS from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector class MusicAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant. Use control_music to play or stop background music." ) #highlight-start # Enable thinking audio with default keyboard sound self.set_thinking_audio() #highlight-end async def on_enter(self): await self.session.say("Hello! Ask me to play music.") async def on_exit(self): await self.session.say("Goodbye! Hope you enjoyed the music.") @function_tool async def control_music(self, action: str): """ Controls background music. :param action: 'play' to start music, 'stop' to end it """ if action == "play": #highlight-start await self.play_background_audio( override_thinking=True, looping=True ) #highlight-end return "Music started" elif action == "stop": #highlight-start await self.stop_background_audio() #highlight-end return "Music stopped" return "Invalid action" async def entrypoint(ctx: JobContext): agent = MusicAgent() pipeline = CascadingPipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=OpenAITTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=ConversationFlow(agent) ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context(): return JobContext( room_options=RoomOptions( room_id="", name="Music Agent", #highlight-start background_audio=True # Required! #highlight-end ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Pipeline Support Background audio works with both pipeline types: ### Cascading Pipeline - Thinking audio plays automatically during LLM processing - Background audio can be controlled via agent methods - Audio stops automatically when agent speaks ### RealTime Pipeline - Full background audio support with streaming models - Automatic lifecycle management during conversation turns ## Audio Behavior | Feature | Thinking Audio | Background Audio | |---------|---------------|------------------| | **Trigger** | Automatic during processing | Manual via `play_background_audio()` | | **Default File** | `agent_keyboard.wav` | `classical.wav` | | **Typical Duration** | Short (during LLM call) | Long/continuous | | **Looping** | Optional | Recommended (`looping=True`) | | **User Control** | No | Yes (via function tools) | | **Stops When** | Agent speaks | Agent speaks or `stop_background_audio()` | ## Audio File Requirements - **Format:** WAV (`.wav`) - **Recommended:** 16-bit PCM, 16kHz sample rate, mono channel - **Built-in files:** - `agent_keyboard.wav`: Default thinking sound - `classical.wav`: Default background music ## Best Practices 1. **Always enable in RoomOptions:** Set `background_audio=True` before using audio methods 2. **Use `override_thinking=True`:** When playing music to avoid overlapping sounds 3. **Loop background audio:** Set `looping=True` for continuous ambient sounds 4. **Control via function tools:** Let users control music through natural language 5. **Clean audio files:** Use high-quality WAV files to avoid distortion ## Common Use Cases - **Music player agent:** Control playback through conversation - **Ambient soundscapes:** Create atmosphere during interactions - **Processing feedback:** Custom thinking sounds for different agent personalities - **Hold music:** Play audio while agent performs long operations ## Example - Try It Yourself } ]} /> ## FAQs
Troubleshooting | Issue | Solution | |--------|-----------| | Audio not playing | Verify `background_audio=True` in `RoomOptions` | | Audio quality issues | Use WAV format with 16-bit PCM encoding | | Audio doesn't stop | Ensure `stop_background_audio()` is called properly | | Overlapping sounds | Use `override_thinking=True` when playing background audio |
--- --- title: Call Transfer hide_title: false hide_table_of_contents: false description: Learn how to enable your AI Agent to seamlessly transfer a live SIP call to a different phone number pagination_label: Call Transfer keywords: - AI Agent SDK - AI Telephony Agent - ai-agent - SIP - VideoSDK - Call Transfer - Telephony Integration - Webhooks image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Call Transfer slug: call-transfer --- import { AgentCardGrid, GithubIcon, DocumentIcon, } from "@site/src/components/agent/cards"; Call Transfer lets your AI Agent move an ongoing SIP call to another phone number without ending the current session. Instead of making the caller hang up and dial a new number, the agent can automatically route the call. ## How Call Transfer Works - The agent evaluates the user’s intent to determine when a call transfer is required and then triggers the function tool. - When the function tool is triggered, it tells the system to move the call to another phone number. - The ongoing SIP call is forwarded to the new number instantly, without disconnecting or redialing. ## Trigger Call Transfer To set up incoming call handling, outbound calling, and routing rules, check out the [Quick Start Example](https://docs.videosdk.live/telephony/ai-telephony-agent-quick-start#part-2-connect-your-agent-to-the-phone-network). ```python title="main.py" from videosdk.agents import Agent, function_tool, class CallTransferAgent(Agent): def __init__(self): super().__init__( instructions="You are the Call Transfer Agent Which Help and provide to transfer on going call to new number. use transfer_call tool to transfer the call to new number.", ) async def on_enter(self) -> None: await self.session.say("Hello Buddy, How can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye Buddy, Thank you for calling!") @function_tool async def transfer_call(self) -> None: """Transfer the call to Provided number""" token = os.getenv("VIDEOSDK_AUTH_TOKEN") transfer_to = os.getenv("CALL_TRANSFER_TO") return await self.session.call_transfer(token,transfer_to) ``` ## Example - Try It Yourself }, ]} columns={2} /> --- --- title: Cascading Pipeline hide_title: false hide_table_of_contents: false description: "Explore the `Cascading Pipeline` component in the VideoSDK AI Agent SDK. Learn how it manages AI models (like OpenAI and Gemini), configurations, streaming audio, and multi-modal capabilities." pagination_label: "Cascading Pipeline" keywords: - Pipeline Component - AI Agent SDK - VideoSDK Agents - AI Models - OpenAI - Gemini - Model Configuration - Streaming Audio - Multi-modal AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Cascading Pipeline slug: cascading-pipeline --- # Cascading Pipeline The `Cascading Pipeline` component provides a flexible, modular approach to building AI agents by allowing you to mix and match different components for Speech-to-Text (STT), Large Language Models (LLM), Text-to-Speech (TTS), Voice Activity Detection (VAD), and Turn Detection. ## Core Architecture The pipeline is composed of five key stages that work in sequence to handle a conversation: - **VAD (Voice Activity Detection)** - Detects the presence of human speech in the audio stream to know when to start processing. - **STT (Speech-to-Text)** - Converts the detected speech from audio into a text transcript. - **LLM (Large Language Model)** - Takes the text transcript as input, processes it, and generates a meaningful response. - **TTS (Text-to-Speech)** - Converts the LLM's text response back into audible speech. - **Turn Detection** - Manages the back-and-forth of the conversation, determining when one speaker has finished and another can begin. ![Cascading Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_casading_pipeline.png) ## Basic Usage ### Simple Pipeline Here is the most basic setup, combining STT, LLM, and TTS components. The SDK will use default configurations if no specific settings are provided. ```python title="main.py" from videosdk.agents import CascadingPipeline from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector pipeline = CascadingPipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) ``` ## Key Features: - **Modular Component Selection** - Choose different providers for each component - **Flexible Configuration** - Mix and match STT, LLM, TTS, VAD, and Turn Detection - **Custom Processing** - Add custom processing for STT and LLM outputs - **Provider Agnostic** - Support for multiple AI service providers - **Advanced Control** - Fine-tune each component independently ## Advance Configuration You can fine-tune the behavior of each component by passing specific parameters during initialization. ```python title="main.py" from videosdk.agents import CascadingPipeline from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector stt=DeepgramSTT( model="nova-2", language="en", punctuate=True, diarize=True ), llm=OpenAILLM( model="gpt-4o", temperature=0.7, max_tokens=1000 ), tts=ElevenLabsTTS( model="eleven_flash_v2_5", voice_id="21m00Tcm4TlvDq8ikWAM" ), vad=SileroVAD( threshold=0.35, min_silence_duration=0.5 ), turn_detector=TurnDetector( threshold=0.8, min_turn_duration=1.0 ) pipeline = CascadingPipeline(stt=stt, llm=llm, tts=tts, vad=vad, turn_detector=turn_detector) ``` ## Dynamic Component Changes The pipeline supports runtime component swapping ```python # Change components during runtime await pipeline.change_component( stt=new_stt_provider, llm=new_llm_provider, tts=new_tts_provider ) ``` ## Plugin Ecosystem There are multiple plugins available for STT, LLM, & TTS. Checkout here: import { AgentCardGrid, GithubIcon, RobotIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon } from '@site/src/components/agent/cards'; ## Plugin Development ### Creating Custom Plugins To create custom plugins, follow the [plugin development guide ↗](https://github.com/videosdk-live/agents/blob/main/BUILD_YOUR_OWN_PLUGIN.md). Key requirements include: - Inherit from the correct base class (`STT`, `LLM`, or `TTS`) - Implement all abstract methods - Handle errors consistently using `self.emit("error", message)` - Clean up resources in the `aclose()` method ## Plugin Installation Install additional plugins as needed: ```python # Install specific provider plugins pip install videosdk-plugins-openai pip install videosdk-plugins-elevenlabs pip install videosdk-plugins-deepgram ``` ## Best Practices 1. **Component Selection:** Choose providers based on your specific requirements (latency, quality, cost) 2. **Error Handling:** Implement proper error handling and fallback strategies 3. **Resource Management:** Use the `cleanup()` method to properly close components. 4. **Configuration Monitoring:** Use `get_component_configs()` for debugging and monitoring 5. **Audio Format:** Ensure your custom plugins handle the 48kHz audio format correctly ## Key Benefits The Cascading Pipeline offers several advantages over integrated solutions: - **Multi-language Support** - Use specialized STT for different languages - **Cost Optimization** - Mix premium and cost-effective services - **Custom Voice Processing** - Add domain-specific processing logic - **Performance Optimization** - Choose fastest providers for each component - **Compliance Requirements** - Use specific providers for regulatory compliance ## Comparison with Realtime Pipeline | Feature | Cascading Pipeline | Realtime Pipeline | | ------------- | ----------------------------------- | ----------------------------- | | Control | Maximum control over each component | Integrated model control | | Flexibility | Mix different providers | Single model provider | | Latency | Higher due to sequential processing | Lower with streaming | | Customization | Extensive customization options | Limited to model capabilities | | Complexity | More complex configuration | Simpler setup | The `Cascading Pipeline` is ideal when you need maximum flexibility and control over each processing stage, while the `Realtime Pipeline` is better for low-latency applications with integrated model providers. ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. } ]} /> --- --- title: Conversation Flow hide_title: false hide_table_of_contents: false description: "Explore the `Conversation Flow` component in the VideoSDK AI Agent SDK. Learn how it manages turn taking in the Agents" pagination_label: "Conversation Flow" keywords: - Conversation Flow - AI Agent SDK - VideoSDK Agents - AI Models - OpenAI - Gemini - Model Configuration - Streaming Audio - Multi-modal AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Conversation Flow slug: conversation-flow --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, ExternalLinkIcon } from '@site/src/components/agent/cards'; # Conversation Flow The `Conversation Flow` component manages the turn-taking logic in AI agent conversations, ensuring smooth and natural interactions. It is an inheritable class that allows you to inject custom logic into the `Cascading Pipeline`, enabling advanced capabilities like context preservation, dynamic adaptation, and Retrieval-Augmented Generation (RAG) before the final LLM call. ![Conversation Flow](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_conversation_flow.png) :::note Conversation Flow is a powerful feature that currently works exclusively with the [Cascading Pipeline ↗](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline). ::: ## Core Features The key methods allow you to inject custom logic at different stages of the conversation flow, enabling sophisticated AI agent behaviors while maintaining clean separation of concerns: ### **Core Capabilities** - **Turn-taking Management:** Control the flow and timing of agent and user turns - **Context Preservation:** Maintain conversation history and user data across turns (handled automatically) - **Advanced Flow Control:** Build stateful conversations that can adapt to user input - **Performance Optimization:** Fine-tune conversation processing for speed and efficiency - **Error Handling:** Implement robust error recovery and fallback mechanisms ### **Advanced Use Cases** - **RAG Implementation:** Retrieve relevant documents and context before LLM processing - **Memory Management:** Store and recall conversation history across sessions - **Content Filtering:** Apply safety checks and content moderation on input/output - **Analytics & Logging:** Track conversation metrics and user behavior patterns (built-in metrics integration) - **Business Logic Integration:** Add domain-specific processing and validation rules - **Multi-step Workflows:** Implement complex conversation flows with state management - **Function Tool Execution:** Automatic execution of function tools when requested by the LLM. ## Basic Usage ### Complete Setup with CascadingPipeline The recommended approach is to use `ConversationFlow` with a `CascadingPipeline`, which handles component configuration automatically: ```python title="main.py" from videosdk.agents import ConversationFlow, Agent, CascadingPipeline # First, define your agent class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant." ) async def on_enter(self): # Initialize agent state pass async def on_exit(self): # Cleanup resources pass # Create pipeline and conversation flow pipeline = CascadingPipeline(stt=my_stt, llm=my_llm, tts=my_tts) conversation_flow = ConversationFlow(MyAgent()) # Pipeline automatically configures all components pipeline.set_conversation_flow(conversation_flow) ``` ### Constructor Parameters The ConversationFlow constructor accepts comprehensive configuration options: ```python ConversationFlow( agent: Agent, stt: STT | None = None, llm: LLM | None = None, tts: TTS | None = None, vad: VAD | None = None, turn_detector: EOU | None = None, denoise: Denoise | None = None ) ``` To add custom behavior, you inherit from `ConversationFlow` and override its methods. ## Built-in Methods ### Core Processing Methods - `process_with_llm()`: Processes the current chat context with the LLM and handles function tool execution automatically. - `say(message: str)`: Direct TTS synthesis for agent responses. - `process_text_input(text: str)`: Handle text input for A2A communication, bypassing STT. ### Lifecycle Hooks Override these methods to add custom behavior at specific conversation points: ```python class CustomFlow(ConversationFlow): async def on_turn_start(self, transcript: str) -> None: """Called when a user turn begins.""" print(f"User said: {transcript}") async def on_turn_end(self) -> None: """Called when a user turn ends.""" print("Turn completed") ``` ## Automatic Features - **Context Management**: The conversation flow automatically manages the agent's chat context. Do not manually add user messages as this will create duplicates. - **Audio Processing**: Audio data is automatically processed through send_audio_delta(), handling denoising, STT, and VAD processing. - **Interruption Handling**: The system includes sophisticated interruption logic that gracefully handles user interruptions during agent responses. ## Custom Conversation Flows ### RAG (Retrieval-Augmented Generation) Integration Enhance your agent's knowledge by integrating RAG to retrieve relevant information from external documents and databases. **Benefits:** - Access external documents and FAQs - Reduce hallucinations with real data - Dynamic context retrieval ```python title="rag_example.py" class RAGConversationFlow(ConversationFlow): async def run(self, transcript: str) -> AsyncIterator[str]: # Retrieve relevant context context = await self.agent.retrieve_relevant_documents(transcript) # Add context to conversation if context: self.agent.chat_context.add_message( role="system", content=f"Use this information: {context}" ) # Generate response with enhanced context async for response in self.process_with_llm(): yield response ``` See our [RAG Integration Documentation](../core-components/rag) for complete implementation. ### Implementing Custom Flows You can create a custom flow by inheriting from `ConversationFlow` and overriding the `run` method. This allows you to intercept the user's transcript, modify it, manage state, and even change the response from the LLM. ```python title="main.py" from typing import AsyncIterator from videosdk.agents import ConversationFlow, Agent class CustomConversationFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.turn_count = 0 async def run(self, transcript: str) -> AsyncIterator[str]: """Override the main conversation loop to add custom logic.""" self.turn_count += 1 # You can access and add to the agent's chat context before calling the LLM self.agent.chat_context.add_message(role=ChatRole.USER, content=transcript) # Process with the standard LLM call async for response_chunk in self.process_with_llm(): # Apply custom processing to the response processed_chunk = await self.apply_custom_processing(response_chunk) yield processed_chunk async def apply_custom_processing(self, chunk: str) -> str: """A helper method to modify the LLM's output.""" if self.turn_count == 1: # Prepend a greeting on the first turn return f"Hello! {chunk}" elif self.turn_count > 5: # Offer to summarize after many turns return f"This is an interesting topic. To summarize: {chunk}" else: return chunk ``` ### Advanced Turn-Taking Logic For more complex interactions, you can implement a state machine within your conversation flow to manage different states of the conversation. ```python title="main.py" class AdvancedTurnTakingFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.conversation_state = "listening" # Initial state async def run(self, transcript: str) -> AsyncIterator[str]: """A state-driven conversation loop.""" if self.conversation_state == "listening": # If we were listening, we now process the user's input # and transition to the responding state. await self.process_user_input(transcript) self.conversation_state = "responding" async for response_chunk in self.process_with_llm(): yield response_chunk # Once done responding, go back to listening self.conversation_state = "listening" elif self.conversation_state == "waiting_for_confirmation": # Handle a confirmation state if "yes" in transcript.lower(): yield "Great! Proceeding." self.conversation_state = "listening" else: yield "Okay, cancelling." self.conversation_state = "listening" async def process_user_input(self, transcript: str): """Custom logic for processing user input.""" print(f"Processing user input in state: {self.conversation_state}") # Add logic here, e.g., check if the user is asking a question that needs confirmation if "delete my account" in transcript.lower(): self.conversation_state = "waiting_for_confirmation" ``` ### Context-Aware Conversations Maintain conversation history and user preferences to create a personalized and context-aware experience. ```python title="main.py" import time class ContextAwareFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.conversation_history = [] self.current_topic = "general" async def run(self, transcript: str) -> AsyncIterator[str]: # First, update the context with the new transcript await self.update_context(transcript) # The agent's chat_context (automatically managed) will be # used by process_with_llm() to generate a context-aware response. async for response_chunk in self.process_with_llm(): yield response_chunk async def update_context(self, transcript: str): """Update history and identify the topic before calling the LLM.""" self.conversation_history.append({ 'role': 'user', 'content': transcript, 'timestamp': time.time() }) await self.identify_topic(transcript) # Add topic-specific context (system messages are safe to add) if hasattr(self, 'current_topic'): self.agent.chat_context.add_message( role=ChatRole.SYSTEM, content=f"System note: The user is asking about {self.current_topic}." ) async def identify_topic(self, transcript: str): """A simple topic identification logic.""" if "weather" in transcript.lower(): self.current_topic = "weather" elif "finance" in transcript.lower(): self.current_topic = "finance" ``` ## Performance Optimization To ensure the best user experience, consider the following optimization strategies: - **Efficient Context**: Keep the context provided to the LLM concise. Summarize earlier parts of the conversation to reduce token count and improve LLM response time. - **Asynchronous Operations**: When performing RAG or calling external APIs for data, ensure the operations are fully asynchronous (async/await) to avoid blocking the event loop. - **Caching**: Cache frequently accessed data (e.g., from a database or RAG store) to reduce lookup latency on subsequent turns. - **Streaming**: The run method returns an `AsyncIterator`. Process and yield response chunks as soon as they are available from the LLM to minimize perceived latency for the user. ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. } ]} /> --- --- title: Conversational Graph hide_title: false hide_table_of_contents: false description: "Learn how to use Conversational Graph to build structured, state-based conversation flows for your AI agents." sidebar_label: Conversational Graph pagination_label: Conversational Graph keywords: - Conversational Graph - AI Agents - State Machine - Conversation Flow - Python SDK image: img/videosdklive-thumbnail.jpg --- import Step from '@site/src/components/Step' import { AgentCardGrid, GithubIcon } from '@site/src/components/agent/cards'; # Conversational Graph The **Conversational Graph** is a powerful tool that allows you to define complex, structured conversation flows for your AI agents. Instead of relying solely on an LLM's inherent reasoning, which can sometimes be unpredictable, you can use a graph-based approach to guide the conversation through specific states and transitions. ## Installation To use the Conversational Graph, you need to install the `videosdk-conversational-graph` package. ```bash pip install videosdk-conversational-graph ``` :::note Check out the latest version of [videosdk-conversational-graph](https://pypi.org/project/videosdk-conversational-graph/) on PyPI. ::: ## Core Concepts The Conversational Graph is built around a few key concepts: 1. **ConversationalGraph**: The main object that manages the states and transitions. 2. **State**: A specific point in the conversation (e.g., "Greeting", "Asking for Name"). Each state has instructions for the agent. 3. **Transition**: Logic that dictates how the agent moves from one state to another based on user input or collected data. ## Example: Loan Application Let's walk through a complete example of building a Loan Application agent. This agent will guide the user through selecting a loan type (Personal, Home, or Car) and collecting the necessary details. ![Conversational Graph Loan Application](https://assets.videosdk.live/images/conversational-Graph.png)
### Step 1: Define the Data Model
First, define the data you want to collect using `ConversationalDataModel`. This ensures the agent knows exactly what information to extract. ```python title="main.py" from pydantic import Field from conversational_graph import ConversationalGraph,ConversationalDataModel class LoanFlow(ConversationalDataModel): loan_type: str = Field(None, description="Type of loan: personal, home, car") annual_income: int = Field(None, description="Annual income of the applicant in INR") credit_score: int = Field(None, description="Credit score of the applicant. Must be between 300 and 850") property_value: int = Field(None, description="Value of the property for home loan in INR") vehicle_price: int = Field(None, description="Price of the vehicle for car loan in INR") loan_amount: int = Field(None, description="Desired loan amount in INR. MUST be greater than ₹11 lakh for approval") ```
### Step 2: Initialize the Graph
Create an instance of `ConversationalGraph` and pass your data model. ```python title="main.py" loan_application = ConversationalGraph( name="Loan Application", DataModel=LoanFlow, off_topic_threshold=5 ) ```
### Step 3: Define States
Define the various states of your conversation. Each state has a `name` and `instruction` that tells the agent what to do in that state. You can also define specific `tools` available to the agent within that state. ```python title="main.py" # Start Greeting q0 = loan_application.state( name="Greeting", instruction="Welcome user and start the conversation about loan application. Ask if they are ready to apply for a loan.", ) # Loan Type Selection q1 = loan_application.state( name="Loan Type Selection", instruction="Ask user to select loan type. We only offer personal loan, home loan, and car loan at the moment.", ) # highlight-start # Tool for state q2 submit_loan_application = PreDefinedTool().http_tool(HttpToolRequest( name="submit_loan_application", description="Called when loan request is approved and sumitted.", url="https://videosdk.free.beeceptor.com/apply", method="POST" ) ) q2 = loan_application.state( name="Review and Confirm", instruction="Review all loan details with the user and get confirmation.", tool=submit_loan_application ) # highlight-end # Master / Off-topic handler q_master = loan_application.state( name="Off-topic Handler", instruction="Handle off-topic or inappropriate inputs respectfully and end the call politely", master=True ) ```
### Step 4: Define Transitions
Now, link the states together using transitions. You specify the `from_state`, `to_state`, and a `condition` that must be met to trigger the transition. ```python title="main.py" # Greeting → Loan Type Selection loan_application.transition( from_state=q0, to_state=q1, condition="User ready to apply for loan" ) # Branch from Loan Type Selection loan_application.transition( from_state=q1, to_state=q1a, condition="User wants personal loan" ) # Merge all branches → Loan Amount Collection loan_application.transition( from_state=q1a, to_state=q2, condition="Personal loan details collected and verified" ) # ... More Transitions ... ```
### Step 5: Integrate with the Agent Pipeline
Finally, pass the `conversational_graph` to your `CascadingPipeline`. ```python title="main.py" class VoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful voice assistant that can assist you with your loan applications") async def on_enter(self) -> None: await self.session.say("Hello, I am here to help with your loan application. How can I help you today?", interruptible=False) async def on_exit(self) -> None: await self.session.say("Goodbye!") async def entrypoint(ctx: JobContext): agent = VoiceAgent() conversation_flow = ConversationFlow(agent) # highlight-start pipeline = CascadingPipeline( stt= DeepgramSTT(), llm=OpenAILLM(), tts=GoogleTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), conversational_graph = loan_application ) # highlight-end session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: room_options = RoomOptions(room_id="", name="Workflow Agent", playground=True) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Working Example } ]} /> --- --- title: De-noise hide_title: false hide_table_of_contents: false description: "Learn how to enhance voice quality by removing background noise in VideoSDK AI Agents. Implement real-time audio denoising for clearer conversations." pagination_label: "De-noise" keywords: - Voice Enhancement - Noise Removal - Audio Denoising - RNNoise - Audio Processing - Background Noise - Voice Quality - AI Agent SDK - VideoSDK Agents - Real-time Audio - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 7 sidebar_label: De-noise slug: de-noise --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; # De-noise De-noise improves audio quality in your AI agent conversations by filtering out background noise. This creates more professional and engaging interactions, especially in noisy environments. ## Overview The VideoSDK Agents framework provides real-time audio denoising capabilities via `RNNoise` plugin that: - **Remove Background Noise**: Filters out ambient sounds, keyboard typing, air conditioning, and other distractions - **Enhance Voice Clarity**: Improves speech intelligibility and quality - **Work in Real-time**: Processes audio with minimal latency during live conversations - **Integrate Seamlessly**: Works with both `CascadingPipeline` and `RealTimePipeline` architectures ## What De-noise Solves Without noise removal, your agents may struggle with: - Poor audio quality affecting transcription accuracy - Background noise interfering with conversation flow - Unprofessional sound quality in business applications - Difficulty understanding users in noisy environments With De-noise, you get: - Crystal clear audio for better user experience - Improved speech-to-text accuracy - Professional-grade audio quality - Better performance in various acoustic environments ## RNNoise Implementation `RNNoise` is a real-time noise suppression library that uses deep learning to distinguish between speech and noise, providing effective background noise removal. ### Key Features - **Real-time Processing**: Low-latency noise removal suitable for live conversations - **Adaptive Filtering**: Automatically adjusts to different types of background noise - **Speech Preservation**: Maintains voice quality while removing unwanted sounds - **Lightweight**: Efficient processing with minimal computational overhead ### Basic Setup ```python from videosdk.plugins.rnnoise import RNNoise # Initialize noise removal denoise = RNNoise() ``` ## Pipeline Integration Add noise removal to your cascading pipeline: ```python title="main.py" from videosdk.agents import Agent, CascadingPipeline, AgentSession # highlight-start from videosdk.plugins.rnnoise import RNNoise # highlight-end # Add your preferred providers from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.silero import SileroVAD class EnhancedVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a professional assistant with crystal-clear audio quality. Help users with their questions while maintaining excellent conversation flow." ) async def on_enter(self): await self.session.say("Hello! I'm here with enhanced audio quality for our conversation.") async def on_exit(self): await self.session.say("Goodbye! It was great talking with you.") # Set up pipeline with noise removal pipeline = CascadingPipeline( stt=DeepgramSTT(api_key="your-deepgram-key"), llm=OpenAILLM(api_key="your-openai-key", model="gpt-4"), tts=ElevenLabsTTS(api_key="your-elevenlabs-key", voice_id="your-voice-id"), vad=SileroVAD(), # highlight-start denoise=RNNoise() # Enable noise removal # highlight-end ) # Create and start session async def main(): session = AgentSession(agent=EnhancedVoiceAgent(), pipeline=pipeline) await session.start() if __name__ == "__main__": import asyncio asyncio.run(main()) ``` Integrate with real-time models: ```python title="main.py" from videosdk.agents import Agent, RealTimePipeline, AgentSession # highlight-start from videosdk.plugins.rnnoise import RNNoise # highlight-end from videosdk.plugins.openai import OpenAIRealtime class EnhancedRealtimeAgent(Agent): def __init__(self): super().__init__( instructions="You are a professional assistant with crystal-clear audio quality. Engage in natural, real-time conversations while providing helpful responses." ) async def on_enter(self): await self.session.say("Hello! I'm ready for a real-time conversation with enhanced audio quality.") async def on_exit(self): await self.session.say("Thank you for the conversation! Take care.") # Set up real-time model model = OpenAIRealtime( model="gpt-4o-realtime-preview", api_key="your-openai-key", voice="alloy" # Choose from: alloy, echo, fable, onyx, nova, shimmer ) # Set up pipeline with noise removal pipeline = RealTimePipeline( model=model, # highlight-start denoise=RNNoise() # Enable noise removal # highlight-end ) # Create and start session async def main(): session = AgentSession(agent=EnhancedRealtimeAgent(), pipeline=pipeline) await session.start() if __name__ == "__main__": import asyncio asyncio.run(main()) ``` ## Audio Processing Flow When noise removal is enabled, audio processing follows this flow: 1. **Raw Audio Input:** Microphone captures audio with background noise 2. **Noise Removal:** `RNNoise` filters out unwanted sounds 3. **Enhanced Audio:** Clean audio is passed to speech processing 4. **Improved Results:** Better transcription and conversation quality ## Example - Try Out Yourself } ]} /> --- --- title: DTMF Events hide_title: false hide_table_of_contents: false description: Learn how to enable and listen to DTMF (Dual-Tone Multi-Frequency) events in VideoSDK AI Agents. pagination_label: DTMF Events keywords: - SIP - VideoSDK - DTMF - Telephony Integration - Webhooks - PubSub - Events image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: DTMF Events slug: dtmf-events --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; DTMF (Dual-Tone Multi-Frequency) events happen when a caller presses keys (0–9, *, #) on their phone or SIP device during a call. AI agents can listen for these events to capture user input, run specific actions, or respond to the caller based on the key they pressed. DTMF provides a simple and reliable way for users to interact with the agent during a call. ## How It Works - **DTMF Event Detection**: The agent detects key presses (0–9, *, #) from the caller during a call session. - **Real-Time Processing**: Each key press generates a DTMF event that is delivered to the agent immediately. - **Callback Integration**: A user-defined callback function handles incoming DTMF events. - **Action Execution**: The agent executes actions or triggers workflows based on the received DTMF input like building IVR flows, collecting user input, or triggering actions in your application. ## How to enable DTMF Events ### Step 1 : Activate DTMF Detection DTMF event detection can be enabled in two ways: When creating an Inbound SIP gateway in the VideoSDK dashboard, enable the `DTMF` option. ![dtmf-event](https://assets.videosdk.live/images/DTMF-events.png) Set the `enableDtmf` parameter to `true` when creating or updating a SIP gateway using the API. ```bash curl -H 'Authorization: $YOUR_TOKEN' \ -H 'Content-Type: application/json' \ -d '{ "name" : "Twilio Inbound Gateway", "enableDtmf" : "true", "numbers" : ["+0123456789"] }' \ -XPOST https://api.videosdk.live/v2/sip/inbound-gateways ``` ### Step 2 : Implementation To set up inbound calls, outbound calls, and routing rules check out the [Quick Start Example](https://docs.videosdk.live/telephony/managing-calls/making-outbound-calls). ```python title="main.py" from videosdk.agents import AgentSession, DTMFHandler async def entrypoint(ctx: JobContext): async def dtmf_callback(digit: int): if digit == 1: agent.instructions = "You are a Sales Representative. Your goal is to sell our products" await agent.session.say( "Routing you to Sales. Hi, I'm from Sales. How can I help you today?" ) elif digit == 2: agent.instructions = "You are a Support Specialist. Your goal is to help customers with technical issues." await agent.session.say( "Routing you to Support. Hi, I'm from Support. What issue are you facing?" ) else: await agent.session.say( "Invalid input. Press 1 for Sales or 2 for Support." ) #highlight-start dtmf_handler = DTMFHandler(dtmf_callback) #highlight-end session = AgentSession( #highlight-start dtmf_handler = dtmf_handler, #highlight-end ) ``` ## Example - Try It Yourself , }, ]} columns={2} /> --- --- title: Fallback Adapter hide_title: false hide_table_of_contents: false description: "Learn about Fallback and recovery for STT, LLM, and TTS providers in VideoSDK AI Agents." pagination_label: "Fallback Adapter" keywords: - Fallback Adapter - Fallback and recovery - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Fallback Adapter slug: fallback-adapter --- import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; # Fallback Adapter The `Fallback Adapter` provides automatic failover between multiple STT, LLM, or TTS providers. If a provider becomes unavailable, the system automatically switches to the next configured provider without interrupting the session. ## Features - **Automatic Fallback**: Switches to lower-priority providers if the primary provider fails. - **Cooldown-based Retry**: Implements a cooldown period before retrying a failed provider, preventing immediate repeated failures. - **Auto-Recovery**: Automatically switches back to a higher-priority provider once it becomes healthy again. - **Permanent Disable**: Permanently disables a provider after a configured number of failed recovery attempts. ## Example Usage Here is how you can implement fallback providers for STT, LLM, and TTS in your agent configuration. ```python from videosdk.agents import FallbackSTT, FallbackLLM, FallbackTTS from videosdk.plugins.openai import OpenAISTT, OpenAILLM, OpenAITTS from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.cerebras import CerebrasLLM from videosdk.plugins.cartesia import CartesiaTTS # Configure Fallback STT stt_provider = FallbackSTT( [OpenAISTT(), DeepgramSTT()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) # Configure Fallback LLM llm_provider = FallbackLLM( [OpenAILLM(model="gpt-4o-mini"), CerebrasLLM()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) # Configure Fallback TTS tts_provider = FallbackTTS( [OpenAITTS(voice="alloy"), CartesiaTTS()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) ``` ## Configuration Options You can configure the fallback behavior using the following parameters: | Parameter | Description | | :--- | :--- | | `temporary_disable_sec` | The duration (in seconds) to wait before retrying a failed provider. | | `permanent_disable_after_attempts` | The maximum number of recovery attempts allowed before a provider is permanently disabled. | ## Examples - Try Out Yourself } ]} /> --- --- title: Memory hide_title: false hide_table_of_contents: false description: "Enable your VideoSDK AI Agents with long-term memory to create personalized, context-aware conversations. This guide covers integrating memory providers like Mem0, retrieving context, and enhancing user experience." pagination_label: "Memory" keywords: - Memory - Long-term Memory - Conversation History - Personalization - Mem0 - AI Agent Memory - Context Retrieval - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 12 sidebar_label: Memory slug: memory --- import { AgentCardGrid, GithubIcon, CodeIcon } from '@site/src/components/agent/cards'; import Step from '@site/src/components/Step'; # Memory Give your AI agents the ability to remember past interactions and user preferences. By integrating a memory provider, your agent can move beyond the limits of its immediate context window to deliver truly personalized and context-aware conversations. ## How Memory Enhances Conversations A standard LLM's memory is limited to its context window. A dedicated memory provider solves this by creating a persistent, intelligent storage layer that recalls information across different sessions. ![Memory-enabled Conversation Flow](https://assets.videosdk.live/images/voice-agent-memory-manager.png) As the diagram shows, the agent intelligently stores key facts and retrieves them in later conversations to provide a personalized, efficient interaction. ## Implementation with Mem0 This guide demonstrates how to implement long-term memory using [**Mem0**](https://mem0.ai/), an open-source platform designed to give AI agents a persistent memory layer. This example creates a "Concierge Agent" that remembers returning users. We will break down the implementation into logical steps. :::note The following sections outline the steps you might follow. For a complete working example, see the GitHub repository: - https://github.com/videosdk-live/agents-quickstart/tree/main/Memory ::: ### Prerequisites - A Mem0 API key, available from the [Mem0 dashboard](https://app.mem0.ai/). - Ensure your agent environment is set up per the [AI Voice Agent Quickstart](/ai_agents/voice-agent-quick-start). This is the baseline app where we'll implement the memory features in the steps below.
### Step 1: Create a Dedicated Memory Manager
Start by creating a memory manager class that abstracts your chosen memory provider's API. This class should handle three core operations: storing memories, retrieving memories, and deciding what to remember. The key is to implement a `should_store` method that intelligently determines which conversations are worth remembering based on keywords, user intent, or other criteria you define. ```python title="memory_utils.py" from mem0.client.main import AsyncMemoryClient class Mem0MemoryManager: """Handles all interactions with the Mem0 API.""" def __init__(self, api_key: str, user_id: str): self.user_id = user_id self._client = AsyncMemoryClient(api_key=api_key) async def fetch_recent_memories(self, limit: int = 5) -> list[str]: """Retrieves the most recent memories for the user.""" try: response = await self._client.get_all(filters={"user_id": self.user_id}, limit=limit) return [entry.get("memory", "") for entry in response] except Exception as e: print(f"Error fetching memories: {e}") return [] def should_store(self, user_message: str) -> bool: """Determines if a message contains keywords indicating a fact to remember.""" keywords = ("remember", "preference", "my name is", "likes", "dislike") return any(keyword in user_message.lower() for keyword in keywords) async def record_memory(self, user_message: str, assistant_message: str | None = None): """Stores a conversational turn in Mem0.""" # ... implementation to call self._client.add() ```
### Step 2: Accessing Memory to Personalize the Agent
Implement memory retrieval at session startup to personalize your agent's behavior. Create a function that fetches relevant user memories and injects them into your agent's system prompt or context. Consider how you want to use retrieved memories: for personalized greetings, context-aware responses, or maintaining conversation continuity across sessions. ```python title="main.py" class MemoryAgent(Agent): def __init__(self, instructions: str, remembered_facts: list[str] | None = None): self._remembered_facts = remembered_facts or [] super().__init__(instructions=instructions) async def on_enter(self): # Use the retrieved facts for a personalized greeting if self._remembered_facts: top_fact = "; ".join(self._remembered_facts[:2]) await self.session.say(f"Welcome back! I remember that {top_fact}. What can I help you with?") else: await self.session.say("Hello! How can I help today?") # This helper function runs at the start of the session async def build_agent_instructions(memory_manager: Mem0MemoryManager | None) -> tuple[str, list[str]]: base_instructions = "You are a helpful voice concierge..." if not memory_manager: return base_instructions, [] # Fetches memories and adds them to the system prompt remembered_facts = await memory_manager.fetch_recent_memories() if not remembered_facts: return base_instructions, [] memory_lines = "\n".join(f"- {fact}" for fact in remembered_facts) enriched_instructions = f"{base_instructions}\n\nKnown details about this caller:\n{memory_lines}" return enriched_instructions, remembered_facts ```
### Step 3: Storing New Memories with a Custom Conversation Flow
Extend your conversation flow to capture and store new memories during interactions. If you are new to flows, review the core concepts in the [Conversation Flow](./conversation-flow.md) guide. Override the conversation flow's main processing method to evaluate each user message after the agent responds. The goal is to identify valuable information (user preferences, personal details, important facts) and store it without impacting response latency. You can implement this as a post-processing step or integrate it into your existing conversation handling logic. :::tip Want a deeper dive or to run this locally? - Review core concepts in the [Conversation Flow](./conversation-flow.md) guide. - To run your agent, follow the [AI Voice Agent Quickstart](/ai_agents/voice-agent-quick-start). ::: ```python title="memory_utils.py" from videosdk.agents import ConversationFlow class Mem0ConversationFlow(ConversationFlow): """A custom flow that records memories after each turn.""" def __init__(self, agent: Agent, memory_manager: Mem0MemoryManager, **kwargs): super().__init__(agent=agent, **kwargs) self._memory_manager = memory_manager self._pending_user_message: str | None = None async def run(self, transcript: str): self._pending_user_message = transcript # First, let the standard conversation turn happen full_response = "".join([chunk async for chunk in super().run(transcript)]) # After the response, decide if the turn should be stored in memory if self._pending_user_message and self._memory_manager.should_store(self._pending_user_message): await self._memory_manager.record_memory(self._pending_user_message, full_response or None) self._pending_user_message = None ```
### Step 4: Assembling the Agent Session
Integrate all components in your main application entry point. Initialize your memory manager, use it to build personalized agent instructions, and configure your session with the enhanced conversation flow. This is where you connect the memory system to your agent's lifecycle, ensuring memories are loaded at startup and new information is captured during conversations. ```python title="main.py" async def start_session(context: JobContext): # 1. Setup memory manager memory_manager = Mem0MemoryManager(api_key=os.getenv("MEM0_API_KEY"), user_id="demo-user") # 2. Build agent with personalized instructions instructions, facts = await build_agent_instructions(memory_manager) agent = MemoryAgent(instructions=instructions, remembered_facts=facts) # 3. Setup conversation flow with memory capabilities conversation_flow = Mem0ConversationFlow(agent=agent, memory_manager=memory_manager, ...) # 4. Create the session with the custom flow session = AgentSession( agent=agent, pipeline=pipeline, # your pipeline conversation_flow=conversation_flow ) # ... rest of your session and job context setup ``` This creates a powerful feedback loop where each interaction enriches the agent's knowledge, leading to smarter and more personalized conversations over time.
### Step 5: Run the Agent
Start your worker process and connect the agent to a room using a `JobContext`. This boots your agent and keeps it running. ```python title="main.py" from videosdk.agents import WorkerJob, JobContext, RoomOptions def make_context() -> JobContext: return JobContext(room_options=RoomOptions(name="Concierge Agent", playground=True)) if __name__ == "__main__": WorkerJob(entrypoint=start_session, jobctx=make_context).start() ``` This will initialize the session using your `start_session` function from Step 4 and keep the worker alive. ## Example - Try It Yourself Explore our complete, runnable example on GitHub to see how to integrate a memory provider into a VideoSDK AI Agent. } ]} /> --- --- title: Multi Agent Switching hide_title: false hide_table_of_contents: false description: "Learn how to switch between multiple specialized agents in VideoSDK for context-aware workflow using real world examples" pagination_label: "Multi Agent Switching" keywords: - Multi Agent Switching - Multi-Agent Orchestration - VideoSDK Agents - VideoSDK AI Voice - Python SDK - Real-time Transcription - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Multi Agent Switching slug: multi-agent-switching --- import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; Multi agent switching allows you to break a complex workflow into multiple specialized agents, each responsible for a specific domain or task. Instead of relying on a single agent to manage every tool and decision, you can coordinate smaller agents that operate independently. ### Context Inheritance When switching agents, you can control whether the new agent should be aware of the previous conversation using the `inherit_context` flag. - **`inherit_context=True`**: The new agent receives the full chat context. This is ideal for maintaining continuity, so the user doesn't have to repeat information. - **`inherit_context=False`** (Default): The new agent starts with a fresh state. This is useful when switching to a completely unrelated task. ## How It Works - The primary VideoSDK agent identifies whether specialized assistance is needed based on the users intent. - It invokes a `function tool` to switch to the appropriate specialized agent. - Control automatically shifts to the new agent and has access to the previous chat context as `inherit_context=True` was passed. - The specialized agent handles the user’s request, and complete the interaction. ### Implementation ```python title="main.py" from videosdk.agents import Agent, function_tool, class TravelAgent(Agent): def __init__(self): super().__init__( instructions="""You are a travel assistant. Help users with general travel inquiries and guide them to booking when needed.""", ) async def on_enter(self) -> None: await self.session.reply(instructions="Greet the user and ask how you can help with their travel plans.") async def on_exit(self) -> None: await self.session.say("Safe travels!") @function_tool() async def transfer_to_booking(self) -> Agent: """Transfer the user to a booking specialist for reservations and scheduling.""" return BookingAgent(inherit_context=True) class BookingAgent(Agent): def __init__(self, inherit_context: bool = False): super().__init__( instructions="""You are a booking specialist. Help users book or modify flights, hotels, and travel reservations.""", inherit_context=inherit_context ) async def on_enter(self) -> None: await self.session.say("I'm a booking specialist. What would you like to book or modify today?") async def on_exit(self) -> None: await self.session.say("Your booking request is complete. Have a great trip!") ``` ## Example - Try It Yourself , }, { title: "Health Care Agent Example", description: "Implement Health Care Agent Example", link: "https://github.com/videosdk-live/agents-quickstart/blob/main/Multi%20Agent%20Switch/Health%20Care%20agent/", icon: , }, ]} columns={2} /> --- --- title: Overview hide_title: false hide_table_of_contents: false description: "Get an overview of the VideoSDK AI Agent SDK, a framework for building AI agents for real-time conversations. Learn about its core components: Agent, Pipeline, and Agent Session." pagination_label: "Overview" keywords: - AI Agent SDK - VideoSDK Agents - Core Components - Agent - Pipeline - Agent Session - Real-time AI - Agentic Workflow - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Overview slug: overview --- The VideoSDK AI Agent SDK provides a powerful framework for building AI agents that can participate in real-time conversations. This guide explains the core components and demonstrates how to create a complete agentic workflow. The SDK serves as a real-time bridge between AI models and your users, facilitating seamless voice and media interactions. ## Architecture The Agent Session orchestrates the entire workflow, combining the Agent with a Pipeline for real-time communication. You can use a direct Realtime Pipeline for speech-to-speech, or a Cascading Pipeline with a Conversation Flow for modular STT-LLM-TTS control. ![Overview](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_overview.png) 1. **Agent** - This is the base class for defining your agent's identity and behavior. Here, you can configure custom instructions, manage its state, and register function tools. 2. **Pipeline** - This component manages the real-time flow of audio and data between the user and the AI models. The SDK offers two types of pipelines: - **Realtime Pipeline** - A speech to speech pipeline where there is no need for converting speech to text or text to speech and no llm to configure in between. - **Cascading Pipleine** - The traditional STT-LLM-TTS pipeline which allows flexibility to mix and match different providers for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). 3. **Agent Session** - This component brings together the agent, pipeline, and conversation flow to manage the agent's lifecycle within a VideoSDK meeting. 4. **Conversation Flow** - This inheritable class works with the CascadingPipeline to let you define custom turn-taking logic and preprocess transcripts. ## Supporting Components These components work behind the scenes to support the core functionality of the AI Agent SDK: - Execution & Lifecycle Management - **JobContext** - Provides the execution environment and lifecycle management for AI agents. It encapsulates the context in which an agent job is running. - **WorkerJob** - Manages the execution of jobs and worker processes using Python's multiprocessing, allowing for concurrent agent operations. - Configuration & Settings - **RoomOptions** - This allows you to configure the behavior of the session, including room settings and other advanced features for the agent's interaction within a meeting. - **Options** - This is used to configure the behavior of the worker, including logging and other execution settings. - External Integration - **MCP Servers** - These enable the integration of external tools through either stdio or HTTP transport. - **MCPServerStdio** - Facilitates direct process communication for local Python scripts. - **MCPServerHTTP** - Enables HTTP-based communication for remote servers and services. ## Advanced Features The AI Agent SDK includes a range of advanced features to build sophisticated conversational agents: import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, EyeIcon, RecordingIcon, NetworkIcon, MCPServerIcon, } from '@site/src/components/agent/cards'; }, { title: "Playground Mode", description: "A testing environment to experiment with different agent configurations", link: "https://docs.videosdk.live/ai_agents/core-components/agent-session#playground-mode", icon: }, { title: "Vision Integration", description: "Enable agents to receive and process video input from the meeting", link: "https://docs.videosdk.live/ai_agents/core-components/vision-and-multi-modality", icon: }, { title: "Recording Capabilities", description: "Record agent sessions for analysis and quality assurance", link: "https://docs.videosdk.live/ai_agents/core-components/recording", icon: }, { title: "A2A Communication", description: "Allows for seamless collaboration between specialized AI agents", link: "https://docs.videosdk.live/ai_agents/a2a/overview", icon: }, { title: "MCP Server Integration", description: "Connect agents to external tools and data sources", link: "https://docs.videosdk.live/ai_agents/mcp-integration", icon: } ]} /> ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent and customize according to your needs. }, { title: "Human in the loop", description: "Implement human intervention capabilities in AI agent conversations for better control and oversight", link: "https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop", icon: }, { title: "Enhanced Pronounciation", description: "Improve speech quality and pronunciation accuracy for better user experience and communication clarity", link: "https://github.com/videosdk-live/agents/blob/main/examples/enhanced_pronounciation.py", icon: }, { title: "PubSub Messaging", description: "Facilitates real-time messaging between agent and client", link: "https://github.com/videosdk-live/agents/blob/main/examples/pubsub_example.py", icon: } ]} /> --- --- title: Preemptive Response hide_title: false hide_table_of_contents: false description: "Learn how to enable Preemptive generation for faster STT responses using the VideoSDK AI Agent SDK." pagination_label: "Preemptive Response" keywords: - Preemptive Generation - Preemptive Response - Deepgram STT - Speech To Text - VideoSDK Agents - Python SDK - Real-time Transcription - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Preemptive Response slug: preemptive-response --- # Preemptive Response Preemptive Response is a feature that allows the Speech-to-Text (STT) engine to produce **partial, low-latency text output** while the user is still speaking. This is crucial for building highly responsive conversational AI agents. By enabling preemptive response, your agent can begin processing the user's intent and formulating a response before the full utterance is completed, significantly reducing the perceived latency. ## How It Works ![preemtive-response](https://assets.videosdk.live/images/preemptive-response.png) - User audio is streamed to the STT, which generates partial transcripts. - These partial transcripts are immediately sent to the LLM to enable preemptive (early) responses. - The LLM output is then passed to the TTS to generate the spoken response. ## Prerequisites Ensure you have the required packages installed: ```text pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" ``` :::tip Currently, preemptive response generation is limited to Deepgram’s STT implementation and is available only in the Flux model. ::: ## Enabling Preemptive Generation To enable this feature, set the `enable_preemptive_generation` flag to `True` when initializing your STT plugin (e.g., `DeepgramSTTV2`). ```python from videosdk.plugins.deepgram import DeepgramSTTV2 stt = DeepgramSTTV2( enable_preemptive_generation=True ) ``` ## Full Working Example The following example demonstrates how to build a voice agent with preemptive transcription enabled. This setup uses Deepgram for STT, OpenAI for LLM, and ElevenLabs for TTS. ```python import asyncio import os from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTTV2 from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS # Pre-download the Turn Detector model to avoid delays during startup pre_download_model() class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") async def on_exit(self): await self.session.say("Goodbye!") async def start_session(context: JobContext): # 1. Create the agent and conversation flow agent = MyVoiceAgent() conversation_flow = ConversationFlow(agent) # 2. Define the pipeline with Preemptive Generation enabled pipeline = CascadingPipeline( stt=DeepgramSTTV2( model="flux-general-en", enable_preemptive_generation=True # Enable low-latency partials ), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) # 3. Initialize the session session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow ) try: await context.connect() await session.start() # Keep the session running await asyncio.Event().wait() finally: # Clean up resources await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( name="VideoSDK Cascaded Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` --- --- title: Pub/Sub Messaging hide_title: false hide_table_of_contents: false description: "Learn how to implement real-time, bidirectional communication between your VideoSDK AI Agent and client applications using Pub/Sub messaging. This guide covers sending and receiving messages, handling events, and practical use cases." pagination_label: "Pub/Sub Messaging" keywords: - Pub/Sub - Real-time Messaging - Bidirectional Communication - VideoSDK Agents - AI Agent SDK - Event Handling - Client-Agent Communication - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 11 sidebar_label: Pub/Sub Messaging slug: pubsub-messaging --- import { AgentCardGrid, GithubIcon, CodeIcon } from '@site/src/components/agent/cards'; import Step from '@site/src/components/Step'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Pub/Sub Messaging Pub/Sub (Publish/Subscribe) messaging enables real-time, bidirectional communication between your AI agent and client applications within a VideoSDK meeting. This allows you to build interactive experiences where the client can send commands or data to the agent, and the agent can push updates or notifications back to the client, all without relying on voice. ![Pub/Sub Architecture Diagram](https://strapi.videosdk.live/uploads/user_agent_pubsub_chat_e62ce8f209.png) ## Key Features - **Send Messages**: Agents can publish messages to any specified Pub/Sub topic, which can be received by any participant (including client applications) subscribed to that topic. - **Receive Messages**: Agents can subscribe to topics to receive messages published by client applications or other participants. - **Bidirectional Flow**: Communication is not one-way. Both the agent and the client can publish and subscribe, enabling a fully interactive loop. - **Decoupled Communication**: The client and agent do not need to know about each other's existence directly. They communicate through shared topics, which simplifies the architecture. ## Implementation Implementing Pub/Sub involves two main parts: subscribing to topics to receive messages and publishing messages. Subscribing is typically the first step on both the agent and client side. Use the tabs below to see how to subscribe to a Pub/Sub topic across the AI Agent and client SDKs. ```python title="Subscribe on Room Context" from videosdk import PubSubSubscribeConfig def on_client_message(message): print(f"Received: {message}") await ctx.room.subscribe_to_pubsub(PubSubSubscribeConfig( topic="CHAT", cb=on_client_message )) ``` ```js title="Subscribe on meeting join" // Subscribe to CHAT meeting.on("meeting-joined", () => { meeting.pubSub.subscribe("CHAT", (data) => { const { message, senderId, senderName, timestamp } = data; console.log("Client command:", message); }); }); ``` ```js title="usePubSub hook" import { usePubSub } from "@videosdk.live/react-sdk"; function ClientCommands() { usePubSub("CHAT", { onMessageReceived: ({ message, senderId }) => { console.log("Client command:", message); }, }); return null; } ``` ```js title="usePubSub hook" import { usePubSub } from "@videosdk.live/react-native-sdk"; function ClientCommands() { const { messages } = usePubSub("CHAT", { onMessageReceived: (message) => { console.log("Client command:", message.message); }, }); return null; } ``` ```swift title="Subscribe with listener" class ClientCommandsListener: PubSubMessageListener { func onMessageReceived(message: PubSubMessage) { print("Client command: \(message.message)") } } let listener = ClientCommandsListener() meeting?.pubsub.subscribe(topic: "CHAT", forListener: listener) ``` ```kotlin title="Subscribe with listener" val listener = PubSubMessageListener { message -> Log.d("PubSub", "Client command: ${message.message}") } meeting?.pubSub?.subscribe("CHAT", listener) ``` ```dart title="Subscribe with handler" void messageHandler(PubSubMessage message) { print("Client command: ${message.message}"); } final messages = await room.pubSub.subscribe( "CHAT", messageHandler, ); ``` The most effective way for an agent to publish messages is by exposing a `function_tool`. This allows the LLM to decide when to send a message based on the conversation. To publish, you use `PubSubPublishConfig` and call the `publish_to_pubsub` method on the `JobContext` room object. ```python from videosdk import PubSubPublishConfig from videosdk.agents import Agent, function_tool, JobContext class MyPubSubAgent(Agent): def __init__(self, ctx: JobContext): super().__init__( instructions="You can send messages to the client using the send_message tool." ) self.ctx = ctx @function_tool async def send_message_to_client(self, message: str): """Sends a text message to the client application on the 'CHAT' topic.""" publish_config = PubSubPublishConfig( topic="CHAT", message=message ) await self.ctx.room.publish_to_pubsub(publish_config) return f"Message '{message}' sent to client." ``` To receive messages, the agent must subscribe to a topic using `PubSubSubscribeConfig` and the `subscribe_to_pubsub` method, which registers a callback function to handle incoming messages. This setup is typically done in your main `entrypoint` function after connecting to the room. ```python import asyncio from videosdk import PubSubSubscribeConfig from videosdk.agents import JobContext # Define the callback function that will process incoming messages def on_client_message(message): print(f"Received message from client: {message}") # Add your logic here to process the message. # For example, you could pass it to the agent's pipeline. async def entrypoint(ctx: JobContext): # ... (agent and session setup) try: await ctx.connect() await ctx.room.wait_for_participant() # Configure the subscription subscribe_config = PubSubSubscribeConfig( topic="CHAT", cb=on_client_message ) # Subscribe to the topic await ctx.room.subscribe_to_pubsub(subscribe_config) # Start the agent session await session.start() await asyncio.Event().wait() finally: await session.close() await ctx.shutdown() ``` ## Best Practices - **Topic Naming Conventions**: Use clear and consistent topic names (e.g., `CHAT`, `AGENT_STATUS`) to keep your application organized. - **Structured Data**: Use JSON for your message payloads. This makes messages easy to parse and allows for sending complex data structures. - **Error Handling**: Your callback function should gracefully handle malformed or unexpected messages to prevent crashes. - **Asynchronous Callbacks**: If your callback function performs long-running tasks, make sure it is `async` and consider running tasks in the background with `asyncio.create_task()` to avoid blocking the main event loop. ## Example - Try Out Yourself Check out our quickstart repository for a complete, runnable example of an agent using Pub/Sub. } ]} /> --- --- title: RAG (Retrieval-Augmented Generation) hide_title: false hide_table_of_contents: false description: "Learn how to implement Retrieval-Augmented Generation (RAG) with VideoSDK AI Agents to enhance your agent's knowledge base with external documents, databases, and real-time information retrieval capabilities." pagination_label: "RAG Integration" keywords: - RAG - Retrieval-Augmented Generation - Knowledge Base - Document Retrieval - Vector Database - Embeddings - Semantic Search - Context Retrieval - AI Agent Knowledge - Document Processing - Information Retrieval - VideoSDK Agents - AI Agent SDK - Python - Real-time Knowledge - External Data Sources - Voice Agent Sessions - Knowledge Management image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: RAG slug: rag --- import ReactPlayer from "react-player"; # RAG (Retrieval-Augmented Generation) integration **RAG** helps your AI agent find relevant information from documents to give better answers. It searches through your knowledge base and uses that context to respond more accurately. ## Architecture ![RAG](https://cdn.videosdk.live/website-resources/docs-resources/voice_agent_rag.png) The RAG pipeline flow: 1. **Voice Input** → Deepgram STT converts speech to text 2. **Retrieval** → Query embeddings fetch relevant documents from ChromaDB 3. **Augmentation** → Retrieved context is injected into the prompt 4. **Generation** → OpenAI LLM generates a grounded response 5. **Voice Output** → ElevenLabs TTS converts response to speech ## Managed RAG With Managed RAG, you can upload knowledge bases from the VideoSDK dashboard and attach them to your agent to enhance responses using retrieval-augmented generation. #### Step 1 : Upload Knowledge Base on the dashboard #### Step 2 : Configure it in Cascading Pipeline After uploading, the Knowledge Base is assigned a unique ID(as shown in Step 1), which you can use to load it, enabling the agent to fetch relevant information during conversations. ```python title="main.py" import os from videosdk.agents import KnowledgeBase, KnowledgeBaseConfig # Initialize Knowledge Base with ID from Dashboard kb_id = os.getenv("KNOWLEDGE_BASE_ID") config = KnowledgeBaseConfig(id=kb_id, top_k=3) # Load Knowledge Base and pass it to the agent agent = VoiceAgent( knowledge_base=KnowledgeBase(config) ) ``` ## Custom RAG ### Prerequisites - Install VideoSDK agents with all dependencies: ```bash pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" pip install chromadb openai numpy ``` - Set API keys in envrionment: ```shell title=".env" DEEPGRAM_API_KEY = "Your Deepgram API Key" OPENAI_API_KEY = "Your OpenAI API Key" ELEVENLABS_API_KEY = "Your ElevenLabs API Key" VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token" ``` :::tip For a complete working example with all the code integrated together, check out our GitHub repository: [RAG Implementation Example](https://github.com/videosdk-live/agents-quickstart/blob/main/RAG/rag.py) ::: ## Implementation ### Step 1: Custom Voice Agent with RAG Create a custom agent class that extends `Agent` and adds retrieval capabilities: ```python title="main.py" class VoiceAgent(Agent): def __init__(self): super().__init__( instructions="""You are a helpful voice assistant that answers questions based on provided context. Use the retrieved documents to ground your answers. If no relevant context is found, say so. Be concise and conversational.""" ) # Initialize OpenAI client for embeddings self.openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY")) # Define your knowledge base self.documents = [ "What is VideoSDK? VideoSDK is a comprehensive video calling and live streaming platform...", "How do I authenticate with VideoSDK? Use JWT tokens generated with your API key...", # Add more documents ] # Set up ChromaDB self.chroma_client = chromadb.Client() # In-memory # For persistence: chromadb.PersistentClient(path="./chroma_db") self.collection = self.chroma_client.create_collection( name="videosdk_faq_collection" ) # Generate embeddings and populate database self._initialize_knowledge_base() def _initialize_knowledge_base(self): """Generate embeddings and store documents.""" embeddings = [self._get_embedding_sync(doc) for doc in self.documents] self.collection.add( documents=self.documents, embeddings=embeddings, ids=[f"doc_{i}" for i in range(len(self.documents))] ) ``` ### Step 2: Embedding Generation Implement both synchronous (for initialization) and asynchronous (for runtime) embedding methods: ```python title="main.py" def _get_embedding_sync(self, text: str) -> list[float]: """Synchronous embedding for initialization.""" import openai client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY")) response = client.embeddings.create( input=text, model="text-embedding-ada-002" ) return response.data[0].embedding async def get_embedding(self, text: str) -> list[float]: """Async embedding for runtime queries.""" response = await self.openai_client.embeddings.create( input=text, model="text-embedding-ada-002" ) return response.data[0].embedding ``` ### Step 3: Retrieval Method Add semantic search capability: ```python title="main.py" async def retrieve(self, query: str, k: int = 2) -> list[str]: """Retrieve top-k most relevant documents from vector database.""" # Generate query embedding query_embedding = await self.get_embedding(query) # Query ChromaDB results = self.collection.query( query_embeddings=[query_embedding], n_results=k ) # Return matching documents return results["documents"][0] if results["documents"] else [] ``` ### Step 4: Agent Lifecycle Hooks Define agent behavior on entry and exit: ```python title="main.py" async def on_enter(self) -> None: """Called when agent session starts.""" await self.session.say("Hello! I'm your VideoSDK assistant. How can I help you today?") async def on_exit(self) -> None: """Called when agent session ends.""" await self.session.say("Thank you for using VideoSDK. Goodbye!") ``` ### Step 5: Custom Conversation Flow Override the conversation flow to inject retrieved context: ```python title="main.py" class RAGConversationFlow(ConversationFlow): async def run(self, transcript: str) -> AsyncIterator[str]: """ Process user input with RAG pipeline. Args: transcript: User's speech transcribed to text Yields: Generated response chunks """ # Step 1: Retrieve relevant documents context_docs = await self.agent.retrieve(transcript) # Step 2: Format context if context_docs: context_str = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs)]) else: context_str = "No relevant context found." # Step 3: Inject context into conversation self.agent.chat_context.add_message( role="system", content=f"Retrieved Context:\n{context_str}\n\nUse this context to answer the user's question." ) # Step 4: Generate response with LLM async for response_chunk in self.process_with_llm(): yield response_chunk ``` ### Step 6: Session and Pipeline Setup Configure the agent session and start the job: ```python title="main.py" async def entrypoint(ctx: JobContext): agent = VoiceAgent() conversation_flow = RAGConversationFlow( agent=agent, ) session = AgentSession( agent=agent, pipeline=CascadingPipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ), conversation_flow=conversation_flow, ) # Register cleanup ctx.add_shutdown_callback(lambda: session.close()) # Start agent try: await ctx.connect() print("Waiting for participant...") await ctx.room.wait_for_participant() print("Participant joined - starting session") await session.start() await asyncio.Event().wait() except KeyboardInterrupt: print("\nShutting down gracefully...") finally: await session.close() await ctx.shutdown() def make_context() -> JobContext: room_options = RoomOptions(name="RAG Voice Assistant", playground=True) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Advanced Features ### Dynamic Document Updates Add documents at runtime: ```python title="main.py" async def add_document(self, document: str, metadata: dict = None): """Add a new document to the knowledge base.""" doc_id = f"doc_{len(self.documents)}" embedding = await self.get_embedding(document) self.collection.add( documents=[document], embeddings=[embedding], ids=[doc_id], metadatas=[metadata] if metadata else None ) self.documents.append(document) ``` ### Document Chunking Split large documents for better retrieval: ```python title="main.py" def chunk_document(self, document: str, chunk_size: int = 500, overlap: int = 50) -> list[str]: """Split document into overlapping chunks.""" words = document.split() chunks = [] for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i:i + chunk_size]) chunks.append(chunk) return chunks # Use when adding documents for doc in large_documents: chunks = self.chunk_document(doc) for chunk in chunks: self.documents.append(chunk) ``` #### Best Practices 1. Document Quality: Use clear, well-structured documents with specific information 2. Chunk Size: Keep chunks between 300-800 words for optimal retrieval 3. Retrieval Count: Start with k=2-3, adjust based on response quality and latency 4. Context Window: Ensure retrieved context fits within LLM token limits 5. Persistent Storage: Use PersistentClient in production to save embeddings 6. Error Handling: Always handle retrieval failures gracefully 7. Testing: Test with diverse queries to ensure good coverage #### Common Issues | Issue | Solution | | ------------------ | ----------------------------------------------------------------------------- | | Slow responses | Reduce retrieval count (k), use faster embedding model, or cache embeddings | | Irrelevant results | Improve document quality, adjust chunking strategy, or use metadata filtering | | Out of memory | Use PersistentClient instead of in-memory Client | --- --- title: Realtime Pipeline hide_title: false hide_table_of_contents: false description: "Explore the `Realtime Pipeline` component in the VideoSDK AI Agent SDK. Learn how it manages AI models (like OpenAI and Gemini), configurations, streaming audio, and multi-modal capabilities." pagination_label: "Realtime Pipeline" keywords: - Pipeline Component - AI Agent SDK - VideoSDK Agents - AI Models - OpenAI - Gemini - Model Configuration - Streaming Audio - Multi-modal AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Realtime Pipeline slug: realtime-pipeline --- # RealTime Pipeline The `Realtime Pipeline` provides direct speech-to-speech processing with minimal latency. It uses unified models that handle the entire audio processing pipeline in a single step, offering the fastest possible response times for conversational AI. ![Realtime Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_realtime_pipeline.png) :::tip The `RealTimePipeline` is specifically designed for real-time AI models that provide end-to-end speech processing which is better for conversational agent. For use cases requiring more granular control over individual components (STT, LLM, TTS) for context support and response control, the [CascadingPipeline ↗](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline) would be more appropriate. ::: ## Basic Usage Setting up a `RealTimePipeline` is straightforward. You simply need to initialize your chosen real-time model and pass it to the pipeline's constructor. ```python title="main.py" from videosdk.agents import RealTimePipeline from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig # Initialize the desired real-time model model = OpenAIRealtime( model="gpt-4o-realtime-preview", config=OpenAIRealtimeConfig( voice="alloy", response_modalities=["AUDIO"] ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` In addition to [OpenAI ↗](https://docs.videosdk.live/ai_agents/plugins/realtime/openai), the Realtime Pipeline also supports other advanced models like [Google Gemini (Live API) ↗](https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api) and [AWS Nova Sonic ↗](https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic), each offering unique features for building high-performance conversational agents, you can check their pages for advance configuration options. import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, GeminiIcon, OpenAIIcon, AWSNovaSonicIcon, } from '@site/src/components/agent/cards'; }, { title: "Google Gemini", description: "More about Gemini Realtime Plugin", link: "https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api", icon: }, { title: "AWS Nova Sonic", description: "More about AWSNovaSonic Realtime Plugin", link: "https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic", icon: } ]} columns={3} />
:::tip - Choose a model based on its optimal audio sample rate (OpenAI/Nova Sonic: 16kHz, Gemini: 24kHz) to best fit your needs. - For cloud providers like AWS, select the server region closest to your users to minimize network latency. ::: ## Custom Model Integration To integrate a custom real-time model, you need to implement the `RealtimeBaseModel` interface, which requires implementing methods like `connect()`, `handle_audio_input()`, `send_message()`, and `interrupt()`. ```python title="main.py" from videosdk.agents import RealtimeBaseModel, Agent, CustomAudioStreamTrack from typing import Literal, Optional import asyncio class CustomRealtime(RealtimeBaseModel[Literal["user_speech_started", "error"]]): """Custom real-time AI model implementation""" def __init__(self, model_name: str, api_key: str): super().__init__() self.model_name = model_name self.api_key = api_key self.audio_track: Optional[CustomAudioStreamTrack] = None self._instructions = "" self._tools = [] self._connected = False def set_agent(self, agent: Agent) -> None: """Set agent instructions and tools""" self._instructions = agent.instructions self._tools = agent.tools async def connect(self) -> None: """Initialize connection to your AI service""" # Your connection logic here self._connected = True print(f"Connected to {self.model_name}") async def handle_audio_input(self, audio_data: bytes) -> None: """Process incoming audio from user""" if not self._connected: return # Process audio and generate response # Your audio processing logic here # Emit user speech detection self.emit("user_speech_started", {"type": "detected"}) # Generate and play response audio if self.audio_track: response_audio = b"your_generated_audio_bytes" await self.audio_track.add_new_bytes(response_audio) async def send_message(self, message: str) -> None: """Send text message to model""" # Your text processing logic here pass async def interrupt(self) -> None: """Interrupt current response""" if self.audio_track: self.audio_track.interrupt() async def aclose(self) -> None: """Cleanup resources""" self._connected = False if self.audio_track: await self.audio_track.cleanup() ``` ## Comparison with Cascading Pipeline The key architectural difference is that `RealTimePipeline` uses integrated models that handle the entire speech-to-speech pipeline internally, while cascading pipelines coordinate separate STT, LLM, and TTS components. | Feature | Realtime Pipeline | Cascading Pipeline | | :-------------- | :---------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ | | **Latency** | Significantly lower latency, ideal for highly interactive, real-time conversations. | Higher latency due to coordinating separate STT, LLM, and TTS components. | | **Control** | Less granular control; tools are handled directly by the integrated model. | Granular control over each step (STT, LLM, TTS), allowing for more complex logic. | | **Flexibility** | Limited to the capabilities of the single, chosen real-time model. | Allows mixing and matching different providers for each component (e.g., Google STT, OpenAI LLM). | | **Complexity** | Simpler to configure as it involves a single, unified model. | More complex to set up due to the coordination of multiple separate components. | | **Cost** | Varies depending on the chosen real-time model and usage patterns. | Varies depending on the combination of providers and usage for each component. | ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. }, { title: "OpenAI", description: "Implement with OpenAI", link: "https://github.com/videosdk-live/agents-quickstart/tree/main/OpenAI", icon: }, { title: "Google Gemini (LiveAPI)", description: "Implement with Google Gemini (LiveAPI)", link: "https://github.com/videosdk-live/agents-quickstart/tree/main/Google%20Gemini%20(LiveAPI)", icon: }, { title: "AWS Nova Sonic", description: "Implement with AWS Nova Sonic", link: "https://github.com/videosdk-live/agents-quickstart/tree/main/AWS%20Nova%20Sonic", icon: } ]} /> --- --- id: recording title: Recording hide_title: false hide_table_of_contents: false description: "Learn how to enable the recording functionality with VideoSDK AI Agents for agent sessions and user interactions." pagination_label: "Recording" keywords: - Agent Recording - AI Agents - Recording - AI Agent Oversight - Traces - Playback - VideoSDK Agents - MCP Server - Python SDK - Audio Store - Autoscroll Transcript - Timestamped Playback image: img/videosdklive-thumbnail.jpg sidebar_position: 11 sidebar_label: Recording slug: recording --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon, APIIcon } from '@site/src/components/agent/cards'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Recording Recording capabilities in VideoSDK Agents allow you to capture and store meeting conversations, enabling features like conversation analysis, compliance documentation, and quality assurance. VideoSDK provides three distinct recording approaches, each suited for different use cases and requirements. ## Recording Types Overview VideoSDK offers three types of recording functionality: 1. **Participant Recording** - Built-in automatic recording managed by the agent framework 2. **Track Recording** - Individual audio/video track recording with granular control 3. **Meeting Recording** - Complete meeting session recording with composite output ## 1. Participant Recording (Built-in) Participant recording is the simplest approach, automatically managed by the VideoSDK Agents framework when you enable the `recording` parameter. ### How It Works When recording=True is set in RoomOptions, the system automatically: - Starts recording when the agent joins the meeting. - Starts recording for each participant as they join. - Stops and merges recordings when the session ends. ### Basic Setup ```python title="main.py" from videosdk.agents import JobContext, RoomOptions def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Recording Agent", #highlight-start recording=True # Enable automatic participant recording #highlight-end ) ) ``` ## 2. Track Recording Track recording provides granular control over individual audio and video tracks, allowing you to record specific streams with custom configurations. ### When to Use Track Recording - Need to record specific audio/video tracks separately - Require custom recording configurations per track - Want to control recording start/stop timing manually - Need different quality settings for different tracks ### Key Features - **Individual Control**: Start/stop recording for specific tracks - **Custom Configuration**: Set different recording parameters per track - **Flexible Output**: Choose output formats and quality settings - **Manual Management**: Full control over recording lifecycle ### API References for Track Recording }, { title: "Stop Track Recording", description: "This API lets you stop recording of track of participant of your room by passing roomId, participantId and kind as a body parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/stop-track-recording", icon: }, { title: "Fetch a Track Recording", description: "This API lets you fetch a particular track recording info by passing trackRecordingId as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-a-track-recording", icon: }, { title: "Fetch All Track Recordings", description: "This API lets you fetch details of your track recording by passing roomId, sessionId and participantId as query parameters.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-all-track-recordings", icon: }, { title: "Delete A Track Recording", description: "This API lets you delete a particular track recording by passing trackRecordingId as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/delete-track-recording", icon: } ]} /> ## 3. Meeting Recording Meeting recording captures the entire meeting session as a single composite recording, including all participants and their interactions. ### When to Use Meeting Recording - Need a single recording file for the entire meeting - Want automatic mixing of all audio/video streams - Require meeting-level recording controls - Need simplified post-processing workflow ### Key Features - **Composite Output**: Single recording file with all participants - **Automatic Mixing**: Audio/video streams automatically combined - **Meeting-level Control**: Start/stop recording for entire meeting - **Simplified Management**: One recording per meeting session ### API References for Meeting Recording }, { title: "Stop Recording", description: "This API lets you stop recording of your room by passing roomId as a body parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/stop-recording", icon: }, { title: "Fetch Recordings", description: "This API lets you fetch details of your recording by passing roomId and sessionId as query parameters.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-recordings", icon: }, { title: "List all Recordings", description: "This API lets you fetch a particular recording info by passing recording Id as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-recording-using-recordingId", icon: }, { title: "Delete a Recording", description: "This API lets you delete a particular recording by passing recording Id as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/delete-recording", icon: } ]} /> ## Choosing the Right Recording Type | Use Case | Recommended Type | Reason | | :--- | :--- | :--- | | Agent conversations with automatic management | Participant Recording | Built-in automation and channel separation | | Custom recording workflows | Track Recording | Granular control over individual streams | | Simple meeting archival | Meeting Recording | Single composite file for entire meeting | | Compliance and audit trails | Participant Recording | Automatic lifecycle management | | Advanced post-processing | Track Recording | Individual track access and control | ## Best Practices ### Recording Management - Choose the appropriate recording type based on your use case - Ensure proper authentication tokens for recording API access - Monitor recording status and handle errors gracefully - Plan for adequate storage capacity ### Privacy and Compliance - Inform participants that sessions are being recorded - Implement proper data retention and deletion policies - Ensure compliance with local privacy regulations - Use appropriate recording type for your compliance requirements --- --- id: room-options title: RoomOptions hide_title: false hide_table_of_contents: false description: "Learn how to configure RoomOptions for VideoSDK AI Agents to customize meeting connection, agent behavior, and session management." pagination_label: "RoomOptions" keywords: - RoomOptions - AI Agents Configuration - Meeting Connection - Agent Settings - VideoSDK Agents - Session Management - Python SDK - Agent Identity - Playground Mode image: img/videosdklive-thumbnail.jpg sidebar_position: 12 sidebar_label: RoomOptions slug: room-options --- # RoomOptions `RoomOptions` is a configuration class that defines how an AI agent connects to and behaves within a VideoSDK meeting room. It serves as the primary interface for customizing agent behavior, meeting connection parameters, and session management settings. ## Introduction The `RoomOptions` class is the central configuration point for VideoSDK AI agents, providing comprehensive control over how agents join meetings, interact with participants, and manage their sessions. This configuration is passed to the `JobContext` during agent initialization and influences all aspects of the agent's behavior within the meeting environment. ## Core Features - **Meeting Connection**: Configure room ID and authentication for VideoSDK meetings - **Agent Identity**: Set display name and visual representation - **Session Management**: Control automatic session termination and timeouts - **Media Capabilities**: Enable vision processing and meeting recording - **Development Tools**: Playground mode for testing and development - **Error Handling**: Custom error handling callbacks - **Avatar Integration**: Support for virtual avatars ## Basic Example ```python title="main.py" from videosdk.agents import RoomOptions, JobContext # Basic configuration room_options = RoomOptions( room_id="your-meeting-id", name="My AI Agent", playground=True ) # Create job context context = JobContext(room_options=room_options) ``` ## Parameters Parameters that you can pass with `RoomOptions`: | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `room_id` | Optional[str] | None | Unique identifier for the VideoSDK meeting | | `auth_token` | Optional[str] | None | VideoSDK authentication token | | `name` | Optional[str] | "Agent" | Display name of the agent in the meeting | | `playground` | bool | True | Enable playground mode for easy testing | | `vision` | bool | False | Enable video processing capabilities | | `recording` | bool | False | Enable meeting recording | | `avatar` | Optional[Any] | None | Virtual avatar for visual representation | | `join_meeting` | Optional[bool] | True | Whether agent should join the meeting | | `on_room_error` | Optional[Callable] | None | Error handling callback function | | `auto_end_session` | bool | True | Automatically end session when participants leave | | `session_timeout_seconds` | Optional[int] | 30 | Timeout for automatic session termination | | `signaling_base_url` | Optional[str] | None | Custom VideoSDK signaling server URL | ## Additional Resources import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, EyeIcon, RecordingIcon, NetworkIcon, MCPServerIcon, } from '@site/src/components/agent/cards'; }, { title: "Vision Integration", description: "Enable agents to receive and process video input from the meeting", link: "https://docs.videosdk.live/ai_agents/core-components/vision-and-multi-modality", icon: }, { title: "Recording Capabilities", description: "Record agent sessions for analysis and quality assurance", link: "https://docs.videosdk.live/ai_agents/core-components/recording", icon: }, { title: "Avatar", description: "Use Avatar for visually engaging AI Voice Agent", link: "https://docs.videosdk.live/ai_agents/core-components/avatar", icon: } ]} /> --- --- title: Speech Handle hide_title: false hide_table_of_contents: false description: "Learn about Speech Handle in the VideoSDK AI Agent SDK. Understand how to control agent speech at both session and utterance levels, manage interruptions, and coordinate sequential speech playback." pagination_label: "Speech Handle" keywords: - Speech Handle - Agent Speech - Utterance Handle - Session Control - Interruption Handling - Async Await - AgentSession - TTS - VideoSDK Agents - Python SDK - AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Speech Handle slug: speech-handle --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Speech Handle Speech control in VideoSDK agents operates through two complementary layers: **session-level** methods for initiating speech and **utterance-level** handles for managing speech lifecycle. This document covers both aspects of controlling agent speech output. ## Session-Level Speech Control The `AgentSession` provides three primary methods for controlling agent speech output **1. Say** `say(message: str, interruptible: bool = True)`: Sends a direct message from the agent to meeting participants with interruption control. **Parameters:** - `message`: The message to be spoken. - `interruptible`: When `True`, the agent’s speech can be interrupted. When `False`, the agent will continue speaking until the message is fully delivered. Default is `True`. ```python # Basic usage # highlight-start await session.say("Critical update!", interruptible=False) # highlight-end # In agent lifecycle hooks class MyAgent(Agent): async def on_enter(self): # highlight-start await self.session.say("Welcome to the meeting!") # highlight-end ``` **2. Reply** `reply(instructions: str, wait_for_playback: bool = True, interruptible: bool = True)`: Generates agent responses dynamically using custom instructions with interruption control. **Parameters:** - `instructions`: Custom instructions for generating the response - `wait_for_playback`: When `True`, prevents user interruptions until playback completes - `interruptible`: When `True`, the agent’s response can be interrupted. When `False`, the agent will continue speaking without interruption. Default is `True`. ```python # Generate immediate response # highlight-start await session.reply(instructions="Please summarize the conversation so far", interruptible=False) # highlight-end # Wait for complete playback before allowing new inputs # highlight-start await session.reply( instructions="Explain the next steps", wait_for_playback=True ) # highlight-end # Practical example in function tools class MyAgent(Agent): @function_tool async def get_summary(self) -> str: #highlight-start await self.session.reply( instructions="Based on our conversation, let me provide a summary..." ) #highlight-end return "Summary generated" ``` **3. Interrupt** `interrupt()`: Immediately stops the agent's current speech operation. ```python # Emergency stop during agent response # highlight-start session.interrupt() # highlight-end # User interruption handling class InteractiveAgent(Agent): async def handle_user_input(self, user_input: str): if "stop" in user_input.lower(): #highlight-start self.session.interrupt() #highlight-end await self.session.reply(instructions="How can I help you instead?") @function_tool async def emergency_stop(self) -> str: """Stop current agent operation immediately""" # highlight-start self.session.interrupt() # highlight-end return "Agent stopped successfully" ``` ## Utterance-Level Management `UtteranceHandle` manages individual agent utterances, preventing overlapping speech and enabling graceful interruption handling. ### Core Concepts - **Lifecycle Management** - Each `UtteranceHandle` tracks a single utterance from creation through completion. - **Completion States** An utterance can complete in two ways: 1. **Natural Completion:** The TTS finishes playing the audio 2. **User Interruption:** The user starts speaking during playback - **Awaitable Pattern** - The handle supports Python's async/await syntax for sequential speech control. ### API Reference | Property/Method | Return Type | Description | |------------------|--------------|--------------| | `id` | `str` | Unique identifier for the utterance | | `done()` | `bool` | Returns `True` if utterance is complete | | `interrupted` | `bool` | Returns `True` if user interrupted | | `interrupt()` | `None` | Manually marks utterance as interrupted | | `__await__()` | `Generator` | Enables awaiting the handle | ### Usage Patterns - **Sequential Speech** To prevent overlapping TTS, await each handle before starting the next utterance: ```python # Correct approach handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 # Wait for first utterance to complete handle2 = self.session.say("Do you live in this city?") await handle2 # Wait for second utterance to complete ``` - **Checking Interruption Status** Access the current utterance handle via `self.session.current_utterance` to detect interruptions: ```python utterance: UtteranceHandle | None = self.session.current_utterance # In long-running operations, check periodically for i in range(10): if utterance and utterance.interrupted: logger.info("Task was interrupted by the user.") return "The task was cancelled because you interrupted me." await asyncio.sleep(1) ``` ### Best Practices - **Sequential Speech:** Always await handles when you need sequential speech to prevent audio overlap - **Interruption Handling:** Check `interrupted` status in long-running operations to enable graceful cancellation - **Handle References:** Store handle references if you need to check status later in your function - **Avoid Concurrent Tasks:** Don't use `create_task()` for speech that should play sequentially ### Common Use Cases - **Multi-part responses:** When function tools need to speak multiple sentences in sequence - **Long-running operations:** Tasks that should be cancellable when users interrupt - **Conversational flows:** Scenarios requiring precise timing between utterances ## Example - Try It Yourself } ]} /> ## FAQs
Troubleshooting | Issue | Solution | |--------|-----------| | Overlapping speech | Use `await` on handles instead of `create_task()` | | Tasks not cancelling on interruption | Check `utterance.interrupted` in loops | | Handle is None | Only available during function tool execution via `session.current_utterance` |
Correct Usage Pattern #### ✅ Correct: Sequential Speech Await each handle to prevent overlapping TTS. ```python handle1 = session.say("First") await handle1 handle2 = session.say("Second") await handle2 ``` --- #### ❌ Incorrect: Concurrent Speech Using `create_task()` causes audio overlap. ```python asyncio.create_task(session.say("First")) asyncio.create_task(session.say("Second")) ```
--- --- title: Testing and Evaluation hide_title: false hide_table_of_contents: false description: "Learn how to test and evaluate your AI agents using the VideoSDK Agent SDK. Measure latency, accuracy, and reasoning capabilities." pagination_label: "Testing and Evaluation" keywords: - Testing - Evaluation - Metrics - Latency - LLM Judge - AI Agent SDK - VideoSDK Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 7 sidebar_label: Testing and Evaluation slug: testing-and-evaluation --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, GithubIcon } from '@site/src/components/agent/cards'; # Testing and Evaluation The VideoSDK Agent SDK provides a structured evaluation framework that allows you to run controlled tests on individual agent components: Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) and collect performance metrics such as latency, accuracy, and stability. ## Evaluation Components To test your agent, use the `Evaluation` class. This allows you to define different scenarios (called "turns") and run them to see how your agent performs. Key components include: - **`Evaluation`**: Runs all your test scenarios. - **`EvalTurn`**: Represents a single conversational turn, one complete exchange where the user gives input and the agent processes it to provide a response. - **`EvalMetric`**: Measurements like `STT_LATENCY`, `LLM_LATENCY`, etc. - **`LLMAsJudge`**: Uses an LLM to "judge" the quality of your agent's response. These are the criteria the Judge can use to evaluate the agent: | Metric | Description | | :--- | :--- | | **REASONING** | Explains *why* the agent responded in a certain way. Useful for debugging logic. | | **RELEVANCE** | Checks if the response actually answers the user's question. | | **CLARITY** | Checks if the response is easy to understand. | | **SCORE** | Gives a numerical rating (0-10) for the quality of the response. | ## Implementation The following steps explain how to set up a test for your agent. ### 1. Import Libraries First, import the necessary modules from the SDK. ```python import logging import aiohttp from videosdk.agents import ( Evaluation, EvalTurn, EvalMetric, LLMAsJudgeMetric, LLMAsJudge, STTEvalConfig, LLMEvalConfig, TTSEvalConfig, STTComponent, LLMComponent, TTSComponent, function_tool ) # Set up logging to see the output logging.basicConfig(level=logging.INFO) ``` ### 2. Define Tools If your agent uses tools (like checking the weather), you need to define them here so the evaluation can use them. ```python @function_tool async def get_weather( latitude: str, longitude: str, ): """ Called when the user asks about the weather. Returns the weather for the given location. Args: latitude: The latitude of the location longitude: The longitude of the location """ print("### Getting weather for", latitude, longitude) url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}¤t=temperature_2m" weather_data = {} try: async with aiohttp.ClientSession() as session: async with session.get(url) as response: if response.status == 200: data = await response.json() print("Weather data", data) weather_data = { "temperature": data["current"]["temperature_2m"], "temperature_unit": "Celsius", } else: print(f"Failed to get weather data, status code: {response.status}") raise Exception(f"Failed to get weather data, status code: {response.status}") except Exception as e: print(f"Exception in get_weather tool: {e}") raise e return weather_data ``` ### 3. Setup Evaluation Create an `Evaluation` instance. You can specify which metrics you want to track. ```python eval = Evaluation( name="basic-agent-eval", include_context=False, metrics=[ EvalMetric.STT_LATENCY, EvalMetric.LLM_LATENCY, EvalMetric.TTS_LATENCY, EvalMetric.END_TO_END_LATENCY ], output_dir="./reports" ) ``` **Parameters:** | Parameter | Type | Description | | :--- | :--- | :--- | | `name` | `str` | Name of the evaluation suite. | | `include_context` | `bool` | Whether to include conversation context. | | `metrics` | `list` | List of metrics to calculate (e.g., `EvalMetric.STT_LATENCY`). | | `output_dir` | `str` | Directory to save the evaluation reports. | ### 4. Add Test Scenarios (Turns) Add "turns" to your evaluation. A turn simulates a single complete interaction loop (Input -> Processing -> Response) between the user and the agent. You can mix and match mock inputs (text) and real inputs (audio files). #### Scenario 1: Complex Interaction Here, we test the full pipeline: 1. **STT**: Transcribes an audio file (`sample.wav`). 2. **LLM**: Receives a mock text input (overriding the STT output for this test) and uses the `get_weather` tool. 3. **TTS**: Generates speech from a mock text string. 4. **Judge**: An LLM reviews the answer to see if it is relevant. :::warning Note Only `.wav` files are supported for STT evaluation. Please ensure your audio files are in this format. ::: ```python eval.add_turn( EvalTurn( stt=STTComponent.deepgram( STTEvalConfig(file_path="./sample.wav") ), llm=LLMComponent.google( LLMEvalConfig( model="gemini-2.5-flash-lite", use_stt_output=False, mock_input="write one paragraph about Water and get weather of Delhi", tools=[get_weather] ) ), tts=TTSComponent.google( TTSEvalConfig( model="en-US-Standard-A", use_llm_output=False, mock_input="Peter Piper picked a peck of pickled peppers" ) ), judge=LLMAsJudge.google( model="gemini-2.5-flash-lite", prompt="Can you evaluate the agent's response based on the following criteria: Is it relevant to the user input?", checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE] ) ) ) ``` **Configuration Parameters:** | Parameter | Type | Description | | :--- | :--- | :--- | | `file_path` | `str` | Path to the audio file. **Note:** Only `.wav` files are supported. | | Parameter | Type | Description | | :--- | :--- | :--- | | `model` | `str` | The LLM model to use (e.g., `gemini-2.5-flash-lite`). | | `use_stt_output` | `bool` | If `True`, uses the output from the STT stage as input. | | `mock_input` | `str` | Text input to use if `use_stt_output` is `False`. | | `tools` | `list` | List of tool functions available to the LLM. | | Parameter | Type | Description | | :--- | :--- | :--- | | `model` | `str` | The TTS model to use. | | `use_llm_output` | `bool` | If `True`, uses the output from the LLM stage as input. | | `mock_input` | `str` | Text input to use if `use_llm_output` is `False`. | | Parameter | Type | Description | | :--- | :--- | :--- | | `model` | `str` | The LLM model to use for judging. | | `prompt` | `str` | The prompt/criteria for the judge. | | `checks` | `list` | List of metrics to check (e.g., `LLMAsJudgeMetric.REASONING`, `LLMAsJudgeMetric.SCORE`). | #### Scenario 2: End-to-End Flow This scenario uses the output from one step as the input for the next. The STT output is fed into the LLM, and the LLM output is fed into the TTS. ```python eval.add_turn( EvalTurn( stt=STTComponent.deepgram( STTEvalConfig(file_path="./Sports.wav") ), llm=LLMComponent.google( LLMEvalConfig( model="gemini-2.5-flash-lite", use_stt_output=True, # Use the text from STT ) ), tts=TTSComponent.google( TTSEvalConfig( model="en-US-Standard-A", use_llm_output=True # Use the text from LLM ) ), judge=LLMAsJudge.google( model="gemini-2.5-flash-lite", prompt="Is the response relevant?", checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE] ) ) ) ``` #### Scenario 3: Individual Component Testing You can also test components in isolation. ```python eval.add_turn( EvalTurn( stt=STTComponent.deepgram( STTEvalConfig(file_path="./Sports.wav") ) ) ) ``` ```python eval.add_turn( EvalTurn( llm=LLMComponent.google( LLMEvalConfig( model="gemini-2.5-flash-lite", use_stt_output=False, mock_input="write one paragraph about trees", ) ), ) ) ``` ```python eval.add_turn( EvalTurn( tts=TTSComponent.google( TTSEvalConfig( model="en-US-Standard-A", use_llm_output=False, mock_input="A big black bug bit a big black bear, made the big black bear bleed blood." ) ) ) ) ``` ### 5. Run and Save Results Finally, run the evaluation and save the report. The report will be saved to the `output_dir`. ```python results = eval.run() results.save() ``` ## Examples - Try It Out Yourself } ]} columns={2} /> --- --- title: Turn Detection & Voice Activity Detection (VAD) hide_title: false hide_table_of_contents: false description: "Learn about Turn Detection in the VideoSDK AI Agent SDK. Understand Voice Activity Detection (VAD), End-of-Utterance (EOU) detection, and how to implement natural conversation flow in your AI agents." pagination_label: "Turn Detection & VAD" keywords: - Turn Detection - Voice Activity Detection - VAD - End of Utterance - EOU - Conversation Flow - SileroVAD - TurnDetector - AI Agent SDK - VideoSDK Agents - Speech Processing - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Turn Detection and VAD slug: turn-detection-and-vad --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Turn Detection and Voice Activity Detection In conversational AI, timing is everything. Traditional voice agents rely on simple silence-based timers (Voice Activity Detection or VAD) to guess when a user has finished speaking. This often leads to awkward interruptions or unnatural pauses. To solve this, VideoSDK created **Namo-v1**: an open-source, high-performance turn-detection model that understands the _meaning_ of the conversation, not just the silence. ![Namo Turn Detection](https://strapi.videosdk.live/uploads/namo_v1_turn_detection_12e042c6ec.png) ## From Silence Detection to Speech Understanding Namo shifts from basic audio analysis to sophisticated Natural Language Understanding (NLU), allowing your agent to know when a user is truly finished speaking versus just pausing to think. | Traditional VAD (Silence-Based) | Namo Turn Detector (Semantic-Based) | | :---------------------------------------------- | :------------------------------------------------------- | | **Listens for silence.** | **Understands words and context.** | | Relies on a fixed timer (e.g., 800ms). | Uses a transformer model to predict intent. | | Often interrupts or lags. | Knows when to wait and when to respond instantly. | | Struggles with natural pauses and filler words. | Distinguishes between a brief pause and a true endpoint. | This semantic understanding enables AI agents to respond quicker and more naturally, creating a fluid, human-like conversational experience. :::tip Learn More For a deep dive into Namo's architecture, performance benchmarks, and how to use it as a standalone model, check out the dedicated [**Namo Turn Detector plugin page**](/ai_agents/plugins/namo-turn-detector). ::: ## Implementation For the most robust setup, you can use VAD and Namo together. VAD acts as a basic speech detector, while Namo intelligently decides if the turn is over. ### 1. Voice Activity Detection (VAD) First, configure VAD to detect the presence of speech. This helps manage interruptions and acts as a first-pass filter. ```python from videosdk.plugins.silero import SileroVAD # Configure VAD to detect speech activity vad = SileroVAD( threshold=0.5, # Sensitivity to speech (0.3-0.8) min_speech_duration=0.1, # Ignore very brief sounds min_silence_duration=0.75 # Wait time before considering speech ended ) ``` ### 2. Namo Turn Detection Next, add the `NamoTurnDetectorV1` plugin to analyze the content of the speech and predict the user's intent. #### Multilingual Model If your agent needs to support multiple languages, use the default multilingual model. It's a single, powerful model that works across more than 20 languages. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download the multilingual model to avoid runtime delays pre_download_namo_turn_v1_model() # Initialize the multilingual Turn Detector turn_detector = NamoTurnDetectorV1( threshold=0.7 # Confidence level for triggering a response ) ``` The table below lists all supported languages with their performance metrics and language codes. #### Language-Specific Models For maximum performance and accuracy in a single language, use a specialized model. These models are faster and have a smaller memory footprint. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download a specific language model (e.g., German) pre_download_namo_turn_v1_model(language="de") # Initialize the Turn Detector for German turn_detector = NamoTurnDetectorV1( language="de", threshold=0.7 ) ``` Namo-v1-Korean, accuracy: '97.3%' }, { language: '🇹🇷 Turkish', code: 'tr', modelLink: Namo-v1-Turkish, accuracy: '96.8%' }, { language: '🇯🇵 Japanese', code: 'ja', modelLink: Namo-v1-Japanese, accuracy: '93.5%' }, { language: '🇮🇳 Hindi', code: 'hi', modelLink: Namo-v1-Hindi, accuracy: '93.1%' }, { language: '🇩🇪 German', code: 'de', modelLink: Namo-v1-German, accuracy: '91.9%' }, { language: '🇬🇧 English', code: 'en', modelLink: Namo-v1-English, accuracy: '91.5%' }, { language: '🇳🇱 Dutch', code: 'nl', modelLink: Namo-v1-Dutch, accuracy: '90.0%' }, { language: '🇮🇳 Marathi', code: 'mr', modelLink: Namo-v1-Marathi, accuracy: '89.7%' }, { language: '🇨🇳 Chinese', code: 'zh', modelLink: Namo-v1-Chinese, accuracy: '88.8%' }, { language: '🇵🇱 Polish', code: 'pl', modelLink: Namo-v1-Polish, accuracy: '87.8%' }, { language: '🇳🇴 Norwegian', code: 'no', modelLink: Namo-v1-Norwegian, accuracy: '87.3%' }, { language: '🇮🇩 Indonesian', code: 'id', modelLink: Namo-v1-Indonesian, accuracy: '87.1%' }, { language: '🇵🇹 Portuguese', code: 'pt', modelLink: Namo-v1-Portuguese, accuracy: '86.9%' }, { language: '🇮🇹 Italian', code: 'it', modelLink: Namo-v1-Italian, accuracy: '86.8%' }, { language: '🇪🇸 Spanish', code: 'es', modelLink: Namo-v1-Spanish, accuracy: '86.7%' }, { language: '🇩🇰 Danish', code: 'da', modelLink: Namo-v1-Danish, accuracy: '86.5%' }, { language: '🇧🇩 Bengali', code: 'bn', modelLink: Namo-v1-Bengali, accuracy: '79.2%' }, { language: '🇸🇦 Arabic', code: 'ar', modelLink: Namo-v1-Arabic, accuracy: '79.7%' }, { language: '🇫🇮 Finnish', code: 'fi', modelLink: Namo-v1-Finnish, accuracy: '84.8%' }, { language: '🇫🇷 French', code: 'fr', modelLink: Namo-v1-French, accuracy: '85.0%' }, { language: '🇺🇦 Ukrainian', code: 'uk', modelLink: Namo-v1-Ukrainian, accuracy: '86.2%' }, { language: '🇻🇳 Vietnamese', code: 'vi', modelLink: Namo-v1-Vietnamese, accuracy: '82.37%' }, { language: '🇷🇺 Russian', code: 'ru', modelLink: Namo-v1-Russian, accuracy: '84.1%' } ]} /> :::note To see all available models for different languages, along with their benchmarks and accuracy, please visit our [Hugging Face models page](https://huggingface.co/videosdk-live/models). ::: ### 3. Adaptive End-of-Utterance (EOU) Handling The **Adaptive EOU** mode dynamically adjusts the speech-wait timeout based on the confidence scores. This ensures that the agent waits longer when the user is hesitant and responds faster when the user's intent is clear, creating a more natural conversational flow. You can configure this by setting the `eou_config` in your pipeline options: ```python pipeline = CascadingPipeline( # ... other config eou_config=EOUConfig( mode='ADAPTIVE', # or 'DEFAULT' min_max_speech_wait_timeout=[0.5, 0.8] # Min 0.5s, Max 0.8s wait ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `mode` | `str` | • **DEFAULT**: Uses a fixed timeout value.
• **ADAPTIVE**: Dynamically adjusts timeout based on confidence scores.| | `min_max_speech_wait_timeout` | `list[float]` | Defines the minimum and maximum wait time (in seconds)| ##### Example | User Input | Agent Reaction | Wait Time | Example | |------------|----------------|-----------|---------| | **Mode = DEFAULT**
Speaks clearly | Responds immediately | ~0.5s | `“Book a meeting for tomorrow at 10.”` | | **Mode = DEFAULT**
Pauses or hesitates mid-sentence | Waits slightly longer | ~0.8s | `“Book a meeting for… um… tomorrow…”` | | **Mode = ADAPTIVE** | Adjusts based on speech clarity | Scaled between min/max | `“Remind me to call… uh… John later.”` | ### 4. Interruption Detection (VAD + STT) Interruption Detection controls when the system should treat user speech as an intentional interruption. It evaluates both voice activity and recognized speech content to avoid triggering interruptions from short noises, filler words, or background audio. The agent only stops or responds when the user clearly intends to speak. #### Configuration Example (HYBRID mode) ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( mode="HYBRID", interrupt_min_duration=0.2, # 200ms of continuous speech interrupt_min_words=2, # At least 2 words recognized ) ) ``` #### VAD_ONLY mode ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( mode="VAD_ONLY", interrupt_min_duration=0.2, # 200ms of continuous speech ) ) ``` #### STT_ONLY mode ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( mode="STT_ONLY", interrupt_min_words=2, # At least 2 words recognized ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `mode` | `str` |• **HYBRID** : Combines VAD and STT. Requires both audio detection and recognized words to trigger an interruption.
• **VAD_ONLY** : Uses only raw speech activity detection. Faster but may be triggered by background noise.
• **STT_ONLY** : Relies only on recognized words from the transcript. Slower but ensures speech is intelligible. | | `interrupt_min_duration` | `float` | Minimum duration (in seconds) of continuous speech required to trigger interruption. | | `interrupt_min_words` | `int` | Minimum number of words that must be recognized (used in `HYBRID` and `STT_ONLY` modes). | ### 5. False-Interruption Recovery The **False-Interruption Recovery** feature detects accidental or brief user noises and allows the agent to automatically resume speaking when interruptions are not genuine. #### Configuration Example ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( false_interrupt_pause_duration=2.0, # Wait 2 seconds to confirm interruption resume_on_false_interrupt=True, # Auto-resume if interruption is brief ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `false_interrupt_pause_duration` | `float` | Duration (in seconds) to wait after detecting an interruption before considering it false. If the user doesn't continue speaking within this time, the interruption is considered accidental and the agent resumes. | | `resume_on_false_interrupt` | `bool` | If `True`, the agent will automatically resume speaking after detecting a false interruption. If `False`, the agent will remain paused even after brief interruptions. | ## Pipeline Integration Combine VAD and Namo in your `CascadingPipeline` to bring it all together. ```python from videosdk.agents import CascadingPipeline from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download the model you intend to use pre_download_namo_turn_v1_model(language="en") pipeline = CascadingPipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, # highlight-start vad=SileroVAD(threshold=0.5), turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7) # highlight-end ) ``` :::tip The `RealTimePipeline` for providers like OpenAI includes built-in turn detection, so external VAD and Turn Detector components are not required. ::: ## Example Implementation Here’s a complete example showing Namo in a conversational agent. ```python title="main.py" from videosdk.agents import Agent, CascadingPipeline, AgentSession from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from your_providers import your_stt_provider, your_llm_provider, your_tts_provider class ConversationalAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant that waits for users to finish speaking before responding." ) async def on_enter(self): await self.session.say("Hello! I'm listening and will respond when you're ready.") # 1. Pre-download the model to ensure fast startup pre_download_namo_turn_v1_model(language="en") # 2. Set up the pipeline with Namo for intelligent turn detection pipeline = CascadingPipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, # highlight-start vad=SileroVAD(threshold=0.5), turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7) # highlight-end ) # 3. Create and start the session session = AgentSession(agent=ConversationalAgent(), pipeline=pipeline) # ... connect to your call transport ``` ## Examples - Try It Yourself }, { title: "Cascading Pipleine", description: "Turn-Detection and VAD with cascading pipeline", link: "https://github.com/videosdk-live/agents/blob/main/examples/cascade_basic.py", icon: } ]} columns={2} /> --- --- title: Utterance Handle hide_title: false hide_table_of_contents: false description: "Learn about UtteranceHandle in the VideoSDK AI Agent SDK. Understand how to manage agent utterances, prevent overlapping speech, and handle user interruptions gracefully." pagination_label: "Utterance Handle" keywords: - Utterance Handle - Speech Management - Interruption Handling - Async Await - Conversation Flow - AgentSession - TTS - VideoSDK Agents - Python SDK - AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 7 sidebar_label: Utterance Handle slug: utterance-handle --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Utterence Handle `UtteranceHandle` is a lifecycle management class for agent utterances in the videosdk-agents framework. It solves two critical problems: - preventing overlapping text-to-speech (TTS) output - enabling graceful interruption handling when users speak during agent responses. This is essential for creating natural conversational experiences where agents can generate multiple sequential speech outputs without audio overlap. ## Core Concepts ### Lifecycle Management Each `UtteranceHandle` instance tracks a single utterance from creation through completion. The handle manages state transitions automatically as the conversation progresses. ### Completion States An utterance can complete in two ways: 1. **Natural Completion:** The TTS finishes playing the audio to completion 2. **User Interruption:** The user starts speaking, triggering an interruption ### Awaitable Pattern The handle is compatible with Python's async/await syntax. This allows you to write sequential speech code that waits for each utterance to complete before starting the next one. ## API Reference ### Properties | Property/Method | Return Type | Description | |----------------|-------------|-------------| | id | str | Unique identifier for the utterance | | done() | bool | Returns True if utterance is complete | | interrupted | bool | Returns True if user interrupted | | interrupt() | None | Manually marks utterance as interrupted | | __await__() | Generator | Enables awaiting the handle | ### Methods - `interrupt()`: Manually marks the utterance as interrupted - `__await__()`: Enables awaiting the handle to wait for completion ## Usage Patterns ### Sequential Speech To prevent overlapping TTS, await each handle before starting the next utterance: ```python # Correct approach handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 # Wait for first utterance to complete handle2 = self.session.say("Do you live in this city?") await handle2 # Wait for second utterance to complete ``` ### Checking Interruption Status Access the current utterance handle via `self.session.current_utterance` in function tools to detect interruptions: ```python utterance: UtteranceHandle | None = self.session.current_utterance # In long-running operations, check periodically for i in range(10): if utterance and utterance.interrupted: logger.info("Task was interrupted by the user.") return "The task was cancelled because you interrupted me." await asyncio.sleep(1) ``` ## Anti-Pattern: Concurrent Speech Never use `asyncio.create_task()` for speech that should be sequential, as this causes overlapping audio: ```python # INCORRECT - causes overlapping speech asyncio.create_task(self.session.say(f"The current temperature is {temperature}°C.")) asyncio.create_task(self.session.say("Do you live in this city?")) ``` ## Integration with AgentSession The `session.say()` method returns an `UtteranceHandle` instance. During function tool execution, the current utterance is accessible via `self.session.current_utterance`. The handle's lifecycle is managed automatically by the session, with completion and interruption states updated as the conversation progresses. ### Complete Example ```python @function_tool async def get_weather(self, latitude: str, longitude: str) -> dict: utterance: UtteranceHandle | None = self.session.current_utterance # Fetch weather data temperature = await fetch_temperature(latitude, longitude) # Sequential speech with await handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 handle2 = self.session.say("Do you live in this city?") await handle2 # Check if user interrupted if utterance and utterance.interrupted: return {"response": "Weather request cancelled due to user interruption."} return {"response": f"The temperature is {temperature}°C."} ``` ## Best Practices 1. Always await handles when you need sequential speech to prevent audio overlap 2. Check `interrupted` status in long-running operations to enable graceful cancellation 3. Store handle references if you need to check status later in your function 4. Avoid `create_task()` for speech that should play sequentially ## Common Use Cases - **Multi-part responses:** When function tools need to speak multiple sentences in sequence - **Long-running operations:** Tasks that should be cancellable when users interrupt - **Conversational flows:** Scenarios requiring precise timing between utterances ## Example - Try It Yourself } ]} /> ## FAQs
Troubleshooting | Issue | Solution | |--------|-----------| | Overlapping speech | Use `await` on handles instead of `create_task()` | | Tasks not cancelling on interruption | Check `utterance.interrupted` in loops | | Handle is None | Only available during function tool execution via `session.current_utterance` |
Correct Usage Pattern #### ✅ Correct: Sequential Speech Await each handle to prevent overlapping TTS. ```python handle1 = session.say("First") await handle1 handle2 = session.say("Second") await handle2 ``` --- #### ❌ Incorrect: Concurrent Speech Using `create_task()` causes audio overlap. ```python asyncio.create_task(session.say("First")) asyncio.create_task(session.say("Second")) ```
--- --- title: Vision & Multi-modality hide_title: false hide_table_of_contents: false description: "Learn how to add vision and multi-modal capabilities to your VideoSDK AI Agents. Understand image processing, live video input, and multi-modal conversation flows." pagination_label: "Vision & Multi-modality" keywords: - Vision - Multi-modality - Image Processing - Live Video - Visual AI - ImageContent - Gemini Live - Real-time Vision - AI Agent SDK - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 9 sidebar_label: Vision & Multi-modality slug: vision-and-multi-modality --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Vision & Multi-modality Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see. The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases. ## Pipeline Architecture Overview The framework provides two pipeline types with different vision support: | Pipeline Type | Vision Capabilities | Supported Models | Use Cases | |---|---|---|---| | CascadingPipeline | Live frame capture & static images | OpenAI, Anthropic, Google | On-demand frame analysis, document analysis, visual Q&A | | RealTimePipeline | Continuous live video streaming | Google Gemini Live only | Real-time visual interactions, live video commentary | ## Cascading Pipeline Vision The CascadingPipeline supports vision through two approaches: capturing live video frames from participants, or processing static images. This works with all supported LLM providers (OpenAI, Anthropic, Google). ### Enabling Vision Enable vision capabilities by setting `vision=True` in RoomOptions: ```python from videosdk.agents import JobContext, RoomOptions room_options = RoomOptions( room_id="your-room-id", name="Vision Agent", #highlight-start vision=True # Enable vision capabilities #highlight-end ) job_context = JobContext(room_options=room_options) ``` ### Live Frame Capture Capture video frames from meeting participants on-demand using `agent.capture_frames()`: ```python from videosdk.agents import Agent, AgentSession, CascadingPipeline from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.google import GoogleLLM class VisionAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant that can analyze images." ) async def entrypoint(ctx: JobContext): agent = VisionAgent(ctx) conversation_flow = ConversationFlow(agent) pipeline = CascadingPipeline( stt=DeepgramSTT(), llm=GoogleLLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow, ) shutdown_event = asyncio.Event() #highlight-start async def on_pubsub_message(message): print("Pubsub message received:", message) if isinstance(message, dict) and message.get("message") == "capture_frames": print("Capturing frame....") try: frames = agent.capture_frames(num_of_frames=1) if frames: print(f"Captured {len(frames)} frame(s)") await session.reply( "Please analyze this frame and describe what you see in details within one line.", frames=frames ) else: print("No frames available. Make sure vision is enabled in RoomOptions.") except ValueError as e: print(f"Error: {e}") def on_pubsub_message_wrapper(message): asyncio.create_task(on_pubsub_message(message)) #highlight-end #rest of the code.. ``` :::tip The `capture_frames` function returns an array and the max number of frames you can input is 5 (`num_of_frames <=5`) ::: **Key Features:** - **On-Demand Capture:** Capture frames only when needed, triggered by events or user requests - **Event-Driven:** Use PubSub or other triggers to capture frames at the right moment - **Flexible Analysis:** Send custom instructions along with frames for specific analysis tasks ### Silent Capture (Saving Captured Frames) You can save captured video frames to disk for later analysis or debugging. The frames returned by `agent.capture_frames()` are `av.VideoFrame` objects that can be converted to JPEG images. (Silent capture - as it doesn't invoke any agent speech saying the image is being captured unless explicity set to do so) ```python title="main.py" import io from av import VideoFrame from PIL import Image def save_frame_as_jpeg(frame: VideoFrame, filename: str) -> None: """Save a video frame as a JPEG file.""" img = frame.to_image() # Convert to PIL Image img.save(filename, format="JPEG") # In your agent code frames = agent.capture_frames(num_of_frames=1) if frames: # Save the first frame save_frame_as_jpeg(frames[0], "captured_frame.jpg") # Or save as bytes for uploading/processing buffer = io.BytesIO() frames[0].to_image().save(buffer, format="JPEG") jpeg_bytes = buffer.getvalue() ``` **Use Cases:** - **Debugging:** Save frames to verify what the agent is seeing - **Logging:** Archive frames for audit trails or quality assurance - **Preprocessing:** Save frames before sending to external vision APIs - **Thumbnails:** Generate preview images for user interfaces ### Static Image Processing For pre-existing images or URLs, use the `ImageContent` class: ```python from videosdk.agents import ChatRole, ImageContent # Add image from URL agent.chat_context.add_message( role=ChatRole.USER, content=[ImageContent(image="https://example.com/image.jpg")] ) # Add image with custom settings image_content = ImageContent( image="https://example.com/document.png", inference_detail="high" # "auto", "high", or "low" ) agent.chat_context.add_message( role=ChatRole.USER, content=[image_content] ) ``` ### Provider Support All major LLM providers support vision in `CascadingPipeline`: | Provider | Vision Models | Capabilities | |-----------|----------------|---------------| | OpenAI | GPT-4 Vision models | Configurable detail levels, URL & base64 support | | Anthropic | Claude 3 models | Advanced image understanding, document analysis | | Google | Gemini models | Comprehensive visual analysis, multi-image support | ### Best Practices - **Frame Timing:** Capture frames at meaningful moments (e.g., when user asks "what do you see?") - **Error Handling:** Always check if frames are available before processing - **Vision Enablement:** Ensure `vision=True` is set in `RoomOptions` for frame capture - **Image Quality:** Use appropriate resolutions for your use case (1024x1024 recommended for detailed analysis) *Here is the example you can try out : [Cascade Pipeline Vision Example](https://github.com/videosdk-live/agents/blob/main/examples/vision/vision_cascade.py)* --- ## RealTime Pipeline Vision The `RealTimePipeline` enables continuous live video processing for real-time visual interactions. Video frames are automatically streamed to the model as they arrive. ### Live Video Processing Live video input is enabled through the `vision` parameter in `RoomOptions` and requires Google's Gemini Live model. ```python title="main.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline,WorkerJob, JobContext, RoomOptions from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig async def start_session(context: JobContext): # Initialize Gemini with vision capabilities model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) agent = VisionAgent() session = AgentSession( agent=agent, pipeline=pipeline, ) await session.start(wait_for_participant=True, run_until_shutdown=True) # Enable live video processing def make_context() -> JobContext: room_options = RoomOptions(room_id="", name="Sandbox Agent", playground=True, #highlight-start vision=True #highlight-end ) return JobContext( room_options=room_options ) ``` ### Video Processing Flow When vision is enabled, the system automatically does following: 1. **Continuous Capture**: Captures video frames from meeting participants 2. **Frame Processing**: Processes frames at optimal intervals (throttled to 0.5 seconds) 3. **Model Integration**: Sends visual data to the Gemini Live model 4. **Context Integration**: Integrates visual understanding with conversation context ### RealTimePipeline Limitations - **Model Restriction**: Only works with `GeminiRealtime` model - **Network Requirements**: Requires stable network connections for optimal performance - **Frame Rate**: Automatically throttled to prevent overwhelming the model *Here is the example you can try out : [**Realtime Pipeline Vision Example**](https://github.com/videosdk-live/agents/blob/main/examples/vision/vision_realtime.py)* ## Choosing the Right Approach | Use Case | Recommended Pipeline | Why | |-----------|----------------------|-----| | On-demand frame analysis | CascadingPipeline | Capture frames only when needed, works with all LLM providers | | Document/image Q&A | CascadingPipeline | Process static images with custom instructions | | Real-time video commentary | RealTimePipeline | Continuous streaming for live visual interactions | | Multi-provider support | CascadingPipeline | Works with OpenAI, Anthropic, and Google | | Lowest latency | RealTimePipeline | Direct streaming to Gemini Live model | ## Examples - Try Out Yourself Checkout examples of using Realtime and Cascading Vision functionality import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; }, { title: "Realtime Pipeline Vision", description: "Continuous video streaming with Gemini Realtime API", link: "https://github.com/videosdk-live/agents/blob/main/examples/vision/vision_realtime.py", icon: } ]} /> ## Frequently Asked Questions
Can I use vision with any LLM provider? CascadingPipeline vision works with OpenAI, Anthropic, and Google LLMs. RealTimePipeline vision only works with Google's Gemini Live model.
How do I capture frames at specific moments? Use event-driven triggers like PubSub messages or user speech to call `agent.capture_frames()` at the right time. See the example code above for implementation details.
What's the difference between frame capture and continuous streaming? Frame capture (CascadingPipeline) captures frames on-demand when you call `capture_frames()`. Continuous streaming (RealTimePipeline) automatically sends video frames to the model in real-time.
--- --- title: Voice Mail Detection hide_title: false hide_table_of_contents: false description: "Learn how VideoSDK AI agents detect voicemail systems during outbound calls and take actions such as leaving a voicemail message or ending the call" pagination_label: "Voice Mail Detection" keywords: - Voice Mail Detection - VideoSDK Agents - VideoSDK AI Voice - Python SDK - Real-time Transcription - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Voice Mail Detection slug: voice-mail-detection --- import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; # Voice Mail Detection Voice Mail Detection allows you to automatically handle voicemail scenarios when making outbound calls with a VideoSDK AI agent. When an outbound call is forwaded to a voicemail system, the detector triggers a callback so your agent can take the action such as leaving a voicemail message or ending the call. ## What Problem This Solves In outbound calling workflows, unanswered calls are often routed to voicemail systems. Without detection, agents may continue speaking or wait unnecessarily. Voice Mail Detection lets you: - Detect voicemail systems automatically - Control how your agent responds - End calls cleanly after voicemail handling :::info To set up an outbound calling, and routing rules check out the [Quick Start Example](https://docs.videosdk.live/telephony/managing-calls/making-outbound-calls). ::: ## Enabling Voice Mail Detection To use voicemail detection, import and add `VoiceMailDetector` to your agent configuration and register a callback that defines how voicemail should be handled. ```python from videosdk.agents import VoiceMailDetector from videosdk.plugins.openai import OpenAILLM async def voice_mail_callback(message): print("Voice Mail message received:", message) # highlight-start voicemail = VoiceMailDetector( llm=OpenAILLM(), duration=5, callback=custom_callback_voicemail, ) # highlight-end session = AgentSession( # highlight-start voice_mail_detector=voicemail # highlight-end ) ``` ## Parameters | Parameter | Description | |----------|-------------| | `llm` | LLM to process the detected voicemail. | | `duration` | The minimum period of silence (in seconds) that triggers voicemail detection. | | `callback` | A function that is called whenever a voicemail is detected, allowing for custom actions like hanging up or leaving a message. | ## Example - Try It Yourself , }, ]} columns={2} /> --- --- title: Worker hide_title: false hide_table_of_contents: false description: "The `Worker` class in VideoSDK's AI Agent SDK serves as the central orchestrator that manages job execution, backend registration, and agent lifecycle coordination. It handles task execution through configurable process/thread executors, manages VideoSDK room connections, and coordinates between agents, pipelines, and infrastructure components for seamless real-time AI communication." pagination_label: "Worker" keywords: - Worker - AI Agent SDK - VideoSDK Agents - Job Orchestration - Task Execution - Backend Registration - Agent Lifecycle - Process Management - Session Coordination - Real-time AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 13 sidebar_label: Worker slug: worker --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; # Worker This document covers the `worker` and `job` execution system that manages `agent` processes, handles backend registration, and coordinates job assignment and execution. This system provides the foundation for running VideoSDK agents either locally or as part of a distributed backend infrastructure. ## Architecture Overview The `worker` and `job` system consists of three primary components that work together to execute agent code: - **WorkerJob**: The main entry point that configures and starts agent execution - **Worker**: Manages process pools, backend communication, and job lifecycle - **JobContext**: Provides runtime context and resources to agent entrypoint functions ![Worker](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_worker.png) ## Core Components ### Worker Class The `Worker` class manages the complete lifecycle of agent execution, including process management, backend communication, and job coordination. **Core Responsibilities:** - Process pool management and lifecycle - Backend registry communication - Job assignment and execution coordination - Resource monitoring and cleanup - Error handling and recovery ### WorkerJob The `WorkerJob` class serves as the primary entry point for creating and running agents. It accepts an `entrypoint function` and `configuration options`, then delegates to the Worker class for execution. ```python from videosdk.agents import WorkerJob, Options, JobContext, RoomOptions # Configure worker options options = Options( agent_id="MyAgent", max_processes=5, register=True, # Registers worker with backend for job scheduling ) # Set up room configuration room_options = RoomOptions( name="My Agent", ) # Create job context job_context = JobContext(room_options=room_options) # Define your agent entrypoint async def your_agent_function(ctx: JobContext): # Your agent logic here await ctx.connect() # Agent implementation... # Create and start the worker job # highlight-start job = WorkerJob( entrypoint=your_agent_function, jobctx=lambda: job_context, options=options, ) job.start() # highlight-end ``` - **Entrypoint:** An async function that serves as your agent's main execution logic. This function receives a `JobContext` parameter and contains your agent implementation. - **JobContext:** Provides the runtime environment for your agent, managing room connections and VideoSDK integration. It handles room setup, authentication, and cleanup operations. - **Options:** Configuration settings for worker execution including process management, authentication, and backend registration. You can find worker options [here ↗](https://docs.videosdk.live/ai_agents/deployments/self-hosting/worker-configuration#worker-options-explained). **Key Methods:** - `start()`: Initiates worker execution based on configuration ## Deployments Choose how to deploy your VideoSDK agents based on your infrastructure needs and requirements. ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. } ]} /> --- --- title: Deploy Your Agents hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Deploy Your Agents" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Deploy Your Agents slug: deploy-your-agents --- # Deploy Your Agents This guide shows you how to deploy AI Agents with the [videosdk-agents](https://pypi.org/project/videosdk-agents/) python package. Once your AI Agent is ready to use, you need to create an AI Deployment. The AI Deployment is responsible for running your AI Agent. Before proceeding, ensure you have completed the steps under **Prerequisites**. ## Prerequisites To deploy your AI Deployment, make sure you have: - Created an AI Deployment using the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment). - A VideoSDK authentication token (generate from [VideoSDK Dashboard](https://app.videosdk.live)) ## YAML Configuration Create a `videosdk.yaml` file with the following structure: ``` version: "1.0" deployment: id: your_ai_deployment_id entry: path: entry_point_for_deployment env: # Optional to run your agent locally path: "./.env" secrets: VIDEOSDK_AUTH_TOKEN: your_auth_token deploy: cloud: true ``` ### Field Descriptions | Field | Description | | ----------------------------- | ---------------------------------------------------------------------------------------------- | | `deployment.id` | The `deploymentId` obtained from the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment) | | `deployment.entry.path` | Path to the entry point script for your AI Deployment. | | `env.path` | Path to your `.env` file, used only when running the agent locally. | | `secrets.VIDEOSDK_AUTH_TOKEN` | Your VideoSDK auth token (required for deployment). | | `deploy.cloud` | Set to `true` to allow deploying the deployment to VideoSDK Cloud, when using the deploy command. Use `false` to avoid accidental deploys. | ## CLI Commands - ###### Run the AI Deployment locally for Testing. ``` videosdk run ``` - ###### Deploy the AI Deployment. ``` videosdk deploy ``` ## Next Steps After deploying your AI Deployment, you can start using it by: 1. Creating a new session using the [Start Session API](/api-reference/agent-cloud/start-session) 2. Ending the session using the [End Session API](/api-reference/agent-cloud/end-session) --- --- title: Authentication hide_title: false hide_table_of_contents: false description: "Learn how to authenticate with the VideoSDK CLI using login and logout commands. Set up your credentials for Agent Cloud deployments." pagination_label: "CLI Authentication" keywords: - VideoSDK CLI - Authentication - Login - Logout - Auth Token - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Authentication slug: authentication --- # Authentication Before deploying agents to Agent Cloud, you need to authenticate the VideoSDK CLI with your account. This section covers the authentication commands that link your local development environment to your VideoSDK account. ## Login The `login` command authenticates your CLI session with your VideoSDK account. ### Usage ```bash videosdk auth login ``` ### What Happens 1. **Browser Opens**: The CLI automatically opens your default browser to the authentication page 2. **Login & Confirm**: You log in to your VideoSDK account and confirm the CLI authentication request 3. **Token Storage**: Once approved, an authentication token is securely stored locally 4. **Ready to Deploy**: The stored token will be used for all future CLI commands ### Example Output ```bash $ videosdk auth login ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Authentication ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ℹ Initiating browser authentication... ✓ Opened authentication URL in browser https://app.videosdk.live/cli/confirm-auth?requestId=abc123xyz ⠋ Waiting for authentication... ✓ Successfully authenticated! ``` ### Notes - The CLI will wait for you to complete the authentication in the browser - If the browser doesn't open automatically, copy and paste the displayed URL - Authentication will timeout if not completed within the specified time - You can cancel the authentication anytime by pressing `Ctrl+C` ## Logout The `logout` command removes the stored authentication token from your local environment. ### Usage ```bash videosdk auth logout ``` ### What Happens 1. **Token Removal**: The stored authentication token is removed from your local configuration 2. **Session End**: Your CLI session is disconnected from the VideoSDK account ### Example Output ```bash $ videosdk auth logout ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Logout ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✓ Successfully logged out ``` ### Notes - After logging out, you'll need to run `videosdk auth login` again before using authenticated commands - This command does not affect any existing deployments on Agent Cloud ## Next Steps After authenticating, you're ready to initialize your agent. See the [Initialize Agent](./init-agent) documentation for more details. --- --- title: Build & Push hide_title: false hide_table_of_contents: false toc_max_heading_level: 2 description: "Learn how to build and push Docker images for your AI agents using the VideoSDK CLI. Package your agent code and deploy to container registries." pagination_label: "CLI Build & Push" keywords: - VideoSDK CLI - Docker Build - Docker Push - Container Registry - Agent Deployment image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Build & Push slug: build-push --- # Build & Push After developing your AI agent, you need to package it as a Docker image and push it to a container registry before deploying to Agent Cloud. This section covers the build and push commands. ## Prerequisites Before building your image, you must have a `Dockerfile` and a `requirements.txt` file in your project's root directory. The `Dockerfile` is automatically generated when you run [videosdk agent init](./init-agent). ### requirements.txt Create a `requirements.txt` file listing all the Python packages your agent needs. ### Dockerfile A `Dockerfile` is automatically created when you run `videosdk agent init`. Below is a minimal **multi-stage build** example that keeps your final image small while ensuring all build-time dependencies are met. ```dockerfile # Stage 1: Build Stage FROM python:3.12-slim AS builder WORKDIR /app # Install build-time dependencies # These are only needed to compile/build packages like aec-audio-processing RUN apt-get update && apt-get install -y --no-install-recommends \ build-essential \ python3-dev \ swig \ pkg-config \ meson \ ninja-build \ && rm -rf /var/lib/apt/lists/* # Create a virtual environment to keep dependency installation isolated and easy to move RUN python -m venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Stage 2: Runtime Stage FROM python:3.12-slim WORKDIR /app # Copy the virtual environment from the builder stage # This includes all the installed Python packages but NONE of the build tools COPY --from=builder /opt/venv /opt/venv # Ensure the app uses the virtual environment's Python and packages ENV PATH="/opt/venv/bin:$PATH" # Install minimal runtime libraries if needed (e.g., libstdc++ for compiled extensions) RUN apt-get update && apt-get install -y --no-install-recommends \ libstdc++6 \ && rm -rf /var/lib/apt/lists/* # Copy your application code COPY agent.py . # Run the application CMD ["python", "agent.py"] ``` ## Build The `build` command creates a Docker image for your agent using a Dockerfile. ### Usage ```bash videosdk agent build [OPTIONS] ``` ### Options | Option | Short | Description | Default | | --------- | ----- | -------------------------------------------------------- | -------------------- | | `--image` | `-i` | Image name with optional tag (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--file` | `-f` | Path to Dockerfile | `./Dockerfile` | ### What Happens 1. **Dockerfile Detection**: The CLI locates your Dockerfile (default: `./Dockerfile`) 2. **Image Build**: Docker builds the image for the `linux/arm64` platform 3. **Local Storage**: The built image is stored in your local Docker registry ### Example Output ```bash $ videosdk agent build --image myrepo/myagent:v1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Building Docker Image ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Platform linux/arm64 Image myrepo/myagent:v1 Dockerfile /path/to/your/project/Dockerfile ──────────────────────────────────────── [Docker build output...] ✓ Successfully built image: myrepo/myagent:v1 ``` ### Examples ```bash # Build with explicit image name videosdk agent build --image myrepo/myagent:v1 # Build with custom Dockerfile videosdk agent build --image myrepo/myagent:v1 --file Dockerfile.prod # Build using image from videosdk.yaml videosdk agent build ``` ### Notes - The image name must be lowercase - In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). - If `--image` is not provided, the CLI reads from `agent.image` in your `videosdk.yaml` - Docker must be installed and running on your machine - The build uses `linux/arm64` platform for Agent Cloud compatibility ## Push The `push` command uploads your Docker image to a container registry. ### Usage ```bash videosdk agent push [OPTIONS] ``` ### Options | Option | Short | Description | Default | | ------------ | ----- | ----------------------------------------------- | -------------------- | | `--image` | `-i` | Image name with tag (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--server` | `-s` | Registry server URL | `docker.io` | | `--username` | `-u` | Registry username for authentication | None | | `--password` | `-p` | Registry password for authentication | None | ### What Happens 1. **Authentication** (optional): If credentials are provided, the CLI logs into the registry 2. **Image Push**: The Docker image is uploaded to the specified registry 3. **Ready for Deploy**: The image is now available for Agent Cloud deployments ### Example Output ```bash $ videosdk agent push --image myrepo/myagent:v1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Pushing Docker Image ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Image myrepo/myagent:v1 Registry docker.io ──────────────────────────────────────── [Docker push output...] ✓ Successfully pushed image: myrepo/myagent:v1 ``` ### Examples ```bash # Push to Docker Hub (default) videosdk agent push --image myrepo/myagent:v1 # Push to GitHub Container Registry with authentication videosdk agent push --image myrepo/myagent:v1 --server ghcr.io -u username -p token # Push to private registry videosdk agent push --image myrepo/myagent:v1 --server registry.example.com -u user -p pass # Push using image from videosdk.yaml videosdk agent push ``` ### Supported Registries | Registry | Server URL | | ------------------------- | ------------------------------------------ | | Docker Hub | `docker.io` (default) | | GitHub Container Registry | `ghcr.io` | | AWS ECR | `.dkr.ecr..amazonaws.com` | | Google Container Registry | `gcr.io` | | Azure Container Registry | `.azurecr.io` | ### Notes - Ensure the image is built before pushing (`videosdk agent build`) - Replace `myrepo` with your actual Docker registry username. - For Docker Hub, you can omit `--server` as it's the default - For private registries, you must provide authentication credentials - The registry server is automatically detected from the image name if `--server` is not specified ## yaml Configuration Both commands can read the image name from your `videosdk.yaml` configuration file: ```yaml agent: id: your-agent-id image: myrepo/myagent:v1 ``` When the `image` is configured in `videosdk.yaml`, you can simply run: ```bash videosdk agent build videosdk agent push ``` ## Example Here's a typical workflow for building and pushing your agent: ```bash # 1. Build the Docker image videosdk agent build --image myrepo/myagent:v1 # 2. Push to container registry videosdk agent push --image myrepo/myagent:v1 # 3. Deploy to Agent Cloud (covered in deployment docs) videosdk agent deploy --image myrepo/myagent:v1 ``` ## Next Steps After pushing your image to a container registry, you're ready to deploy your agent to Agent Cloud. See the [deployment documentation](./deploy) for more details. --- --- title: Deploy & Version Commands hide_title: false hide_table_of_contents: false toc_max_heading_level: 2 description: "Learn how to deploy and manage versions of your AI agents on Agent Cloud using VideoSDK CLI commands." pagination_label: "CLI Deployment" keywords: - VideoSDK CLI - Agent Deploy - Version Management - Agent Cloud - Replicas - Scaling image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Deployment slug: deploy --- # Deploy & Version Commands This section covers all CLI commands for deploying and managing versions of your AI agents on Agent Cloud. ## Deploy Create a new version of your agent on VideoSDK cloud. ### Usage ```bash videosdk agent deploy [NAME] [OPTIONS] ``` ### Arguments | Argument | Required | Description | | -------- | -------- | --------------------- | | `NAME` | No | Optional version name | ### Options | Option | Short | Description | Default | | --------------------- | ----- | ------------------------------------------------------- | -------------------- | | `--image` | `-i` | Docker image URL (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--version-tag` | | Version tag (e.g., `main/0.0.2`) | None | | `--min-replica` | | Minimum number of replicas | `0` | | `--max-replica` | | Maximum number of replicas | `3` | | `--profile` | | Compute profile: `cpu-small`, `cpu-medium`, `cpu-large` | `cpu-small` | | `--agent-id` | | Agent ID | From `videosdk.yaml` | | `--deployment-id` | | Deployment ID | From `videosdk.yaml` | | `--env-secret` | | ID of the environment secret set to use | From `videosdk.yaml` | | `--image-pull-secret` | | Name of the image pull secret for private registries | From `videosdk.yaml` | | `--region` | | Deployment region (e.g., `in002`, `us002`) | `us002` | :::note In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). ::: ### Example Output ```bash $ videosdk agent deploy --image myrepo/myagent:v1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Creating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Version Name my-agent's version Agent ID abc123xyz Deployment ID dep-456 Image myrepo/myagent:v1 ──────────────────────────────────────── ⠋ Creating Version... ✓ Version created successfully for agent: abc123xyz Version ID ver-789 ℹ Next step: Check version status videosdk agent version status -v ver-789 ``` ### Examples ```bash # Basic deployment videosdk agent deploy --image myrepo/myagent:v1 # Named version with version tag videosdk agent deploy my-version --image myrepo/myagent:v1 --version-tag main/0.0.1 # Deployment with secrets and custom profile videosdk agent deploy --image myrepo/myagent:v1 --env-secret my-secrets --profile cpu-medium # Deployment to specific region videosdk agent deploy --image myrepo/myagent:v1 --region in002 ``` ## List List all versions for an agent deployment. ### Usage ```bash videosdk agent version list [OPTIONS] ``` ### Options | Option | Description | Default | | ----------------- | ----------------------------------------------------- | -------------------- | | `--agent-id` | Agent ID | From `videosdk.yaml` | | `--deployment-id` | Deployment ID | From `videosdk.yaml` | | `--page` | Page number | `1` | | `--per-page` | Items per page | `10` | | `--sort` | Sort order: `1` (oldest first) or `-1` (newest first) | `-1` | ### Example Output ```bash $ videosdk agent version list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Versions ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Agent ID abc123xyz Deployment ID dep-456 ┌────────────┬─────────┬──────────┬────────────┬─────────────┐ │ Version ID │ Status │ Region │ Profile │ Replicas │ ├────────────┼─────────┼──────────┼────────────┼─────────────┤ │ ver-001 │ active │ us002 │ cpu-small │ 2/5 │ │ ver-002 │ inactive│ in002 │ cpu-medium │ 0/10 │ └────────────┴─────────┴──────────┴────────────┴─────────────┘ ``` ### Examples ```bash # List all versions videosdk agent version list # List versions for specific agent videosdk agent version list --agent-id abc123 # Paginated listing videosdk agent version list --page 2 --per-page 20 # Sort oldest first videosdk agent version list --sort 1 ``` ## Update Update an existing version configuration. ### Usage ```bash videosdk agent version update [OPTIONS] ``` ### Options | Option | Short | Description | Required | | --------------------- | ----- | --------------------------- | -------- | | `--version-id` | `-v` | Version ID to update | **Yes** | | `--min-replica` | | New minimum replicas | No | | `--max-replica` | | New maximum replicas | No | | `--profile` | | New compute profile | No | | `--image-pull-secret` | | New image pull secret name | No | | `--env-secret` | | New environment secret name | No | ### Example Output ```bash $ videosdk agent version update -v ver123 --min-replica 3 --max-replica 15 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Updating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Min Replicas 3 Max Replicas 15 ⠋ Updating Version... ✓ Version updated successfully ``` ### Examples ```bash # Update replica counts videosdk agent version update -v ver123 --min-replica 2 --max-replica 10 # Update to larger profile videosdk agent version update -v ver123 --profile cpu-large # Update environment secrets videosdk agent version update -v ver123 --env-secret new-secrets ``` ## Activate Activate a version to start receiving traffic. ### Usage ```bash videosdk agent version activate [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ---------------------- | -------- | | `--version-id` | `-v` | Version ID to activate | **Yes** | ### Example Output ```bash $ videosdk agent version activate -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Activating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Activating Version... ✓ Version activated successfully ``` ## Deactivate Deactivate a version to stop receiving new traffic. ### Usage ```bash videosdk agent version deactivate [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ------------------------------------------ | -------- | | `--version-id` | `-v` | Version ID to deactivate | **Yes** | | `--force` | | Force deactivate even with active sessions | No | ### Example Output ```bash $ videosdk agent version deactivate -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Deactivating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Deactivating Version... ✓ Version deactivated successfully ``` ### Examples ```bash # Graceful deactivation videosdk agent version deactivate -v ver123 # Force deactivation (terminates active sessions) videosdk agent version deactivate -v ver123 --force ``` ## Status Get the current status of a version. ### Usage ```bash videosdk agent version status [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ------------------- | -------- | | `--version-id` | `-v` | Version ID to check | **Yes** | ### Example Output ```bash $ videosdk agent version status -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Getting Version Status ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Version ID ver123 Status active Replicas 3/5 running Health healthy ℹ Next step: Start a session videosdk agent session start -v ver123 ``` ## Describe Get detailed information about a version. ### Usage ```bash videosdk agent version describe [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ---------------------- | -------- | | `--version-id` | `-v` | Version ID to describe | **Yes** | ### Example Output ```bash $ videosdk agent version describe -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Describing Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Version ID ver123 Agent ID abc123xyz Deployment ID dep-456 Status active Image myrepo/myagent:v1 Profile cpu-medium Min Replicas 2 Max Replicas 10 Current Replicas 5 Region in002 Created At 2026-01-15 10:30:00 ``` ## Quick Reference | Command | Description | | ----------------------------------- | ------------------------- | | `videosdk agent deploy` | Create a new version | | `videosdk agent version list` | List all versions | | `videosdk agent version update` | Update version config | | `videosdk agent version activate` | Activate a version | | `videosdk agent version deactivate` | Deactivate a version | | `videosdk agent version status` | Get version status | | `videosdk agent version describe` | Get detailed version info | ## videosdk.yaml Reference You can configure your entire deployment process in the `videosdk.yaml` file. This allows you to run commands like `videosdk agent build`, `videosdk agent push`, and `videosdk agent deploy` without providing extra flags. ```yaml agent: id: ag_xxxxxx # automatically generated name: agent-test # automatically generated, if not provided in videosdk agent init build: image: username/myagent:v1 # docker image logs: id: b_xxxxxx # automatically generated enabled: true # build logs enabaled or not deploy: id: dep_xxxxxx # automatically generated replicas: min: 0 max: 3 profile: cpu-small region: us002 # options: in002, us002 secrets: env: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx # env secrets, global image-pull: image-pull-secret-name # image pull secrets for private images, region specific ``` ## Next Steps - Learn about managing [environment secrets](./env-secrets.md) or [image pull secrets](./image-pull-secrets.md) for your deployments - View [sessions](./sessions) for your agents - Use [Up & Down](./up-down) commands to streamline your workflow --- --- title: Environment Secrets hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Manage environment secrets for your AI agents using the VideoSDK CLI." pagination_label: "Environment Secrets" keywords: - VideoSDK CLI - Secrets - Environment Variables - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Environment Secrets slug: env-secrets --- # Environment Secrets Environment secrets are key-value pairs that are securely injected as environment variables into your agent containers at runtime. ## List List all secret sets. ### Usage ```bash videosdk agent secrets list ``` ### Example Output ```bash $ videosdk agent secrets list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ┌──────────────────┬─────────────────┬──────────┐ │ Name │ Secret ID │ Type │ ├──────────────────┼─────────────────┼──────────┤ │ my-secrets │ sec-abc123 │ env │ │ prod-credentials │ sec-xyz789 │ env │ └──────────────────┴─────────────────┴──────────┘ ✓ Secrets listed successfully ``` ## Create Create a new secret set. ### Usage ```bash videosdk agent secrets create [OPTIONS] ``` ### Options | Option | Short | Description | Default | | ---------- | ----- | -------------------------------------- | ----------------------- | | `--file` | `-f` | Path to .env file with key=value pairs | None (interactive mode) | | `--region` | | Region for storing secrets | None | ### Example Output ```bash $ videosdk agent secrets create my-secrets --file .env ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Creating Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Secret Name: my-secrets File: .env Secrets to be saved: - API_KEY: ****** - DATABASE_URL: ****** Confirm action ❯ Save secrets Cancel Saving secrets... Secrets saved successfully. ✓ Secret 'my-secrets' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret ID saved to videosdk.yaml: ``` ### videosdk.yaml Structure When saved to `videosdk.yaml`, the secret ID is added under the `secrets` section: ```yaml secrets: env: ``` ``` ### Examples ```bash # Create from .env file videosdk agent secrets create my-secrets --file .env # Create interactively (will prompt for key-value pairs) videosdk agent secrets create my-secrets # Create with specific region videosdk agent secrets create my-secrets --file .env --region in002 ``` ## Add Add new keys to an existing secret set. ### Usage ```bash videosdk agent secrets add ``` ### Example Output ```bash $ videosdk agent secrets add my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding to Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding secret... Enter key: NEW_API_KEY Enter value: ******** Add another secret? ❯ Yes No Secrets to be saved: - NEW_API_KEY: ****** Confirm action ❯ Save secrets Cancel Secret added successfully. ✓ Keys added to secret 'my-secrets' successfully ``` ## Remove Remove specific keys from a secret set. ### Usage ```bash videosdk agent secrets remove ``` ### Example Output ```bash $ videosdk agent secrets remove my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing Keys from Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing secret... Enter key: OLD_API_KEY Remove another key? ❯ Yes No Secret removed successfully. ``` ## Describe Show details of a secret set (keys only, values are hidden). ### Usage ```bash videosdk agent secrets describe ``` ### Example Output ```bash $ videosdk agent secrets describe my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Describing Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Name my-secrets Secret ID sec-abc123 Type env ┌──────────────────┬──────────┐ │ Key │ Value │ ├──────────────────┼──────────┤ │ OPENAI_API_KEY │ ****** │ │ DATABASE_URL │ ****** │ │ SECRET_TOKEN │ ****** │ └──────────────────┴──────────┘ ``` ## Delete Permanently delete a secret set. ### Usage ```bash videosdk agent secrets delete ``` ### Example Output ```bash $ videosdk agent secrets delete my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Deleting Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✓ Secret 'my-secrets' deleted successfully ``` > This action is permanent and cannot be undone. All keys in the secret set will be deleted. ## Using Environment Secrets in Deployments Once you've created environment secrets, you can reference them when deploying your agent: ```bash videosdk agent deploy --image myrepo/myagent:v1 --env-secret my-secrets ``` --- --- title: Image Pull Secrets hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Create and manage image pull secrets for private registries using the VideoSDK CLI." pagination_label: "Image Pull Secrets" keywords: - VideoSDK CLI - Secrets - Image Pull Secret - Container Registry - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: Image Pull Secrets sidebar_position: 5 slug: image-pull-secrets --- # Image Pull Secrets Image pull secrets store container registry credentials, allowing Agent Cloud to pull images from private registries. ## Create Image Pull Secret Create an image pull secret for private container registries. ### Usage ```bash videosdk agent image-pull-secret [OPTIONS] ``` ### Arguments | Argument | Required | Description | | -------- | -------- | ------------------------------ | | `name` | **Yes** | Name for the image pull secret | ### Options | Option | Short | Required | Description | Default | | ------------ | ----- | -------- | --------------------------------- | ------- | | `--server` | | **Yes** | Registry server URL | — | | `--username` | `-u` | **Yes** | Registry username | — | | `--password` | `-p` | **Yes** | Registry password or access token | — | | `--region` | | No | Deployment region for the secret | `us002` | ::::note `--username` and `--password` are **container registry credentials**, not your VideoSDK login or cloud console password. :::: ## Obtaining Registry Credentials Depending on your registry provider, follow the steps below to get the correct username and password: ### AWS Elastic Container Registry (ECR) - **Account ID & Region**: Your server URL will be `.dkr.ecr..amazonaws.com`. - **Username**: Always use `AWS`. - **Password**: Generate a temporary token using the AWS CLI: ```bash aws ecr get-login-password --region ``` ### Azure Container Registry (ACR) - **Username**: Use the **Registry Name** (found in Azure Portal > ACR > Access Keys). - **Password**: Enable the **Admin user** in ACR settings and use one of the generated passwords. - *Alternatively*, use a **Service Principal** Application ID as the username and its Secret as the password. ### Google Artifact Registry (GAR) - **Server URL**: `-docker.pkg.dev` - **Username**: Always use `_json_key`. - **Password**: The **content of your Service Account JSON key file**. ```bash # Example usage -p "$(cat keyfile.json)" ``` - **Permissions**: Ensure the service account has the `Artifact Registry Reader` role. ### Docker Hub - **Username**: Your Docker Hub username. - **Password**: Use a **Personal Access Token (PAT)** instead of your account password. - Go to **Account Settings** > **Security** > **New Access Token**. ### GitHub Container Registry (GHCR) - **Username**: Your GitHub username. - **Password**: A **Personal Access Token (classic)** with `read:packages` scope. --- ### What Happens 1. The CLI validates the required registry details provided via flags. 2. Credentials are securely stored on VideoSDK Cloud. 3. **Automatic Configuration**: The CLI prompts you to save the secret to your `videosdk.yaml` file. If confirmed, the secret name is automatically added under the `secrets` section. #### Example Output ```bash ✓ Image pull secret 'my-registry-secret' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret Name saved to videosdk.yaml: my-registry-secret ``` #### videosdk.yaml Structure ```yaml secrets: image-pull: my-registry-secret ``` ### Examples ```bash # ECR (AWS) videosdk agent image-pull-secret my-ecr-secret \ --server 1234567890.dkr.ecr.ap-south-1.amazonaws.com \ -u AWS \ -p $(aws ecr get-login-password --region ap-south-1) # ACR (Azure) videosdk agent image-pull-secret my-acr-secret \ --server myregistry.azurecr.io \ -u myusername \ -p mypassword # Google Artifact Registry (GAR) videosdk agent image-pull-secret my-gcr-secret \ --server https://-docker.pkg.dev \ -u _json_key \ -p "$(cat keyfile.json)" \ --region us002 # Docker Hub videosdk agent image-pull-secret my-dockerhub-secret \ --server https://index.docker.io/v1/ \ -u myusername \ -p mypassword \ --region us002 ``` ## Using Image Pull Secrets in Deployments Once you've created an image pull secret, you can reference it when deploying your agent: ```bash videosdk agent deploy --image ghcr.io/myorg/myagent:v1 --image-pull-secret ghcr-secret ``` ## Next Steps: Registry-Specific Guides Follow these step-by-step guides for building, pushing, and deploying agents from popular registries to VideoSDK Cloud: - [Deploy from AWS ECR to VideoSDK Cloud](/ai_agents/deployments/agent-cloud/deployment-guides/ecr-to-videosdk-cloud) - [Deploy from Azure Container Registry (ACR) to VideoSDK Cloud](/ai_agents/deployments/agent-cloud/deployment-guides/acr-to-videosdk-cloud) - [Deploy from Google Container Registry (GCR) to VideoSDK Cloud](/ai_agents/deployments/agent-cloud/deployment-guides/gcr-to-videosdk-cloud) --- --- title: Initialize Agent hide_title: false hide_table_of_contents: false description: "Learn how to initialize a new AI agent deployment using the VideoSDK CLI. Set up your agent configuration and deployment settings." pagination_label: "CLI Initialize" keywords: - VideoSDK CLI - Initialize Agent - Agent Init - videosdk.yaml - Agent Cloud sidebar_position: 2 sidebar_label: Initialize slug: init-agent --- # Initialize Agent The `init` command sets up a new agent deployment by creating an agent and a deployment in VideoSDK cloud. It also generates a `videosdk.yaml` configuration file and a `Dockerfile` in your project directory. ## Initialize Create a new agent and deployment. ### Usage ```bash videosdk agent init --name my-agent ``` ### Options | Option | Short | Description | Default | | ----------- | ----- | ------------------------------------------------ | --------------------------- | | `--name` | `-n` | Name for your deployment | Auto-generated if not provided | | `--template`| `-t` | Template ID to use (e.g., Template01) | None | ### What Happens 1. **Cloud Creation**: The CLI communicates with VideoSDK cloud to create a new agent and a corresponding deployment. 2. **Config Generation**: A `videosdk.yaml` file is created in your current directory. This file contains the unique IDs for your agent and deployment. 3. **Dockerfile Generation**: A standard `Dockerfile` is automatically created, optimized for running VideoSDK AI agents. 4. **Project Setup**: Your local project is now linked to the cloud resources. ### Example Output ```bash $ videosdk agent init --name my-agent ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Initializing Deployment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Initializing Deployment... ✓ Deployment initialized successfully ℹ Next step: Build your agent Docker image videosdk agent build --image /: ``` ### videosdk.yaml Structure The generated `videosdk.yaml` file will look like this: ```yaml agent: id: ag_48bnvu name: agent-test deploy: id: ddv_h3b5sd ``` | Field | Description | | ---------- | ------------------------------------------------ | | `agent.id` | Unique identifier for your AI agent | | `agent.name`| Name of your agent | | `deploy.id`| Unique identifier for this specific deployment | ### Update Agent Code After the `videosdk.yaml` file is generated, you must update your agent's code with the `id` from the `agent` field. This links your local agent logic to the cloud resource. 1. Open your `videosdk.yaml` file and copy the `id` under the `agent` section. 2. In your Python agent code, set the `agent_id` in the `Options` class. **Example:** ```python if __name__ == "__main__": options = Options( agent_id="ag_48bnvu", # Use the id from your videosdk.yaml ) job = WorkerJob(entrypoint=start_session, jobctx=make_context, options=options) job.start() ``` ### Notes - You should run this command in the root of your agent's project directory. - The `videosdk.yaml` file should be committed to your version control system (e.g., Git). - If you already have a `videosdk.yaml` file, running `init` again will prompt you or might overwrite settings depending on the version. ## Next Steps After initializing your agent, the next step is to build and push your agent's Docker image. See the [Build & Push](./build-push) documentation for more details. --- --- title: Installation hide_title: false hide_table_of_contents: false description: "Get started with VideoSDK CLI. Learn how to install the CLI on Linux and macOS using curl or pip." pagination_label: "CLI Installation" keywords: - VideoSDK CLI - Installation - Install - curl - pip - Agent Cloud sidebar_position: 0 sidebar_label: Installation slug: installation --- # Installation To get started with VideoSDK Agent Cloud, you need to install the VideoSDK CLI. There are two ways to install it: ## Using pip You can also install the VideoSDK CLI using `pip`, the Python package manager. ```bash pip install videosdk-cli ``` ## Using curl This is the quickest way to install the VideoSDK CLI on Linux and macOS. ```bash curl -fsSL https://videosdk.live/install | bash ``` ## Verify Installation Once installed, you can verify the installation by checking the help command. ```bash videosdk --help ``` ## Next Steps After installing the CLI, the next step is to authenticate your account. See the [Authentication](./authentication) documentation for more details. --- --- title: Secrets Management hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Learn how to manage environment secrets and image pull secrets for your AI agents using the VideoSDK CLI." pagination_label: "CLI Secrets" keywords: - VideoSDK CLI - Secrets - Environment Variables - Image Pull Secret - Container Registry - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Secrets slug: secrets --- # Secrets Management This section covers all CLI commands for managing secrets used by your AI agents on Agent Cloud. Secrets allow you to securely store sensitive configuration values like API keys, database credentials, and registry authentication. There are two types of secrets: - **Environment secrets**: key-value pairs injected as environment variables. - **Image pull secrets**: container registry credentials for pulling private images. ## Environment Secrets Environment secrets are key-value pairs that are securely injected as environment variables into your agent containers at runtime. ### List List all secret sets. #### Usage ```bash videosdk agent secrets list ``` #### Example Output ```bash $ videosdk agent secrets list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ┌──────────────────┬─────────────────┬──────────┐ │ Name │ Secret ID │ Type │ ├──────────────────┼─────────────────┼──────────┤ │ my-secrets │ sec-abc123 │ env │ │ prod-credentials │ sec-xyz789 │ env │ └──────────────────┴─────────────────┴──────────┘ ✓ Secrets listed successfully ``` ### Create Create a new secret set. #### Usage ```bash videosdk agent secrets create [OPTIONS] ``` #### Options | Option | Short | Description | Default | | ---------- | ----- | -------------------------------------- | ----------------------- | | `--file` | `-f` | Path to .env file with key=value pairs | None (interactive mode) | | `--region` | | Region for storing secrets | None | #### Example Output ```bash $ videosdk agent secrets create my-secrets --file .env ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Creating Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Secret Name: my-secrets File: .env Secrets to be saved: - API_KEY: ****** - DATABASE_URL: ****** Confirm action ❯ Save secrets Cancel Saving secrets... Secrets saved successfully. ✓ Secret 'my-secrets' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret ID saved to videosdk.yaml: ``` #### videosdk.yaml Structure When saved to `videosdk.yaml`, the secret ID is added under the `secrets` section: ```yaml secrets: env: ``` ``` #### Examples ```bash # Create from .env file videosdk agent secrets create my-secrets --file .env # Create interactively (will prompt for key-value pairs) videosdk agent secrets create my-secrets # Create with specific region videosdk agent secrets create my-secrets --file .env --region in002 ``` ### Add Add new keys to an existing secret set. #### Usage ```bash videosdk agent secrets add ``` #### Example Output ```bash $ videosdk agent secrets add my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding to Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding secret... Enter key: NEW_API_KEY Enter value: ******** Add another secret? ❯ Yes No Secrets to be saved: - NEW_API_KEY: ****** Confirm action ❯ Save secrets Cancel Secret added successfully. ✓ Keys added to secret 'my-secrets' successfully ``` ### Remove Remove specific keys from a secret set. #### Usage ```bash videosdk agent secrets remove ``` #### Example Output ```bash $ videosdk agent secrets remove my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing Keys from Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing secret... Enter key: OLD_API_KEY Remove another key? ❯ Yes No Secret removed successfully. ``` ### Describe Show details of a secret set (keys only, values are hidden). #### Usage ```bash videosdk agent secrets describe ``` #### Example Output ```bash $ videosdk agent secrets describe my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Describing Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Name my-secrets Secret ID sec-abc123 Type env ┌──────────────────┬──────────┐ │ Key │ Value │ ├──────────────────┼──────────┤ │ OPENAI_API_KEY │ ****** │ │ DATABASE_URL │ ****** │ │ SECRET_TOKEN │ ****** │ └──────────────────┴──────────┘ ``` ### Delete Permanently delete a secret set. #### Usage ```bash videosdk agent secrets delete ``` #### Example Output ```bash $ videosdk agent secrets delete my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Deleting Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✓ Secret 'my-secrets' deleted successfully ``` :::caution This action is permanent and cannot be undone. All keys in the secret set will be deleted. ::: ## Image Pull Secrets Image pull secrets store container registry credentials, allowing Agent Cloud to pull images from private registries. ### Create Image Pull Secret Create an image pull secret for private container registries. #### Usage ```bash videosdk agent image-pull-secret [OPTIONS] ``` #### Arguments | Argument | Required | Description | | -------- | -------- | ------------------------------ | | `name` | **Yes** | Name for the image pull secret | #### Options | Option | Short | Required | Description | Default | | ------------------- | ----- | -------- | --------------------------------- | ------- | | `--server` | | **Yes** | Registry server URL | — | | `--username` | `-u` | **Yes** | Registry username | — | | `--password` | `-p` | **Yes** | Registry password or access token | — | | `--region` | | No | Deployment region for the secret | `us002` | ### What Happens 1. The CLI validates the required registry details provided via flags. 2. Credentials are securely stored and can be referenced in deployments. 3. **Automatic Configuration**: The CLI prompts you to save the secret to your `videosdk.yaml` file. If confirmed, the secret name is automatically added under the `secrets` section. #### Example Output ```bash ✓ Image pull secret 'my-registry-secret' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret Name saved to videosdk.yaml: my-registry-secret ``` #### videosdk.yaml Structure ```yaml secrets: image-pull: my-registry-secret ``` #### Examples ```bash # ECR (AWS) videosdk agent image-pull-secret my-ecr-secret \ --server 1234567890.dkr.ecr.ap-south-1.amazonaws.com \ -u AWS \ -p $(aws ecr get-login-password --region ap-south-1) # ACR (Azure) videosdk agent image-pull-secret my-acr-secret \ --server myregistry.azurecr.io \ -u myusername \ -p mypassword # GCR (GCP) videosdk agent image-pull-secret my-gcr-secret \ --server https://-docker.pkg.dev \ -u _json_key \ -p "$(cat keyfile.json)" \ --region us002 # Docker Hub videosdk agent image-pull-secret my-dockerhub-secret \ --server https://index.docker.io/v1/ \ -u myusername \ -p mypassword \ --region us002 ``` ## Using Secrets in Deployments Once you've created secrets, you can reference them when deploying your agent: :::note In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). Replace it with your actual username. ::: ### Environment Secrets ```bash videosdk agent deploy --image myrepo/myagent:v1 --env-secret my-secrets ``` ### Image Pull Secrets ```bash videosdk agent deploy --image ghcr.io/myorg/myagent:v1 --image-pull-secret ghcr-secret ``` ### Combined Example ```bash videosdk agent deploy \ --image ghcr.io/myorg/myagent:v1 \ --env-secret prod-credentials \ --image-pull-secret ghcr-secret \ --min-replica 2 \ --max-replica 10 ``` ## Quick Reference | Command | Description | | ----------------------------------------- | --------------------------- | | `videosdk agent secrets list` | List all secret sets | | `videosdk agent secrets create ` | Create a new secret set | | `videosdk agent secrets add ` | Add keys to a secret | | `videosdk agent secrets remove ` | Remove keys from a secret | | `videosdk agent secrets describe ` | Show secret details | | `videosdk agent secrets delete ` | Delete a secret set | | `videosdk agent image-pull-secret ` | Create registry credentials | ## Best Practices 1. **Use .env files for bulk creation**: When you have many secrets, create a `.env` file and use `--file .env` 2. **Separate secrets by environment**: Create different secret sets for development, staging, and production 3. **Rotate secrets regularly**: Delete and recreate secrets periodically for security 4. **Use descriptive names**: Name your secrets clearly (e.g., `prod-api-keys`, `staging-db-creds`) --- --- title: Session Management hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Learn how to start, stop, and list agent sessions using the VideoSDK CLI." pagination_label: "CLI Sessions" keywords: - VideoSDK CLI - Sessions - Agent Sessions - Room Management - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Sessions slug: sessions --- # Session Management This section covers all CLI commands for managing agent sessions on Agent Cloud. Sessions represent individual instances of your agent running in rooms. ## Session Commands Control individual agent sessions - start agents in rooms and stop running sessions. ### Start Start an agent session in a room. #### Usage ```bash videosdk agent session start [OPTIONS] ``` #### Options | Option | Short | Description | Default | | -------------- | ----- | -------------------------------------------------- | -------------------- | | `--version-id` | `-v` | Version ID to use | Latest version | | `--room-id` | `-r` | Room ID to join (creates new room if not provided) | Auto-created | | `--agent-id` | `-a` | Agent ID | From `videosdk.yaml` | #### Example Output ```bash $ videosdk agent session start -v ver123 -r room-abc ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Starting Session ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Starting Session... ✓ Session started successfully Room ID room-abc ℹ Useful commands: View logs: videosdk agent version logs Stop session: videosdk agent session stop -r room-abc ``` #### Examples ```bash # Start with specific version and room videosdk agent session start -v ver123 -r room-abc # Start with specific version (creates new room) videosdk agent session start -v ver123 # Start with latest version in existing room videosdk agent session start -r room-abc # Start with latest version (creates new room) videosdk agent session start ``` ### Stop Stop an agent session. #### Usage ```bash videosdk agent session stop [OPTIONS] ``` #### Options | Option | Short | Description | Required | | -------------- | ----- | ------------------ | --------------------------- | | `--room-id` | `-r` | Room ID of session | **Yes** (or `--session-id`) | | `--session-id` | `-s` | Session ID to stop | **Yes** (or `--room-id`) | :::note Either `--room-id` or `--session-id` must be provided. ::: #### Example Output ```bash $ videosdk agent session stop -r room-abc ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Stopping Session ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Ending Session... ✓ Session ended successfully ``` #### Examples ```bash # Stop by room ID videosdk agent session stop -r room-abc # Stop by session ID videosdk agent session stop -s session-123 ``` ## Sessions List View and filter all sessions for your agent. ### List List all sessions for an agent. #### Usage ```bash videosdk agent sessions list [OPTIONS] ``` #### Options | Option | Short | Description | Default | | -------------- | ----- | ----------------------------------------------------- | -------------------- | | `--agent-id` | | Agent ID | From `videosdk.yaml` | | `--version-id` | `-v` | Filter by Version ID | None | | `--room-id` | | Filter by Room ID | None | | `--session-id` | | Filter by Session ID | None | | `--page` | | Page number | `1` | | `--per-page` | | Items per page | `10` | | `--sort` | | Sort order: `1` (oldest first) or `-1` (newest first) | `-1` | #### Example Output ```bash $ videosdk agent sessions list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Sessions ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Agent ID abc123xyz Deployment ID dep-456 +------------+----------+---------------+---------+----------+ | Session ID | Room ID | Deployment ID | Status | Duration | +------------+----------+---------------+---------+----------+ | sess-001 | room-abc | dep-456 | running | 5m 30s | | sess-002 | room-xyz | dep-456 | ended | 12m 45s | | sess-003 | room-123 | dep-456 | ended | 3m 15s | +------------+----------+---------------+---------+----------+ ``` #### Examples ```bash # List all sessions videosdk agent sessions list # List sessions for specific agent videosdk agent sessions list --agent-id abc123 # Filter by version videosdk agent sessions list --version-id ver123 # Filter by room videosdk agent sessions list --room-id room-abc # Paginated listing videosdk agent sessions list --page 2 --per-page 20 # Sort oldest first videosdk agent sessions list --sort 1 ``` ## Quick Reference | Command | Description | | ------------------------------ | ------------------------ | | `videosdk agent session start` | Start an agent in a room | | `videosdk agent session stop` | Stop an agent session | | `videosdk agent sessions list` | List all sessions | ## Workflow Example Here's a typical workflow for managing agent sessions: ```bash # 1. Start a session with your deployed version videosdk agent session start -v ver123 # 2. Check running sessions videosdk agent sessions list # 3. View logs for debugging videosdk agent logs -v ver123 # 4. Stop the session when done videosdk agent session stop -r room-abc ``` --- --- title: Up & Down Commands hide_title: false hide_table_of_contents: false toc_max_heading_level: 2 description: "Learn how to use the up and down commands to quickly build, push, and deploy your AI agents, or stop all running versions." pagination_label: "CLI Up & Down" keywords: - VideoSDK CLI - Agent Up - Agent Down - Deployment - Automation image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Up & Down slug: up-down --- # Up & Down Commands The `up` and `down` commands provide a streamlined way to manage your agent's lifecycle on Agent Cloud. ## Up The `up` command is a powerful command that performs the **build**, **push**, and **deploy** actions together in a single command. This significantly speeds up the development-to-deployment workflow by replacing three separate steps with one. ### Usage ```bash videosdk agent up [OPTIONS] ``` ### Options | Option | Short | Description | Default | | ------------ | ----- | ------------------------------------------------ | -------------------- | | `--image` | `-i` | Image name with tag (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--file` | `-f` | Path to Dockerfile | `./Dockerfile` | | `--server` | `-s` | Registry server URL | `docker.io` | | `--username` | `-u` | Registry username for authentication | None | | `--password` | `-p` | Registry password for authentication | None | | `--skip-build` | | Skip build step (use existing local image) | - | | `--skip-push` | | Skip push step (image already in registry) | - | ### What Happens 1. **Build**: The CLI builds the Docker image for the `linux/arm64` platform locally (unless `--skip-build` is used). 2. **Push**: The CLI pushes the image to your container registry (unless `--skip-push` is used). 3. **Deploy**: The CLI creates and activates a new version on Agent Cloud using the image. ### Example Output ```bash $ videosdk agent up --image myrepo/myagent:v1 ◆ Step 1/3 — Building Docker Image Platform linux/arm64 Image myrepo/myagent:v1 Dockerfile /path/to/your/project/Dockerfile ──────────────────────────────────────── [Docker build output...] ✓ Successfully built image: myrepo/myagent:v1 ◆ Step 2/3 — Pushing Docker Image Image myrepo/myagent:v1 Registry docker.io ──────────────────────────────────────── [Docker push output...] ✓ Pushed image: myrepo/myagent:v1 ◆ Step 3/3 — Deploying Agent Agent ID: ag_xxxxxx Image: myrepo/myagent:v1 Min Replicas: 1 Max Replicas: 3 Profile: cpu-small ──────────────────────────────────────────────────────────── Creating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 ✓ Agent is up! Version ID: v_xxxxxx ℹ Check status: videosdk agent version status ℹ View logs: videosdk agent logs ℹ Take it down: videosdk agent down ``` ### Examples ```bash # Build, push, and deploy using defaults from videosdk.yaml videosdk agent up # Specify image and registry credentials videosdk agent up --image myrepo/myagent:v1 -u username -p password # Skip build and use existing image videosdk agent up --image myrepo/myagent:v1 --skip-build # Skip build and push, only deploy videosdk agent up --image myrepo/myagent:v1 --skip-build --skip-push ``` --- ## Down The `down` command deactivates **all running versions** of a specific agent. This is the quickest way to stop all active instances of your agent across all deployed versions. ### Usage ```bash videosdk agent down [OPTIONS] ``` ### Options | Option | Short | Description | Default | | --------- | ----- | ------------------------------------------ | ------- | | `--force` | | Force deactivate even with active sessions | No | | `--yes` | `-y` | Skip confirmation prompt | No | ### What Happens 1. **Version Identification**: The CLI identifies all active versions for the specified agent (from `videosdk.yaml`). 2. **Deactivation**: It sends a deactivation request for every active version. 3. **Session Transition**: Once deactivated, these versions will stop receiving new sessions. Existing sessions will continue until they finish (unless `--force` is used). ### Example Output ```bash $ videosdk agent down ◆ Bringing Agent Down Found 1 active version(s): • v_xxxxxx — my-agent's version Deactivate 1 version(s)? [y/N]: y Deactivating v_xxxxxx ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 ✓ Deactivated v_xxxxxx ✓ Agent is down. All versions deactivated. ``` ### Examples ```bash # Graceful deactivation with confirmation videosdk agent down # Force deactivation (terminates active sessions) videosdk agent down --force # Skip confirmation prompt videosdk agent down -y ``` --- ## Next Steps - Learn more about individual [Build & Push](./build-push.md) commands. - Explore detailed [Deployment & Version](./deploy.md) management. --- --- title: Deploy from ACR to VideoSDK Cloud hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Build, push, and deploy an AI agent image from Azure Container Registry (ACR) to VideoSDK Cloud." pagination_label: "ACR → VideoSDK Cloud" keywords: - VideoSDK CLI - Azure - ACR - Image Pull Secret - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: ACR → VideoSDK Cloud sidebar_position: 2 slug: acr-to-videosdk-cloud --- # Deploy from Azure Container Registry (ACR) to VideoSDK Cloud This guide walks you through building, pushing, and deploying an agent container image stored in **Azure Container Registry (ACR)** to **VideoSDK Agent Cloud**. You will: 1. Build your Docker image with the VideoSDK CLI. 2. Push the image to Azure Container Registry. 3. Create an image pull secret in VideoSDK Cloud for ACR. 4. Deploy your agent using the ACR image and image pull secret. ## Prerequisites - A working agent project with a `Dockerfile`. - Azure account with access to an ACR registry. - Azure CLI installed and logged in. - VideoSDK CLI installed and authenticated. ## 1. Build Image with VideoSDK CLI Use the `videosdk agent build` command to build your Docker image. ```bash videosdk agent build \ --image myregistry.azurecr.io/my-agent:latest ``` > Replace `myregistry` with your ACR registry name and `my-agent` with your repository name. ## 2. Push Image to ACR Login to ACR and push the built image. ```bash # Login to ACR az acr login --name myregistry # Push using VideoSDK CLI videosdk agent push \ --image myregistry.azurecr.io/my-agent:latest \ --server myregistry.azurecr.io ``` ## 3. Create Image Pull Secret for ACR To allow VideoSDK Cloud to pull your private image, you need to create an image pull secret with your ACR credentials. ### Obtaining ACR Credentials You can use either an **Admin User** or a **Service Principal**: 1. **Admin User (Simplest)**: - Go to the **Azure Portal** > **Container Registries** > Select your registry. - Under **Settings**, select **Access keys**. - Ensure **Admin user** is enabled. - Use the **Registry name** as `-u` (username) and one of the **passwords** as `-p`. 2. **Service Principal (Recommended)**: - Create a Service Principal with `AcrPull` permissions. - Use the **Application (client) ID** as `-u` (username) and the **Client Secret** as `-p` (password). ### Create the Secret ```bash videosdk agent image-pull-secret my-acr-secret \ --server myregistry.azurecr.io \ -u myusername \ -p mypassword ``` > Replace `myusername` and `mypassword` with the values obtained above. ## 4. Deploy Agent Using ACR Image Now deploy your agent, referencing both the ACR image and the image pull secret you just created. ```bash videosdk agent deploy \ --image myregistry.azurecr.io/my-agent:latest \ --image-pull-secret my-acr-secret \ --min-replica 1 \ --max-replica 3 ``` Your agent will now run on VideoSDK Cloud using the image stored in Azure Container Registry. --- --- title: Deploy from ECR to VideoSDK Cloud hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Build, push, and deploy an AI agent image from AWS ECR to VideoSDK Cloud." pagination_label: "ECR → VideoSDK Cloud" keywords: - VideoSDK CLI - AWS - ECR - Image Pull Secret - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: ECR → VideoSDK Cloud sidebar_position: 1 slug: ecr-to-videosdk-cloud --- # Deploy from AWS ECR to VideoSDK Cloud This guide walks you through building, pushing, and deploying an agent container image stored in **AWS Elastic Container Registry (ECR)** to **VideoSDK Agent Cloud**. You will: 1. Build your Docker image with the VideoSDK CLI. 2. Push the image to AWS ECR. 3. Create an image pull secret in VideoSDK Cloud for ECR. 4. Deploy your agent using the ECR image and image pull secret. ## Prerequisites - A working agent project with a `Dockerfile`. - AWS account with permissions to create and push images to ECR. - AWS CLI installed and configured (`aws configure`). - VideoSDK CLI installed and authenticated. ## 1. Build Image with VideoSDK CLI Use the `videosdk agent build` command to build your Docker image. ```bash videosdk agent build \ --image 1234567890.dkr.ecr.ap-south-1.amazonaws.com/my-agent:latest ``` > Replace `1234567890` with your AWS account ID, `ap-south-1` with your ECR region, and `my-agent` with your repository name. ## 2. Push Image to ECR Authenticate to ECR and push the built image to your ECR repository. ```bash # Login to ECR aws ecr get-login-password --region ap-south-1 \ | docker login \ --username AWS \ --password-stdin 1234567890.dkr.ecr.ap-south-1.amazonaws.com # Push using VideoSDK CLI videosdk agent push \ --image 1234567890.dkr.ecr.ap-south-1.amazonaws.com/my-agent:latest ``` ## 3. Create Image Pull Secret for ECR To allow VideoSDK Cloud to pull your private image, you need to create an image pull secret with your ECR credentials. ### Obtaining ECR Details 1. **Server URL**: Your ECR server URL follows the format: `.dkr.ecr..amazonaws.com`. - **Account ID**: Find your **Account ID** in the top-right corner of the **AWS Management Console**. - **Region**: The **Region code** (e.g., `us-east-1` or `ap-south-1`) where your ECR repository is located. 2. **Username**: Always use `AWS` for ECR. 3. **Password**: This is a temporary authorization token generated by the AWS CLI. ### Create the Secret ```bash videosdk agent image-pull-secret my-ecr-secret \ --server 1234567890.dkr.ecr.ap-south-1.amazonaws.com \ -u AWS \ -p $(aws ecr get-login-password --region ap-south-1) ``` > Ensure you have the AWS CLI configured (`aws configure`) with permissions to access ECR. Replace the account ID and region with your own. ## 4. Deploy Agent Using ECR Image Now deploy your agent, referencing both the ECR image and the image pull secret you just created. ```bash videosdk agent deploy \ --image 1234567890.dkr.ecr.ap-south-1.amazonaws.com/my-agent:latest \ --image-pull-secret my-ecr-secret \ --min-replica 1 \ --max-replica 3 ``` Your agent will now run on VideoSDK Cloud using the image stored in AWS ECR. --- --- title: Deploy from GCR (Artifact Registry) to VideoSDK Cloud hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Build, push, and deploy an AI agent image from Google Artifact Registry (GAR) to VideoSDK Cloud." pagination_label: "GCR → VideoSDK Cloud" keywords: - VideoSDK CLI - Google Cloud - GCR - Artifact Registry - Google Artifact Registry - Image Pull Secret - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: GCR → VideoSDK Cloud sidebar_position: 3 slug: gcr-to-videosdk-cloud --- # Deploy from GCR (Artifact Registry) to VideoSDK Cloud This guide walks you through building, pushing, and deploying an agent container image stored in **Google Artifact Registry** to **VideoSDK Agent Cloud**. ## Prerequisites - A working AI agent project. - A Google Cloud project with the **Artifact Registry API** enabled. - `gcloud` CLI installed and authenticated. - VideoSDK CLI installed and authenticated. ## 1. Set Up Google Cloud Credentials To allow VideoSDK Cloud to pull images from your private Artifact Registry, you need a Service Account with the correct permissions. ### Create a Service Account ```bash # Create service account gcloud iam service-accounts create videosdk-ar-puller \ --display-name "VideoSDK Artifact Registry Puller" ``` ### Grant Artifact Registry Reader Role ```bash # Grant Artifact Registry Reader role at the project level gcloud projects add-iam-policy-binding \ --member="serviceAccount:videosdk-ar-puller@.iam.gserviceaccount.com" \ --role="roles/artifactregistry.reader" ``` ### Generate JSON Key ```bash # Generate JSON key and save to keyfile.json gcloud iam service-accounts keys create keyfile.json \ --iam-account videosdk-ar-puller@.iam.gserviceaccount.com ``` > **Warning**: Keep `keyfile.json` secure and do not commit it to version control. --- ## 2. Create and Configure Repository Artifact Registry organizes images into repositories. It is recommended to create a dedicated repository for your VideoSDK workers. ### Create Repository ```bash # Create a single-purpose repository gcloud artifacts repositories create videosdk-worker \ --repository-format=docker \ --location= \ --description="Docker repository for VideoSDK Worker" ``` ### Grant Repository-Specific Access (Optional) If you prefer more granular access control, grant read-only access only to the specific repository instead of the entire project: ```bash gcloud artifacts repositories add-iam-policy-binding videosdk-worker \ --location= \ --member="serviceAccount:videosdk-ar-puller@.iam.gserviceaccount.com" \ --role="roles/artifactregistry.reader" ``` --- ## 3. Build and Push Image Before pushing, configure Docker to authenticate with your regional registry. ### Authenticate Docker ```bash # Replace with your repository location (e.g., us-central1) gcloud auth configure-docker -docker.pkg.dev ``` ### Build and Push Update your `videosdk.yaml` or use the CLI flags to specify the target image path. ```bash # Build the image videosdk agent build --image -docker.pkg.dev//videosdk-worker/my-agent:v1 # Push the image videosdk agent push --image -docker.pkg.dev//videosdk-worker/my-agent:v1 ``` --- ## 4. Create Image Pull Secret Create the secret in VideoSDK Cloud using the service account key. ```bash videosdk agent image-pull-secret my-gcr-secret \ --server https://-docker.pkg.dev \ -u _json_key \ -p "$(cat keyfile.json)" \ --region us002 ``` --- ## 5. Deploy Agent Use the `videosdk agent up` command for a streamlined workflow that handles the final build-push-deploy sequence, or use `deploy` if you've already pushed. ```bash # Deploy using the GAR image and secret videosdk agent deploy \ --image -docker.pkg.dev//videosdk-worker/my-agent:v1 \ --image-pull-secret my-gcr-secret ``` Alternatively, use the shortcut: ```bash videosdk agent up --image -docker.pkg.dev//videosdk-worker/my-agent:v1 ``` --- ## Summary of URL Format | Field | Format | Example | | ---------- | ------------------------------------------- | ------------------------------------------- | | **Server** | `https://-docker.pkg.dev` | `https://us-central1-docker.pkg.dev` | | **Image** | `-docker.pkg.dev///:` | `us-central1-docker.pkg.dev/my-proj/worker/agent:v1` | --- --- title: Introduction hide_title: false hide_table_of_contents: false description: "Learn the fundamental terminology and concepts of Agent Cloud deployment, including what Agent Cloud is, how deployments work, and understanding versioning with replicas and resource profiles." pagination_label: "Introduction" keywords: - AI Agent SDK - VideoSDK Agents - Agent Cloud - Deployment - Version - Replicas - Cloud Infrastructure - Low Code - CLI Deployment image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Introduction slug: introduction --- ## What is Agent Cloud? **Agent Cloud** is VideoSDK's fully managed cloud infrastructure for deploying and running AI voice agents. It abstracts away all the complexity of server management, scaling, and maintenance, allowing you to focus entirely on building your agent logic. Agent Cloud supports two deployment workflows: ### Low-Code Deployment (UI-Based) For users who prefer a visual approach, Agent Cloud provides a **low-code interface** where you can: - Design your AI agent directly from the VideoSDK dashboard - Configure agent behavior, prompts, and integrations through the UI - Deploy with a single click – no coding required This approach is ideal for rapid prototyping, non-technical users, or teams that want to iterate quickly without writing deployment scripts. ### Developer Deployment (CLI-Based) For developers who build custom AI voice agents using the **VideoSDK Pipeline**, Agent Cloud provides a **CLI-based deployment** workflow: - Develop your AI voice agent using the VideoSDK Agents Python SDK - Use the VideoSDK CLI to package and deploy your agent to the cloud - Manage deployments, versions, and configurations programmatically This approach gives developers full control over their agent code while leveraging the managed infrastructure benefits of Agent Cloud. :::info Check out the [CLI Installation Guide](./cli/installation) to get started with deploying your agents to Agent Cloud. ::: ### Agent Cloud Architecture A single deployment can have **multiple running versions** simultaneously, allowing you to manage and update your agents with flexibility. ![Agent Cloud Architecture](https://assets.videosdk.live/images/cloud-deployment.png) --- ## What is a Deployment? A **Deployment** represents a managed instance of your AI agent running on VideoSDK's cloud infrastructure. When you deploy an agent to Agent Cloud, VideoSDK handles: - **Infrastructure Provisioning**: Automatically allocates compute resources - **Load Balancing**: Distributes incoming requests across available replicas - **Health Monitoring**: Continuously monitors agent health and restarts failed instances - **Scaling**: Automatically scales replicas based on demand within configured limits Each deployment is identified by a unique name and contains configuration for how your agent should be run, scaled, and managed. --- ## What is a Version? A **Version** represents a specific release of your AI agent within a deployment. Each time you update your agent code or configuration and deploy it, a new version is created. ### Version Configuration Every version includes the following configurable parameters: | Parameter | Description | | ---------------- | -------------------------------------------------------------------------------------------------------------------------------- | | **Min Replicas** | The minimum number of agent instances that should always be running. This ensures baseline availability even during low traffic. | | **Max Replicas** | The maximum number of agent instances that can be scaled up to during high demand. This caps your resource usage and costs. | | **Profile** | The compute resource profile that defines CPU and memory allocation for each replica. | ### Resource Profiles Agent Cloud offers predefined resource profiles to match your agent's computational requirements: | Profile | Description | Best For | | -------------- | -------------------------------------------------------------------- | ----------------------------------------------------- | | **cpu-small** | Lightweight compute resources with minimal CPU and memory allocation | Simple agents, low-traffic applications | | **cpu-medium** | Balanced compute resources suitable for most production workloads | Standard agents, moderate traffic | | **cpu-large** | High-performance compute resources with increased CPU and memory | Complex agents, high-traffic, compute-intensive tasks | ### Deployment Regions Agent Cloud is available in multiple regions to ensure low latency and compliance with data residency requirements: | Region | Location | Description | | --------- | ------------- | ---------------------------------------------- | | **in002** | India | Optimized for users in the Indian subcontinent | | **us002** | United States | Optimized for users in North America (default) | :::note If no region is specified during deployment, **us002** (United States) is used as the default region. ::: Choose a region closest to your users for the best performance. You can specify the region when deploying your agent using the `--region` flag: ```bash videosdk agent deploy --image myrepo/myagent:v1 --region in002 ``` :::note In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). ::: ### Replica Scaling Replicas are individual instances of your agent running within a version. Agent Cloud automatically manages replicas based on your configuration: - **Minimum Replicas (`minReplica`)**: Guarantees this many instances are always running, ensuring your agent is ready to handle requests without cold start delays. - **Maximum Replicas (`maxReplica`)**: Sets the upper limit for scaling. When traffic increases, Agent Cloud automatically spins up additional replicas up to this limit. **Example Configuration:** ``` Min Replicas: 2 Max Replicas: 10 Profile: cpu-medium ``` In this example, your agent will always have at least 2 instances running but can scale up to 10 instances during peak demand, each using medium-tier compute resources. ## Summary | Term | Definition | | ---------------- | ----------------------------------------------------------------------------------------------------- | | **Agent Cloud** | VideoSDK's managed cloud platform for deploying AI voice agents | | **Deployment** | A managed instance of your agent on Agent Cloud, capable of running multiple versions | | **Version** | A specific release of your agent within a deployment, with its own scaling and resource configuration | | **Replica** | An individual running instance of your agent within a version | | **Min Replicas** | Minimum number of agent instances always running | | **Max Replicas** | Maximum number of agent instances during peak scaling | | **Profile** | Compute resource tier (cpu-small, cpu-medium, cpu-large) for each replica | | **Region** | Geographic location for deployment (in002 for India, us002 for US) | Understanding these concepts is essential for effectively deploying and managing your AI agents on Agent Cloud. In the following guides, we'll explore how to create deployments, manage versions, and configure scaling for your specific use case. --- --- --- title: Agent Cloud-v1 (Managed) hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Deploy Your Agents" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Agent Cloud-v1 (Managed) slug: agent-cloud-v1 --- # Agent Cloud-v1 This guide shows you how to deploy AI Agents with the [videosdk-agents](https://pypi.org/project/videosdk-agents/) python package. Once your AI Agent is ready to use, you need to create an AI Deployment. The AI Deployment is responsible for running your AI Agent. Before proceeding, ensure you have completed the steps under **Prerequisites**. ## Prerequisites To deploy your AI Deployment, make sure you have: - Created an AI Deployment using the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment). - A VideoSDK authentication token (generate from [VideoSDK Dashboard](https://app.videosdk.live)) ## YAML Configuration Create a `videosdk.yaml` file with the following structure: ``` version: "1.0" deployment: id: your_ai_deployment_id entry: path: entry_point_for_deployment env: # Optional to run your agent locally path: "./.env" secrets: VIDEOSDK_AUTH_TOKEN: your_auth_token deploy: cloud: true ``` ### Field Descriptions | Field | Description | | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | | `deployment.id` | The `deploymentId` obtained from the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment) | | `deployment.entry.path` | Path to the entry point script for your AI Deployment. | | `env.path` | Path to your `.env` file, used only when running the agent locally. | | `secrets.VIDEOSDK_AUTH_TOKEN` | Your VideoSDK auth token (required for deployment). | | `deploy.cloud` | Set to `true` to allow deploying the deployment to VideoSDK Cloud, when using the deploy command. Use `false` to avoid accidental deploys. | ## CLI Commands - ###### Run the AI Deployment locally for Testing. ``` videosdk run ``` - ###### Deploy the AI Deployment. ``` videosdk deploy ``` ## Next Steps After deploying your AI Deployment, you can start using it by: 1. Creating a new session using the [Start Session API](/api-reference/agent-cloud/start-session) 2. Ending the session using the [End Session API](/api-reference/agent-cloud/end-session) --- --- title: Agents deployments hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Overview slug: introduction --- # Deployments ### Overview The VideoSDK Agents framework provides multiple deployment options to run your AI agents in production environments. Understanding these options helps you choose the right deployment strategy for your specific use case. VideoSDK Agents supports two primary deployment modes: 1. **Agent Cloud (Managed)** - Fully managed deployment hosted on VideoSDK infrastructure 2. **Self-Hosting** - Self-managed deployment on your own infrastructure (EC2, Docker, Kubernetes, etc.) ### [Agent Cloud (Hosted on Our Infrastructure)](./agent-cloudv1.md) Agent Cloud is a fully managed service that handles the deployment, scaling, and maintenance of your AI agents. When you deploy to Agent Cloud: - **Zero Infrastructure Management**: No need to manage servers, containers, or scaling - **Automatic Scaling**: Built-in load balancing and auto-scaling capabilities - **High Availability**: Redundant infrastructure with automatic failover - **Managed Updates**: Automatic security patches and framework updates - **Global Distribution**: Agents deployed across multiple regions for low latency - **Built-in Monitoring**: Integrated metrics, logging, and health monitoring **Best for**: Teams that want to focus on agent development rather than infrastructure management, or applications with variable traffic patterns. ### [Self-Hosting (EC2, Docker, or Custom Infrastructure)](./self-hosting/understanding-worker.md) Self-hosting gives you complete control over your deployment environment and infrastructure. When self-hosting: - **Full Control**: Complete control over hardware, networking, and configuration - **Custom Integrations**: Ability to integrate with existing infrastructure and tools - **Cost Optimization**: Potential cost savings for high-volume, predictable workloads - **Compliance**: Meet specific security, compliance, or data residency requirements - **Custom Scaling**: Implement your own scaling strategies and resource management **Best for**: Organizations with existing infrastructure, specific compliance requirements, or predictable high-volume workloads. ### When to Choose Agent Cloud vs Self-Hosting #### Choose Agent Cloud when: - You want to get started quickly without infrastructure setup - You have variable or unpredictable traffic patterns - You need global distribution and low latency - You want automatic scaling and high availability - You prefer a managed service with built-in monitoring #### Choose Self-Hosting when: - You need to meet specific compliance or security requirements - You have predictable, high-volume workloads where cost optimization is important - You require custom integrations with existing systems - You need complete control over the deployment environment ### Common Terminology Understanding these key terms will help you navigate the deployment documentation: | Term | Definition | | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Agent** | Your AI application built using the VideoSDK Agents framework. An agent can handle voice conversations, process audio, and respond with synthesized speech. | | **Worker** | A runtime component that executes your agent code. Workers can run in different environments (Agent Cloud or self-hosted) and handle job assignments from the backend registry system. | | **Backend Registry** | The central service that manages worker registration, job assignment, and load balancing. Workers connect to this registry to receive job assignments and report their status. | | **Job** | A single execution instance of your agent. When a user starts a conversation, the backend registry assigns a job to an available worker. | | **JobContext** | The execution context for a job, containing room configuration, pipeline setup, and session management. This is the main interface your agent code interacts with. | | **Worker Registration** | The process by which self-hosted workers register themselves with the VideoSDK backend registry to receive job assignments. | | **Load Threshold** | A configuration parameter that determines when a worker is considered "at capacity" and should not receive new job assignments. | | **Health Check** | Regular monitoring of worker status to ensure they're available and functioning correctly. Workers provide health endpoints for monitoring. | | **Resource Management** | The system for managing worker resources including process/thread allocation, memory limits, and concurrent job handling. | | **Session Management** | Handles the lifecycle of agent sessions including automatic session ending, timeouts, and cleanup. | | **Horizontal Scaling** | The manual process of deploying additional worker instances to handle increased load (requires manual deployment of new worker instances). | | **Vertical Scaling** | The automatic scaling within a single worker up to its configured maximum capacity (`max_processes`). | | **Dispatch API** | A REST API endpoint that allows you to dynamically dispatch agents to meetings on-demand. | | **AI Deployment** | The deployment configuration that runs your AI agent, either in Agent Cloud or self-hosted environments. | This terminology will be referenced throughout the deployment documentation as we explore specific deployment scenarios and configurations. --- --- title: Dispatch Agents hide_title: false hide_table_of_contents: false description: "Dynamically dispatch AI agents to meetings using the VideoSDK API." pagination_label: "Dispatch Agents" keywords: - AI Agent SDK - VideoSDK Agents - Dispatch API - Agent Assignment image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Dispatch Agents slug: dispatch-agents --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Dispatch Agents Dynamically assign your AI agents to meetings using the VideoSDK dispatch API. This API supports dispatching for both self-hosted agents created with the Agents SDK and agents managed through the VideoSDK dashboard (Agent Runtime). ## How It Works 1. **Your app** calls the dispatch API 2. **VideoSDK backend** finds an available server 3. **Server spawns a job/process** to join the meeting 4. **Agent starts** and begins processing in the meeting ## API Usage ### Endpoint ```bash POST https://api.videosdk.live/v2/agent/dispatch ``` ### Request Body Parameters | Parameter | Type | Required | Description | | :---------- | :----- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------- | | meetingId | string | Yes | The ID of the meeting to which the agent should be dispatched. | | agentId | string | Yes | The ID of the agent to dispatch. | | metadata | object | No | Optional metadata to pass to the agent, such as variables. | | versionId | string | No | The specific version of a dashboard-managed agent to dispatch. If omitted, the latest deployed version is used. Not for self-hosted agents. | ### Example Request ```bash curl -X POST "https://api.videosdk.live/v2/agent/dispatch" \ -H "Authorization: YOUR_VIDEOSDK_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "meetingId": "xxxx-xxxx-xxxx", "agentId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "metadata": { "variables":[ { "name":"fname", "value":"john" } ] }, "versionId":"abcd-abcd-abcd-abcd" }' ``` ### Responses **On Success** A successful request will return a confirmation that the dispatch has been initiated. ```json { "message": "Agent dispatch requested successfully.", "data": { "success": true, "status": "assigned", "roomId": "xxxx-xxxx-xxxx", "agentId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" } } ``` **On Error** If the dispatch fails, you will receive one of the following error messages: This error occurs when no servers and agents are configured to handle the request. ```json { "message": "No workers available" } ``` This error is specific to **self-hosted (Agents SDK) agents**. It means that while the `agentId` is valid, no server has been configured for the specific `agentId`. ```json { "message": "No workers have registered with agentId 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'" } ``` This error is specific to **dashboard-managed agents**. It indicates that the agent exists but has no deployed versions available for dispatch or the specific version user wants to dispatch is not deployed . ```json { "message": "No agent is deployed with agentId 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'" } ``` ## Dispatching Your Agent The prerequisites for dispatching an agent depend on how it was created. ### For Self-Hosted Agents (Agents SDK) If you created your agent using the Python Agents SDK, you are responsible for hosting it. Your server must be: 1. **Registered**: The server must be configured with `register=True`. 2. **Connected**: The server must be running and connected to the VideoSDK backend. 3. **Available**: The server must have the capacity to handle new jobs. The `versionId` parameter is not applicable in this scenario. **Server Configuration Example** ```python from videosdk.agents import Options options = Options( agent_id="MyAgent", # Must match agentId in API call register=True, # Required for dispatch max_processes=10, load_threshold=0.75, ) ``` ### For Dashboard-Managed Agents (Agent Runtime) If you created your agent using the dashboard interface, VideoSDK manages the hosting for you. The only prerequisite is that your agent must be **deployed**. - You can deploy your agent via the dashboard. - You can use the optional `versionId` parameter in your dispatch request to specify which deployed version of the agent to use. - If `versionId` is not provided, the **latest deployed version** will be dispatched by default. ## Code Examples ```python import requests def dispatch_agent(auth_token, meeting_id, agent_id, metadata=None, version_id=None): url = "https://api.videosdk.live/v2/agent/dispatch" headers = { "Authorization": auth_token, "Content-Type": "application/json" } payload = { "meetingId": meeting_id, "agentId": agent_id, } if metadata: payload["metadata"] = metadata if version_id: payload["versionId"] = version_id response = requests.post(url, headers=headers, json=payload) return response.json() # Usage result = dispatch_agent("your-token", "room-123", "MyAgent") ``` ```javascript async function dispatchAgent(authToken, meetingId, agentId, metadata, versionId) { const url = "https://api.videosdk.live/v2/agent/dispatch"; const headers = { Authorization: authToken, "Content-Type": "application/json", }; const body = { meetingId, agentId, }; if (metadata) { body.metadata = metadata; } if (versionId) { body.versionId = versionId; } const response = await fetch(url, { method: "POST", headers, body: JSON.stringify(body), }); return response.json(); } // Usage dispatchAgent("your-token", "room-123", "MyAgent"); ``` --- --- title: AWS EC2 Deployment hide_title: false hide_table_of_contents: false description: "Deploy your VideoSDK AI Agent on AWS EC2 with minimal setup." pagination_label: "AWS EC2 Deployment" keywords: - AI Agent SDK - VideoSDK Agents - AWS EC2 - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: AWS EC2 slug: aws-ec2 --- # AWS EC2 Deploy your VideoSDK AI Agent Worker on AWS EC2 instances. ## Prerequisites - AWS account - SSH key pair - VideoSDK authentication token ## Quick Setup ### 1. Launch EC2 Instance ```bash aws ec2 run-instances \ --image-id ami-0c02fb55956c7d316 \ --instance-type t3.medium \ --key-name your-key-pair \ --security-group-ids sg-xxxxxxxxx \ --user-data file://user-data.sh ``` ### 2. User Data Script ```bash #!/bin/bash yum update -y yum install -y python3 python3-pip git # Clone private repository with token git clone https://YOUR_TOKEN@github.com/your-org/your-agent.git /opt/agent cd /opt/agent # Install dependencies pip3 install -r requirements.txt # Create systemd service cat > /etc/systemd/system/agent-worker.service << EOF [Unit] Description=VideoSDK Agent Worker After=network.target [Service] Type=simple User=ec2-user WorkingDirectory=/opt/agent Environment=VIDEOSDK_AUTH_TOKEN=your_auth_token ExecStart=/usr/bin/python3 main.py Restart=always [Install] WantedBy=multi-user.target EOF # Start the service systemctl enable agent-worker systemctl start agent-worker ``` ### 3. Security Group Configure your security group with these rules: - **SSH (22)**: Your IP - **Custom TCP (8081)**: Your IP (for health checks) - **HTTPS (443)**: 0.0.0.0/0 (for VideoSDK API) ## Deploy Updates ```bash # Connect to your instance ssh -i your-key.pem ec2-user@your-instance-ip # Update your agent cd /opt/agent git pull systemctl restart agent-worker ``` ## Monitor ```bash # Check service status systemctl status agent-worker # View logs journalctl -u agent-worker -f ``` ## Scaling > To support more concurrent agents, you can spin up additional EC2 instances using the same process. Each instance will register with the VideoSDK backend registry and automatically receive job assignments. The backend will distribute the load across all available workers. **To add more instances:** 1. Use the same user data script 2. Launch additional EC2 instances 3. Each instance will automatically join the worker pool 4. The VideoSDK backend will handle load balancing **Example:** ```bash # Launch multiple instances aws ec2 run-instances \ --image-id ami-0c02fb55956c7d316 \ --instance-type t3.medium \ --key-name your-key-pair \ --security-group-ids sg-xxxxxxxxx \ --user-data file://user-data.sh \ --count 3 ``` --- --- title: Docker Deployment hide_title: false hide_table_of_contents: false description: "Deploy your VideoSDK AI Agent using Docker containers." pagination_label: "Docker Deployment" keywords: - AI Agent SDK - VideoSDK Agents - Docker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Docker slug: docker --- # Docker Deploy your VideoSDK AI Agent Worker using Docker containers. ## Prerequisites - Docker installed - VideoSDK authentication token ## Quick Setup ### 1. Create Dockerfile ```dockerfile FROM python:3.11-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y \ gcc \ && rm -rf /var/lib/apt/lists/* # Copy requirements and install Python dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose debug port EXPOSE 8081 # Run the worker CMD ["python", "main.py"] ``` ### 2. Build and Run ```bash # Build the image docker build -t my-agent-worker . # Run the container docker run -d \ --name my-agent-worker \ -p 8081:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` ### 3. Docker Compose (Optional) Create `docker-compose.yml`: ```yaml title="docker-compose.yml" version: "3.8" services: agent-worker: build: . ports: - "8081:8081" environment: - VIDEOSDK_AUTH_TOKEN=${VIDEOSDK_AUTH_TOKEN} restart: unless-stopped ``` Run with: ```bash docker-compose up -d ``` ## Deploy Updates ```bash # Stop container docker stop my-agent-worker # Remove old container docker rm my-agent-worker # Build new image docker build -t my-agent-worker . # Run new container docker run -d \ --name my-agent-worker \ -p 8081:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` ## Monitor ```bash # Check container status docker ps # View logs docker logs my-agent-worker # Execute commands in container docker exec -it my-agent-worker bash ``` ## Scaling > To support more concurrent agents, you can run multiple containers using the same image. Each container will register with the VideoSDK backend registry and automatically receive job assignments. **Run multiple containers:** ```bash # Run additional containers docker run -d \ --name my-agent-worker-2 \ -p 8082:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker docker run -d \ --name my-agent-worker-3 \ -p 8083:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` **Or scale with Docker Compose:** ```bash docker-compose up -d --scale agent-worker=3 ``` --- --- title: Kubernetes Deployment hide_title: false hide_table_of_contents: false description: "Deploy your VideoSDK AI Agent on Kubernetes clusters." pagination_label: "Kubernetes Deployment" keywords: - AI Agent SDK - VideoSDK Agents - Kubernetes - K8s - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Kubernetes slug: kubernetes --- # Kubernetes Deploy your VideoSDK AI Agent Worker on Kubernetes clusters. ## Prerequisites - Kubernetes cluster (EKS, GKE, or self-hosted) - kubectl configured - Docker image of your agent ## Quick Setup ### 1. Create Namespace ```bash kubectl create namespace agent-workers ``` ### 2. Create Secret ```bash kubectl create secret generic agent-secrets \ --from-literal=VIDEOSDK_AUTH_TOKEN=your_auth_token \ --namespace agent-workers ``` ### 3. Deploy Agent ```yaml title="deployment.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: agent-worker namespace: agent-workers spec: replicas: 3 selector: matchLabels: app: agent-worker template: metadata: labels: app: agent-worker spec: containers: - name: agent-worker image: your-registry/agent-worker:latest ports: - containerPort: 8081 env: - name: VIDEOSDK_AUTH_TOKEN valueFrom: secretKeyRef: name: agent-secrets key: VIDEOSDK_AUTH_TOKEN resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" ``` Apply the deployment: ```bash kubectl apply -f deployment.yaml ``` ## Monitor ```bash # Check deployment status kubectl get deployments -n agent-workers # Check pods kubectl get pods -n agent-workers # View logs kubectl logs -f deployment/agent-worker -n agent-workers ``` ## Deploy Updates ```bash # Update image kubectl set image deployment/agent-worker agent-worker=your-registry/agent-worker:latest -n agent-workers # Check rollout status kubectl rollout status deployment/agent-worker -n agent-workers ``` ## Scaling > To support more concurrent agents, you can scale the deployment by increasing the number of replicas. Each pod will register with the VideoSDK backend registry and automatically receive job assignments. **Scale the deployment:** ```bash # Scale to 5 replicas kubectl scale deployment agent-worker --replicas=5 -n agent-workers # Or use HPA for automatic scaling kubectl autoscale deployment agent-worker --cpu-percent=70 --min=2 --max=10 -n agent-workers ``` **Check scaling:** ```bash # View current replicas kubectl get deployment agent-worker -n agent-workers # View HPA status kubectl get hpa -n agent-workers ``` ## Cleanup ```bash # Delete deployment kubectl delete deployment agent-worker -n agent-workers # Delete namespace (removes everything) kubectl delete namespace agent-workers ``` --- --- title: Monitoring APIs hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Monitoring APIs slug: monitoring-apis --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Monitoring APIs Monitor your worker status and performance using HTTP endpoints. All endpoints are available at `http://localhost:8081`. ## Available Endpoints - **`/health`** - Basic health check - **`/worker`** - Worker status - **`/stats`** - Detailed statistics - **`/debug`** - Configuration info - **`/`** - Web dashboard ## Quick Health Check ```bash curl http://localhost:8081/health ``` **Response:** ``` OK ``` ## Worker Status ```bash curl http://localhost:8081/worker ``` **Response:** ```json { "agent_id": "MyAgent", "active_jobs": 3, "connected": true, "worker_id": "worker-123", "worker_load": 0.3 } ``` ## Detailed Statistics ```bash curl http://localhost:8081/stats ``` **Response:** ```json { "worker_load": 0.3, "current_jobs": 3, "max_processes": 10, "agent_id": "MyAgent", "backend_connected": true, "resource_stats": { "total_resources": 10, "available_resources": 7, "active_resources": 3 } } ``` ## Web Dashboard Open `http://localhost:8081/` in your browser for a visual interface showing: - Real-time worker status - Resource utilization - Active jobs - Performance metrics ## Integration Examples ```python import requests def check_worker_health(): response = requests.get("http://localhost:8081/health") return response.status_code == 200 def get_worker_stats(): response = requests.get("http://localhost:8081/stats") return response.json() # Usage if check_worker_health(): stats = get_worker_stats() print(f"Active jobs: {stats['current_jobs']}") ``` ```javascript async function checkWorkerHealth() { const response = await fetch("http://localhost:8081/health"); return response.ok; } async function getWorkerStats() { const response = await fetch("http://localhost:8081/stats"); return response.json(); } // Usage if (await checkWorkerHealth()) { const stats = await getWorkerStats(); console.log(`Active jobs: ${stats.current_jobs}`); } ``` ## Common Use Cases - **Health monitoring**: Use `/health` for load balancer checks - **Performance tracking**: Use `/stats` for resource monitoring - **Debugging**: Use `/debug` to verify configuration - **Visual monitoring**: Use web dashboard for real-time overview --- --- title: Understanding the Worker hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Understanding the Worker slug: understanding-worker --- # Understanding the Worker The **Worker** is the runtime engine that executes your AI agents in production. Think of it as the "server" that runs your agent code and handles multiple conversations simultaneously. ![AI Agent Worker](https://cdn.videosdk.live/website-resources/docs-resources/ai_agent_worker.png) ## What the Worker Does The Worker manages the lifecycle of your AI agents by: - **Executing** your agent code when users start conversations - **Managing** multiple concurrent conversations efficiently - **Connecting** to VideoSDK's backend to receive job assignments - **Monitoring** health and performance automatically - **Scaling** up or down based on demand ## Why Use the Built-in Worker? The VideoSDK Agents framework includes a production-ready Worker that handles all the complex infrastructure concerns, so you can focus on building your AI agent logic. **Key Benefits:** - **Production-Ready**: Built for real-world workloads with proper error handling - **Auto-Scaling**: Automatically handles multiple conversations within a single worker - **Health Monitoring**: Built-in health checks and status reporting - **Zero-Downtime**: Graceful shutdown and deployment capabilities --- --- title: Worker Configuration hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Worker Configuration slug: worker-configuration --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import DeploymentCard from '@site/src/components/DeploymentCard' # Worker Configuration Workers are the execution engines that run your **AI Agent jobs**. Think of them as the bridge between your **agent logic** and the **VideoSDK runtime**. This guide walks you through how to configure and tune a Worker for different environments — from local dev to production. ## Quick Start: Minimal Worker Here’s the simplest Worker setup to get going: ```python from videosdk.agents import WorkerJob, Options, JobContext, RoomOptions options = Options( agent_id="MyAgent", max_processes=5, register=True, # Registers worker with the backend for job scheduling ) room_options = RoomOptions( name="My Agent", ) job_context = JobContext(room_options=room_options) job = WorkerJob( entrypoint=your_agent_function, jobctx=lambda: job_context, options=options, ) job.start() ``` That’s enough to start processing jobs locally or in staging. ## Worker Options Explained The `Options` class gives you fine-grained control over Worker behavior: | Option | Purpose | Example | | -------------------- | ------------------------------------------- | ------------------------------- | | `agent_id` | Unique identifier for your agent | `"SupportBot01"` | | `max_processes` | Maximum concurrent jobs | `10` | | `num_idle_processes` | Pre-warmed processes for faster startup | `2` | | `load_threshold` | Max CPU/Load tolerance before refusing jobs | `0.75` | | `register` | Whether to register with backend | `True` (prod) / `False` (local) | | `log_level` | Logging verbosity | `"DEBUG"`, `"INFO"`, `"ERROR"` | | `host`, `port` | Bind address for health/status endpoints | `"0.0.0.0"`, `8081` | | `memory_warn_mb` | Trigger warning logs at this usage | `500.0` | | `memory_limit_mb` | Hard memory cap (`0` = unlimited) | `1000.0` | | `ping_interval` | Heartbeat interval in seconds | `30.0` | | `max_retry` | Max connection retries before giving up | `16` | ## Example Configurations **Standard Production** configuration for typical deployments: ```python options = Options( agent_id="StandardAgent", max_processes=5, register=True, log_level="INFO", ) ``` This configuration is suitable for: - Standard production deployments - Moderate traffic loads - Most business applications **High-Scale Production** configuration for enterprise workloads: ```python options = Options( agent_id="EnterpriseAgent", max_processes=20, num_idle_processes=5, load_threshold=0.8, memory_limit_mb=2000.0, register=True, log_level="DEBUG", ) ``` This configuration is optimized for: - Enterprise-scale deployments - High concurrent user loads - Advanced monitoring requirements **Local Development** configuration for development: ```python options = Options( agent_id="DevAgent", max_processes=1, register=False, # Don't register with backend log_level="DEBUG", host="localhost", port=8081, ) ``` This configuration is ideal for: - Local development and testing - Debugging agent behavior - Isolated development environments ## Hosting Environments ## Scaling Your Workers Workers can scale both **vertically** (more power per instance) and **horizontally** (more instances). - **Vertical Scaling** → Increase `max_processes` to run more jobs per worker. - **Horizontal Scaling** → Deploy multiple workers; the backend registry will balance load. - **Idle Processes** → Use `num_idle_processes` to reduce cold start latency. - **Load Threshold** → Tune `load_threshold` (default `0.75`) to prevent overload. - **Memory Safety** → Use `memory_warn_mb` and `memory_limit_mb` to keep processes healthy. ## Pro Tips - **Start small** → Begin with `max_processes=5` and adjust as you observe metrics. - **Log smart** → Use `DEBUG` in dev, but `INFO` or `WARN` in prod to reduce noise. - **Monitor & Auto-Scale** → Pair with metrics (Prometheus, Grafana, CloudWatch, etc.) to auto-scale horizontally. - **Keep processes warm** → Set at least `num_idle_processes=1` in production for faster first-response times. --- --- title: Function Tools hide_title: false hide_table_of_contents: false description: "Learn how to extend your VideoSDK AI Agent's capabilities with function tools. Create custom actions, integrate with external services, and enable your agent to perform tasks beyond conversation using the @function_tool decorator." pagination_label: "Function Tools" keywords: - Function Tools - function_tool - Agent Tools - Custom Actions - External Services - API Integration - Agent Capabilities - VideoSDK Agents - AI Agent SDK - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Function Tools slug: function-tools --- import { AgentCardGrid, GithubIcon } from '@site/src/components/agent/cards'; # Function Tools Function tools allow your AI agent to perform actions and interact with external services, extending its capabilities beyond simple conversation. By registering function tools, you enable your agent to execute custom logic, call APIs, access databases, and perform various tasks based on user requests. ## Overview Function tools are Python functions decorated with `@function_tool` that your agent can call during conversations. The LLM automatically decides when to use these tools based on the user's request and the tool's description. ## External Tools External tools are defined as standalone functions and passed into the agent's constructor via the `tools` parameter. This approach is useful for sharing common tools across multiple agents. ```python title="main.py" from videosdk.agents import Agent, function_tool # External tool defined outside the class @function_tool(description="Get weather information for a location") def get_weather(location: str) -> str: """Get weather information for a specific location.""" # Weather logic here return f"Weather in {location}: Sunny, 72°F" class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant.", tools=[get_weather] # Register the external tool ) ``` ## Internal Tools Internal tools are defined as methods within your agent class and decorated with `@function_tool`. This approach is useful for logic that is specific to the agent and needs access to its internal state (`self`). ```python title="main.py" from videosdk.agents import Agent, function_tool class FinanceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful financial assistant." ) self.portfolio = {"AAPL": 10, "GOOG": 5} @function_tool def get_portfolio_value(self) -> dict: """Get the current value of the user's stock portfolio.""" # Access agent state via self return {"total_value": 5000, "holdings": self.portfolio} ``` ## Async Function Tools Function tools can be asynchronous, which is essential for making HTTP requests, performing I/O operations, or integrating with async VideoSDK features. ```python title="main.py" import aiohttp from videosdk.agents import Agent, function_tool class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant that can fetch real-time weather data." ) @function_tool async def get_weather_async(self, location: str) -> dict: """Fetch real-time weather data from an API.""" async with aiohttp.ClientSession() as session: async with session.get(f"https://api.weather.com/{location}") as response: data = await response.json() return { "location": location, "temperature": data.get("temp"), "condition": data.get("condition") } ``` :::note **Sarvam AI LLM**: When using Sarvam AI as the LLM option, function tool calls and MCP tools will not work. Consider using alternative LLM providers if you need function tool support. ::: ## Examples - Try Out Yourself }, { title: "Real-life Usecase", description: "Complete example demonstrating internal and external function tools", link: "https://github.com/videosdk-live/agents/blob/ee3ced912078c3be9dd62c7576c95c1bbe227bae/examples/a2a/agents/customer_agent.py#L22", icon: } ]} columns={2} /> --- --- title: Human in the Loop hide_title: false hide_table_of_contents: false description: "Learn how to implement Human in the Loop (HITL) functionality with VideoSDK AI Agents using Discord integration for human oversight and intervention." pagination_label: "Human in the Loop" keywords: - Human in the Loop - HITL - Discord Integration - AI Agent Oversight - Human Intervention - VideoSDK Agents - MCP Server - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Human in the Loop slug: human-in-the-loop --- # Human in the Loop Human in the Loop (HITL) enables AI agents to escalate specific queries to human operators for review and approval. This implementation uses Discord as the human interface, allowing seamless handoffs between AI automation and human oversight. ## Overview The HITL system allows AI agents to: - Handle routine customer inquiries autonomously - Escalate specific queries (like discount requests) to human operators via Discord - Receive human responses and relay them back to customers - Maintain conversation flow while waiting for human input ## Use Cases - **Discount Requests**: AI escalates pricing queries to human sales agents - **Complex Support**: Technical issues requiring human expertise - **Policy Decisions**: Requests that need human approval or clarification - **Escalation Scenarios**: Situations where AI confidence is low ## Example Overview The implementation consists of two main components: 1. **Customer Agent**: VideoSDK AI agent that handles customer interactions and escalates specific queries 2. **Discord MCP Server**: MCP server that creates Discord threads for human operator responses ## Example Implementation ### Customer Agent Setup ```python from videosdk.agents import Agent, MCPServerStdio import pathlib import sys class CustomerAgent(Agent): def __init__(self, ctx: Optional[JobContext] = None): current_dir = pathlib.Path(__file__).parent discord_mcp_server_path = current_dir / "discord_mcp_server.py" super().__init__( instructions="You are a customer-facing agent for VideoSDK. You have access to various tools to assist with customer inquiries, provide support, and handle tasks. When a user asks for a discount percentage, always use the appropriate tool to retrieve and provide the accurate answer from your superior human agent.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(discord_mcp_server_path)], session_timeout=30 ), ] ) self.ctx = ctx ``` ### Discord MCP Server ```python from mcp.server.fastmcp import FastMCP import discord from discord.ext import commands class DiscordHuman: def __init__(self, user_id: int, channel_id: int): self.user_id = user_id self.channel_id = channel_id self.bot = commands.Bot(command_prefix="!", intents=discord.Intents.all()) self.response_future = None async def ask(self, question: str) -> str: channel = self.bot.get_channel(self.channel_id) thread = await channel.create_thread( name=question[:100], type=discord.ChannelType.public_thread ) await thread.send(f"<@{self.user_id}> {question}") self.response_future = self.loop.create_future() try: return await asyncio.wait_for(self.response_future, timeout=600) except asyncio.TimeoutError: return "⏱️ Timed out waiting for a human response" # MCP Server Setup mcp = FastMCP("HumanInTheLoopServer") @mcp.tool(description="Ask a human agent via Discord for a specific user query such as discount percentage, etc.") async def ask_human(question: str) -> str: return await discord_human.ask(question) ``` ### Pipeline Configuration ```python pipeline = CascadingPipeline( stt=DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")), llm=AnthropicLLM(api_key=os.getenv("ANTHROPIC_API_KEY")), tts=GoogleTTS(api_key=os.getenv("GOOGLE_API_KEY")), vad=SileroVAD(), turn_detector=TurnDetector(threshold=0.8) ) ``` ### Environment Variables Set the following environment variables: ```bash DISCORD_TOKEN=your_discord_bot_token DISCORD_USER_ID=human_operator_user_id DISCORD_CHANNEL_ID=channel_id_for_escalations DEEPGRAM_API_KEY=your_deepgram_key ANTHROPIC_API_KEY=your_anthropic_key GOOGLE_API_KEY=your_google_key ``` ### Example Link Complete implementation with full source code, setup instructions, and configuration examples available in the [VideoSDK Agents GitHub repository](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop). --- --- title: Introduction hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - Voice AI - Real-time Communication - AI Integration - VideoSDK Cloud - Conversational AI - Build AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Introduction slug: introduction --- import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, TelephonyIcon, WaveformIcon, DocsIcon, CloudIcon, PuzzlePieceSimpleIcon, MetricsIcon, BulbIcon, DiscordIcon, SupportIcon } from '@site/src/components/agent/cards'; # AI Voice Agents The VideoSDK AI Agent SDK is a powerful Python framework for developers to seamlessly integrate intelligent, real-time voice agents into any application. Bridge the gap between advanced AI models and human interaction, creating natural, engaging, and responsive conversational experiences. , showArrow: false }, { title: "AI Telephony Agent Quickstart", description: "Build an AI Telephony Agent in less than 10 minutes", link: "/ai_agents/ai-phone-agent-quick-start", icon: , showArrow: false }, { title: "Github Repository", description: "The videosdk agent code and examples", link: "https://github.com/videosdk-live/agents", icon: }, { title: "Agent Starter Apps", description: "Ready-to-run starter apps to get your AI agent up and running fast.", link: "/ai_agents/agent-runtime/connect-agent/web-integrations/agent-starter-react", icon: } ]} /> ## The Architecture The VideoSDK AI Agents framework connects four key components to enable seamless AI voice interactions: - Your **Infrastructure** hosts the agent management system - The **Agent Worker** creates and manages AI sessions - The **VideoSDK Room** handles real-time meeting operations - **User Devices** connect through web, mobile apps, or phone calls to interact with intelligent agents that can listen, understand, and respond naturally in real-time conversations. ![Introduction](https://assets.videosdk.live/images/agent-architecture.png) ## Use Cases Here are some real-world applications where VideoSDK AI Agents can be deployed to create intelligent, voice-enabled experiences across different industries and scenarios. You can use this, or refer this to create your customized agent. ## The Building Blocks Our SDK is built on four primary, modular components that work together to create powerful and customizable agents. Understand these concepts, and you're ready to build. , showArrow: false }, { title: "Deployment Options", description: "Deploy your agent on cloud or self-host it on your own infrastructure", link: "/ai_agents/deployments/introduction", icon: , showArrow: false }, { title: "Observability", description: "Monitor and debug with confidence using our built-in session analytics, latency tracking, and detailed traces.", link: "/ai_agents/tracing-observability/session-analytics", icon: , showArrow: false }, { title: "Plugin Ecosystem", description: "Integrate with dozens of providers like OpenAI, Google, Anthropic, and Elevenlabs for STT, LLM, and TTS.", link: "/ai_agents/plugins/realtime/openai", icon: , showArrow: false } ]} /> ## Need Help? If you have any queries, please feel free to reach out to us using one of the following methods: }, { title: "GitHub", description: "Ask your questions on GitHub.", link: "https://github.com/videosdk-live/agents/issues", icon: }, { title: "Support", description: "Talk to an expert, book demo or talk to sales.", link: "https://www.videosdk.live/contact", icon: } ]} columns={3} /> ## Frequently Asked Questions
What programming language and version are required? The AI Agent SDK is built in Python. You'll need Python 3.12 or higher to use the SDK.
Can my agent answer phone calls? Yes. By integrating with our SIP/telephony services, your AI agent can join a room initiated by a standard phone call. This allows you to build powerful IVR systems, automated appointment schedulers, AI-powered call centers, and more.
What AI models are supported? The SDK supports various AI models including: - **Real-time Models**: OpenAI, Google Gemini, AWS Nova Sonic - **LLM Providers**: OpenAI, Google Gemini, Anthropic Claude, Sarvam AI, Cerebras - **TTS Providers**: ElevenLabs, OpenAI, Google, AWS Polly, Cartesia, and many more - **STT Providers**: OpenAI Whisper, Deepgram, Google, AssemblyAI, and others
Can I use my own custom models? Absolutely! The SDK's modular architecture allows you to create custom plugins for any AI provider. Check our [plugin development guide](https://github.com/videosdk-live/agents/blob/main/BUILD_YOUR_OWN_PLUGIN.md) for detailed instructions.
How is pricing handled for the AI Agent SDK? VideoSDK offers a free tier with limited usage. The AI Agent SDK itself is open-source, but you'll need API keys for the AI services you choose to use (OpenAI, Google, etc.). Check the [pricing page](https://www.videosdk.live/pricing) for VideoSDK usage limits.
Can agents handle more than just voice? Absolutely! Agents support multimodal interactions including vision processing, data messages, and real-time video streams. They can also use function tools to interact with external systems and APIs.
Is the SDK production-ready? Yes, the AI Agent SDK is stable and production-ready. It is designed to be self-hosted on your own infrastructure for full control and scalability, from a single server to a Kubernetes cluster. It includes comprehensive error handling, metrics collection, and deployment flexibility.
--- --- title: MCP Integration hide_title: false hide_table_of_contents: false description: "Learn how to integrate Model Context Protocol (MCP) servers with VideoSDK AI Agents to extend your agent's capabilities with external services, databases, and APIs using STDIO and HTTP transport methods." pagination_label: "MCP Integration" keywords: - MCP Integration - Model Context Protocol - MCP Client - MCP Servers - Multiple MCP Servers - MCP Server Client Example - VideoSDK Agents - AI Agent SDK - Python - MCP Tools - MCP Standard Input/Output (stdio) - MCP Streamable HTTP - MCP Server-Sent Events (SSE) - External APIs - Voice Agent Sessions - Real Time MCP image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: MCP Integration slug: mcp-integration --- The Model Context Protocol (MCP) is an open standard that enables AI assistants to securely connect to data sources and tools. With VideoSDK's AI Agents, you can seamlessly integrate MCP servers to extend your agent's capabilities with external services or applications, databases, and APIs. ## MCP Server Types VideoSDK supports two transport methods for MCP servers: ### 1. STDIO Transport - Direct process communication - Local Python scripts - Best for custom tools and functions - Ideal for server-side integrations ### 2. HTTP Transport (Streamable HTTP or SSE) - Network-based communication - External MCP services - Best for third-party integrations - Supports remote MCP servers ## How It Works with VideoSDK's AI Agent MCP tools are automatically discovered and made available to your agent. Agent will intelligently choose which tools to use based on user requests. When a user asks for information that requires external data, the agent will: - Identify the need for external data based on the user's request - Select appropriate tools from available MCP servers - Execute the tools with relevant parameters - Process the results and provide a natural language response This seamless integration allows your voice agent to access real-time data and external services while maintaining a natural conversational flow. ## Creating an MCP Server # Basic MCP Server Structure A simple MCP server using STDIO to return the current time. First, install the required package: ```bash pip install fastmcp ``` ```python title="mcp_stdio_example.py" from mcp.server.fastmcp import FastMCP import datetime # Create the MCP server mcp = FastMCP("CurrentTimeServer") @mcp.tool() def get_current_time() -> str: """Get the current time in the user's location""" # Get current time now = datetime.datetime.now() # Return formatted time string return f"The current time is {now.strftime('%H:%M:%S')} on {now.strftime('%Y-%m-%d')}" if __name__ == "__main__": # Run the server with STDIO transport mcp.run(transport="stdio") ``` ## Integrating MCP with VideoSDK Agent Now we'll see how to integrate MCP servers with your VideoSDK AI Agent: ```python title="main.py" import asyncio import pathlib import sys from videosdk.agents import Agent, AgentSession, RealTimePipeline,MCPServerStdio, MCPServerHTTP from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig class MyVoiceAgent(Agent): def __init__(self): # Define paths to your MCP servers mcp_script = Path(__file__).parent.parent / "MCP_Example" / "mcp_stdio_example.py" super().__init__( instructions="""You are a helpful assistant with access to real-time data. You can provide current time information. Always be conversational and helpful in your responses.""", mcp_servers=[ # STDIO MCP Server (Local Python script for time) MCPServerStdio( executable_path=sys.executable, # Use current Python interpreter process_arguments=[str(mcp_script)], session_timeout=30 ), # HTTP MCP Server (External service example e.g Zapier) MCPServerHTTP( endpoint_url="https://your-mcp-service.com/api/mcp", session_timeout=30 ) ] ) async def on_enter(self) -> None: await self.session.say("Hi there! How can I help you today?") async def on_exit(self) -> None: await self.session.say("Thank you for using the assistant. Goodbye!") async def main(context: dict): # Configure Gemini Realtime model model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", # Available voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) agent = MyVoiceAgent() session = AgentSession( agent=agent, pipeline=pipeline, context=context ) try: # Start the session await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() if __name__ == "__main__": def make_context(): # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth return { "meetingId": "your_actual_meeting_id_here", # Replace with actual meeting ID "name": "AI Voice Agent", "videosdk_auth": "your_videosdk_auth_token_here" # Replace with actual token } ``` :::tip Get started quickly with the [Quick Start Example](https://github.com/videosdk-live/agents-quickstart/tree/main/MCP) for the VideoSDK AI Agent SDK With MCP — everything you need to build your first AI agent fast. ::: --- --- title: Anam AI Avatar hide_title: false hide_table_of_contents: false description: "Build video agents with unmatched realism using Anam AI avatars and the VideoSDK AI Agent SDK. This guide covers configuration, API integration, and adding a lifelike visual avatar to your agent." pagination_label: "Anam AI" keywords: - Anam - Anam AI - Avatar - Real-time - VideoSDK Agents - Python SDK - AI Agent - Virtual Avatar - Realistic Avatar image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Anam AI slug: anam-ai --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Anam AI Avatar The Anam AI Avatar plugin allows you to integrate a real-time, lip-synced AI avatar into your VideoSDK agent. It provides a visual representation of the agent with expressive facial movements synchronized to speech output, and works with both `CascadingPipeline` and `RealTimePipeline`. ## Installation Install the Anam-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-anam" ``` ## Authentication The Anam AI plugin requires an [Anam API key](https://www.anam.ai/). ## Importing ```python from videosdk.plugins.anam import AnamAvatar ``` ## Setup Credentials To use Anam AI, you need an API key and an avatar ID. You can get them from the [Anam AI Dashboard](https://www.anam.ai/). ```bash ANAM_API_KEY="YOUR_ANAM_API_KEY" ANAM_AVATAR_ID="YOUR_ANAM_AVATAR_ID" ``` ## Example Usage Here's how you can integrate the Anam AI Avatar with both `RealTimePipeline` and `CascadingPipeline`. This example shows how to add the Anam AI Avatar to a `RealTimePipeline`. ```python import os from videosdk.agents import RealTimePipeline from videosdk.plugins.anam import AnamAvatar # 1. Create an AnamAvatar instance anam_avatar = AnamAvatar( api_key=os.getenv("ANAM_API_KEY"), avatar_id=os.getenv("ANAM_AVATAR_ID"), ) # 2. Add the avatar to the pipeline pipeline = RealTimePipeline( avatar=anam_avatar ) ``` For a full working example, see the [Anam Realtime Example on GitHub](https://github.com/videosdk-live/agents/blob/main/examples/avatar/anam_realtime_example.py). This example shows how to add the Anam AI Avatar to a `CascadingPipeline`. ```python import os from videosdk.agents import CascadingPipeline from videosdk.plugins.anam import AnamAvatar # 1. Create an AnamAvatar instance anam_avatar = AnamAvatar( api_key=os.getenv("ANAM_API_KEY"), avatar_id=os.getenv("ANAM_AVATAR_ID"), ) # 2. Add the avatar to the pipeline pipeline = CascadingPipeline( avatar=anam_avatar ) ``` For a full working example, see the [Anam Cascading Example on GitHub](https://github.com/videosdk-live/agents/blob/main/examples/avatar/anam_cascading_example.py). ## Configuration Options ### `AnamAvatar` - `api_key`: (str, **required**) Your Anam API key. - `avatar_id`: (str, optional) The ID of the avatar to use. Defaults to `"d9ebe82e-2f34-4ff6-9632-16cb73e7de08"`. ## Additional Resources The following resources provide more information about using Anam AI with VideoSDK Agents SDK. - **[Anam AI docs](https://docs.anam.ai/)**: Anam AI official documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Simli Avatar hide_title: false hide_table_of_contents: false description: "Learn how to use Simli's real-time AI avatars with the VideoSDK AI Agent SDK. This guide covers configuration, API integration, and adding a visual avatar to your agent." pagination_label: "Simli Avatar" keywords: - Simli - Avatar - Real-time - VideoSDK Agents - Python SDK - AI Agent - Virtual Avatar image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Simli slug: simli --- # Simli Avatar The Simli Avatar plugin allows you to integrate a real-time, lip-synced AI avatar into your VideoSDK agent. This creates a more engaging and interactive experience for users by providing a visual representation of the AI agent. Simli offers two avatar types: Legacy (30 FPS) and Trinity (25 FPS). When creating a SimliAvatar, set is_trinity_avatar=True if you're using a Trinity avatar (default is False). Always select the correct faceID from the Simli dashboard. ## Installation Install the Simli-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-simli" ``` ## Authentication The Simli plugin requires an [Simli API key](https://app.simli.com/apikey). Set `SIMLI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.simli import SimliAvatar, SimliConfig ``` ## Setup Credentials To use Simli, you need an API key. You can get one from the [Simli Dashboard](https://app.simli.com/profile). Set up your credentials by exporting them as an environment variable: ```bash export SIMLI_API_KEY="YOUR_SIMLI_API_KEY" ``` You can also provide a `faceId` if you have a custom one. ```bash export SIMLI_FACE_ID="YOUR_FACE_ID" ``` ## Example Usage Here's how you can integrate the Simli Avatar with both `CascadingPipeline` and `RealTimePipeline`. ### Cascading Pipeline This example shows how to add the Simli Avatar to a `CascadingPipeline`. ```python import os from videosdk.agents import CascadingPipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig # Import other necessary components like STT, LLM, TTS # 1. Initialize SimliConfig simli_config = SimliConfig( apiKey=os.getenv("SIMLI_API_KEY"), faceId=os.getenv("SIMLI_FACE_ID"), # This is optional and has a default value ) # 2. Create a SimliAvatar instance # For Legacy avatars (default) simli_avatar = SimliAvatar(config=simli_config) # For Trinity avatars # simli_avatar = SimliAvatar( # config=simli_config, # is_trinity_avatar=True, # ) # 3. Add the avatar to the pipeline pipeline = CascadingPipeline( # ... stt=stt, llm=llm, tts=tts avatar=simli_avatar ) ``` ### Real-time Pipeline This example shows how to add the Simli Avatar to a `RealTimePipeline`. ```python import os from videosdk.agents import RealTimePipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig # from videosdk.plugins.google import GeminiRealtime # Example model # 1. Initialize SimliConfig simli_config = SimliConfig( apiKey=os.getenv("SIMLI_API_KEY"), ) # 2. Create a SimliAvatar instance # For Legacy avatars (default) simli_avatar = SimliAvatar(config=simli_config) # For Trinity avatars # simli_avatar = SimliAvatar( # config=simli_config, # is_trinity_avatar=True, # ) # 3. Add the avatar to the pipeline pipeline = RealTimePipeline( model=your_realtime_model, # e.g., GeminiRealtime() avatar=simli_avatar ) ``` :::note When using an environment variable for credentials, you should still load it in your code using `os.getenv("SIMLI_API_KEY")` and pass it to `SimliConfig`. ::: ## Configuration Options You can customize the avatar's behavior using the `SimliConfig` and `SimliAvatar` classes. ### `SimliConfig` - `faceId`: (str, optional) The ID for the avatar face. You can find available faces in the [Simli Docs](https://docs.simli.com/api-reference/available-faces) or create your own. Defaults to `"0c2b8b04-5274-41f1-a21c-d5c98322efa9"`. - `maxSessionLength`: (int, optional) A hard time limit in seconds after which the session will disconnect. Defaults to `1800` (30 minutes). - `maxIdleTime`: (int, optional) A soft time limit in seconds that disconnects the session after a period of not sending data. Defaults to `300` (5 minutes). ### `SimliAvatar` - `config`: (`SimliConfig`) A `SimliConfig` object with your desired settings. - `is_trinity_avatar`: (bool, optional) Set to `True` when using Trinity avatars. Defaults to `False` for Legacy avatars. ## Additional Resources The following resources provide more information about using Simli with VideoSDK Agents SDK. - **[Simli docs](https://docs.simli.com/overview)**: Simli docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: RNNoise Denoise hide_title: false hide_table_of_contents: false description: "Learn how to use RNNoise with the VideoSDK AI Agent SDK. This guide covers how to denoise your audio input" pagination_label: "Denoise" keywords: - RNNoise - Denoise - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Denoise slug: denoise --- # Denoise The RNNoise plugin enhances audio quality by removing background noise from your audio input, resulting in improved speech-to-text (STT) accuracy and better overall audio processing performance. RNNoise is a real-time noise suppression library powered by a recurrent neural network that intelligently filters out environmental noise such as air conditioning, computer fans, and other stationary background sounds while preserving the clarity and quality of speech. ## Installation Install the RNNoise plugin for denoising in VideoSDK Agents package: ```bash pip install "videosdk-plugins-rnnoise" ``` ## Importing ```python from videosdk.plugins.rnnoise import RNNoise ``` ## Example Usage ```python from videosdk.plugins.rnnoise import RNNoise from videosdk.agents import CascadingPipeline # Initialize the RNNoise Plugin rnnoise = RNNoise() # Add Denoise Plugin to cascading pipeline pipeline = CascadingPipeline(denoise=rnnoise) ``` It also works with [`RealTimePipeline`](/ai_agents/core-components/realtime-pipeline.md). ## Example Usage in RealTime Pipeline ```python from videosdk.plugins.rnnoise import RNNoise from videosdk.agents import RealTimePipeline # Initialize the RNNoise Plugin rnnoise = RNNoise() # Add Denoise Plugin to realtime pipeline pipeline = RealTimePipeline(denoise=rnnoise) ``` ## Benefits - **Enhanced STT Accuracy**: Cleaner audio input leads to more accurate speech-to-text transcription - **Real-time Processing**: Processes audio streams with minimal latency for seamless user experience - **Intelligent Noise Reduction**: Effectively removes background noise while preserving speech clarity ## Additional Resources The following resources provide more information about using RNNoise with VideoSDK Agents SDK. - **[RNNoise project](https://github.com/xiph/rnnoise)**: The open source RNNoise library that powers the VideoSDK RNNoise plugin. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: VideoSDK Inference hide_title: false hide_table_of_contents: false description: "Learn how to use VideoSDK's Inference Gateway to easily integrate various AI models for STT, TTS, and Realtime communication without your API key." sidebar_label: VideoSDK Inference slug: videosdk-inference --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # VideoSDK Inference VideoSDK Inference provides a unified gateway to access various AI models for Speech-to-Text (STT), LLM (Large Language Models), Text-to-Speech (TTS), and Real-time multimodal communication. With VideoSDK Inference, you don't need to provide your own API keys for individual AI providers (like Sarvam AI, Google Gemini, etc.). VideoSDK handles the authentication and API connections through its unified gateway, allowing you to get started instantly. The services will be charged from your VideoSDK account balance. ## Installation The Inference plugin is part of the core VideoSDK Agents SDK. You can install it using pip: ```bash pip install videosdk-agents ``` ## Importing You can import the `STT`, `LLM`, `TTS`, `Denoise`, and `Realtime` classes from the `videosdk.agents.inference` module. ```python from videosdk.agents.inference import STT, LLM, TTS, Denoise, Realtime ``` ## Setup Authentication Authentication for the Inference gateway is handled via the `VIDEOSDK_AUTH_TOKEN` environment variable. ```bash VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token" ``` In a `CascadingPipeline`, you can use VideoSDK Inference to handle speech recognition and synthesis. This example shows how to use Sarvam AI's models via the VideoSDK gateway. ### Example Usage ```python import logging from videosdk.agents import ( Agent, AgentSession, CascadingPipeline, ConversationFlow, JobContext, RoomOptions, WorkerJob, ) # highlight-start from videosdk.agents.inference import STT, LLM, TTS, Denoise # highlight-end from videosdk.plugins.silero import SileroVAD # Minimal logging logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" ) class SimpleAgent(Agent): """Simple voice agent for testing inference STT.""" def __init__(self): super().__init__( instructions="You are a helpful voice assistant. Keep responses brief and conversational.", ) async def on_enter(self) -> None: await self.session.say( "Hello! I'm using VideoSDK Inference for speech recognition. How can I help you?" ) async def on_exit(self) -> None: await self.session.say("Goodbye!") async def entrypoint(ctx: JobContext): """Main entrypoint for the agent.""" agent = SimpleAgent() conversation_flow = ConversationFlow(agent) # Create pipeline with Inference STT, LLM, TTS & Denoise (via VideoSDK Gateway) pipeline = CascadingPipeline( # highlight-start # Inference STT, LLM, TTS, Denoise (via VideoSDK Gateway) stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"), llm=LLM.google(model_id="gemini-2.5-flash"), tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"), denoise=Denoise.sanas(), # highlight-end vad=SileroVAD(), ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow, ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: """Create job context for playground mode.""" room_options = RoomOptions( name="Inference Test Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` The `RealTimePipeline` uses the VideoSDK Inference Gateway to handle multimodal models like Gemini Live 2.5 Flash Native Audio, which manages the connection efficiently and reduces latency. ### Example Usage ```python import logging from videosdk.agents import ( Agent, AgentSession, RealTimePipeline, ConversationFlow, JobContext, RoomOptions, WorkerJob, ) # highlight-start from videosdk.agents.inference import Realtime # highlight-end # Minimal logging logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" ) class SimpleAgent(Agent): """Simple voice agent for testing inference realtime.""" def __init__(self): super().__init__( instructions="""You are a helpful and friendly voice assistant. You speak in a natural, conversational tone. Keep your responses concise but informative.""", ) async def on_enter(self) -> None: await self.session.say( "Hello! I'm using the VideoSDK Inference Gateway with Gemini. How can I help you today?" ) async def on_exit(self) -> None: await self.session.say("Goodbye! Have a great day!") async def entrypoint(ctx: JobContext): """Main entrypoint for the agent.""" agent = SimpleAgent() conversation_flow = ConversationFlow(agent) # Create RealTimePipeline with Inference Realtime (Gemini) pipeline = RealTimePipeline( # highlight-start model=Realtime.gemini( model_id="gemini-2.5-flash-native-audio-preview-12-2025", voice="Puck", language_code="en-US", response_modalities=["AUDIO"], temperature=0.7 ), # highlight-end ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow, ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: """Create job context for playground mode.""" room_options = RoomOptions( name="Inference Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Configuration Options ### STT Configuration #### `STT.sarvam()` - `model_id`: (str) The specific Sarvam model ID (e.g., `"saarika:v2.5"`). - `language`: (str) Language code for transcription (e.g., `"en-IN"`). #### `STT.google()` - `model_id`: (str) The Google model ID (e.g., `"chirp_3"`). - `language`: (str) Language code for transcription (default: `"en-US"`). ### LLM Configuration #### `LLM.google()` - `model_id`: (str) The Gemini model version (e.g., `"gemini-2.5-flash"`). - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). ### TTS Configuration #### `TTS.sarvam()` - `model_id`: (str) The Sarvam model ID (e.g., `"bulbul:v2"`). - `speaker`: (str) The speaker name (e.g., `"anushka"`). - `language`: (str) Language code (e.g., `"en-IN"`). #### `TTS.google()` - `model_id`: (str) The Google model ID (e.g., `"Chirp3-HD"`). - `voice_id`: (str) The voice ID (e.g., `"Achernar"`). - `language`: (str) Language code (e.g., `"en-US"`). ### Denoise Configuration #### `Denoise.sanas()` - Integrates Sanas for real-time speech enhancement and noise suppression. ### Realtime Configuration #### `Realtime.gemini()` - `model_id`: (str) The Gemini model version (e.g., `"gemini-2.5-flash-native-audio-preview-12-2025"`). - `voice`: (str) The voice to use (e.g., `"Puck"`, `"Charon"`, `"Kore"`, `"Fenrir"`, `"Aoede"`). - `language_code`: (str) Language code (e.g., `"en-US"`). - `response_modalities`: (list) List of modalities, e.g., `["AUDIO"]` or `["TEXT", "AUDIO"]`. - `temperature`: (float) Sampling temperature (default: `0.7`). ## Additional Resources The following resources provide more information about using VideoSDK inferencing. - **[Inference Pricing](https://docs.videosdk.live/help_docs/pricing-inference)**: Detailed provider wise pricing import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Anthropic LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Anthropic's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Anthropic LLM" keywords: - Anthropic - Claude - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Anthropic slug: anthropic-llm --- # Anthropic LLM The Anthropic AI LLM provider enables your agent to use Anthropic AI's language models for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://docs.anthropic.com/en/docs/about-claude/models/overview) models. ## Installation Install the Anthropic-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-anthropic" ``` ## Importing ```python from videosdk.plugins.anthropic import AnthropicLLM ``` ## Authentication The Anthropic plugin requires an [Anthropic API key](https://console.anthropic.com/dashboard). Set `ANTHROPIC_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.anthropic import AnthropicLLM from videosdk.agents import CascadingPipeline # Initialize the Anthropic LLM model llm = AnthropicLLM( model="claude-sonnet-4-20250514", temperature=0.7, max_tokens=1024, ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Anthropic model to use (default: `"claude-sonnet-4-20250514"`). - `api_key`: (str) Your Anthropic API key. Can also be set via the `ANTHROPIC_API_KEY` environment variable. - `base_url`: (str) Optional custom base URL for Claude API (default: `None`). - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (`"auto"`, `"required"`, `"none"`) (default: `"auto"`). - `max_tokens`: (int) Maximum number of tokens in the response (default: `1024`). - `top_p`: (float) Nucleus sampling probability (optional). - `top_k`: (int) Top-k sampling parameter (optional). ## Additional Resources The following resources provide more information about using Anthropic with VideoSDK Agents SDK. - **[Anthropic docs](https://docs.anthropic.com/en/docs/intro)**: Anthropic documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure OpenAI LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Azure OpenAI's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Azure OpenAI LLM" keywords: - OpenAI - Azure - Azure OpenAI - GPT-4o - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Azure OpenAI slug: azureopenai --- # Azure OpenAI LLM The Azure OpenAI LLM provider enables your agent to use Azure OpenAI's language models (like GPT-4o) for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://platform.openai.com/docs/models) models. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAILLM ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import CascadingPipeline # Initialize the Azure OpenAI LLM model llm = OpenAILLM.azure( azure_deployment="gpt-4o", temperature=0.7, ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o"`, `"gpt-4o-mini"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `temperature`: (float) Sampling temperature for response randomness (0.0 to 2.0, default: 0.7) - `tool_choice`: Tool selection mode (e.g., `"auto"`, `"none"`, or specific tool) - `max_completion_tokens`: (int) Maximum number of tokens in the completion response ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Cerebras LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Cerebras's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Cerebras LLM" keywords: - Cerebras - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Cerebras slug: Cerebras-llm --- # Cerebras LLM The Cerebras AI LLM provider enables your agent to use Cerebras AI's language models for text-based conversations and processing. ## Installation Install the Cerebras-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cerebras" ``` ## Importing ```python from videosdk.plugins.cerebras import CerebrasLLM ``` ## Authentication The Cerebras plugin requires an [Cerebras API key](https://cloud.cerebras.ai/). Set `CEREBRAS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cerebras import CerebrasLLM from videosdk.agents import CascadingPipeline # Initialize the Cerebras LLM model llm = CerebrasLLM( model="llama3.3-70b", temperature=0.7, max_tokens=1024, ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Cerebras model to use (default: `"llama3.3-70b"`). Supported models include: `llama3.3-70b`, `llama3.1-8b`, `llama-4-scout-17b-16e-instruct`, `qwen-3-32b`, `deepseek-r1-distill-llama-70b` (private preview) - `api_key`: (str) Your Cerebras API key. Can also be set via the `CEREBRAS_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (`"auto"`, `"required"`, `"none"`) (default: `"auto"`). - `max_completion_tokens`: (int) Maximum number of tokens to generate in the response (optional). - `top_p`: (float) Nucleus sampling probability (optional). - `seed`: (int) Random seed for reproducible completions (optional). - `stop`: (str) Stop sequence that halts generation when encountered (optional). - `user`: (str) Identifier for the end user triggering the request (optional). ## Additional Resources The following resources provide more information about using Cerebras with VideoSDK Agents SDK. - **[Cerebras docs](https://inference-docs.cerebras.ai/introduction)**: Cerebras documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Google LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Google's LLM models (Gemini) with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Google LLM" keywords: - Google - Gemini - gemini-2.0-flash-001 - gemini-3-flash-preview - gemini-3-pro-preview - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google slug: google-llm --- # Google LLM The Google LLM provider enables your agent to use Google's Gemini family of language models for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://ai.google.dev/gemini-api/docs/models) models. ## Installation Install the Google-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Importing ```python from videosdk.plugins.google import GoogleLLM ``` ## Authentication The Google plugin requires an [Gemini API key](https://aistudio.google.com/apikey). Set `GOOGLE_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.google import GoogleLLM from videosdk.agents import CascadingPipeline # Initialize the Google LLM model llm = GoogleLLM( model="gemini-3-flash-preview", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", temperature=0.7, tool_choice="auto", max_output_tokens=1000 ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Vertex AI Integration You can also use Google's Gemini models through Vertex AI. This requires a different authentication and configuration setup. ### Authentication for Vertex AI For Vertex AI, you need to set up Google Cloud credentials. Create a service account, download the JSON key file, and set the path to this file in your environment. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` You should also configure your project ID and location. These can be set as environment variables or directly in the code. If not set, the `project_id` is inferred from the credentials file and the `location` defaults to `us-central1`. ```bash export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export GOOGLE_CLOUD_LOCATION="your-gcp-location" ``` ### Example Usage with Vertex AI To use Vertex AI, set `vertexai=True` when initializing `GoogleLLM`. You can configure the project and location using `VertexAIConfig`, which will take precedence over environment variables. ```python from videosdk.plugins.google import GoogleLLM, VertexAIConfig from videosdk.agents import CascadingPipeline # Import other necessary components like STT and TTS # from videosdk.plugins.deepgram import DeepgramSTT # from videosdk.plugins.elevenlabs import ElevenLabsTTS # Initialize GoogleLLM with Vertex AI configuration llm = GoogleLLM( vertexai=True, vertexai_config=VertexAIConfig( project_id="videosdk", location="us-central1" ) ) # Add llm to a cascading pipeline pipeline = CascadingPipeline( stt=DeepgramSTT(), # Example STT llm=llm, tts=ElevenLabsTTS() # Example TTS ) ``` ## Configuration Options - `model`: (str) The Google model to use (e.g., `"gemini-3-flash-preview"`, `"gemini-3-pro-preview"`, `"gemini-2.0-flash-001"`,) (default: `"gemini-2.0-flash-001"`). - `api_key`: (str) Your Google API key. Can also be set via the `GOOGLE_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (`"auto"`, `"required"`, `"none"`) (default: `"auto"`). - `max_output_tokens`: (int) Maximum number of tokens in the completion response (optional). - `top_p`: (float) The nucleus sampling probability (optional). - `top_k`: (int) The top-k sampling parameter (optional). - `presence_penalty`: (float) Penalizes new tokens based on whether they appear in the text so far (optional). - `frequency_penalty`: (float) Penalizes new tokens based on their existing frequency in the text so far (optional). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Gemini docs](https://ai.google.dev/gemini-api/docs/models)**: Google Gemini documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI LLM hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "OpenAI LLM" keywords: - OpenAI - GPT-4o - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: OpenAI slug: openai --- # OpenAI LLM The OpenAI LLM provider enables your agent to use OpenAI's language models (like GPT-4o) for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://platform.openai.com/docs/models) models. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAILLM ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import CascadingPipeline # Initialize the OpenAI LLM model llm = OpenAILLM( model="gpt-4o", # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", temperature=0.7, tool_choice="auto", max_completion_tokens=1000 ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The OpenAI model to use (e.g., `"gpt-4o"`, `"gpt-4o-mini"`, `"gpt-3.5-turbo"`) - `api_key`: Your OpenAI API key (can also be set via environment variable) - `base_url`: Custom base URL for OpenAI API (optional) - `temperature`: (float) Sampling temperature for response randomness (0.0 to 2.0, default: 0.7) - `tool_choice`: Tool selection mode (e.g., `"auto"`, `"none"`, or specific tool) - `max_completion_tokens`: (int) Maximum number of tokens in the completion response ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/)**: OpenAI documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Sarvam AI LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Sarvam AI's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Sarvam AI LLM" keywords: - Sarvam AI - sarvam-m - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Sarvam AI slug: sarvam-ai-llm --- # Sarvam AI LLM The Sarvam AI LLM provider enables your agent to use Sarvam AI's language models for text-based conversations and processing. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAILLM ``` :::note When using Sarvam AI as the LLM option, the function tool calls and MCP tool will not work. ::: ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAILLM from videosdk.agents import CascadingPipeline # Initialize the Sarvam AI LLM model llm = SarvamAILLM( model="sarvam-m", # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", temperature=0.7, tool_choice="auto", max_completion_tokens=1000, reasoning_effort="medium", # Optional: "low", "medium", "high" wiki_grounding=False, # Optional: enable Wikipedia-grounded responses ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Sarvam AI model to use (default: `"sarvam-m"`). - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (default: `"auto"`). - `max_completion_tokens`: (int) Maximum number of tokens in the completion response (optional). - `reasoning_effort`: (str) Controls reasoning depth for the model. Allowed values: `"low"`, `"medium"`, `"high"` (default: `None`). - `wiki_grounding`: (bool) Enables Wikipedia search to ground responses with factual information (default: `False`). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/)**: Sarvam's full docs site. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Namo Turn Detector hide_title: false hide_table_of_contents: false description: "Learn how to use NamoTurnDetectorV1 model with the VideoSDK AI Agent SDK. This guide covers model configuration." pagination_label: "Turn Detector" keywords: - Turn Detection - Namo Turn Detector - Large Language Model - VideoSDK Agents - Multilingual - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Namo Turn Detector slug: namo-turn-detector --- # Namo Turn Detector The Namo Turn Detector v1 utilizes a custom fine-tuned model from VideoSDK to accurately determine whether a user has finished speaking. This allows for precise management of conversation flow, especially in cascading pipeline setups. It can operate as a multilingual model or be configured for a specific language for optimized performance. ## Installation Install the Turn Detector-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-turn-detector" ``` ## Importing ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1 ``` ## Example Usage **1. For a specific language (e.g., English):** ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from videosdk.agents import CascadingPipeline # Pre-download the English model to avoid delays pre_download_namo_turn_v1_model(language="en") # Initialize the Turn Detector for English turn_detector = NamoTurnDetectorV1( language="en", threshold=0.7 ) # Add the Turn Detector to a cascading pipeline pipeline = CascadingPipeline(turn_detector=turn_detector) ``` **2. For multilingual support:** If you don't specify a language, the detector will default to the multilingual model, which can handle various languages. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from videosdk.agents import CascadingPipeline # Pre-download the multilingual model pre_download_namo_turn_v1_model() # Initialize the multilingual Turn Detector turn_detector = NamoTurnDetectorV1( threshold=0.7 ) # Add the Turn Detector to a cascading pipeline pipeline = CascadingPipeline(turn_detector=turn_detector) ``` ## Configuration Options - `language`: (Optional, `str`): Specifies the language for the turn detection model. If left as `None` (the default), it loads a multilingual model capable of handling all supported languages. - `threshold`: (float) Confidence threshold for turn completion detection (0.0 to 1.0, default: `0.7`) ## Supported Languages The `NamoTurnDetectorV1` supports a wide range of languages when you specify the corresponding language code. If no language is specified, the multilingual model will be used. Here is a list of the supported languages and their codes: | Language | Code | | :--- | :--- | | Arabic | `ar` | | Bengali | `bn` | | Chinese | `zh` | | Danish | `da` | | Dutch | `nl` | | English | `en` | | Finnish | `fi` | | French | `fr` | | German | `de` | | Hindi | `hi` | | Indonesian |`id` | | Italian | `it` | | Japanese | `ja` | | Korean | `ko` | | Marathi | `mr` | | Norwegian | `no` | | Polish | `pl` | | Portuguese | `pt` | | Russian | `ru` | | Spanish | `es` | | Turkish | `tr` | | Ukrainian | `uk` | | Vietnamese |`vi` | ## Pre-downloading Model To avoid delays during agent initialization, you can pre-download the Hugging Face model: You can pre-download a specific language model: ```python from videosdk.plugins.turn_detector import pre_download_namo_turn_v1_model # Download the English model before the agent runs pre_download_namo_turn_v1_model(language="en") ``` Or pre-download the multilingual model: ```python from videosdk.plugins.turn_detector import pre_download_namo_turn_v1_model # Download the multilingual model pre_download_namo_turn_v1_model() ``` ## Additional Resources The following resources provide more information about VideoSDK Turn Detector plugin for AI Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: AWS Nova Sonic hide_title: false hide_table_of_contents: false description: "Learn how to use Amazon's Nova Sonic model with the VideoSDK AI Agent SDK. This guide covers model configuration, streaming audio, and integration with your agent pipeline." pagination_label: "Amazon Nova Sonic" keywords: - Amazon's Nova Sonic - AWS Nova Sonic - AWS Model - Amazon Nova Sonic - NovaSonicRealtime - NovaSonicLiveConfig - Real-time AI - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: AWS Nova Sonic slug: aws-nova-sonic --- # AWS Nova Sonic The AWS Nova Sonic provider enables your agent to use Amazon's Nova Sonic model for real-time, speech-to-speech AI interactions. ### Prerequisites Before Start Using AWS Nova Sonic with the VideoSDK AI Agent, ensure the following: - `AWS Account`: You have an active AWS account with permissions to access Amazon Bedrock. - `Model Access`: You've requested and obtained access to the Amazon Nova models (Nova Lite and Nova Canvas) via the Amazon Bedrock console. - `Region Selection`: You're operating in the US East (N. Virginia) (us-east-1) region, as model access is region-specific. - `AWS Credentials`: Your AWS credentials (aws_access_key_id and aws_secret_access_key) are configured, either through environment variables or your preferred credential management method. ## Installation Install the Gemini-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-aws" ``` ## Authentication The Amazon Nova Sonic plugin requires an [AWS API key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). Set the following environment variables in your `.env` file: ```shell AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= AWS_DEFAULT_REGION= ``` ## Importing ```python from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig ``` ## Example Usage ```python from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig from videosdk.agents import RealTimePipeline # Initialize the Nova Sonic real-time model model = NovaSonicRealtime( model="amazon.nova-sonic-v1:0", # When AWS credentials and region are set in .env - DON'T pass credential parameters region="us-east-1", # Currently, only "us-east-1" is supported for Amazon Nova Sonic. aws_access_key_id="YOUR_ACCESS_KEY", aws_secret_access_key="YOUR_SECRET_KEY", config=NovaSonicConfig( voice="tiffany", # "tiffany","matthew", "amy" temperature=0.7, top_p=0.9, max_tokens=1024 ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: :::note To initiate a conversation with Amazon Nova Sonic, the user must speak first. The model listens for user input to begin the interaction. ::: ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The Amazon Nova Sonic model to use (e.g., "amazon.nova-sonic-v1:0"). - `region`: AWS region where the model is hosted (e.g., "us-east-1"). - `aws_access_key_id`: Your AWS access key ID. - `aws_secret_access_key`: Your AWS secret access key. - `config`: A NovaSonicConfig object for advanced options: - `voice`: (str or None) The voice to use for audio output (e.g., "matthew", "tiffany", "amy"). - `temperature`: (float or None) Sampling temperature for response randomness. - `top_p`: (float or None) Nucleus sampling probability. - `max_tokens`: (int or None) Maximum number of tokens in the output ## Additional Resources The following resources provide more information about using AWS Nova Sonic with VideoSDK Agents SDK. - **[Plugin quickstart](https://github.com/videosdk-live/agents-quickstart/blob/main/Realtime%20Pipeline/AWS%20Nova%20Sonic/aws_novasonic_agent_quickstart.py)**: Quickstart for the AWS Nova Sonic API plugin. - **[AWS Nova Sonic docs](https://docs.aws.amazon.com/nova/latest/userguide/speech.html)**: AWS Nova Sonic documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure Voice Live API hide_title: false hide_table_of_contents: false description: "Learn how to use Azure's Voice Live API with the VideoSDK AI Agent SDK. This guide covers model configuration, real-time speech interactions, and integration with your agent pipeline." pagination_label: "Azure Voice Live" keywords: - Azure - Voice Live API - AzureVoiceLive - AzureVoiceLiveConfig - Real-time AI - VideoSDK Agents - Python SDK - Speech-to-Speech - GPT-4o - Microsoft - Azure Speech Services - Azure AI Speech - Azure AI Foundry image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Azure Voice Live slug: azure-voice-live --- # Azure Voice Live API (Beta) The Azure Voice Live API provider enables your agent to use Microsoft's comprehensive speech-to-speech solution for low-latency, high-quality voice interactions. This unified API eliminates the need to manually orchestrate multiple components by integrating speech recognition, generative AI, and text-to-speech into a single interface. :::note Preview Feature This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live). ::: ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Authentication The Azure Voice Live plugin requires an Azure AI Services resource with Cognitive Services endpoint. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the AI Services resource endpoint and primary key. After your resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_VOICE_LIVE_ENDPOINT` and `AZURE_VOICE_LIVE_API_KEY` in your `.env` file: ```bash AZURE_VOICE_LIVE_ENDPOINT=your-azure-ai-service-endpoint AZURE_VOICE_LIVE_API_KEY=your-azure-ai-service-primary-key ``` ## Importing ```python from videosdk.plugins.azure import AzureVoiceLive, AzureVoiceLiveConfig from videosdk.agents import RealTimePipeline ``` ## Example Usage ```python from videosdk.plugins.azure import AzureVoiceLive, AzureVoiceLiveConfig from videosdk.agents import RealTimePipeline # Configure the Voice Live API settings config = AzureVoiceLiveConfig( voice="en-US-EmmaNeural", # Azure neural voice temperature=0.7, turn_detection_timeout=1000, enable_interruption=True ) # Initialize the Azure Voice Live model model = AzureVoiceLive( # When environment variables are set in .env - DON'T pass credentials # api_key="your-azure-speech-key", model="gpt-4o-realtime-preview", config=config ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key`, `speech_region`, and other credential parameters from your code. ::: :::note To initiate a conversation with Azure Voice Live, the user must speak first. The model listens for user input to begin the interaction. ::: ## Configuration Options - `model`: The Voice Live model to use (e.g., `"gpt-4o-realtime-preview"`, `"gpt-4o-mini-realtime-preview"`) - `api_key`: Your Azure Speech API key (can also be set via environment variable) - `speech_region`: Your Azure Speech region (can also be set via environment variable) - `credential`: Azure DefaultAzureCredential for authentication (alternative to API key) - `config`: An `AzureVoiceLiveConfig` object for advanced options: - `voice`: (str) The Azure neural voice to use (e.g., `"en-US-EmmaNeural"`, `"hi-IN-AnanyaNeural"`) - `temperature`: (float) Sampling temperature for response randomness (default: 0.7) - `turn_detection_timeout`: (int) Timeout for turn detection in milliseconds - `enable_interruption`: (bool) Allow users to interrupt the agent during speech - `noise_suppression`: (bool) Enable noise suppression for clearer audio - `echo_cancellation`: (bool) Enable echo cancellation - `phrase_list`: (List[str]) Custom phrases for improved recognition accuracy ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Additional Resources The following resources provide more information about using Azure Voice Live with VideoSDK Agents SDK. - **[Azure Voice Live API Documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live)**: Complete Azure Voice Live API documentation. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Overview of Azure Speech services. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Google Gemini (LiveAPI) hide_title: false hide_table_of_contents: false description: "Learn how to use Google's Gemini models with the VideoSDK AI Agent SDK. This guide covers model configuration, streaming audio, and integration with your agent pipeline." pagination_label: "Google Gemini" keywords: - Google Gemini - GeminiRealtime - GeminiLiveConfig - Real-time AI - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google Gemini (LiveAPI) slug: google-live-api --- # Google Gemini (LiveAPI) The Google Gemini (Live API) provider allows your agent to leverage Google's Gemini models for real-time, multimodal AI interactions. ## Installation Install the Gemini-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Authentication The Google plugin requires an [Gemini API key](https://aistudio.google.com/apikey). Set `GOOGLE_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig ``` ## Example Usage ```python from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from videosdk.agents import RealTimePipeline # Initialize the Gemini real-time model model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Vertex AI Integration You can also use Google's Gemini models through Vertex AI. This requires a different authentication and configuration setup. ### Authentication for Vertex AI For Vertex AI, you need to set up Google Cloud credentials. Create a service account, download the JSON key file, and set the path to this file in your environment. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` You should also configure your project ID and location. These can be set as environment variables or directly in the code. If not set, the `project_id` is inferred from the credentials file and the `location` defaults to `us-central1`. ```bash export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export GOOGLE_CLOUD_LOCATION="your-gcp-location" ``` ### Example Usage with Vertex AI To use Vertex AI, set `vertexai=True` when initializing `GeminiRealtime`. You can configure the project and location using `VertexAIConfig`, which will take precedence over environment variables. ```python from videosdk.plugins.google import GeminiRealtime, VertexAIConfig from videosdk.agents import RealTimePipeline # Initialize the Gemini real-time model with Vertex AI configuration model = GeminiRealtime( model="gemini-live-2.5-flash-native-audio", vertexai=True, vertexai_config=VertexAIConfig( project_id="videosdk", location="us-central1" ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` ## Vision Support Google Gemini Live can also accept `video stream` directly from the VideoSDK room. To enable this, simply turn on your camera and set the vision flag to true in the session context. Once that's done, start your agent as usual—no additional changes are required in the pipeline. ```python pipeline = RealTimePipeline(model=model) session = AgentSession( agent=my_agent, pipeline=pipeline, ) job_context = JobContext( room_options = RoomOptions( room_id = "YOUR_ROOM_ID", name = "Agent", vision = True ) ) ``` - `vision` (bool, room options) – when `True`, forwards Video Stream from VideoSDK's room to Gemini’s LiveAPI (defaults to `False`). ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The Gemini model to use (e.g., `"gemini-2.5-flash-native-audio-preview-12-2025"`). Other supported models include: `"gemini-2.5-flash-preview-native-audio-dialog"` and `"gemini-2.5-flash-exp-native-audio-thinking-dialog"`. - `api_key`: Your Google API key (can also be set via environment variable) - `config`: A `GeminiLiveConfig` object for advanced options: - `voice`: (str or None) The voice to use for audio output (e.g., `"Puck"`). - `language_code`: (str or None) The language code for the conversation (e.g., `"en-US"`). - `temperature`: (float or None) Sampling temperature for response randomness. - `top_p`: (float or None) Nucleus sampling probability. - `top_k`: (float or None) Top-k sampling for response diversity. - `candidate_count`: (int or None) Number of candidate responses to generate. - `max_output_tokens`: (int or None) Maximum number of tokens in the output. - `presence_penalty`: (float or None) Penalty for introducing new topics. - `frequency_penalty`: (float or None) Penalty for repeating tokens. - `response_modalities`: (List[str] or None) List of enabled output modalities (e.g., `["TEXT"]`or `["AUDIO"]`(one at a time)). - `output_audio_transcription`: (`AudioTranscriptionConfig` or None) Configuration for audio output transcription. ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Plugin quickstart]()**: Quickstart for the Gemini Realtime API plugin. - **[Gemini docs](https://ai.google.dev/gemini-api/docs/live)**: Gemini Live API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's real-time models with the VideoSDK AI Agent SDK. This guide covers model configuration, streaming audio, and integration with your agent pipeline." pagination_label: "OpenAI" keywords: - OpenAI - GPT-4o - Real-time AI - VideoSDK Agents - Python SDK - OpenAIRealtime - OpenAIRealtimeConfig image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: OpenAI slug: openai --- # OpenAI The OpenAI provider enables your agent to use OpenAI's real-time models (like GPT-4o) for text and audio interactions. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig from videosdk.agents import RealTimePipeline from openai.types.beta.realtime.session import TurnDetection # Initialize the OpenAI real-time model model = OpenAIRealtime( model="gpt-realtime-2025-08-28", # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", config=OpenAIRealtimeConfig( voice="alloy", # alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse modalities=["text", "audio"], turn_detection=TurnDetection( type="server_vad", threshold=0.5, prefix_padding_ms=300, silence_duration_ms=200, ), tool_choice="auto" ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The OpenAI model to use (e.g., `"gpt-realtime-2025-08-28"`) - `api_key`: Your OpenAI API key (can also be set via environment variable) - `config`: An `OpenAIRealtimeConfig` object for advanced options: - `voice`: (str) The voice to use for audio output (e.g., `"alloy"`). - `temperature`: (float) Sampling temperature for response randomness. - `turn_detection`: (`TurnDetection` or None) Configure how the agent detects when a user has finished speaking. - `input_audio_transcription`: (`InputAudioTranscription` or None) Configure audio-to-text (e.g., Whisper). - `tool_choice`: (str or None) Tool selection mode (e.g., `"auto"`). - `modalities`: (list[str]) List of enabled modalities (e.g., `["text", "audio"]`). ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[Plugin quickstart](https://github.com/videosdk-live/agents-quickstart/tree/main/Realtime%20Pipeline/OpenAI)**: Quickstart for the OpenAI Realtime API plugin. - **[OpenAI docs](https://platform.openai.com/docs/guides/realtime)**: OpenAI Realtime API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Ultravox hide_title: false hide_table_of_contents: false description: "Learn how to use Ultravox's real-time AI models with the VideoSDK AI Agent SDK. This guide covers model configuration, function calling, MCP integration, and connecting to your agent pipeline." pagination_label: "Ultravox" keywords: - Ultravox - Real-time AI - VideoSDK Agents - Python SDK - UltravoxRealtime - UltravoxLiveConfig image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Ultravox slug: ultravox --- # Ultravox The Ultravox provider enables your agent to use Ultravox's models for real-time, conversational AI interactions. ## Installation Install the Ultravox-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-ultravox" ``` ## Authentication The Ultravox plugin requires an [Ultravox API key](https://app.ultravox.ai/). Set the `ULTRAVOX_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.ultravox import UltravoxRealtime, UltravoxLiveConfig ``` ## Example Usage ```python from videosdk.plugins.ultravox import UltravoxRealtime, UltravoxLiveConfig from videosdk.agents import RealTimePipeline # Initialize the Ultravox real-time model model = UltravoxRealtime( model="fixie-ai/ultravox", # When ULTRAVOX_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-ultravox-api-key", config=UltravoxLiveConfig( voice="54ebeae1-88df-4d66-af13-6c41283b4332" ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using a `.env` file for credentials, you do not need to pass the `api_key` as an argument to the model instance; the SDK reads it automatically. ::: ## Key Features - **Real-time Interactions**: Utilize Ultravox's powerful models for low-latency voice conversations. - **Function Calling**: Empower your agent to perform actions like retrieving weather data or calling external APIs. - **Custom Agent Behaviors**: Define a unique personality and interaction style for your agent through system prompts. - **Call Control**: Agents can manage the conversation flow and gracefully terminate calls. - **MCP Integration**: Connect to external tools and data sources using the Model Context Protocol (MCP) via `MCPServerStdio` for local processes or `MCPServerHTTP` for remote services. ## Configuration Options - `model`: The Ultravox model to use (e.g., `"fixie-ai/ultravox"`). - `api_key`: Your Ultravox API key (can also be set via the `ULTRAVOX_API_KEY` environment variable). - `config`: An `UltravoxLiveConfig` object for advanced options: - `voice`: (str) The Voice ID for the synthesized speech. - `language_hint`: (str) A hint for the conversation's language (e.g., `"en"`). - `temperature`: (float) Controls the randomness of responses (0.0 to 1.0). - `vad_turn_endpoint_delay`: (int) Delay in milliseconds for voice activity detection to determine the end of a turn. - `vad_minimum_turn_duration`: (int) The minimum duration in milliseconds for a valid speech turn. ## Additional Resources The following resources provide more information about using Ultravox with the VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: xAI (Grok) hide_title: false hide_table_of_contents: false description: "Learn how to use xAI's Grok models with the VideoSDK AI Agent SDK. This guide covers model configuration, real-time speech interactions, and integration with your agent pipeline." pagination_label: "xAI (Grok)" keywords: - xAI - Grok - Real-time AI - VideoSDK Agents - Python SDK - XAIRealtime - XAIRealtimeConfig image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: xAI (Grok) slug: xai-grok --- # xAI (Grok) The xAI (Grok) provider enables your agent to use xAI's powerful Grok models for real-time, multimodal AI interactions. ## Installation Install the xAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-xai" ``` ## Authentication The xAI plugin requires an [xAI API key](https://console.x.ai). Set `XAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.xai import XAIRealtime, XAIRealtimeConfig ``` ## Example Usage ```python from videosdk.plugins.xai import XAIRealtime, XAIRealtimeConfig from videosdk.agents import RealTimePipeline # Initialize the xAI Grok real-time model model = XAIRealtime( model="grok-4-1-fast-non-reasoning", # When XAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-xai-api-key", config=XAIRealtimeConfig( voice="Eve", # collection_id="your-collection-id" # Optional ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit the `api_key` parameter from your code. ::: ## Key Features - **Multi-modal Interactions**: Utilize xAI's powerful Grok models for voice and text. - **Function Calling**: Define custom tools to retrieve weather data, interact with external APIs, or perform other actions. - **Web Search**: Enable real-time web search capabilities by setting `enable_web_search=True`. - **X Search**: Access X (formerly Twitter) content by setting `enable_x_search=True` and providing `allowed_x_handles`. ## Configuration Options - `model`: The Grok model to use (e.g., `"grok-4-1-fast-non-reasoning"`). - `api_key`: Your xAI API key (can also be set via the `XAI_API_KEY` environment variable). - `config`: An `XAIRealtimeConfig` object for advanced options: - `voice`: (str) The voice to use for audio output (e.g., `"Eve"`, `"Ara"`, `"Rex"`, `"Sal"`, `"Leo"`). - `enable_web_search`: (bool) Enable or disable web search capabilities. - `enable_x_search`: (bool) Enable or disable search on X (Twitter). - `allowed_x_handles`: (List[str]) A list of allowed X handles to search within. - `collection_id`: (str, optional) The ID of a custom collection from your xAI Console storage to provide additional context. - `turn_detection`: Configuration for detecting when a user has finished speaking. ## Collection Storage xAI Grok supports using "collections" to provide additional context to your agent, grounding its responses in your own documents or data. To use a collection: 1. **Navigate to xAI Console**: Go to your [console.x.ai](https://console.x.ai) dashboard. 2. **Access Storage**: Click on the **Storage** section in the sidebar. 3. **Create New Collection**: Click the "Create New Collection" button. 4. **Upload Files**: Upload your relevant documents or data files to the new collection. 5. **Get Collection ID**: Once the collection is created, copy its **Collection ID**. 6. **Use in Config**: Pass the copied ID to your agent's configuration: ```python config=XAIRealtimeConfig( voice="Eve", collection_id="your-collection-id-from-console", # ... other config options ) ``` The agent will now use the content of this collection to inform its responses. ## Additional Resources The following resources provide more information about using xAI (Grok) with the VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Silero VAD hide_title: false hide_table_of_contents: false description: "Learn how to use Silero's VAD with the VideoSDK AI Agent SDK. This guide covers model configuration, related events." pagination_label: "Silero VAD" keywords: - Silero - VAD - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Silero VAD slug: silero-vad --- # Silero VAD The Silero VAD (Voice Activity Detection) provider enables your agent to detect when users start and stop speaking. When added to a cascading pipeline, it automatically enables interrupt functionality - allowing users to interrupt the agent mid-response. ## Installation Install the Silero VAD-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-silero" ``` ## Importing ```python from videosdk.plugins.silero import SileroVAD ``` ## Example Usage ```python from videosdk.plugins.silero import SileroVAD from videosdk.agents import CascadingPipeline # Initialize the Silero VAD vad = SileroVAD( input_sample_rate=48000, model_sample_rate=16000, threshold=0.3, min_speech_duration=0.1, min_silence_duration=0.75, prefix_padding_duration=0.3 ) # Add VAD to cascading pipeline - automatically enables interrupts pipeline = CascadingPipeline(vad=vad) ``` ## Configuration Options - `input_sample_rate`: (int) Sample rate of input audio in Hz (default: `48000`) - `model_sample_rate`: (Literal[8000, 16000]) Model's expected sample rate (default: `16000`) - `threshold`: (float) Voice activity detection sensitivity (0.0 to 1.0, default: `0.3`) - `min_speech_duration`: (float) Minimum speech duration to trigger detection in seconds (default: `0.1`) - `min_silence_duration`: (float) Minimum silence duration to end speech detection in seconds (default: `0.75`) - `max_buffered_speech`: (float) Maximum speech buffer duration in seconds (default: `60.0`) - `force_cpu`: (bool) Force CPU usage instead of GPU acceleration (default: `True`) - `prefix_padding_duration`: (float) Audio padding before speech detection in seconds (default: `0.3`) ## Additional Resources The following resources provide more information about using Silero VAD with VideoSDK Agents SDK. - **[Silero VAD project](https://github.com/snakers4/silero-vad)**: The open source VAD model that powers the VideoSDK Silero VAD plugin. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: AssemblyAI STT hide_title: false hide_table_of_contents: false description: "Learn how to use AssemblyAI's real-time speech-to-text models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing streaming transcription." pagination_label: "AssemblyAI STT" keywords: - AssemblyAI - real-time transcription - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: AssemblyAI slug: assemblyai --- # AssemblyAI STT The AssemblyAI STT provider enables your agent to use AssemblyAI's real-time WebSocket API for fast and accurate speech-to-text conversion. ## Installation Install the AssemblyAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-assemblyai" ``` ## Authentication The AssemblyAI plugin requires an [AssemblyAI API key](https://www.assemblyai.com/dashboard/docs/your-api-key). Set `ASSEMBLYAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.assemblyai import AssemblyAISTT ``` ## Example Usage ```python from videosdk.plugins.assemblyai import AssemblyAISTT from videosdk.agents import CascadingPipeline # Initialize the AssemblyAI STT model stt = AssemblyAISTT( api_key="your-assemblyai-api-key", language_code="en_us" ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your AssemblyAI API key (required, can also be set via `ASSEMBLYAI_API_KEY` environment variable). - `language_code`: The language code for transcription (e.g., `"en_us"`, `"es"`). ## Additional Resources The following resources provide more information about using AssemblyAI with the VideoSDK Agents SDK. - **[AssemblyAI Docs](https://www.assemblyai.com/docs/guides/speech-to-text/real-time-streaming-transcription)**: AssemblyAI's official real-time streaming transcription documentation. ``` import PluginResourceCards from '@site/src/components/PluginResourceCards' ``` --- --- title: Azure OpenAI STT hide_title: false hide_table_of_contents: false description: "Learn how to use Azure OpenAI's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Azure OpenAI's services" pagination_label: "Azure OpenAI STT" keywords: - OpenAI - Azure - Azure OpenAI - gpt-4o-mini-transcribe - whisper-1 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Azure OpenAI slug: azureopenai --- # Azure OpenAI STT The Azure OpenAI STT provider enables your agent to use Azure OpenAI's speech-to-text models (like Whisper) for converting audio input to text. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAISTT ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAISTT from videosdk.agents import CascadingPipeline # Initialize the Azure OpenAI STT model stt = OpenAISTT.azure( azure_deployment="gpt-4o-transcribe", language="en", ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o-mini-transcribe"`, `"gpt-4o-transcribe"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `language`: (str) Language code for transcription (default: `"en"`) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure STT hide_title: false hide_table_of_contents: false description: "Learn how to use Azure's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Azure's services" pagination_label: "Azure STT" keywords: - Azure - Speech-to-Text - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI - Microsoft - Azure Speech Services - Azure AI Speech - Azure AI Foundry image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Azure AI Speech slug: azure-ai-stt --- # Azure STT The Azure STT provider enables your agent to use Microsoft Azure's advanced speech-to-text models for high-accuracy, real-time audio transcription with support for multiple languages and custom phrase lists. ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Importing ```python from videosdk.plugins.azure import AzureSTT ``` ## Authentication The Azure STT plugin requires an Azure AI Speech Service resource. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in your `.env` file: ```bash AZURE_SPEECH_KEY=your-azure-speech-key AZURE_SPEECH_REGION=your-azure-region ``` ## Example Usage ```python from videosdk.plugins.azure import AzureSTT from videosdk.agents import CascadingPipeline # Initialize the Azure STT model stt = AzureSTT( language="en-US", sample_rate=16000, enable_phrase_list=True, phrase_list=["VideoSDK", "artificial intelligence", "machine learning"] ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using environment variables for credentials, don't pass the `speech_key` and `speech_region` as arguments to the model instance. The SDK automatically reads the environment variables. ::: ## Configuration Options - `speech_key`: (Optional[str]) Azure Speech API key. Uses `AZURE_SPEECH_KEY` environment variable if not provided. - `speech_region`: (Optional[str]) Azure Speech region (e.g., `"eastus"`, `"westus2"`). Uses `AZURE_SPEECH_REGION` environment variable if not provided. - `language`: (str) The language code for transcription (default: `"en-US"`). See [supported languages](https://learn.microsoft.com/en-us/globalization/locale/standard-locale-names). - `sample_rate`: (int) The target audio sample rate in Hz for transcription (default: `16000`). The input audio at 48000Hz will be resampled to this rate. - `enable_phrase_list`: (bool) Whether to enable phrase list for better recognition accuracy (default: `False`). - `phrase_list`: (Optional[List[str]]) List of phrases to boost recognition for domain-specific terms (default: `None`). ## Additional Resources The following resources provide more information about using Azure with VideoSDK Agents SDK. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Complete overview of Azure Speech services. - **[Azure STT docs](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-speech-to-text)**: Azure Speech-to-Text documentation. - **[Getting Started Guide](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text?tabs=macos&pivots=programming-language-python#prerequisites)**: Azure STT setup and prerequisites. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Cartesia STT hide_title: false hide_table_of_contents: false description: "Learn how to use Cartesia's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Cartesia's services" pagination_label: "Cartesia STT" keywords: - Cartesia - Speech-to-Text - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Cartesia slug: cartesia-stt --- # Cartesia STT The Cartesia STT provider enables your agent to use Cartesia's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Cartesia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cartesia" ``` ## Importing ```python from videosdk.plugins.cartesia import CartesiaSTT ``` ## Authentication The Cartesia plugin requires a [Cartesia API key](https://play.cartesia.ai/keys). Set `CARTESIA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cartesia import CartesiaSTT from videosdk.agents import CascadingPipeline # Initialize the Cartesia STT model stt = CartesiaSTT( # When CARTESIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-cartesia-api-key", language="en-US", model="ink-whisper", ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using an environment variable for credentials, don't pass the `api_key` as an argument to the model instance. The SDK automatically reads the environment variable. ::: ## Configuration Options - `api_key`: (str) Your Cartesia API key. Can also be set via the `CARTESIA_API_KEY` environment variable. - `model`: (str) The Cartesia STT model to use (e.g., `"ink-whisper"`). Defaults to `"ink-whisper"`. - `language`: (str) Language code for transcription (default: `"en"`). ## Additional resources The following resources provide more information about using Cartesia with VideoSDK Agents. - **[Cartesia docs](https://docs.cartesia.ai/build-with-cartesia/models/stt)**: Cartesia STT docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Deepgram STT hide_title: false hide_table_of_contents: false description: "Learn how to use Deepgram's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Deepgram's services" pagination_label: "Deepgram STT" keywords: - Deepgram - nova-2 - nova-3 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Deepgram slug: deepgram --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Deepgram STT The Deepgram STT provider enables your agent to use Deepgram's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Deepgram-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-deepgram" ``` ## Authentication The Deepgram plugin requires a [Deepgram API key](https://console.deepgram.com/). Set `DEEPGRAM_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.deepgram import DeepgramSTT ``` ```python from videosdk.plugins.deepgram import DeepgramSTTV2 ``` ## Example Usage ```python from videosdk.plugins.deepgram import DeepgramSTT from videosdk.agents import CascadingPipeline # Initialize the Deepgram STT model stt = DeepgramSTT( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="nova-2", language="en-US", interim_results=True, punctuate=True, smart_format=True, profanity_filter=False, numerals=False, tag=None, enable_diarization=False, ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` ```python from videosdk.plugins.deepgram import DeepgramSTTV2 from videosdk.agents import CascadingPipeline # Initialize the Deepgram STT V2 model with Flux stt = DeepgramSTTV2( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="flux-general-en", eager_eot_threshold=0.6, eot_threshold=0.8, eot_timeout_ms=7000, enable_preemptive_generation=True, tag=None ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Deepgram API key (can also be set via DEEPGRAM_API_KEY environment variable) - `model`: The Deepgram model to use (e.g., "`nova-2`", "`nova-3`", "`whisper-large`") (default: "`nova-2`") - `language`: (str) Language code for transcription (e.g., "`en-US`", "`es`", "`fr`") (default: "`en-US`") - `interim_results`: (bool) Enable real-time partial transcription results (default: `True`) - `punctuate`: (bool) Add punctuation to transcription (default: `True`) - `smart_format`: (bool) Apply intelligent formatting to output (default: `True`) - `filler_words`: (bool) Include filler words like "uh", "um" in transcription (default: `True`) - `sample_rate`: (int) Audio sample rate in Hz (default: `48000`) - `endpointing`: (int) Silence detection threshold in milliseconds (default: `50`) - `base_url`: (str) WebSocket endpoint URL (default: `"wss://api.deepgram.com/v1/listen"`) - `profanity_filter`: (bool) Whether to filter profanity from the transcription. Defaults to `False`. - `numerals`: (bool) Whether to include numerals in the transcription. Defaults to `False`. - `tag`: (str | list[str]) Tag or list of tags to add to the requests for usage reporting. Defaults to `None`. - `enable_diarization`: (bool) Diarize recognizes speaker changes and assigns a speaker to each word in the transcript. Defaults to `False`. - `api_key`: Your Deepgram API key (can also be set via DEEPGRAM_API_KEY environment variable) - `model`: The Flux model to use - language is embedded in model name (default: "`flux-general-en`")(currently only english is available) - `input_sample_rate`: (int) Input audio sample rate in Hz (default: `48000`) - `target_sample_rate`: (int) Target sample rate for Deepgram processing (default: `16000`) - `eager_eot_threshold`: Confidence threshold for early end-of-turn detection, range 0.0-1.0 (default: `0.6`) - Lower values = more aggressive early detection - Higher values = wait for higher confidence before early turn end - `eot_threshold`: Standard end-of-turn confidence threshold, range 0.0-1.0 (default: `0.8`) - Controls when a turn is definitively ended - `eot_timeout_ms`: Timeout in milliseconds before forcing end-of-turn (default: `7000`) - Maximum silence duration before automatically ending turn - `base_url`: (str) WebSocket endpoint URL (default: `"wss://api.deepgram.com/v2/listen"`) - `tag`: (str | list[str]) Tag or list of tags to add to the requests for usage reporting. Defaults to `None`. - `enable_preemptive_generation`: (bool) Enable preemptive generation based on EagerEndOfTurn events (default: `False`). ## Additional Resources The following resources provide more information about using Deepgram with VideoSDK Agents SDK. - **[Deepgram docs V1](https://developers.deepgram.com/docs/live-streaming-audio)**: Deepgram's STT V1 docs - **[Deepgram docs V2](https://developers.deepgram.com/docs/flux/quickstart)**: Deepgram's STT V2 docs - **[Github URL V1](https://github.com/videosdk-live/agents/blob/main/videosdk-plugins/videosdk-plugins-deepgram/videosdk/plugins/deepgram/stt.py)** : Deepgram STT Plugin Source Code - **[Github URL V2](https://github.com/videosdk-live/agents/blob/main/videosdk-plugins/videosdk-plugins-deepgram/videosdk/plugins/deepgram/stt_v2.py)** : Deepgram STT V2 Plugin Source Code import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: ElevenLabs STT hide_title: false hide_table_of_contents: false description: "Learn how to use ElevenLabs's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for ElevenLabs's services" pagination_label: "ElevenLabs STT" keywords: - ElevenLabs - scribe_v2_realtime - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: ElevenLabs slug: eleven-labs --- # ElevenLabs STT The ElevenLabs STT provider enables your agent to use `ElevenLabs` advanced speech-to-text models for high-accuracy, real-time audio transcription with advanced voice activity detection. ## Installation Install the ElevenLabs-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-elevenlabs" ``` ## Importing ```python from videosdk.plugins.elevenlabs import ElevenLabsSTT ``` ## Authentication The ElevenLabs plugin requires an [ElevenLabs API key](https://elevenlabs.io/app/settings/api-keys). Set `ELEVENLABS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.elevenlabs import ElevenLabsSTT from videosdk.agents import CascadingPipeline # Initialize the ElevenLabs STT model stt = ElevenLabsSTT( # When ELEVENLABS_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-elevenlabs-api-key", model_id="scribe_v2_realtime", language_code="en", commit_strategy="vad", vad_silence_threshold_secs=0.8, vad_threshold=0.4, min_speech_duration_ms=50, min_silence_duration_ms=50, include_language_detection=False ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your ElevenLabs API key (can also be set via ELEVENLABS_API_KEY environment variable) - `model_id`: (str) STT model identifier (default: `"scribe_v2_realtime"`) - `language_code`: (str) Language code for transcription (default: `"en"`) - `sample_rate`: (int) Sample rate of input audio in Hz (default: `48000`) - `commit_strategy`: (str) Strategy for committing transcripts (default: `"vad"`) - `"vad"` - Voice Activity Detection based commit strategy - `vad_silence_threshold_secs`: (float) Duration of silence in seconds to detect end-of-speech (default: `0.8`) - `vad_threshold`: (float) Threshold for detecting voice activity (default: `0.4`) - `min_speech_duration_ms`: (int) Minimum duration in milliseconds for a speech segment (default: `50`) - `min_silence_duration_ms`: (int) Minimum duration in milliseconds of silence to consider end-of-speech (default: `50`) - `include_language_detection`: (bool) Whether to include language detection in the transcription (default: `False`) ## Additional Resources The following resources provide more information about using ElevenLabs with VideoSDK Agents SDK. - **[ElevenLabs docs](https://elevenlabs.io/docs)**: ElevenLabs STT docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Gladia STT hide_title: false hide_table_of_contents: false description: "Learn how to use Gladia's real-time speech-to-text models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing streaming transcription." pagination_label: "Gladia STT" keywords: - Gladia - STT - Speech-to-Text - real-time transcription - code-switching - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Gladia slug: gladia --- # Gladia STT The Gladia STT provider enables your agent to use Gladia's fast and accurate speech-to-text models for real-time audio transcription with support for multiple languages and code-switching. ## Installation Install the Gladia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-gladia" ``` ## Authentication The Gladia plugin requires a [Gladia API key](https://app.gladia.io/signup). Set `GLADIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.gladia import GladiaSTT ``` ## Example Usage ```python from videosdk.plugins.gladia import GladiaSTT from videosdk.agents import CascadingPipeline # Initialize the Gladia STT model stt = GladiaSTT( # When GLADIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-gladia-api-key", languages=["en"], code_switching=True, receive_partial_transcripts=True ) # Add stt to a cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using a `.env` file for credentials, you do not need to pass the `api_key` as an argument to the model instance; the SDK reads it automatically. ::: ## Configuration Options - `api_key`: (str, optional) Your Gladia API key. Can also be set via the `GLADIA_API_KEY` environment variable. - `model`: (str, optional) The model to use. Defaults to `"solaria-1"`. - `languages`: (List[str], optional) A list of language codes to detect (e.g., `["en", "fr"]`). Defaults to `["en"]`. - `code_switching`: (bool, optional) Enables automatic language switching between the provided languages. Defaults to `True`. - `input_sample_rate`: (int, optional) The sample rate of the incoming audio. Defaults to `48000`. - `output_sample_rate`: (int, optional) The sample rate Gladia should process. Defaults to `16000`. - `encoding`: (str, optional) The audio encoding format. Defaults to `"wav/pcm"`. - `bit_depth`: (int, optional) The bit depth of the audio. Defaults to `16`. - `channels`: (int, optional) The number of audio channels. Defaults to `1` (mono). - `receive_partial_transcripts`: (bool, optional) Set to `True` to receive interim transcription results for lower latency. Defaults to `False`. --- --- title: Google STT hide_title: false hide_table_of_contents: false description: "Learn how to use Google's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Google's services" pagination_label: "Google STT" keywords: - Google - Speech-to-Text - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google slug: google --- # Google STT The Google STT provider enables your agent to use Google's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Google-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Importing ```python from videosdk.plugins.google import GoogleSTT, VoiceActivityConfig ``` ## Setup Credentials/Authentication To use Google STT, you need to set up your Google Cloud credentials. You can do this by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` Alternatively, you can pass the path to the key file directly to the `GoogleSTT` constructor via the `api_key` parameter. **or** Set `GOOGLE_APPLICATION_CREDENTIALS` in your `.env` file. ## Example Usage ```python from videosdk.plugins.google import GoogleSTT, VoiceActivityConfig from videosdk.agents import CascadingPipeline voice_activity_timeout = VoiceActivityConfig( speech_start_timeout=1.0, speech_end_timeout=5.0 ) # Initialize the Google STT model stt = GoogleSTT( # If GOOGLE_APPLICATION_CREDENTIALS is set, you can omit api_key api_key="/path/to/your/keyfile.json", languages="en-US", model="latest_long", interim_results=True, punctuate=True, profanity_filter=False, voice_activity_timeout = voice_activity_timeout ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using an environment variable for credentials, don't pass the `api_key` as an argument to the model instance. The SDK automatically reads the environment variable. ::: ## Configuration Options - `api_key`: (str) Path to your Google Cloud service account JSON file. This can also be set via the `GOOGLE_APPLICATION_CREDENTIALS` environment variable. - `languages`: (Union[str, list[str]]) Language code or a list of language codes for transcription (default: `"en-US"`). - `model`: (str) The Google STT model to use (e.g., `"latest_long"`, `"telephony"`) (default: `"latest_long"`). - `sample_rate`: (int) The target audio sample rate in Hz for transcription (default: `16000`). The input audio at 48000Hz will be resampled to this rate. - `interim_results`: (bool) Enable real-time partial transcription results (default: `True`). - `punctuate`: (bool) Add punctuation to transcription (default: `True`). - `min_confidence_threshold`: (float) The minimum confidence level for a transcription result to be considered valid (default: `0.1`). - `location`: (str) The Google Cloud location to use for the STT service (default: `"global"`). - `profanity_filter`: (bool) detect profane words and return only the first letter followed by asterisks in the transcript (default: `False`). - `voice_activity_timeout`: (`VoiceActivityConfig`) Configure speech activity timeouts (default: `None`). - `speech_start_timeout`: (float) Seconds to wait for speech to begin before timing out. Minimum `0.5` (default: `1.0`). - `speech_end_timeout`: (float) Seconds of silence after speech before ending. Minimum `0.1` (default: `5.0`). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Google STT docs](https://cloud.google.com/speech-to-text/docs)**: Google Cloud STT documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Navana STT hide_title: false hide_table_of_contents: false description: "Learn how to use Navana's Bodhi STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech-to-text, with a focus on Indian languages." pagination_label: "Navana STT" keywords: - Navana - Bodhi - Indian languages - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Navana slug: navana --- # Navana STT The Navana STT provider enables your agent to use Navana's Bodhi speech-to-text models, which are highly optimized for a variety of Indian languages and accents. ## Installation Install the Navana-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-navana" ``` ## Authentication The Navana plugin requires a **Customer ID** and an **API Key** from your [Navana Bodhi account](https://bodhi.navana.ai/). Set both `NAVANA_API_KEY` and `NAVANA_CUSTOMER_ID` in your `.env` file. ## Importing ```python from videosdk.plugins.navana import NavanaSTT ``` ## Example Usage ```python from videosdk.plugins.navana import NavanaSTT from videosdk.agents import CascadingPipeline # Initialize the Navana STT model stt = NavanaSTT( api_key="your-navana-api-key", customer_id="your-navana-customer-id", model="en-in-general-v2-8khz", language="en-IN" ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key`, `customer_id`, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Navana API key (required, can also be set via `NAVANA_API_KEY` environment variable). - `customer_id`: Your Navana Customer ID (required, can also be set via `NAVANA_CUSTOMER_ID` environment variable). - `model`: The Navana STT model to use (e.g., `"en-in-general-v2-8khz"`, `"hi-general-v2-8khz"`). - `language`: The language code for transcription (e.g., `"en-IN"`, `"hi-IN"`). ## Additional Resources The following resources provide more information about using Navana with the VideoSDK Agents SDK. - **[Navana Docs](https://navana.gitbook.io/bodhi/streaming-asr/streaming-websocket)**: Navana's official streaming API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Nvidia STT hide_title: false hide_table_of_contents: false description: "Learn how to use Nvidia's Riva STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Nvidia's services" pagination_label: "Nvidia STT" keywords: - Nvidia - Riva - Parakeet - STT - VideoSDK Agents - Python SDK - Speech To Text - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_label: Nvidia slug: nvidia --- # Nvidia STT The Nvidia STT provider enables your agent to use Nvidia's Riva speech-to-text models for high-performance, low-latency speech recognition. ## Installation Install the Nvidia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-nvidia" ``` ## Authentication The Nvidia plugin requires an Nvidia API key. Set `NVIDIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.nvidia import NvidiaSTT ``` ## Example Usage ```python from videosdk.plugins.nvidia import NvidiaSTT from videosdk.agents import CascadingPipeline # Initialize the Nvidia STT model stt = NvidiaSTT( # When NVIDIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-nvidia-api-key", model="parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer", language_code="en-US", profanity_filter=False, automatic_punctuation=True ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Nvidia API key (required, can also be set via environment variable) - `model`: The Nvidia Riva model to use (default: `"parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer"`) - `server`: The Nvidia Riva server address (default: `"grpc.nvcf.nvidia.com:443"`) - `function_id`: The specific function ID for the service (default: `"1598d209-5e27-4d3c-8079-4751568b1081"`) - `language_code`: Language code for transcription (default: `"en-US"`) - `sample_rate`: Audio sample rate in Hz (default: `16000`) - `profanity_filter`: (bool) Enable or disable profanity filtering (default: `False`) - `automatic_punctuation`: (bool) Enable or disable automatic punctuation (default: `True`) - `use_ssl`: (bool) Enable SSL connection (default: `True`) ## Additional Resources The following resources provide more information about using Nvidia Riva with VideoSDK Agents SDK. - **[Nvidia Riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)**: Nvidia Riva documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI STT hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for OpenAI's services" pagination_label: "OpenAI STT" keywords: - OpenAI - gpt-4o-mini-transcribe - whisper-1 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: OpenAI slug: openai --- # OpenAI STT The OpenAI STT provider enables your agent to use OpenAI's speech-to-text models (like Whisper) for converting audio input to text. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAISTT ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAISTT from videosdk.agents import CascadingPipeline # Initialize the OpenAI STT model stt = OpenAISTT( # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", model="whisper-1", language="en", prompt="Transcribe this audio with proper punctuation and formatting." ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your OpenAI API key (required, can also be set via environment variable) - `model`: The OpenAI STT model to use (e.g., `"whisper-1"`, `"gpt-4o-mini-transcribe"`) - `base_url`: Custom base URL for OpenAI API (optional) - `prompt`: (str) Custom prompt to guide transcription style and format - `language`: (str) Language code for transcription (default: `"en"`) - `turn_detection`: (dict) Configuration for detecting conversation turns ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/guides/speech-to-text)**: OpenAI STT API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Sarvam AI STT hide_title: false hide_table_of_contents: false description: "Learn how to use Sarvam AI's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Sarvam AI's services" pagination_label: "Sarvam AI STT" keywords: - Sarvam AI - saarika:v2 - saaras:v3 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Sarvam AI slug: sarvam-ai --- # Sarvam AI STT The Sarvam AI STT provider enables your agent to use Sarvam AI's speech-to-text models for transcription. This provider uses Voice Activity Detection (VAD) to send audio chunks for transcription after a period of silence. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAISTT ``` ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAISTT from videosdk.agents import CascadingPipeline # Initialize the Sarvam AI STT model stt = SarvamAISTT( # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", model="saaras:v3", language="en-IN", ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `model`: (str) The Sarvam AI model to use (default: `"saaras:v3"`). - `language`: (str) Language code for transcription (default: `"en-IN"`). - `input_sample_rate`: (int) The sample rate of the audio from the source in Hz (default: `48000`). - `output_sample_rate`: (int) The sample rate to which the audio is resampled before sending for transcription (default: `16000`). - `mode`: (str) Mode of operation. Only applicable for `saaras:v3`. Allowed values: `"transcribe"`, `"translate"`, `"verbatim"`, `"translit"`, `"codemix"` (default: `"transcribe"` for `saaras:v3`, `None` for other models). - `high_vad_sensitivity`: (bool) Whether to use high sensitivity voice activity detection (default: `None`). - `flush_signal`: (bool) Whether to send flush signal (default: `None`). - `translation`: (bool) Enable speech-to-text translation. Supported on `saaras:v3` and `saaras:v2.5` models. When enabled, routes to the translation endpoint (default: `False`). - `prompt`: (str) Prompt to guide the translation. Only applicable when `translation` is `True` (default: `None`). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/)**: Sarvam's full docs site. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: AWS Polly TTS hide_title: false hide_table_of_contents: false description: "Learn how to use AWS Polly's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for AWS's services" pagination_label: "AWS Polly TTS" keywords: - AWS Polly - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: AWS Polly slug: aws-polly-tts --- # AWS Polly TTS The AWS Polly TTS provider enables your agent to use AWS Polly's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the AWS Poly-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-aws" ``` ## Importing ```python from videosdk.plugins.aws import AWSPollyTTS ``` ## Authentication - `AWS Account`: You have an active AWS account with permissions to access Amazon Polly. - `Region Selection`: You're operating in the US East (N. Virginia) (us-east-1) region, as model access is region-specific. - `AWS Credentials`: Your AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) are configured, either through environment variables or your preferred credential management method. ## Example Usage ```python from videosdk.plugins.aws import AWSPollyTTS from videosdk.agents import CascadingPipeline # Initialize the AWS Polly TTS model tts = AWSPollyTTS( voice="Joanna", engine="neural", speed=1.2, pitch=0.1, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `voice`: (str) Voice ID for the TTS output (default: `"Joanna"`). - `engine`: (str) Polly engine type: `"standard"` or `"neural"` (default: `"neural"`). - `region`: (str) AWS region for Polly service (default: `"us-east-1"` or from `AWS_DEFAULT_REGION`). - `aws_access_key_id`: (str) AWS access key ID (optional; can be set via environment variable). - `aws_secret_access_key`: (str) AWS secret access key (optional; can be set via environment variable). - `aws_session_token`: (str) Optional AWS session token for temporary credentials. - `speed`: (float) Speech rate multiplier (e.g., `1.0` is normal speed, `1.5` is 50% faster). - `pitch`: (float) Pitch adjustment multiplier (e.g., `0.0` is normal, `0.2` raises pitch). ## Additional Resources The following resources provide more information about using AWS Polly with VideoSDK Agents SDK. - **[AWS Polly docs](https://docs.aws.amazon.com/polly/latest/dg/what-is.html)**: AWS Polly documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure OpenAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Azure OpenAI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Azure OpenAI's services" pagination_label: "Azure OpenAI TTS" keywords: - OpenAI - Azure - Azure OpenAI - gpt-4o-mini-tts - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Azure OpenAI slug: azureopenai --- # Azure OpenAI TTS The Azure OpenAI TTS provider enables your agent to use Azure OpenAI's text-to-speech models for converting text responses to natural-sounding audio output. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAITTS ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAITTS from videosdk.agents import CascadingPipeline # Initialize the Azure OpenAI TTS model tts = OpenAITTS.azure( azure_deployment="gpt-4o-mini-tts", speed=1.0, response_format="pcm" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o-mini-tts"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `voice`: (str) Voice to use for audio output (e.g., `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"`) - `speed`: (float) Speed of the generated audio (0.25 to 4.0, default: 1.0) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Azure's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Azure's services" pagination_label: "Azure TTS" keywords: - Azure - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI - Microsoft - Azure Speech Services - Azure AI Speech - Azure AI Foundry image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Azure AI Speech slug: azure-ai-tts --- # Azure TTS The Azure TTS provider enables your agent to use Microsoft Azure's high-quality text-to-speech models for generating natural-sounding voice output with advanced voice tuning and expressive speaking styles. ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Importing ```python from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle ``` ## Authentication The Azure TTS plugin requires an Azure AI Speech Service resource. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in your `.env` file: ```bash AZURE_SPEECH_KEY=your-azure-speech-key AZURE_SPEECH_REGION=your-azure-region ``` ## Example Usage ```python from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle from videosdk.agents import CascadingPipeline # Configure voice tuning for prosody control voice_tuning = VoiceTuning( rate="fast", volume="loud", pitch="high" ) # Configure speaking style for expressive speech speaking_style = SpeakingStyle( style="cheerful", degree=1.5 ) # Initialize the Azure TTS model tts = AzureTTS( voice="en-US-EmmaNeural", language="en-US", tuning=voice_tuning, style=speaking_style ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `speech_key`, `speech_region`, and other credential parameters from your code. ::: ## Configuration Options - `speech_key`: (Optional[str]) Azure Speech API key. Uses `AZURE_SPEECH_KEY` environment variable if not provided. - `speech_region`: (Optional[str]) Azure Speech region (e.g., `"eastus"`, `"westus2"`). Uses `AZURE_SPEECH_REGION` environment variable if not provided. - `speech_endpoint`: (Optional[str]) Custom endpoint URL. Uses `AZURE_SPEECH_ENDPOINT` environment variable if not provided. - `voice`: (str) Voice name to use for audio output (default: `"en-US-EmmaNeural"`). Get available voices using the [Azure voices API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python#select-synthesis-language-and-voice). - `language`: (str) Language code (optional, inferred from voice if not specified). - `tuning`: (`VoiceTuning`) Voice tuning object for rate, volume, and pitch control: - `rate`: (str) Speaking rate (`"x-slow"`, `"slow"`, `"medium"`, `"fast"`, `"x-fast"` or percentage like `"50%"`) - `volume`: (str) Speaking volume (`"silent"`, `"x-soft"`, `"soft"`, `"medium"`, `"loud"`, `"x-loud"` or percentage) - `pitch`: (str) Voice pitch (`"x-low"`, `"low"`, `"medium"`, `"high"`, `"x-high"` or frequency like `"+50Hz"`) - `style`: (`SpeakingStyle`) Speaking style object for expressive speech: - `style`: (str) Speaking style (e.g., `"cheerful"`, `"sad"`, `"angry"`, `"excited"`, `"friendly"`) - `degree`: (float) Style intensity from 0.01 to 2.0 (default: 1.0) - `deployment_id`: (str) Custom deployment ID for custom models. - `speech_auth_token`: (str) Authorization token for authentication. ## Voice Selection You can find available voices using the Azure Voices List API: ```bash curl --location --request GET 'https://eastus2.tts.speech.microsoft.com/cognitiveservices/voices/list' \ --header 'Ocp-Apim-Subscription-Key: YOUR_SPEECH_KEY' ``` Popular voice options include: - `en-US-EmmaNeural` (Female, neutral) - `en-US-BrianNeural` (Male, neutral) - `en-US-AriaNeural` (Female, cheerful) - `en-GB-SoniaNeural` (Female, British) ## Additional Resources The following resources provide more information about using Azure with VideoSDK Agents SDK. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Complete overview of Azure Speech services. - **[Azure TTS docs](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-text-to-speech)**: Azure Text-to-Speech documentation. - **[Voice Selection Guide](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python#select-synthesis-language-and-voice)**: Guide for selecting synthesis language and voice. - **[Speech Synthesis Markup](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#adjust-prosody)**: Learn about prosody adjustments and voice tuning. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: CambAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use CambAI's TTS models with the VideoSDK AI Voice Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for CambAI's services" pagination_label: "CambAI TTS" keywords: - CambAI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI - AI Voice Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: CambAI slug: cambai-tts --- # CambAI TTS The CambAI TTS provider enables your agent to use CambAI's high-quality, low-latency text-to-speech models for generating natural-sounding voice output with advanced voice customization capabilities. ## Installation Install the CambAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cambai" ``` ## Importing ```python from videosdk.plugins.cambai import CambAITTS, InferenceOptions, VoiceSettings, OutputConfiguration ``` ## Authentication The CambAI plugin requires a [CambAI API key](https://studio.camb.ai/). Set `CAMBAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cambai import CambAITTS, InferenceOptions, VoiceSettings, OutputConfiguration from videosdk.agents import CascadingPipeline inference_options = InferenceOptions( stability=0.5, temperature=0.7, inference_steps=60, speaker_similarity=0.8, localize_speaker_weight=0.5, acoustic_quality_boost=True ) # Configure voice settings voice_settings = VoiceSettings( enhance_reference_audio_quality=False, maintain_source_accent=False, ) output_configuration = OutputConfiguration( format="wav", sample_rate=24000, # Audio sample rate duration=None ) # Initialize CambAI TTS with optional audio output settings tts = CambAITTS( speech_model="mars-pro", voice_id=147320, language="en-us", user_instructions=None, # Optional for mars-instruct enhance_named_entities_pronunciation=True, voice_settings=voice_settings, inference_options=inference_options, output_configuration=output_configuration, ) # Add TTS to a cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your CambAI API key. Can also be set via the `CAMBAI_API_KEY` environment variable. - `speech_model`: (str) The CambAI TTS model to use (e.g., `"mars-pro"`, `"mars-flash"`, `"mars-instruct"`). Defaults to `"mars-pro"`. - `voice_id`: (int) Numeric voice profile ID from CambAI's voice library. Defaults to `147320`. - `language`: (str) BCP-47 locale string (e.g., `"en-us"`). Defaults to `"en-us"`. - `user_instructions`: (str) Style and tone guidance for the generated speech. Only supported when `speech_model` is set to `"mars-instruct"`. - `enhance_named_entities_pronunciation`: (bool) Improve pronunciation of brand names and proper nouns (default: `False`). - `voice_settings`: (`VoiceSettings`) Voice behaviour preferences: - `enhance_reference_audio_quality`: (bool) Enhance the quality of reference audio (default: `False`) - `maintain_source_accent`: (bool) Preserve the original speaker's accent (default: `False`) - `inference_options`: (`InferenceOptions`) Model sampling controls: - `stability`: (float) Voice stability control (optional) - `temperature`: (float) Sampling temperature (optional) - `inference_steps`: (int) Number of inference steps (optional) - `speaker_similarity`: (float) Speaker similarity control (optional) - `localize_speaker_weight`: (float) Speaker localization weight (optional) - `acoustic_quality_boost`: (bool) Enable acoustic quality enhancement (optional) - `output_configuration`: (`OutputConfiguration`) Audio output format and pacing options: - `format`: (str) Output audio format. Currently `"wav"` is supported (default: `"wav"`) - `sample_rate`: (int) Audio sample rate in Hz (default: `24000`) - `duration`: (float) Target speech duration in seconds. When set, the model attempts to pace the audio to match the requested duration. Omit or set to `None` for natural pacing (optional) ## Additional Resources The following resources provide more information about using CambAI with VideoSDK Agents. - **[CambAI docs](https://docs.camb.ai/)**: CambAI TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Cartesia TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Cartesia's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Cartesia's services" pagination_label: "Cartesia TTS" keywords: - Cartesia - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Cartesia slug: cartesia-tts --- # Cartesia TTS The Cartesia TTS provider enables your agent to use Cartesia's high-quality, low-latency text-to-speech models for generating natural-sounding voice output. ## Installation Install the Cartesia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cartesia" ``` ## Importing ```python from videosdk.plugins.cartesia import CartesiaTTS ``` ## Authentication The Cartesia plugin requires a [Cartesia API key](https://play.cartesia.ai/keys). Set `CARTESIA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cartesia import CartesiaTTS from videosdk.agents import CascadingPipeline # Initialize the Cartesia TTS model tts = CartesiaTTS( # When CARTESIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-cartesia-api-key", model="sonic-2", voice_id="794f9389-aac1-45b6-b726-9d9369183238", language="en", pronunciation_dict_id= None, max_buffer_delay_ms=None ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Cartesia API key. Can also be set via the `CARTESIA_API_KEY` environment variable. - `model`: (str) The Cartesia TTS model to use (e.g., `"sonic-2"`, `"sonic-turbo"`). Defaults to `"sonic-2"`. - `voice_id`: (str) The ID of the voice to use for generating speech. - `language`: (str) The language of the voice (e.g., `"en"`, `"fr"`). Defaults to `"en"`. - `pronounciation_dict_id`: (str) The ID of the pronunciation dictionary to use for generating speech. - `max_buffer_delay_ms` : (int) Maximum buffering delay before audio streaming starts. Values between 0-5000ms are supported. Defaults to `3000ms`. ## Additional resources The following resources provide more information about using Cartesia with VideoSDK Agents. - **[Cartesia docs](https://docs.cartesia.ai/build-with-cartesia/models/tts)**: Cartesia TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Deepgram TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Deepgram's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Deepgram's services" pagination_label: "Deepgram TTS" keywords: - Deepgram TTS - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Deepgram TTS slug: deepgram --- # Deepgram TTS The Deepgram TTS provider enables your agent to use Deepgram's high-quality text-to-speech models for generating natural, expressive voice output with advanced voice capabilities. ## Installation Install the Deepgram-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-deepgram" ``` ## Importing ```python from videosdk.plugins.deepgram import DeepgramTTS ``` ## Authentication The Deepgram plugin requires a [Deepgram API key](https://developers.deepgram.com/docs/create-additional-api-keys). Set `DEEPGRAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.deepgram import DeepgramTTS from videosdk.agents import CascadingPipeline # Initialize the Deepgram TTS model tts = DeepgramTTS( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="aura-asteria-en", encoding="linear16", # linear16, mulaw, alaw, opus, mp3, flac, aac sample_rate=24000 ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model` : The Deepgram model to use (e.g., `"aura-asteria-en"`, `"aura-luna-en"`) - `api_key`: Your Deepgram API key (can also be set via environment variable) - `encoding` : (str) Encoding allows you to specify the expected encoding of your audio output (default : `"linear16"`) - `sample_rate`: (int) Sample rate for output (default: `24000`) ## Additional Resources The following resources provide more information about using Deepgram with VideoSDK Agents SDK. - **[Deepgram docs](https://developers.deepgram.com/reference/text-to-speech-api/speak-streaming)**: Deepgram TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: ElevenLabs TTS hide_title: false hide_table_of_contents: false description: "Learn how to use ElevenLabs's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for ElevenLabs's services" pagination_label: "ElevenLabs TTS" keywords: - ElevenLabs - eleven_flash_v2_5 - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: ElevenLabs slug: eleven-labs --- # ElevenLabs TTS The ElevenLabs TTS provider enables your agent to use ElevenLabs' high-quality text-to-speech models for generating natural, expressive voice output with advanced voice cloning capabilities. ## Installation Install the ElevenLabs-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-elevenlabs" ``` ## Importing ```python from videosdk.plugins.elevenlabs import ElevenLabsTTS, VoiceSettings ``` ## Authentication The ElevenLabs plugin requires an [ElevenLabs API key](https://elevenlabs.io/app/settings/api-keys). Set `ELEVENLABS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.elevenlabs import ElevenLabsTTS, VoiceSettings from videosdk.agents import CascadingPipeline # Configure voice settings voice_settings = VoiceSettings( stability=0.71, similarity_boost=0.5, style=0.0, use_speaker_boost=True ) # Initialize the ElevenLabs TTS model tts = ElevenLabsTTS( # When ELEVENLABS_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-elevenlabs-api-key", model="eleven_flash_v2_5", voice="EXAVITQu4vr4xnSDxMaL", speed=1.0, response_format="pcm_24000", enable_streaming=True, enable_ssml_parsing=False, apply_text_normalization="auto", auto_mode="auto", voice_settings=voice_settings, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The ElevenLabs model to use (e.g., `"eleven_flash_v2_5"`, `"eleven_multilingual_v2"`) - `voice`: (str) Voice ID to use for audio output (default: "EXAVITQu4vr4xnSDxMaL") - `speed`: (float) Speed of the generated audio (default: 1.0) - `api_key`: Your ElevenLabs API key (can also be set via environment variable) - `response_format`: (str) Audio format for output (default: `"pcm_24000"`) - `voice_settings`: (`VoiceSettings`) Advanced voice configuration options: - `stability`: (float) Voice stability (0.0 to 1.0, default: 0.71) - `similarity_boost`: (float) Voice similarity enhancement (0.0 to 1.0, default: 0.5) - `style`: (float) Voice style exaggeration (0.0 to 1.0, default: 0.0) - `use_speaker_boost`: (bool) Enable speaker boost for clarity (default: `True`) - `base_url`: (str) Custom base URL for ElevenLabs API (optional) - `enable_streaming`: (bool) Enable real-time audio streaming (default: `False`) - `enable_ssml_parsing`: (bool) Whether to enable SSML parsing (default: `False`) - `apply_text_normalization`: (str) Controls text normalization (e.g., spelling out numbers). Modes: - "auto" (default) – System decides automatically - "on" – Always applied - "off" – Skipped - `Note`: For `eleven_turbo_v2_5` and `eleven_flash_v2_5` models, enabling text normalization requires an Enterprise plan. - `auto_mode`: (bool) Reduces latency by disabling chunk schedule and buffers. Recommended for full sentences/phrases. ## Additional Resources The following resources provide more information about using ElevenLabs with VideoSDK Agents SDK. - **[ElevenLabs docs](https://elevenlabs.io/docs)**: ElevenLabs TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Google TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Google's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Google's services" pagination_label: "Google TTS" keywords: - Google - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google slug: google-tts --- # Google TTS The Google TTS plugin enables your agent to use Google's text-to-speech models for generating natural-sounding voice output. It supports low-latency gRPC streaming with Chirp 3 HD voices and Vertex AI endpoints. ## Installation ```bash pip install "videosdk-plugins-google" ``` ## Authentication Set your Google API key as an environment variable: ```bash export GOOGLE_API_KEY="your-google-api-key" ``` You can obtain an API key from the [Google AI Studio](https://aistudio.google.com/apikey). ## Example Usage ```python from videosdk.plugins.google import GoogleTTS, GoogleVoiceConfig from videosdk.agents import CascadingPipeline # Configure voice settings voice_config = GoogleVoiceConfig( languageCode="en-US", name="en-US-Chirp3-HD-Aoede", ssmlGender="FEMALE" ) # Initialize the Google TTS model tts = GoogleTTS( # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", speed=1.0, pitch=0.0, voice_config=voice_config, custom_pronunciations=[{"tomato": "təˈmeɪtoʊ"}], # Optional IPA overrides ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` ### Vertex AI To use the Vertex AI endpoint instead of an API key, authenticate using [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/application-default-credentials) and set your project ID: ```bash export GOOGLE_CLOUD_PROJECT="my-gcp-project" ``` ```python from videosdk.plugins.google import GoogleTTS, VertexAIConfig tts = GoogleTTS( vertexai=True, vertexai_config=VertexAIConfig(location="us-central1"), streaming=False, # Streaming cannot be used with Vertex AI ) ``` :::note - `streaming=True` (the default) requires a Chirp 3 HD voice (e.g. `en-US-Chirp3-HD-Aoede`) and cannot be combined with `vertexai=True`. - Vertex AI requires a GCP project ID via `VertexAIConfig(project_id="...")`, the `GOOGLE_CLOUD_PROJECT` env variable, or a `GOOGLE_APPLICATION_CREDENTIALS` service-account file. ::: ## Configuration Options - `api_key`: (str) Your Google Cloud TTS API key. Can also be set via the `GOOGLE_API_KEY` environment variable. - `speed`: (float) The speaking rate of the generated audio (default: `1.0`). - `pitch`: (float) The pitch of the generated audio. Can be between -20.0 and 20.0 (default: `0.0`). - `response_format`: (str) The format of the audio response. Currently only supports `"pcm"` (default: `"pcm"`). - `voice_config`: (`GoogleVoiceConfig`) Configuration for the voice to be used. - `languageCode`: (str) The language code of the voice (e.g., `"en-US"`, `"en-GB"`) (default: `"en-US"`). - `name`: (str) The name of the voice to use (e.g., `"en-US-Chirp3-HD-Aoede"`, `"en-US-News-N"`) (default: `"en-US-Chirp3-HD-Aoede"`). - `ssmlGender`: (str) The gender of the voice (`"MALE"`, `"FEMALE"`, `"NEUTRAL"`) (default: `"FEMALE"`). - `custom_pronunciations`: (list[dict] | dict | None) IPA pronunciation overrides for specific words (e.g., `[{"tomato": "təˈmeɪtoʊ"}]`). Defaults to `None`. - `streaming`: (bool) Use gRPC `StreamingSynthesize` for lower-latency audio generation. Only compatible with Chirp 3 HD voices and cannot be combined with `vertexai=True` (default: `True`). - `vertexai`: (bool) Use the Vertex AI TTS endpoint with Application Default Credentials (ADC) instead of an API key (default: `False`). - `vertexai_config`: (`VertexAIConfig`) Project and region settings for Vertex AI. - `project_id`: (str | None) Your GCP project ID. Falls back to `GOOGLE_CLOUD_PROJECT` or `GOOGLE_APPLICATION_CREDENTIALS` (default: `None`). - `location`: (str) GCP region for the TTS endpoint (default: `"us-central1"`). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Google TTS docs](https://cloud.google.com/text-to-speech/docs)**: Google Cloud TTS documentation. - **[Chirp 3 HD voices](https://cloud.google.com/text-to-speech/docs/chirp3-hd)**: Available voices for low-latency streaming synthesis. - **[Vertex AI TTS](https://cloud.google.com/vertex-ai/docs/text-to-speech)**: Vertex AI Text-to-Speech documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Groq TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Groq's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Groq's services" pagination_label: "Groq TTS" keywords: - Groq - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Groq slug: groq-ai-tts --- # Groq TTS The Groq TTS provider enables your agent to use Groq's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Groq-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-groq" ``` ## Importing ```python from videosdk.plugins.groq import GroqTTS ``` ## Authentication The Groq plugin requires an [Groq API key](https://console.groq.com/keys). Set `GROQ_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.groq import GroqTTS from videosdk.agents import CascadingPipeline # Initialize the Groq AI TTS model tts = GroqTTS( model="playai-tts", voice="Fritz-PlayAI", ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model` (str): The TTS model to use. Default: "playai-tts" - `voice` (str): The voice to use. Default: "Fritz-PlayAI" - `speed` (float): Speed of speech (0.5 to 5.0). Default: 1.0 - `api_key` (str, optional): Groq API key. If not provided, uses GROQ_API_KEY environment variable ## Additional Resources The following resources provide more information about using Groq with VideoSDK Agents SDK. - **[Groq docs](https://console.groq.com/docs/text-to-speech)**: Groq TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Hume AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Hume AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Hume AI's services" pagination_label: "Hume AI TTS" keywords: - Hume AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Hume AI slug: hume-ai-tts --- # Hume AI TTS The Hume AI TTS provider enables your agent to use Hume AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Hume AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-humeai" ``` ## Importing ```python from videosdk.plugins.humeai import HumeAITTS ``` ## Authentication The Hume plugin requires an [Hume API key](https://platform.hume.ai/settings/keys). Set `HUMEAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.hume import HumeAITTS from videosdk.agents import CascadingPipeline # Initialize the Hume AI TTS model tts = HumeAITTS( voice="Serene Assistant", instant_mode=True, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `instant_mode`: (bool) Whether to use instant mode synthesis (default: `True`). Instant mode requires specifying a voice. - `voice`: (str) Voice name to use (default: `"Serene Assistant"`). Required when `instant_mode` is `True`. - `speed`: (float) Speaking rate multiplier (default: `1.0`). Values >1.0 increase speed. - `api_key`: (str) Hume AI API key. Can also be set via the `HUMEAI_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Hume with VideoSDK Agents SDK. - **[Hume AI docs](https://dev.hume.ai/docs/text-to-speech-tts)**: Hume AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Inworld AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Inworld AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Inworld AI's services" pagination_label: "Inworld AI TTS" keywords: - Inworld AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Inworld AI slug: inworld-ai-tts --- # Inworld AI TTS The Inworld AI TTS provider enables your agent to use Inworld AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Inworld AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-inworldai" ``` ## Importing ```python from videosdk.plugins.inworld import InworldAITTS ``` ## Authentication The Inworld plugin requires an [Inworld API key](https://studio.inworld.ai/login). Set `INWORLD_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.inworldai import InworldAITTS from videosdk.agents import CascadingPipeline # Initialize the Inworld AI TTS model tts = InworldAITTS( api_key="your-api-key", voice_id="Hades", model_id="inworld-tts-1" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model_id`: (str) Inworld TTS model identifier (default: `"inworld-tts-1"`). - `voice_id`: (str) Voice identifier to use (default: `"Hades"`). - `temperature`: (float) Sampling temperature for variation in prosody (default: `0.8`). - `api_key`: (str) Inworld API key. Can also be set via the `INWORLD_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Inworld with VideoSDK Agents SDK. - **[Inworld AI docs](https://docs.inworld.ai/docs/introduction)**: Inworld AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: LMNT AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use LMNT AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for LMNT AI's services" pagination_label: "LMNT AI TTS" keywords: - LMNT AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: LMNT AI slug: lmnt-ai-tts --- # LMNT AI TTS The LMNT AI TTS provider enables your agent to use LMNT AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the LMNT AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-lmnt" ``` ## Importing ```python from videosdk.plugins.lmnt import LMNTTTS ``` ## Authentication The LMNT plugin requires an [LMNT API key](https://app.lmnt.com/account). Set `LMNT_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.lmnt import LMNTTTS from videosdk.agents import CascadingPipeline # Initialize the LMNT TTS model tts = LMNTTTS( voice="ava", model="blizzard", language="auto", ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your LMNT API key (can also be set via `LMNT_API_KEY` environment variable) - `voice`: Voice ID to use for synthesis (required) - `model`: Model to use for synthesis (default: "blizzard") - `language`: Language code for synthesis (default: "auto") ## Additional Resources The following resources provide more information about using LMNT with VideoSDK Agents SDK. - **[LMNT docs](https://docs.lmnt.com/)**: LMNT API docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Murf AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Murf AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Murf AI's services" pagination_label: "Murf AI TTS" keywords: - Murf AI - Falcon - GEN2 - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Murf AI slug: murf-ai-tts --- # Murf AI TTS The Murf AI TTS provider enables your agent to use Murf AI's high-quality text-to-speech models for generating natural, expressive voice output with advanced voice customization. ## Installation Install the Murf AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-murfai" ``` ## Importing ```python from videosdk.plugins.murfai import MurfAITTS, MurfAIVoiceSettings ``` ## Authentication The Murf AI plugin requires a [Murf AI API key](https://murf.ai/). Set `MURFAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.murfai import MurfAITTS, MurfAIVoiceSettings from videosdk.agents import CascadingPipeline # Configure voice settings voice_settings = MurfAIVoiceSettings( pitch=0, rate=0, style="Conversational", variation=1, multi_native_locale=None ) # Initialize the Murf AI TTS model tts = MurfAITTS( # When MURFAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-murfai-api-key", region="US_EAST", model="Falcon", voice="en-US-natalie", voice_settings=voice_settings, enable_streaming=True ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Murf AI API key (can also be set via MURFAI_API_KEY environment variable) - `region`: (str) The region code for API deployment (default: `"US_EAST"`) - Available regions: `"GLOBAL"`, `"US_EAST"`, `"US_WEST"`, `"INDIA"`, `"CANADA"`, `"SOUTH_KOREA"`, `"UAE"`, `"JAPAN"`, `"AUSTRALIA"`, `"EU_CENTRAL"`, `"UK"`, `"SOUTH_AFRICA"` - `model`: (str) The Murf AI model to use (default: `"Falcon"`) - Available models: `"Gen2"`, `"Falcon"` - `voice`: (str) Voice ID to use for audio output (default: `"en-US-natalie"`) - `voice_settings`: (`MurfAIVoiceSettings`) Advanced voice configuration options: - `pitch`: (int) Voice pitch adjustment, range varies by voice (default: `0`) - `rate`: (int) Speech rate adjustment, range varies by voice (default: `0`) - `style`: (str) Voice style/emotion (default: `"Conversational"`) - `variation`: (int) Voice variation for diversity (default: `1`) - `multi_native_locale`: (str) Optional locale for multi-native voices (default: `None`) - `enable_streaming`: (bool) Enable WebSocket streaming for low latency. When `False`, uses HTTP chunked transfer (default: `True`) ## Additional Resources The following resources provide more information about using Murf AI with VideoSDK Agents SDK. - **[Murf AI docs](https://murf.ai/api/docs/introduction/overview)**: Murf AI TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Neuphonic TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Neuphonic's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Neuphonic's services" pagination_label: "Neuphonic TTS" keywords: - Neuphonic - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Neuphonic slug: neuphonic-tts --- # Neuphonic TTS The Neuphonic TTS provider enables your agent to use Neuphonic's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Neuphonic-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-neuphonic" ``` ## Importing ```python from videosdk.plugins.neuphonic import NeuphonicTTS ``` ## Authentication The Neuphonic plugin requires an [Neuphonic API key](https://app.neuphonic.com/apikey). Set `NEUPHONIC_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.neuphonic import NeuphonicTTS from videosdk.agents import CascadingPipeline # Initialize the Neuphonic AI TTS model tts = NeuphonicTTS( lang_code="en", voice_id="8e9c4bc8-3979-48ab-8626-df53befc2090", speed=1.0, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Neuphonic API key (can also be set via `NEUPHONIC_API_KEY` environment variable) - `lang_code`: Language code for the desired language (e.g., 'en', 'es', 'de', 'nl', 'hi') - `voice_id`: The voice ID for the desired voice - `speed`: Playback speed of the audio (range: 0.7-2.0, default: 1.0) ## Additional Resources The following resources provide more information about using Neuphonic with VideoSDK Agents SDK. - **[Neuphonic AI docs](https://docs.neuphonic.com/)**: Neuphonic docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Nvidia TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Nvidia's Riva TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Nvidia's services" pagination_label: "Nvidia TTS" keywords: - Nvidia - Riva - TTS - VideoSDK Agents - Python SDK - Text To Speech - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_label: Nvidia slug: nvidia --- # Nvidia TTS The Nvidia TTS provider enables your agent to use Nvidia's Riva text-to-speech models for converting text responses to natural-sounding audio output with low latency. ## Installation Install the Nvidia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-nvidia" ``` ## Authentication The Nvidia plugin requires an Nvidia API key. Set `NVIDIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.nvidia import NvidiaTTS ``` ## Example Usage ```python from videosdk.plugins.nvidia import NvidiaTTS from videosdk.agents import CascadingPipeline # Initialize the Nvidia TTS model tts = NvidiaTTS( # When NVIDIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-nvidia-api-key", voice_name="Magpie-Multilingual.EN-US.Aria", language_code="en-US", sample_rate=24000 ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Nvidia API key (required, can also be set via environment variable) - `server`: The Nvidia Riva server address (default: `"grpc.nvcf.nvidia.com:443"`) - `function_id`: The specific function ID for the service (default: `"877104f7-e885-42b9-8de8-f6e4c6303969"`) - `voice_name`: (str) The voice to use (default: `"Magpie-Multilingual.EN-US.Aria"`) - `language_code`: (str) Language code for synthesis (default: `"en-US"`) - `sample_rate`: (int) Audio sample rate in Hz (default: `24000`) - `use_ssl`: (bool) Enable SSL connection (default: `True`) ## Additional Resources The following resources provide more information about using Nvidia Riva with VideoSDK Agents SDK. - **[Nvidia Riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)**: Nvidia Riva documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for OpenAI's services" pagination_label: "OpenAI TTS" keywords: - OpenAI - gpt-4o-mini-tts - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: OpenAI slug: openai --- # OpenAI TTS The OpenAI TTS provider enables your agent to use OpenAI's text-to-speech models for converting text responses to natural-sounding audio output. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAITTS ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAITTS from videosdk.agents import CascadingPipeline # Initialize the OpenAI TTS model tts = OpenAITTS( # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", model="tts-1", voice="alloy", speed=1.0, response_format="pcm" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The OpenAI TTS model to use (e.g., `"tts-1"`, `"tts-1-hd"`) - `voice`: (str) Voice to use for audio output (e.g., `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"`) - `speed`: (float) Speed of the generated audio (0.25 to 4.0, default: 1.0) - `instructions`: (str) Custom instructions to guide speech synthesis style - `api_key`: Your OpenAI API key (can also be set via environment variable) - `base_url`: Custom base URL for OpenAI API (optional) - `response_format`: (str) Audio format for output (default: `"pcm"`) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/guides/text-to-speech)**: OpenAI TTS API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Papla Media TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Papla Media's text-to-speech service with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing TTS for Papla Media." pagination_label: "Papla Media TTS" keywords: - Papla Media - TTS - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Papla Media slug: papla-media --- # Papla Media TTS The Papla Media TTS provider enables your agent to use Papla Media's text-to-speech service for converting text responses into spoken audio. ## Installation Install the Papla Media-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-papla" ``` ## Importing ```python from videosdk.plugins.papla import PaplaTTS ``` ## Authentication The Papla Media plugin requires an API key, which you can generate from your app dashboard. Set `PAPLA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.papla import PaplaTTS from videosdk.agents import CascadingPipeline # Initialize the Papla Media TTS service tts = PaplaMediaTTS( # When PAPLA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-papla-media-api-key", ) # Add tts to a cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so you should omit the `api_key` parameter from your code. ::: ## Configuration Options ### Initialization Parameters These are the options you can set when creating an instance of `PaplaMediaTTS`. - `model_id` (str): The TTS model to use. Defaults to `"papla_p1"`. - `api_key` (str, optional): Your Papla Media API key. It's recommended to set this via the `PAPLA_API_KEY` environment variable instead. - `base_url` (str, optional): Custom base URL for the Papla Media API. Defaults to `"https://api.papla.media/v1"`. ## Additional Resources The following resources provide more information about using Papla Media with the VideoSDK Agent Framework. - **[Papla Media API Docs](https://api.papla.media/docs)**: Papla Media's official API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Resemble AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Resemble AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Resemble AI's services" pagination_label: "Resemble AI TTS" keywords: - Resemble AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Resemble AI slug: resemble-ai-tts --- # Resemble AI TTS The Resemble AI TTS provider enables your agent to use Resemble AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Resemble AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-resemble" ``` ## Importing ```python from videosdk.plugins.resemble import ResembleTTS ``` ## Authentication The Resemble plugin requires an [Resemble API key](https://app.resemble.ai/account/api). Set `RESEMBLE_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.resemble import ResembleTTS from videosdk.agents import CascadingPipeline # Initialize the Resemble AI TTS model tts = ResembleTTS( # When RESEMBLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-resemble-api-key", voice_uuid="55592656" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Resemble AI API key. Can also be set via the `RESEMBLE_API_KEY` environment variable. - `voice_uuid`: (str) The UUID of the voice to use for synthesis (default: `"55592656"`). ## Additional Resources The following resources provide more information about using Resemble with VideoSDK Agents SDK. - **[Resemble AI docs](https://docs.app.resemble.ai)**: Resemble AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Rime AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Rime AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Rime AI's services" pagination_label: "Rime AI TTS" keywords: - Rime AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Rime AI slug: rime-ai-tts --- # Rime AI TTS The Rime AI TTS provider enables your agent to use Rime AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Rime AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-rime" ``` ## Importing ```python from videosdk.plugins.rime import RimeTTS ``` ## Authentication The Rime plugin requires an [Rime API key](https://rime.ai/). Set `RIME_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.Rime import RimeTTS from videosdk.agents import CascadingPipeline # Initialize the Rime AI TTS model tts = RimeTTS( speaker="river", model_id="mist", lang="eng", speed_alpha=1.0 ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `speaker`: (str) Voice ID to use (default: `"river"`). Must match the model's available speakers. - `model_id`: (str) Rime model identifier (default: `"mist"`). Supported: `"mist"`, `"mistv2"`. - `lang`: (str) Language code for the voice (default: `"eng"`). - `speed_alpha`: (float) Controls speaking rate (`1.0` is normal speed). - `reduce_latency`: (bool) Whether to minimize streaming delay (default: `False`). - `pause_between_brackets`: (bool) Insert pauses around bracketed text (default: `False`). - `phonemize_between_brackets`: (bool) Use phonemes for bracketed text (default: `False`). - `inline_speed_alpha`: (str) Optional per-word speed override (e.g., `"1.2,1.0,0.8"`). - `api_key`: (str) Rime API key. Can also be set via the `RIME_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Rime with VideoSDK Agents SDK. - **[Rime AI docs](https://docs.rime.ai/)**: Rime AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Sarvam AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Sarvam AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Sarvam AI's services" pagination_label: "Sarvam AI TTS" keywords: - Sarvam AI - bulbul:v2 - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Sarvam AI slug: sarvam-ai-tts --- # Sarvam AI TTS The Sarvam AI TTS provider enables your agent to use Sarvam AI's text-to-speech models for generating voice output. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAITTS ``` ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAITTS from videosdk.agents import CascadingPipeline # Initialize the Sarvam AI TTS model tts = SarvamAITTS( # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", model="bulbul:v2", speaker="anushka", language="en-IN", pitch=0.0, pace=1.0, loudness=1.0, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `model`: (str) The Sarvam AI model to use, e.g. `"bulbul:v2"`, `"bulbul:v3"`, `"bulbul:v3-beta"` (default: `"bulbul:v2"`). - `speaker`: (str) The speaker voice to use (default: `"anushka"`). - `language`: (str) The language code for the generated audio (default: `"en-IN"`). - `enable_streaming`: (bool) If `True`, uses WebSockets for low-latency streaming. If `False`, uses HTTP for batch synthesis (default: `True`). - `sample_rate`: (int) The audio sample rate in Hz (default: `8000`). - `output_audio_codec`: (str) The output audio codec (default: `"linear16"`). - `pitch`: (float | None) Pitch of the voice. Only supported on `bulbul:v2`. Range: [-0.75, 0.75]. Set to `None` to omit (default: `0.0`). - `pace`: (float | None) Pace/speed of the voice. `bulbul:v2`: range [0.3, 3.0]; `bulbul:v3`/`bulbul:v3-beta`: range [0.5, 2.0]. Set to `None` to omit (default: `1.0`). - `loudness`: (float | None) Loudness of the voice. Only supported on `bulbul:v2`. Range: [0.3, 3.0]. Set to `None` to omit (default: `1.0`). - `temperature`: (float | None) Sampling temperature. Only supported on `bulbul:v3` and `bulbul:v3-beta`. Range: [0.01, 1.0]. Set to `None` to omit (default: `0.6`). - `output_audio_bitrate`: (str) Output audio bitrate. Allowed values: `"32k"`, `"64k"`, `"96k"`, `"128k"`, `"192k"` (default: `"128k"`). - `min_buffer_size`: (int) Minimum character length that triggers buffer flushing (default: `50`). - `max_chunk_length`: (int) Maximum chunk length for sentence splitting (default: `150`). - `enable_preprocessing`: (bool) Controls normalization of English words and numeric entities (e.g., numbers, dates). Recommended for mixed-language text. Only supported on `bulbul:v2` (default: `False`). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/)**: Sarvam's full docs site. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: SmallestAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use SmallestAI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for SmallestAI's services" pagination_label: "SmallestAI TTS" keywords: - SmallestAI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: SmallestAI slug: smallestai-tts --- # SmallestAI TTS The SmallestAI TTS provider enables your agent to use SmallestAI's high-quality text-to-speech models for generating voice output. ## Installation Install the SmallestAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-smallestai" ``` ## Importing ```python from videosdk.plugins.smallestai import SmallestAITTS ``` ## Authentication The Smallest AI plugin requires a [Smallest AI API key](https://console.smallest.ai/apikeys). Set `SMALLEST_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.smallestai import SmallestAITTS from videosdk.agents import CascadingPipeline # Initialize the SmallestAI TTS model tts = SmallestAITTS( # When SMALLEST_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-smallestai-api-key", model="lightning", voice_id="emily" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` from your code. ::: ## Configuration Options - `api_key`: (str) Your SmallestAI API key. Can also be set via the `SMALLEST_API_KEY` environment variable. - `model`: (str) The TTS model to use (e.g., `"lightning"`, `"lightning-large"`). Defaults to `"lightning"`. - `voice_id`: (str) The ID of the voice to use. Defaults to `"emily"`. - `speed`: (float) Speech speed multiplier. Defaults to `1.0`. - `consistency`: (float) Controls word repetition and skipping. Only supported in `lightning-large` model. Defaults to `0.5`. - `similarity`: (float) Controls similarity to the reference audio. Only supported in `lightning-large` model. Defaults to `0.0`. - `enhancement`: (bool) Enhances speech quality at the cost of increased latency. Only supported in `lightning-large` model. Defaults to `False`. ## Additional Resources The following resources provide more information about using Smallest AI with VideoSDK Agents SDK. - **[Smallest AI docs](https://waves-docs.smallest.ai/v3.0.1/content/introduction/introduction)**: Smallest AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Speechify TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Speechify's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Speechify's services" pagination_label: "Speechify TTS" keywords: - Speechify - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Speechify slug: speechify-tts --- # Speechify TTS The Speechify TTS provider enables your agent to use Speechify's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Speechify-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-speechify" ``` ## Importing ```python from videosdk.plugins.speechify import SpeechifyTTS ``` ## Authentication The Speechify plugin requires an [Speechify API key](https://console.sws.speechify.com/). Set `SPEECHIFY_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.speechify import SpeechifyTTS from videosdk.agents import CascadingPipeline # Initialize the Speechify TTS model tts = SpeechifyTTS( voice_id="kristy", model="simba-english" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `voice_id`: (str) The Speechify voice to use (default: `"kristy"`). - `api_key`: (str) Speechify API key. Can also be set via the `SPEECHIFY_API_KEY` environment variable. - `model`: (str) The model variant to use (`"simba-base"`, `"simba-english"`, `"simba-multilingual"`, `"simba-turbo"`). Default: `"simba-english"`. - `language`: (str) Optional ISO language code for multilingual models (e.g., `"en"`, `"es"`). ## Additional Resources The following resources provide more information about using Speechify with VideoSDK Agents SDK. - **[Speechify AI docs](https://docs.sws.speechify.com/v1/docs)**: Speechify AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Turn Detector hide_title: false hide_table_of_contents: false description: "Learn how to use TurnDetector model with the VideoSDK AI Agent SDK. This guide covers model configuration." pagination_label: "Turn Detector" keywords: - Turn Detection - Turn Detector - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Turn Detector slug: turn-detector --- # Turn Detector The Turn Detector uses a Hugging Face model to determine whether a user's turn is completed or not, enabling precise conversation flow management in cascading pipelines. ## Installation Install the Turn Detector-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-turn-detector" ``` ## Importing ```python from videosdk.plugins.turn_detector import TurnDetector ``` ## Example Usage ```python from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.agents import CascadingPipeline # Pre-download the model (optional but recommended) pre_download_model() # Initialize the Turn Detector turn_detector = TurnDetector( threshold=0.7 ) # Add Turn Detector to cascading pipeline pipeline = CascadingPipeline(turn_detector=turn_detector) ``` ## Configuration Options - `threshold`: (float) Confidence threshold for turn completion detection (0.0 to 1.0, default: `0.7`) ## Pre-downloading Model To avoid delays during agent initialization, you can pre-download the Hugging Face model: ```python from videosdk.plugins.turn_detector import pre_download_model # Download model before running the agent pre_download_model() ``` ## Additional Resources The following resources provide more information about VideoSDK Turn Detector plugin for AI Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- id: recording title: Recording hide_title: false hide_table_of_contents: false description: "Learn how to enable the recording functionality with VideoSDK AI Agents for agent sessions and user interactions." pagination_label: "Recording" keywords: - Agent Recording - AI Agents - Recording - AI Agent Oversight - Traces - Playback - VideoSDK Agents - MCP Server - Python SDK - Audio Store - Autoscroll Transcript - Timestamped Playback image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Recording slug: recording --- The AI Agent SDK now supports session recordings, which can be enabled with a simple configuration. When enabled, all interactions between the user and the agent are recorded. These recordings can be played back directly from the dashboard with autoscrolling transcripts and precise timestamps, and you can also download them for offline review and analysis. ## Enabling Recording To enable recording for an AI agent session, you need to set the `recording` flag to `true` in the session context. Once that's done, start your agent as usual—no additional changes are required in the pipeline. By default, the recording flag is set to `false`. ```python job_context = JobContext( room_options = RoomOptions( room_id = "YOUR_ROOM_ID", name = "Agent", recording = True ) ) ``` --- --- title: Running Agents with Worker hide_title: false hide_table_of_contents: false description: "Learn how to run AI agent instances using the Worker system in the VideoSDK AI Agent SDK. Understand WorkerJob and JobContext for robust agent deployment with proper process isolation and lifecycle management." pagination_label: "Running Agents with Worker System" keywords: - Worker System - VideoSDK Agents - AI Agent SDK - Python - Multiprocessing - Process Isolation - WorkerJob - JobContext - Voice Agent Sessions - Agent Deployment image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Running Agents with Worker slug: running-multiple-agents --- The worker system provides a robust way to run AI agent instances using Python's multiprocessing. It offers process isolation, proper lifecycle management, and a clean separation between agent logic and infrastructure concerns. ## Key Components ### 1. WorkerJob `WorkerJob` is the main class that defines an agent task to be executed in a separate process. It takes two parameters: - `entrypoint`: An async function that accepts a JobContext parameter - `jobctx`: A JobContext object or a callable that returns a JobContext ```python job = WorkerJob(entrypoint=my_function, jobctx=my_context) ``` ### 2. JobContext `JobContext` provides the runtime environment for your agent, including: - **Room Management**: Handles VideoSDK room connections - **Shutdown Callbacks**: Allows cleanup operations - **Process Isolation**: Each job runs in its own process ### 3. Worker `Worker` manages the execution of jobs in separate processes, providing: - Process isolation for each agent instance - Automatic cleanup on shutdown - Error handling and logging ## Usage Example Here's a complete example of how to use the worker system with a voice agent: ```python import asyncio import aiohttp from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from videosdk.agents import Agent, AgentSession, RealTimePipeline, WorkerJob, JobContext, RoomOptions class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful voice assistant that can answer questions and help with tasks.", ) async def on_enter(self) -> None: await self.session.say("Hello, how can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def entrypoint(ctx: JobContext): model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) agent = MyVoiceAgent(ctx) session = AgentSession( agent=agent, pipeline=pipeline, ) async def cleanup_session(): print("Cleaning up session...") ctx.add_shutdown_callback(cleanup_session) try: # connect to the room await ctx.connect() await ctx.room.wait_for_participant() await session.start() await asyncio.Event().wait() except KeyboardInterrupt: print("Shutting down...") finally: await session.close() await ctx.shutdown() def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Sandbox Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Configuration Options ### RoomOptions - `room_id`: The VideoSDK meeting ID - `auth_token`: Authentication token (or use VIDEOSDK_AUTH_TOKEN env var) - `name`: Agent name displayed in the meeting - `playground`: Enable playground mode for testing - `vision`: Enable vision capabilities - `avatar`: Use virtual avatars from available providers ## Best Practices 1. **Always use cleanup callbacks**: Register shutdown callbacks to ensure proper resource cleanup 2. **Handle exceptions gracefully**: Use try-finally blocks to ensure cleanup happens 3. **Use playground mode for testing**: Set `playground=True` for easy testing and debugging 4. **Set environment variables**: Use `VIDEOSDK_AUTH_TOKEN` for authentication 5. **Wait for participants**: Use `wait_for_participant()` to ensure agent waits for a participant The worker system provides a production-ready way to deploy AI agents with proper isolation, lifecycle management, and error handling. --- --- id: session-analytics title: Session Analytics hide_title: false hide_table_of_contents: false description: "Understand how to use Tracing & Observability for the AI Agent SDK on the VideoSDK Dashboard to inspect sessions, transcripts, and end‑to‑end latency per component." keywords: - AI Agent SDK - VideoSDK Agents - Tracing and Observability - Session Analytics - Telemetry and Metrics - Latency Measurement - End-to-End Latency - Session Debugging - Transcript Playback - Conversation Analytics - Interaction Turns - Tool Calls Monitoring - Audio/Video Session Recording - Agent Responsiveness - Performance Monitoring - User-Agent Interaction - Real-time Insights - Session Playback - VideoSDK Dashboard image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Session Analytics slug: session-analytics --- VideoSDK's AI Agent framework offers powerful **Tracing and Observability** tools, providing deep insights into your AI agent's performance and behavior. These tools, accessible from the VideoSDK dashboard, allow you to monitor sessions, analyze interactions, and debug issues with precision. ## Prerequisites To View Tracing and Observability At VideoSDK Dashboard, make sure to install the VideoSDK AI Agent package using pip: ```bash pip install videosdk-agents==0.0.23 ``` :::note Tracing and Observability support was added starting from version 0.0.23, which is why this version is required. ::: ## Sessions The Sessions dashboard provides a comprehensive list of all interactions with your AI agents. Each session is a unique conversation between a user and an agent, identified by a `Session ID` and associated with a `Room ID`.
### Key Metrics For each session, you can monitor the following key metrics at a glance: - **Session ID**: A unique identifier for the session. - **Room ID**: The identifier of the room where the session took place. - **TTFW (Time to First Word)**: The time it takes for the agent to utter its first word after the user has finished speaking. This metric is crucial for measuring the responsiveness of your agent. - **P50, P90, P95**: These are percentile metrics for latency, providing a statistical distribution of response times. For example, P90 indicates that 90% of the responses were faster than the specified value. - **Interruption**: The number of times the agent was interrupted by the user. - **Duration**: The total duration of the session. - **Recording**: Indicates whether the session was recorded. You can play back the recording directly from the dashboard. - **Created At**: The timestamp of when the session was created. - **Actions**: From here, you can navigate to the detailed analytics view for the session. ## Session View By clicking on "View Analytics" for a specific session, you are taken to the Session View. This view provides a complete transcript of the conversation, along with timestamps and speaker identification (Caller or Agent).
If the session was recorded, you can play back the audio and follow along with the transcript, which automatically scrolls as the conversation progresses. This is an invaluable tool for understanding the user experience and identifying areas for improvement. By analyzing these metrics, you can quickly identify underperforming agents, diagnose latency issues, and gain a holistic view of the user experience. The next section will delve into the detailed session and trace views, where you can explore individual conversations and their underlying processes. --- --- id: traces title: Trace Insights keywords: - VideoSDK Tracing - Trace View - Spans and Traces - Speech-to-Text (STT) - Text-to-Speech (TTS) - Large Language Model (LLM) - End-of-Utterance (EOU) - AI Agent Performance - Session Trace Analysis - Conversation Turn Breakdown - Latency Metrics - Tool Call Debugging - Agent Interaction Insights - Real-time Trace Visualization - AI Agent Observability - VideoSDK Dashboard Traces image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Traces slug: traces --- The real power of VideoSDK's Tracing and Observability tools lies in the detailed session and trace views. These views provide a granular breakdown of each conversation, allowing you to analyze every turn, inspect component latencies, and understand the agent's decision-making process. ## Trace View The Trace View offers an even deeper level of insight, breaking down the entire session into a hierarchical structure of traces and spans.
### Session Configuration At the top level, you'll find the **Session Configuration**, which details all the parameters the agent was initialized with. This includes the models used for STT, LLM, and TTS, as well as any function tools or MCP tools that were configured. This information is crucial for reproducing and debugging specific agent behaviors. ### User & Agent Turns The core of the Trace View is the breakdown of the conversation into **User & Agent Turns**. Each turn represents a single exchange between the user and the agent.
Within each turn, you can see a detailed timeline of the underlying processes, including: - **STT (Speech-to-Text) Processing**: The time it took to transcribe the user's speech. - **EOU (End-of-Utterance) Detection**: The time taken to detect that the user has finished speaking. - **LLM Processing**: The time the Large Language Model took to process the input and generate a response. - **TTS (Text-to-Speech) Processing**: The time it took to convert the LLM's text response into speech. - **Time to First Byte**: The initial delay before the agent starts speaking. - **User Input Speech**: The duration of the user's speech. - **Agent Output Speech**: The duration of the agent's spoken response. ### Turn Properties For each turn, you can inspect the properties of the components involved. This includes the transcript of the user's input, the response from the LLM, and any errors that may have occurred.
By leveraging the detailed information in the Trace View, you can pinpoint performance bottlenecks, debug errors, and gain a comprehensive understanding of your AI agent's inner workings. ### Tool Calls When an LLM invokes a tool, the Trace View provides specific details about the tool call, including the tool's name and the parameters it was called with. This is essential for debugging integrations and ensuring that your agent's tools are functioning as expected.
--- --- title: Wake Up Call hide_title: false hide_table_of_contents: false description: "Learn how to implement Wake Up Call functionality with VideoSDK AI Agents to automatically trigger actions when users are inactive for a specified duration." pagination_label: "Wake Up Call" keywords: - Wake Up Call - Inactivity Detection - Auto Trigger - User Engagement - VideoSDK Agents - Callback Functions - Session Management - AgentSession - Timeout Management image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Wake Up Call slug: wakeup-call --- # Wake Up Call Wake Up Call enables AI agents to automatically trigger actions when users remain inactive for a specified duration. This feature helps maintain user engagement and provides proactive assistance during conversation sessions. ## Overview The Wake Up Call system allows AI agents to: - Monitor user inactivity periods during conversations - Automatically trigger custom callback functions after specified timeouts - Re-engage users with proactive messages or actions ## Key Components ### 1. Wake Up Configuration Set the inactivity timeout duration in the `AgentSession` constructor using the `wake_up` parameter: ```python session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow, wake_up=10 # seconds ) ``` **Important**: If a `wake_up` time is provided, you must set a callback function before starting the session. If no `wake_up` time is specified, no timer or callback will be activated. ### 2. Callback Function Define a custom async function that will be executed when the inactivity threshold is reached: ```python async def on_wake_up(): print("Wake up triggered - user inactive for 10 seconds") session.say("Hello, how can I help you today?") # Assign the callback function to the session session.on_wake_up = on_wake_up ``` :::tip Get started quickly with the [Wake Up Call Example](https://github.com/videosdk-live/agents/tree/main/examples/wakeup_call.py) — everything you need to implement inactivity detection in your AI agents. ::: --- --- title: WhatsApp Agent Quick Start hide_title: false hide_table_of_contents: false description: "A comprehensive guide to creating a powerful AI voice agent that can answer calls made to your WhatsApp Business number. Learn how to integrate with Meta Business Platform using direct SIP integration and VideoSDK." pagination_label: "WhatsApp Agent Quick Start" keywords: - WhatsApp Voice Agent - Quick Start - VideoSDK Agents - AI Agent SDK - Python - SIP - WhatsApp Business - Meta Business Platform - Voice Integration - AI Assistant - Customer Service - Real-time Communication image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: AI WhatsApp Agent slug: whatsapp-voice-agent-quick-start --- import WhatsAppQuickStart from '@site/mdx/\_whatsapp-voice-agent-quick-start.mdx'; --- # A2A Implementation Guide This guide shows you how to build a complete Agent to Agent (A2A) system using the concepts from the [A2A Overview](overview). We'll create a banking customer service system with a main customer service agent and a loan specialist. ## Implementation Overview We'll build a system with: - **Customer Service Agent**: Voice-enabled interface agent using a **cascading Pipeline** (STT + LLM + TTS) for voice interactions - **Loan Specialist Agent**: Text-based domain expert using an **LLM-only Pipeline** for efficient background text processing - **Intelligent Routing**: Automatic detection and forwarding of loan queries - **Seamless Communication**: Users get expert responses without knowing about the routing ## Supported Pipeline Configurations | Configuration | Customer Agent | Specialist Agent | Description | |---|---|---|---| | **Cascading + Cascading (LLM-only)** | STT + LLM + TTS + VAD | LLM only | Customer uses full voice pipeline, specialist processes text in background | | **Realtime + Cascading (LLM-only)** | Realtime model (e.g., Gemini) | LLM only | Customer uses realtime model for voice, specialist processes text in background | ## Structure of the project ```js A2A ├── agents/ │ ├── customer_agent.py # CustomerServiceAgent definition │ ├── loan_agent.py # LoanAgent definition │ ├── session_manager.py # Handles session creation, pipeline setup └── main.py # Entry point: runs main() and starts agents ``` ## Sequence Diagram ![A2A Architecture](https://cdn.videosdk.live/website-resources/docs-resources/a2a_sequence_diagram.png) ## Step 1: Create the Customer Service Agent - **`Interface Agent`**: Creates `CustomerServiceAgent` as the main user-facing agent with voice capabilities and customer service instructions. - **`Function Tool`**: Implements `@function_tool forward_to_specialist()` that uses A2A discovery to find and route queries to domain specialists. - **`Response Relay`**: Includes `handle_specialist_response()` method that automatically receives and relays specialist responses back to users via `session.say()`. ```python title="agents/customer_agent.py" from videosdk.agents import Agent, AgentCard, A2AMessage, function_tool import asyncio from typing import Dict, Any class CustomerServiceAgent(Agent): def __init__(self): super().__init__( agent_id="customer_service_1", instructions=( "You are a helpful bank customer service agent. " "For general banking queries (account balances, transactions, basic services), answer directly. " "For ANY loan-related queries, questions, or follow-ups, ALWAYS use the forward_to_specialist function " "with domain set to 'loan'. This includes initial loan questions AND all follow-up questions about loans. " "Do NOT attempt to answer loan questions yourself - always forward them to the specialist. " "After forwarding a loan query, stay engaged and automatically relay any response you receive from the specialist. " "When you receive responses from specialists, immediately relay them naturally to the customer." ) ) @function_tool async def forward_to_specialist(self, query: str, domain: str) -> Dict[str, Any]: """Forward queries to domain specialist agents using A2A discovery""" # Use A2A discovery to find specialists by domain specialists = self.a2a.registry.find_agents_by_domain(domain) id_of_target_agent = specialists[0] if specialists else None if not id_of_target_agent: return {"error": f"No specialist found for domain {domain}"} # Send A2A message to the specialist await self.a2a.send_message( to_agent=id_of_target_agent, message_type="specialist_query", content={"query": query} ) return { "status": "forwarded", "specialist": id_of_target_agent, "message": "Let me get that information for you from our loan specialist..." } async def handle_specialist_response(self, message: A2AMessage) -> None: """Handle responses from specialist agents and relay to user via TTS directly""" response = message.content.get("response") if response: # Brief pause for natural conversation flow await asyncio.sleep(0.5) await self.session.say(response) async def on_enter(self): # Register this agent with the A2A system await self.register_a2a(AgentCard( id="customer_service_1", name="Customer Service Agent", domain="customer_service", capabilities=["query_handling", "specialist_coordination"], description="Handles customer queries and coordinates with specialists" )) await self.session.say("Hello! I am your customer service agent. How can I help you?") # Set up message listener for specialist responses self.a2a.on_message("specialist_response", self.handle_specialist_response) async def on_exit(self): print("Customer agent left the meeting") ``` ## Step 2: Create the Loan Specialist Agent - **`Specialist Agent Setup`**: Creates `LoanAgent` class with specialized loan expertise instructions and agent_id `"specialist_1"`. - **`Message Handlers`**: Implements `handle_specialist_query()` to process incoming queries and `handle_model_response()` to send responses back. - **`Registration`**: Registers with A2A system using domain "loan" so it can be `discovered` by other agents needing loan expertise. ```python title="agents/loan_agent.py" from videosdk.agents import Agent, AgentCard, A2AMessage class LoanAgent(Agent): def __init__(self): super().__init__( agent_id="specialist_1", instructions=( "You are a specialized loan expert at a bank. " "Provide detailed, helpful information about loans including interest rates, terms, and requirements. " "Give complete answers with specific details when possible. " "You can discuss personal loans, car loans, home loans, and business loans. " "Provide helpful guidance and next steps for loan applications. " "Be friendly and professional in your responses. " "Keep responses concise within 5-7 lines and easily understandable." ) ) async def handle_specialist_query(self, message: A2AMessage): """Process incoming queries from customer service agent""" query = message.content.get("query") if query: # Send the query to our AI model for processing await self.session.pipeline.send_text_message(query) async def handle_model_response(self, message: A2AMessage): """Send processed responses back to requesting agent""" response = message.content.get("response") requesting_agent = message.to_agent if response and requesting_agent: # Send the specialist response back to the customer service agent await self.a2a.send_message( to_agent=requesting_agent, message_type="specialist_response", content={"response": response} ) async def on_enter(self): await self.register_a2a(AgentCard( id="specialist_1", name="Loan Specialist Agent", domain="loan", capabilities=["loan_consultation", "loan_information", "interest_rates"], description="Handles loan queries" )) self.a2a.on_message("specialist_query", self.handle_specialist_query) self.a2a.on_message("model_response", self.handle_model_response) async def on_exit(self): print("LoanAgent Left") ``` ## Step 3: Configure Session Management - **`Unified Pipeline`**: Uses a single `Pipeline` class for both agents. The Pipeline auto-detects the mode based on the components you provide. - **`Session Factory`**: Provides `create_pipeline()` and `create_session()` functions to configure agent sessions based on their roles. - **`Modality Separation`**: Ensures customer agent can handle voice while specialist processes text in background. ### Option A: Cascading Customer + Cascading Specialist (Recommended) The customer agent uses a full cascade (STT + LLM + TTS + VAD) for voice interaction, and the specialist uses an LLM-only pipeline for text processing. The specialist's response goes directly to the customer's TTS via `session.say()` — **only 1 LLM call** for the specialist query (no duplicate processing). ```python title="session_manager.py" from videosdk.agents import AgentSession, Pipeline from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.google import GoogleLLM, GoogleTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model import os pre_download_model() def create_pipeline(agent_type: str) -> Pipeline: if agent_type == "customer": # Customer agent: Full cascade for voice interaction return Pipeline( stt=DeepgramSTT(), llm=GoogleLLM(api_key=os.getenv("GOOGLE_API_KEY")), tts=GoogleTTS(api_key=os.getenv("GOOGLE_API_KEY")), vad=SileroVAD(), turn_detector=TurnDetector(), ) else: # Specialist agent: LLM-only pipeline for background text processing return Pipeline( llm=OpenAILLM(api_key=os.getenv("OPENAI_API_KEY")), ) def create_session(agent, pipeline) -> AgentSession: return AgentSession( agent=agent, pipeline=pipeline, ) ``` ### Option B: Realtime Customer + Cascading Specialist The customer agent uses a realtime model (Gemini) for low-latency voice interaction, and the specialist uses an LLM-only cascade. ```python title="session_manager.py" from videosdk.agents import AgentSession, Pipeline from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import os def create_pipeline(agent_type: str) -> Pipeline: if agent_type == "customer": # Customer agent: Realtime model for voice interaction return Pipeline( llm=GeminiRealtime( model="gemini-3.1-flash-live-preview", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) ) else: # Specialist agent: LLM-only pipeline for background text processing return Pipeline( llm=OpenAILLM(api_key=os.getenv("OPENAI_API_KEY")), ) def create_session(agent, pipeline) -> AgentSession: return AgentSession( agent=agent, pipeline=pipeline, ) ``` :::note While setting up pipelines, make sure: - The **customer agent** has **voice capabilities** (via cascading STT+LLM+TTS or a realtime model in the `Pipeline`). - The **specialist agent (Loan Agent)** operates in **text-only mode** (via a standard LLM like `OpenAILLM` in the `Pipeline`). ::: :::info **Pipeline Support**: The VideoSDK AI Agents framework uses a **unified `Pipeline`** class that automatically detects whether you're using a realtime model or cascading components. You can pass a realtime model (like `GeminiRealtime` or `OpenAIRealtime`) or cascading components (STT + LLM + TTS + VAD) to the same `Pipeline` class. This enables flexible configurations for voice and text processing with **A2A**. ::: ## Step 4: Deploy A2A System on VideoSDK Platform - **`Meeting Setup`**: Customer agent joins VideoSDK meeting for user interaction while specialist runs in background mode. Requires environment variables: `VIDEOSDK_AUTH_TOKEN`, `GOOGLE_API_KEY`, and `OPENAI_API_KEY`. - **`System Orchestration`**: Uses `JobContext` and `WorkerJob` to manage the meeting lifecycle and agent coordination. - **`Resource Management`**: Handles startup sequence, keeps system running, and provides clean shutdown with proper A2A unregistration ```python title="main.py" import asyncio from contextlib import suppress from agents.customer_agent import CustomerServiceAgent from agents.loan_agent import LoanAgent from session_manager import create_pipeline, create_session from videosdk.agents import JobContext, RoomOptions, WorkerJob async def main(ctx: JobContext): specialist_agent = LoanAgent() specialist_pipeline = create_pipeline("specialist") specialist_session = create_session(specialist_agent, specialist_pipeline) customer_agent = CustomerServiceAgent() customer_pipeline = create_pipeline("customer") customer_session = create_session(customer_agent, customer_pipeline) specialist_task = asyncio.create_task(specialist_session.start()) try: await ctx.connect() await customer_session.start() await asyncio.Event().wait() except (KeyboardInterrupt, asyncio.CancelledError): print("Shutting down...") finally: specialist_task.cancel() with suppress(asyncio.CancelledError): await specialist_task await specialist_session.close() await customer_session.close() await specialist_agent.unregister_a2a() await customer_agent.unregister_a2a() await ctx.shutdown() def customer_agent_context() -> JobContext: room_options = RoomOptions(room_id="", name="Customer Service Agent", playground=True) return JobContext( room_options=room_options ) if __name__ == "__main__": job = WorkerJob(entrypoint=main, jobctx=customer_agent_context) job.start() ``` :::note Ensure that the `JobContext` is created **only for the primary (main) agent**, i.e., the agent responsible for user-facing interaction (e.g., Customer Agent). The background agent (e.g., Loan Agent) should not have its own context or initiate a separate connection. ::: #### Running the Application Set the required environment variables: ```bash export VIDEOSDK_AUTH_TOKEN="your_videosdk_token" export GOOGLE_API_KEY="your_google_api_key" export OPENAI_API_KEY="your_openai_api_key" ``` Replace `` in the code with your actual meeting ID, then run: ```bash cd A2A python main.py ``` :::tip Quick Start Get the complete working example at [A2A Quick Start Repository](https://github.com/videosdk-live/agents-quickstart/tree/main/A2A) with all the code ready to run. ::: --- --- title: Agent to Agent (A2A) hide_title: false hide_table_of_contents: false description: "Understanding the core concepts of Agent to Agent (A2A) communication in VideoSDK AI Agents - AgentCard, A2AMessage, agent registration, and discovery mechanisms for building collaborative multi-agent systems." pagination_label: "A2A Overview" keywords: - A2A Overview - A2A Protocol - Agent To Agent - AI Agent - Google's A2A - AgentCard - A2AMessage - Agent Registration - Agent Discovery - Multi-Agent Communication - VideoSDK Agents - AI Agent SDK - Agent Collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Overview slug: overview --- # Agent to Agent (A2A) The Agent to Agent (A2A) protocol enables seamless collaboration between specialized AI agents, allowing them to communicate, share knowledge, and coordinate responses based on their unique capabilities and domain expertise. With VideoSDK's A2A implementation, you can create multi-agent systems where different agents work together to provide comprehensive solutions. ## How It Works ### Basic Flow 1. **Agent Registration**: Agents register themselves with an `AgentCard` that contains their capabilities and domain expertise 2. **Client Query**: Client sends a query to the main agent 3. **Agent Discovery**: Main agent discovers relevant specialist agents using agent cards 4. **Query Forwarding**: Main agent forwards specialized queries to appropriate agents 5. **Response Chain**: Specialist agents process queries and respond back to the main agent 6. **Client Response**: Main agent formats and delivers the final response to the client ![A2A Architecture](https://cdn.videosdk.live/website-resources/docs-resources/a2a_diagram.png) ### Example Scenario ``` Client → "Book a flight to New York and find a hotel" ↓ Travel Agent (Main) → Analyzes query ↓ Travel Agent → Discovers Flight Booking Agent & Hotel Booking Agent ↓ Travel Agent → Forwards flight query to Flight Booking Agent Travel Agent → Forwards hotel query to Hotel Booking Agent ↓ Specialist Agents → Process queries and respond back (text format) ↓ Travel Agent → Combines responses and sends to client (audio format) ``` # Core Components ## 1. AgentCard The `AgentCard` is how agents identify themselves and advertise their capabilities to other agents. #### Structure ```python AgentCard( id="agent_flight_001", name="Skymate", domain="flight", capabilities=[ "search_flights", "modify_bookings", "show_flight_status" ], description="Handles all flight-related tasks" ) ``` #### Parameters | Parameter | Type | Required | Description | | -------------- | ------ | -------- | ------------------------------------ | | `id` | string | Yes | Unique identifier for the agent | | `name` | string | Yes | Human-readable agent name | | `domain` | string | Yes | Primary expertise domain | | `capabilities` | list | Yes | List of specific capabilities | | `description` | string | Yes | Brief description of agent's purpose | | `metadata` | dict | No | Additional metadata for the agent | ## 2. A2AMessage `A2AMessage` is the standardized communication format between agents. #### Structure ```python message = A2AMessage( from_agent="travel_agent_1", to_agent="agent_flight_001", type="flight_status_query", content={"query": "What's the status of AI202?"}, metadata={"client_id": "xyz123", "urgency": "medium"} ) ``` #### Parameters | Parameter | Type | Required | Description | | ------------ | ------ | -------- | --------------------------- | | `from_agent` | string | Yes | ID of the sending agent | | `to_agent` | string | Yes | ID of the receiving agent | | `type` | string | Yes | Message type/event name | | `content` | dict | Yes | Message payload | | `metadata` | dict | No | Additional message metadata | ## 3. Agent Registry #### `register_a2a(agent_card)` Register an agent with the A2A system. ```python async def on_enter(self): await self.register_a2a(AgentCard( id="agent_flight_001", name="Skymate", domain="flight", capabilities=[ "search_flights", "modify_bookings", "show_flight_status" ], description="Handles all flight-related tasks" )) ``` **What Registration Does:** - Adds the agent to the global `AgentRegistry` singleton - Makes the agent discoverable by other agents - Stores both the `AgentCard` and agent instance - Enables message routing to this agent #### `unregister()` Unregister an agent from the A2A system. ```python await self.unregister_a2a() ``` ## 4. A2AProtocol Class The main class for managing agent-to-agent communication. ### Agent Discovery #### `find_agents_by_domain(domain: str)` Discover agents based on their domain expertise. ```python agents = self.a2a.registry.find_agents_by_domain("hotel") # Returns: ["agent_hotel_001"] ``` #### `find_agents_by_capability(cap: str)` Find agents with specific skills. ```python agents = await self.a2a.registry.find_agents_by_capability("modify_bookings") # Returns: ["agent_flight_001"] ``` --- ### Agent Communications #### `send_message(to_agent, message_type, content, metadata=None)` Send messages directly to other agents. ```python await self.a2a.send_message( to_agent="agent_hotel_001", message_type="hotel_booking_query", # Event name that the receiving agent listens for content={"query": "Find 3-star hotels in Delhi under $100"}, metadata={"client_id": "xyz123"} # Optional metadata ) ``` **Parameters:** - `to_agent` (string): Target agent ID - `message_type` (string): Event name the receiving agent listens for - `content` (dict): Message payload - `metadata` (dict, optional): Additional message metadata #### `on_message(message_type, handler)` Register message handlers for incoming messages. ```python # Register a handler for specialist queries self.a2a.on_message("hotel_booking_query", self.handle_specialist_query) async def handle_specialist_query(self, message): # Process the incoming message query = message.content.get("query") # ... process query ... # Return response return {"response": "Current mortgage rates are 6.5%"} ``` ## Next Steps Now that you're familiar with the core A2A concepts, it's time to move from theory to practice: 👉 **[Explore the Full A2A Implementation](implementation)** Dive into a complete, working example that demonstrates agent discovery, messaging, and collaboration in action. --- --- title: Build a Custom Voice AI Agent in Minutes hide_title: false hide_table_of_contents: false description: "Use VideoSDK's low-code builder to design, test, and deploy a personalized voice agent powered by your preferred LLM." keywords: - voice ai agent - low-code agent builder - conversational ai - videosdk agents - gemini - realtime - telephony - knowledge base - speech recognition - tts image: https://strapi.videosdk.live/uploads/Screenshot_2025_11_17_at_5_06_23_PM_33a509fd4e.png sidebar_label: Build Agent slug: build-agent --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import Step from '@site/src/components/Step' # Agent Runtime Guide AI voice agents are transforming how businesses interact with customers, providing natural, conversational experiences through voice interfaces. VideoSDK's **Agent Runtime** feature offers a powerful **no-code/low-code interface** that enables you to build sophisticated AI voice agents without extensive programming knowledge. ## Prerequisites Before you begin, ensure you have: - **VideoSDK Account:** Visit [VideoSDK Dashboard](https://app.videosdk.live) to sign up for a free account and access the AI Agent builder. ## Step-By-Step Guide
### Step 1: Create a New Agent
1. In the dashboard, navigate to **AI Agent > Agents** or visit [Agents Dashboard](https://app.videosdk.live/agents/agents). 2. You'll see the `AI Agent > Agents` section in the dashboard. 3. To create a voice agent, click on **Agents** in the sidebar. ![Select Agents in Dashboard](https://strapi.videosdk.live/uploads/1_Select_Agents_in_Dashboard_1b6a6f6d0c.png)
### Step 2: Click `Add New Agent`
This is where you'll start creating your voice agent. If no agent has been created yet, you'll see a **Add New Agent** button. If agents already exist, you'll see a list of all AI voice agents, and you can click the button in the top right corner to create a new agent. ![Click Create AI Voice Agent Button](https://strapi.videosdk.live/uploads/2_Click_Create_AI_Voice_Agent_Button_349f3799f2.png)
### Step 3: Configure Agent Details
This is where you can define your AI voice agent's persona and behavior: - **Agent Name:** Set a descriptive name for your agent (e.g., "AI Interviewer"). - **System Prompt:** Define the agent's role, personality, and behavior guidelines. - **Welcome Message:** Set the message that plays when the agent joins a conversation. - **Closing Message:** Set the message that plays when the agent leaves a conversation. ![Create Voice Agent Persona](https://strapi.videosdk.live/uploads/3_Create_Voice_Agent_Persona_6281a768ef.png)
### Step 4: Configure the Pipeline
The pipeline is the core engine of your voice agent, processing audio through speech recognition, AI reasoning, and text-to-speech. VideoSDK offers two pipeline options: **Realtime** and **Cascade**. The **Realtime** provides direct speech-to-speech processing with minimal latency, ideal for natural, conversational interactions. Example: Adding **Gemini Realtime Model** 1. Add your Gemini API key in the pipeline configuration or at [Realtime Integrations](https://app.videosdk.live/agents/integrations/realtime). 2. To get your API key, visit [Gemini API Keys](https://aistudio.google.com/api-keys). ![Gemini Add Your API Key](https://strapi.videosdk.live/uploads/4_Gemini_Add_Your_API_Key_bcf81a0f82.png) **Available models:** - `gemini-2.5-flash-native-audio-preview-12-2025` - `gemini-2.0-flash` - `gemini-2.5-flash-native-audio-preview-12-2025` - `gemini-2.5-flash-native-audio` The **Cascade** processes audio through distinct stages (STT → LLM → TTS), providing maximum control over each component. Configure your providers for [STT Integrations](https://app.videosdk.live/agents/integrations/stt), [LLM Integrations](https://app.videosdk.live/agents/integrations/llm) and [TTS Integrations](https://app.videosdk.live/agents/integrations/tts). ![STT Providers](https://strapi.videosdk.live/uploads/stt_e2522d9ea2.png) Example: Adding **Deepgram STT** - Get API Key at: [Deepgram Console](https://console.deepgram.com/) **Available models:** - `flux-general-en` - `nova-2` or `nova-2-general` (for non-English transcriptions) - `nova-3` or `nova-3-general` - `base`
### Step 5: Knowledge Base Integration
Upload a knowledge base to provide context and domain expertise to your voice agent. This dramatically improves answer accuracy and enables your agent to handle specialized queries. - Navigate to the **Knowledge Base** tab in your agent configuration. - Upload documents, FAQs, or product sheets that contain relevant information. - The agent will use this knowledge to provide more accurate and contextual responses. ![Add Knowledge Base in VideoSDK](https://strapi.videosdk.live/uploads/Add_Knowlodege_base_in_videosdk_363aaa82f3.png)
### Step 6: Configure Telephony Settings
Configure telephony settings to enable your agent to handle phone calls: - **Agent Type:** Set the type of agent (inbound, outbound, or both). - **Inbound Gateways:** Set up gateways to receive incoming calls. - **Outbound Gateways:** Set up gateways to make outbound calls. - **Routing Rules:** Create rules to map phone numbers to your agent. - **Calling Settings:** Configure call handling preferences and behavior. ![Telephony Configuration](https://strapi.videosdk.live/uploads/telephony_agents_dd2c2080ac.png) This configuration is essential for **call center automation**, **platform integration**, and smooth **agent orchestration**.
### Step 7: Test Your Voice Agent
You can interact with the agent directly from the dashboard before connecting it to production channels: 1. Visit [Agents Dashboard](https://app.videosdk.live/agents/agents). 2. Locate your agent in the list and click the **Test** button in the top-right corner. 3. Use the built-in simulator to speak with the agent in real time, view live transcripts, and fine-tune prompts based on the conversation. ![Test AI Voice Agent](https://strapi.videosdk.live/uploads/test_ai_voice_agent_30e0045af0.png)
### Step 8: Connect Voice Agent
Once your agent is configured, you can connect it to various platforms and devices: - **Web:** Integrate your agent into web applications. - **Mobile:** Connect to iOS and Android mobile apps. - **Telephony:** Deploy to phone systems for voice calls. - **IoT Devices:** Connect to Internet of Things devices. ![Connect AI Voice Agent](https://strapi.videosdk.live/uploads/8_connect_ai_voice_agent_17fe428419.png) ## Next Steps Congratulations! You've successfully created your AI voice agent. Here are the next steps: - **Test Your Agent:** Use the built-in test simulator to verify your agent's behavior and responses. - **Deploy to Production:** Connect your agent to production environments and real user interactions. - **Monitor Performance:** Track agent performance, user satisfaction, and conversation quality. - **Iterate and Improve:** Refine your agent's prompts, knowledge base, and configuration based on real-world usage. Keep refining your agent's configuration to build a powerful voice AI solution tailored to your specific business needs. ### Starter Apps import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; --- --- title: Android Agent Starter hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using an Android frontend and a no-code agent from the dashboard. sidebar_label: Android pagination_label: Agent Runtime with Android keywords: - ai agent - no-code - voice interaction - real-time communication - android sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 3 slug: agent-starter-android --- import Step from '@site/src/components/Step' import CreateAgent from '@site/mdx/\_ai-agent-starter-sdk-guide.mdx' # Agent Starter App - Android VideoSDK enables you to add a voice-enabled AI agent to your Android app. This guide walks you through connecting your Android frontend to an agent configured and deployed directly from the VideoSDK dashboard. ## Prerequisites - A deployed AI agent on VideoSDK Agent Cloud. If you haven't done this yet, create and deploy your agent using the [Low-Code Deployment UI](/ai_agents/agent-runtime/build-agent) on the VideoSDK Dashboard. Once deployed, note down your **Agent ID**. - Android 8.0 (API level 26) or later - Android Studio (latest stable) with JDK 17 - Valid Video SDK [Account](https://app.videosdk.live/) :::info Explore the complete Quickstart implementation in the [Android Agent Starter](https://github.com/videosdk-live/agent-starter-app-android) and see how to run and customize it for your own use case. ::: import APISecret from '@site/mdx/introduction/\_api-key.mdx'; ## Run the Sample Project
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agent-starter-app-android.git cd agent-starter-app-android ```
### Step 2: Open in Android Studio
Launch Android Studio and open the `agent-starter-app-android` folder. Let Gradle finish syncing before proceeding.
### Step 3: Create Your Agent (Optional)
:::info If you've already configured and deployed your agent from the VideoSDK Dashboard, you can jump directly to [Step 4](#step-4-setup-environment-variables). :::
### Step 4: Setup Environment Variables
Copy the `local.properties.example` file to `local.properties` at the project root. ```bash cp local.properties.example local.properties ``` Update the `local.properties` file with your credentials. The `agentId` is the identifier for the Low-Code agent you deployed from the VideoSDK Dashboard. ```properties title="local.properties" authToken=your_videosdk_auth_token agentId=your_agent_id meetingId=your_meeting_id versionId=your_version_id ``` :::tip You can obtain your `authToken` and `agentId` from the [VideoSDK Dashboard](https://app.videosdk.live/) under your Agent Cloud deployment. `meetingId` and `versionId` are optional — if left blank, the app will create a new meeting automatically and use the latest deployed version of your agent. :::
### Step 5: Run the Sample App
Bingo, it's time to push the launch button. Connect a physical Android device (or start an emulator running API 26+), select it as the run target in Android Studio, and click the **Run** button (or press `Shift + F10`). You can also install a debug build from the command line: ```bash ./gradlew installDebug ``` Once running, the app will use the Dispatch API to send your deployed agent into the meeting room. You'll see the live transcription as you speak, and the agent will respond in real time. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `agentId` and `versionId` in your `local.properties` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. - Ensure `RECORD_AUDIO` permission is declared in `AndroidManifest.xml`. 3. **"Failed to connect agent" error:** - Verify your `agentId` and `versionId` are correct. - Check the debug console for any network errors. 4. **Android build issues:** - Ensure you are running Android 8.0 (API level 26) or higher. - Verify JDK 17 is configured in Android Studio under **File → Project Structure → SDK Location**. - Confirm Android Gradle Plugin version is 9.0.1 and Kotlin version is 2.0.21. - Try cleaning the build: `./gradlew clean`. - Try **File → Invalidate Caches / Restart** in Android Studio, then sync Gradle again. --- --- title: Flutter Agent Starter hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using Flutter frontend and a no-code agent from the dashboard. sidebar_label: Flutter pagination_label: Agent Runtime with Flutter keywords: - ai agent - no-code - voice interaction - real-time communication - flutter sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: agent-starter-flutter --- import Step from '@site/src/components/Step' import CreateAgent from '@site/mdx/\_ai-agent-starter-sdk-guide.mdx' # Agent Starter App - Flutter VideoSDK enables you to seamlessly add a voice-enabled AI agent to your Flutter app — this guide walks you through connecting your Flutter frontend to an agent configured and deployed directly from the VideoSDK dashboard. ## Prerequisites - A deployed AI agent on VideoSDK Agent Cloud. If you haven't done this yet, create and deploy your agent using the [Low-Code Deployment UI](/ai_agents/agent-runtime/build-agent) on the VideoSDK Dashboard — no coding required. Once deployed, note down your **Agent ID**. - If your target platform is iOS, your development environment must meet the following requirements: - Flutter 3.8.0 or later - Dart 3.x or later - Valid Video SDK [Account](https://app.videosdk.live/) :::info Explore the complete Quickstart implementation in the [Flutter Agent Starter](https://github.com/videosdk-live/agent-starter-app-flutter) and see how to run and customize it for your own use case. ::: import APISecret from '@site/mdx/introduction/\_api-key.mdx'; ## Run the Sample Project
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agent-starter-app-flutter.git cd agent-starter-flutter ```
### Step 2: Install the dependencies
Install all the dependencies to run the project. ```bash flutter pub get ```
### Step 3: Create Your Agent (Optional)
:::info If you've already configured and deployed your agent from the VideoSDK Dashboard, you can jump directly to [Step 4](#step-4-setup-environment-variables). :::
### Step 4: Setup Environment Variables
Copy the `.env.example` file to `.env`. ```bash cp .env.example .env ``` Update the `.env` file with your credentials. The `AGENT_ID` is the identifier for the Low-Code agent you deployed from the VideoSDK Dashboard. ```env title=".env" AUTH_TOKEN=your_videosdk_auth_token AGENT_ID=your_agent_id MEETING_ID=your_meeting_id VERSION_ID=your_version_id ``` **Tip:** You can obtain your `AUTH_TOKEN` and `AGENT_ID` from the [VideoSDK Dashboard](https://app.videosdk.live/) under your Agent Cloud deployment. `MEETING_ID` is optional — if left blank, the app will create a new meeting automatically.
### Step 5: Run the Sample App
Bingo, it's time to push the launch button. **Android:** ```bash flutter run ``` **iOS:** ```bash cd ios && pod install && cd .. flutter run -d ios ``` Once running, the app will use the Dispatch API to send your deployed agent into the meeting room. You'll see the live transcription as you speak, and the agent will respond in real time. --- ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `AGENT_ID` and `VERSION_ID` in your `.env` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the debug console for any network errors. 4. **Flutter build issues:** - Ensure your Flutter version is compatible (3.8.0 or later for iOS targets). - Try cleaning the build: `flutter clean`. - Delete `pubspec.lock` and run `flutter pub get`. - For iOS: run `cd ios && pod install` before `flutter run`. --- --- title: iOS Agent Starter hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using iOS frontend and a no-code agent from the dashboard. sidebar_label: iOS pagination_label: Agent Runtime with iOS keywords: - ai agent - no-code - voice interaction - real-time communication - ios sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 3 slug: agent-starter-ios --- import Step from '@site/src/components/Step' import CreateAgent from '@site/mdx/\_ai-agent-starter-sdk-guide.mdx' # Agent Starter App - iOS VideoSDK enables you to seamlessly add a voice-enabled AI agent to your iOS app — this guide walks you through connecting your iOS application to an agent configured and deployed directly from the VideoSDK dashboard. ## Prerequisites - A deployed AI agent on VideoSDK Agent Cloud. If you haven't done this yet, create and deploy your agent using the [Low-Code Deployment UI](/ai_agents/agent-runtime/build-agent) on the VideoSDK Dashboard — no coding required. Once deployed, note down your **Agent ID**. - For iOS, your development environment must meet the following requirements: - iOS 18 or later - Xcode 16.4 or later - Valid Video SDK [Account](https://app.videosdk.live/) :::info Explore the complete Quickstart implementation in the [IOS Agent Starter](https://github.com/videosdk-live/agent-starter-app-ios) and see how to run and customize it for your own use case. ::: import APISecret from '@site/mdx/introduction/\_api-key.mdx'; ## Run the Sample Project
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agent-starter-app-ios.git cd agent-starter-ios ```
### Step 2: Open the project in XCode
Open the `agent-starter-ios.xcodeproj` file using Xcode.
### Step 3: Create Your Agent (Optional)
:::info If you've already configured and deployed your agent from the VideoSDK Dashboard, you can jump directly to [Step 4](#step-4-set-up-credentials). :::
### Step 4: Set up credentials
Before running the app, you need to configure your authentication details. Open `agent-starter-ios/Constants/MeetingConfig.swift` and supply the required values: ``` AUTH_TOKEN: AGENT_ID: MEETING_ID: VERSION_ID: ``` **Tip:** You can obtain your `AUTH_TOKEN` and `AGENT_ID` from the [VideoSDK Dashboard](https://app.videosdk.live/) under your Agent Cloud deployment. `MEETING_ID` is optional — if left blank, the app will create a new meeting automatically. `VERSION_ID` is also optional, if left blank, the app will fetch the agent's version and choose the latest one and proceed with the meeting.
### Step 5: Build and Run
Bingo, Now Select your target physical device and click the Run button (or press Cmd + R) in Xcode! Once running, the app will use the Dispatch API to send your deployed agent into the meeting room. You'll see the live transcription as you speak, and the agent will respond in real time. --- ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `AGENT_ID` and `VERSION_ID` in your `agent-starter-ios/Constants/MeetingConfig.swift` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the debug console for any network errors. --- --- title: Agent Runtime with Flutter hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using Flutter frontend and a no-code agent from the dashboard. sidebar_label: With Flutter pagination_label: Agent Runtime with Flutter keywords: - ai agent - no-code - voice interaction - real-time communication - flutter sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-flutter --- import Step from '@site/src/components/Step' # Agent Runtime with Flutter VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your Flutter application within minutes. This guide shows you how to connect a Flutter frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Flutter installed on your device - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token. ::: ## Project Structure Your project structure should look like this: ```jsx title="Project Structure" root ├── android ├── ios ├── lib │ ├── api_call.dart │ ├── join_screen.dart │ ├── main.dart │ ├── meeting_controls.dart │ ├── meeting_screen.dart │ └── participant_tile.dart ├── macos ├── web └── windows ``` You will be working on the following files: - `join_screen.dart`: Responsible for the user interface to join a meeting. - `meeting_screen.dart`: Displays the meeting interface and handles meeting logic. - `api_call.dart`: Handles API calls for creating meetings and dispatching agents. ## 1. Flutter Frontend
### Step 1: Getting Started
Follow these steps to create the environment necessary to add AI agent functionality to your app. #### Create a New Flutter App Create a new Flutter app using the following command: ```bash $ flutter create videosdk_ai_agent_flutter_app ``` #### Install VideoSDK Install the VideoSDK using the following Flutter command. Make sure you are in your Flutter app directory before you run this command. ```bash $ flutter pub add videosdk $ flutter pub add http ```
### Step 2: Configure Project
#### For Android - Update the `/android/app/src/main/AndroidManifest.xml` for the permissions we will be using to implement the audio and video features. ```xml title="android/app/src/main/AndroidManifest.xml" ``` - If necessary, in the `build.gradle` you will need to increase `minSdkVersion` of `defaultConfig` up to `23` (currently default Flutter generator set it to `16`). #### For iOS - Add the following entries which allow your app to access the camera and microphone to your `/ios/Runner/Info.plist` file : ```xml title="/ios/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Uncomment the following line to define a global platform for your project in `/ios/Podfile` : ```ruby title="/ios/Podfile" platform :ios, '12.0' ``` #### For MacOS - Add the following entries to your `/macos/Runner/Info.plist` file which allow your app to access the camera and microphone. ```xml title="/macos/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Add the following entries to your `/macos/Runner/DebugProfile.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/DebugProfile.entitleaments" com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ``` - Add the following entries to your `/macos/Runner/Release.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/Release.entitlements" com.apple.security.network.server com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ```
### Step 3: Configure Environment and Credentials
Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `lib/api_call.dart` along with your agent credentials. ```dart title="lib/api_call.dart" import 'dart:convert'; import 'package:http/http.dart' as http; //Auth token we will use to generate a meeting and connect to it const token = 'YOUR_VIDEOSDK_AUTH_TOKEN'; const agentId = 'YOUR_AGENT_ID'; const versionId = 'YOUR_VERSION_ID'; // API call to create meeting Future createMeeting() async { final http.Response httpResponse = await http.post( Uri.parse('https://api.videosdk.live/v2/rooms'), headers: {'Authorization': token}, ); //Destructuring the roomId from the response return json.decode(httpResponse.body)['roomId']; } // API call to connect agent Future connectAgent(String meetingId) async { final http.Response httpResponse = await http.post( Uri.parse('https://api.videosdk.live/v2/agent/general/dispatch'), headers: { 'Authorization': token, 'Content-Type': 'application/json', }, body: json.encode({ 'agentId': agentId, 'meetingId': meetingId, 'versionId': versionId, }), ); if (httpResponse.statusCode != 200) { throw Exception('Failed to connect agent'); } } ```
### Step 4: Design the User Interface (UI)
Update the UI files to add the "Connect Agent" button and connect the logic. ```dart title="lib/join_screen.dart" import 'package:flutter/material.dart'; import 'api_call.dart'; import 'meeting_screen.dart'; class JoinScreen extends StatelessWidget { final _meetingIdController = TextEditingController(); JoinScreen({super.key}); void onJoinButtonPressed(BuildContext context) { // check meeting id is not null or invaild // if meeting id is vaild then navigate to MeetingScreen with meetingId,token Navigator.of(context).push( MaterialPageRoute( builder: (context) => MeetingScreen(meetingId: "YOUR_MEETING_ID", token: token), ), ); } @override Widget build(BuildContext context) { return Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(12.0), child: Center( child: ElevatedButton( onPressed: () => onJoinButtonPressed(context), child: const Text('Join Meeting'), ), ), ), ); } } ``` ```dart title="lib/meeting_screen.dart" import 'package:flutter/material.dart'; import 'package:videosdk/videosdk.dart'; import 'participant_tile.dart'; import 'meeting_controls.dart'; import 'api_call.dart'; class MeetingScreen extends StatefulWidget { final String meetingId; final String token; const MeetingScreen({ super.key, required this.meetingId, required this.token, }); @override State createState() => _MeetingScreenState(); } class _MeetingScreenState extends State { late Room _room; var micEnabled = true; var camEnabled = true; bool _isAgentConnected = false; Map participants = {}; @override void initState() { // create room _room = VideoSDK.createRoom( roomId: widget.meetingId, token: widget.token, displayName: "John Doe", micEnabled: micEnabled, camEnabled: false, defaultCameraIndex: 1, // Index of MediaDevices will be used to set default camera ); setMeetingEventListener(); // Join room _room.join(); super.initState(); } // listening to meeting events void setMeetingEventListener() { _room.on(Events.roomJoined, () { setState(() { participants.putIfAbsent( _room.localParticipant.id, () => _room.localParticipant, ); }); }); _room.on(Events.participantJoined, (Participant participant) { setState( () => participants.putIfAbsent(participant.id, () => participant), ); }); _room.on(Events.participantLeft, (String participantId) { if (participants.containsKey(participantId)) { setState(() => participants.remove(participantId)); } }); _room.on(Events.roomLeft, () { participants.clear(); Navigator.popUntil(context, ModalRoute.withName('/')); }); } void _connectAgent() async { try { await connectAgent(widget.meetingId); setState(() { _isAgentConnected = true; }); ScaffoldMessenger.of(context).showSnackBar( const SnackBar(content: Text('Agent connected successfully!')), ); } catch (e) { ScaffoldMessenger.of(context).showSnackBar( SnackBar(content: Text('Failed to connect agent: ${e.toString()}')), ); } } // onbackButton pressed leave the room Future _onWillPop() async { _room.leave(); return true; } @override Widget build(BuildContext context) { return WillPopScope( onWillPop: () => _onWillPop(), child: Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(8.0), child: Column( children: [ Text(widget.meetingId), //render all participant Expanded( child: Padding( padding: const EdgeInsets.all(8.0), child: GridView.builder( gridDelegate: const SliverGridDelegateWithFixedCrossAxisCount( crossAxisCount: 2, crossAxisSpacing: 10, mainAxisSpacing: 10, mainAxisExtent: 300, ), itemBuilder: (context, index) { return ParticipantTile( key: Key(participants.values.elementAt(index).id), participant: participants.values.elementAt(index), ); }, itemCount: participants.length, ), ), ), MeetingControls( onToggleMicButtonPressed: () { micEnabled ? _room.muteMic() : _room.unmuteMic(); micEnabled = !micEnabled; }, onLeaveButtonPressed: () => _room.leave(), onConnectAgentButtonPressed: _isAgentConnected ? null : _connectAgent, ), ], ), ), ), ); } } ``` ```dart title="lib/meeting_controls.dart" import 'package:flutter/material.dart'; class MeetingControls extends StatelessWidget { final void Function() onToggleMicButtonPressed; final void Function() onLeaveButtonPressed; final void Function()? onConnectAgentButtonPressed; const MeetingControls({ super.key, required this.onToggleMicButtonPressed, required this.onLeaveButtonPressed, required this.onConnectAgentButtonPressed, }); @override Widget build(BuildContext context) { return Row( mainAxisAlignment: MainAxisAlignment.spaceEvenly, children: [ ElevatedButton( onPressed: onLeaveButtonPressed, child: const Text('Leave'), ), ElevatedButton( onPressed: onToggleMicButtonPressed, child: const Text('Toggle Mic'), ), ElevatedButton( onPressed: onConnectAgentButtonPressed, child: const Text('Connect Agent'), ), ], ); } } ``` ## 2. Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard.
### Step 1: Create Your Agent
First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard.
### Step 2: Get Agent and Version ID
Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png)
### Step 3: Configure IDs in Frontend
Now, update your `lib/api_call.dart` file with these IDs. ```dart title="lib/api_call.dart" const token = 'your_videosdk_auth_token_here'; const agentId = 'paste_your_agent_id_here'; const versionId = 'paste_your_version_id_here'; ``` ## 3. Run the Application
### Step 1: Run the Frontend
Once you have completed all the steps mentioned above, start your Flutter application: ```bash flutter run ```
### Step 2: Connect and Interact
1. **Join the meeting from the Flutter app:** - Click the "Join Meeting" button. - Allow microphone permissions when prompted. 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see a confirmation that the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `roomId`, `agentId`, and `versionId` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `agentId` and `versionId` are correct. - Check the debug console for any network errors. 4. **Flutter build issues:** - Ensure your Flutter version is compatible. - Try cleaning the build: `flutter clean`. - Delete `pubspec.lock` and run `flutter pub get`. --- --- title: Agent Runtime with iOS hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using an iOS frontend and a no-code agent from the dashboard. sidebar_label: With iOS pagination_label: Agent Runtime with iOS keywords: - ai agent - no-code - voice interaction - real-time communication - ios sdk - swiftui image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-ios --- import Step from '@site/src/components/Step' # Agent Runtime with iOS VideoSDK empowers you to integrate an AI voice agent into your iOS app within minutes. This guide shows you how to connect an iOS (SwiftUI) frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites - macOS with Xcode 15.0+ - iOS 13.0+ deployment target - Valid VideoSDK [Account](https://app.videosdk.live/) - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. :::
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agents-quickstart.git cd mobile-quickstarts/ios/ ```
### Step 2: Environment Configuration
### Create a Meeting Room Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_VIDEOSDK_AUTH_TOKEN" \ -H "Content-Type: application/json" ``` Use the returned `roomId` in your configuration files. ### Configuration Files Update the following files with your credentials. The Agent and Version IDs will be retrieved in a later step. **MeetingViewController.swift** (line 14): ```swift var token = "YOUR_VIDEOSDK_AUTH_TOKEN" // Add Your token here var agentId = "YOUR_AGENT_ID" var versionId = "YOUR_VERSION_ID" ``` **JoinScreenView.swift** (line 13): ```swift let meetingId: String = "YOUR_MEETING_ID" ```
### Step 3: iOS Frontend Modifications
### Step 1: Add Connect Agent Button In `MeetingView.swift`, add a button to connect the agent. ```swift title="MeetingView.swift" // Add this button to your view hierarchy Button(action: { meetingVC.connectAgent() }) { Text("Connect Agent") } .disabled(meetingVC.isAgentConnected) ``` ### Step 2: Implement Connect Logic In `MeetingViewController.swift`, add the logic to call the dispatch API. ```swift title="MeetingViewController.swift" // Add state to track if the agent is connected @Published var isAgentConnected = false // ... func connectAgent() { guard let url = URL(string: "https://api.videosdk.live/v2/agent/general/dispatch") else { return } var request = URLRequest(url: url) request.httpMethod = "POST" request.setValue("application/json", forHTTPHeaderField: "Content-Type") request.setValue(token, forHTTPHeaderField: "Authorization") let body: [String: Any] = [ "agentId": agentId, "meetingId": room?.id ?? "", "versionId": versionId ] request.httpBody = try? JSONSerialization.data(withJSONObject: body) URLSession.shared.dataTask(with: request) { data, response, error in if let error = error { print("Connect error: \(error.localizedDescription)") return } if let httpResponse = response as? HTTPURLResponse, httpResponse.statusCode == 200 { DispatchQueue.main.async { self.isAgentConnected = true print("Agent connected successfully") } } else { print("Failed to connect agent") } }.resume() } ```
### Step 4: Creating the AI Agent from Dashboard (No-Code)
### Step 1: Create Your Agent First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard. ### Step 2: Get Agent and Version ID Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png) ### Step 3: Configure IDs in Frontend Now, update your `MeetingViewController.swift` file with these IDs. ```swift title="MeetingViewController.swift" var agentId = "paste_your_agent_id_here" var versionId = "paste_your_version_id_here" ```
### Step 5: Run the iOS Frontend
1. **Open Xcode:** ```bash open videosdk-agents-quickstart-ios.xcodeproj ``` 2. **Configure your development team:** - Select the project in Xcode - Go to "Signing & Capabilities" - Select your development team 3. **Build and run:** - Select your target device or simulator - Press `Cmd + R` to build and run
### Step 6: Connect and Interact
1. Join the meeting from the app and allow microphone permissions. 2. When you join, click the "Connect Agent" button to call the agent into the meeting. 3. Talk to the agent in real time. ## Troubleshooting ### Common Issues 1. **Build Errors:** - Ensure Xcode 15.0+ is installed - Check iOS deployment target (13.0+) - Verify VideoSDK package dependency 2. **Authentication Issues:** - Verify `VIDEOSDK_AUTH_TOKEN` in `MeetingViewController.swift` - Check token permissions include `allow_join` 3. **Meeting Connection Issues:** - Ensure `YOUR_MEETING_ID` is correct - Verify network connectivity - Check VideoSDK account status 4. **AI Agent Issues:** - Verify `agentId` and `versionId` are set correctly - Check for errors in the Xcode console when connecting the agent. --- --- title: Agent Runtime with React Native hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using a React Native frontend and a no-code agent from the dashboard. sidebar_label: With React Native pagination_label: Agent Runtime with React Native keywords: - ai agent - no-code - voice interaction - real-time communication - react native sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-react-native --- import Step from '@site/src/components/Step' import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Agent Runtime with React Native VideoSDK empowers you to integrate an AI voice agent into your React Native app (Android/iOS) within minutes. This guide shows you how to connect a React Native frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites - VideoSDK Developer Account (get token from the [dashboard](https://app.videosdk.live/api-keys)) - Node.js and a working React Native environment (Android Studio and/or Xcode) - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK token and an agent from the dashboard. Generate your VideoSDK token from the dashboard. ::: ## Project Structure First, create an empty project using `mkdir folder_name` on your preferable location for the React Native Frontend. Your final project structure should look like this: ```jsx title="Directory Structure" root ├── android/ ├── ios/ ├── App.js ├── constants.js └── index.js ``` You will work on: - `android/`: Contains the Android-specific project files. - `ios/`: Contains the iOS-specific project files. - `App.js`: The main React Native component, containing the UI and meeting logic. - `constants.js`: To store token, meetingId, and agent credentials for the frontend. - `index.js`: The entry point of the React Native application, where VideoSDK is registered. ## Building the React Native Frontend
### Step 1: Create App and Install SDKs
Create a React Native app and install the VideoSDK RN SDK: ```bash npx react-native init videosdkAiAgentRN cd videosdkAiAgentRN # Install VideoSDK npm install "@videosdk.live/react-native-sdk" ```
### Step 2: Configure the Project
#### Android Setup ```xml title="android/app/src/main/AndroidManifest.xml" ``` ```java title="android/app/build.gradle" dependencies { implementation project(':rnwebrtc') } ``` ```gradle title="android/settings.gradle" include ':rnwebrtc' project(':rnwebrtc').projectDir = new File(rootProject.projectDir, '../node_modules/@videosdk.live/react-native-webrtc/android') ``` ```java title="MainApplication.kt" import live.videosdk.rnwebrtc.WebRTCModulePackage class MainApplication : Application(), ReactApplication { override val reactNativeHost: ReactNativeHost = object : DefaultReactNativeHost(this) { override fun getPackages(): List { val packages = PackageList(this).packages.toMutableList() packages.add(WebRTCModulePackage()) return packages } // ... } } ``` ```java title="android/gradle.properties" /* This one fixes a weird WebRTC runtime problem on some devices. */ android.enableDexingArtifactTransform.desugaring=false ``` ```java title="android/app/proguard-rules.pro" -keep class org.webrtc.** { *; } ``` ```java title="android/build.gradle" buildscript { ext { minSdkVersion = 23 } } ``` #### iOS Setup To update CocoaPods, you can reinstall the gem using the following command: ```gem $ sudo gem install cocoapods ``` ```sh title="ios/Podfile" pod ‘react-native-webrtc’, :path => ‘../node_modules/@videosdk.live/react-native-webrtc’ ``` You need to change the platform field in the Podfile to 12.0 or above because react-native-webrtc doesn't support iOS versions earlier than 12.0. Update the line: platform : ios, ‘12.0’. After updating the version, you need to install the pods by running the following command: ```sh pod install ``` Add the following lines to your info.plist file located at (project folder/ios/projectname/info.plist): ```html title="ios/MyApp/Info.plist" NSCameraUsageDescription Camera permission description NSMicrophoneUsageDescription Microphone permission description ```
### Step 3: Register Service and Configure
Register VideoSDK services in your root `index.js` file for the initialization service. ```js title="index.js" import { AppRegistry } from "react-native"; import App from "./App"; import { name as appName } from "./app.json"; import { register } from "@videosdk.live/react-native-sdk"; register(); AppRegistry.registerComponent(appName, () => App); ``` Create a `constants.js` file to store your token, meeting ID, and agent credentials. ```js title="constants.js" export const token = "YOUR_VIDEOSDK_AUTH_TOKEN"; export const meetingId = "YOUR_MEETING_ID"; export const name = "User Name"; export const agentId = "YOUR_AGENT_ID"; export const versionId = "YOUR_VERSION_ID"; ```
### Step 4: Build UI and wire up MeetingProvider
```js title="App.js" import React, { useState } from 'react'; import { SafeAreaView, TouchableOpacity, Text, View, FlatList, Alert, } from 'react-native'; import { MeetingProvider, useMeeting, } from '@videosdk.live/react-native-sdk'; import { meetingId, token, name, agentId, versionId } from './constants'; const Button = ({ onPress, buttonText, backgroundColor }) => { return ( {buttonText} ); }; function ControlsContainer({ join, leave, toggleMic }) { const [connected, setConnected] = useState(false); const connectAgent = async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: token, }, body: JSON.stringify({ agentId: agentId, meetingId: meetingId, versionId: versionId }), }); if (response.ok) { Alert.alert("Agent connected successfully!"); setConnected(true); } else { Alert.alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); Alert.alert("An error occurred while connecting the agent."); } }; return (
```
### Step 3: Configure the Frontend
Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `config.js`. You will get the Agent and Version IDs in the next section. ```js title="config.js" TOKEN = "your_videosdk_auth_token_here"; ROOM_ID = "YOUR_MEETING_ID"; AGENT_ID = "YOUR_AGENT_ID"; VERSION_ID = "YOUR_VERSION_ID"; ```
### Step 4: Implement Meeting Logic
In `index.js`, retrieve DOM elements, declare variables, and add the core meeting functionalities, including the logic to connect the agent. ```js title="index.js" // getting Elements from Dom const leaveButton = document.getElementById("leaveBtn"); const toggleMicButton = document.getElementById("toggleMicBtn"); const createButton = document.getElementById("createMeetingBtn"); const connectAgentButton = document.getElementById("connectAgentBtn"); const audioContainer = document.getElementById("audioContainer"); const textDiv = document.getElementById("textDiv"); // declare Variables let meeting = null; let meetingId = ""; let isMicOn = false; // Join Agent Meeting Button Event Listener createButton.addEventListener("click", async () => { document.getElementById("join-screen").style.display = "none"; textDiv.textContent = "Please wait, we are joining the meeting"; meetingId = ROOM_ID; initializeMeeting(); }); // Initialize meeting function initializeMeeting() { window.VideoSDK.config(TOKEN); meeting = window.VideoSDK.initMeeting({ meetingId: meetingId, name: "C.V.Raman", micEnabled: true, webcamEnabled: false, }); meeting.join(); meeting.localParticipant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, meeting.localParticipant, true); } }); meeting.on("meeting-joined", () => { textDiv.textContent = null; document.getElementById("grid-screen").style.display = "block"; document.getElementById("meetingIdHeading").textContent = `Meeting Id: ${meetingId}`; }); meeting.on("meeting-left", () => { audioContainer.innerHTML = ""; }); meeting.on("participant-joined", (participant) => { let audioElement = createAudioElement(participant.id); participant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, participant, false); audioContainer.appendChild(audioElement); } }); }); meeting.on("participant-left", (participant) => { let aElement = document.getElementById(`a-${participant.id}`); if (aElement) aElement.remove(); }); } // Create audio elements for participants function createAudioElement(pId) { let audioElement = document.createElement("audio"); audioElement.setAttribute("autoPlay", "false"); audioElement.setAttribute("playsInline", "true"); audioElement.setAttribute("controls", "false"); audioElement.setAttribute("id", `a-${pId}`); audioElement.style.display = "none"; return audioElement; } // Set audio track function setAudioTrack(stream, participant, isLocal) { if (stream.kind === "audio") { if (isLocal) { isMicOn = true; } else { const audioElement = document.getElementById(`a-${participant.id}`); if (audioElement) { const mediaStream = new MediaStream(); mediaStream.addTrack(stream.track); audioElement.srcObject = mediaStream; audioElement.play().catch((err) => console.error("audioElem.play() failed", err)); } } } } // Implement controls leaveButton.addEventListener("click", async () => { meeting?.leave(); document.getElementById("grid-screen").style.display = "none"; document.getElementById("join-screen").style.display = "block"; }); toggleMicButton.addEventListener("click", async () => { if (isMicOn) meeting?.muteMic(); else meeting?.unmuteMic(); isMicOn = !isMicOn; }); connectAgentButton.addEventListener("click", async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: TOKEN, }, body: JSON.stringify({ agentId: AGENT_ID, meetingId: ROOM_ID, versionId: VERSION_ID }), }); if (response.ok) { alert("Agent connected successfully!"); connectAgentButton.style.display = "none"; } else { alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); alert("An error occurred while connecting the agent."); } }); ``` ## Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard.
### Step 1: Create Your Agent
First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard.
### Step 2: Get Agent and Version ID
Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png)
### Step 3: Configure IDs in Frontend
Now, update your `config.js` file with these IDs. ```js title="config.js" TOKEN = "your_videosdk_auth_token_here"; ROOM_ID = "YOUR_MEETING_ID"; AGENT_ID = "paste_your_agent_id_here"; VERSION_ID = "paste_your_version_id_here"; ``` ## Run the Application
### Step 1: Start the Frontend
Once you have completed all the steps, serve your frontend files: ```bash # Using Python's built-in server python3 -m http.server 8000 # Or using Node.js http-server npx http-server -p 8000 ``` Open `http://localhost:8000` in your web browser.
### Step 2: Connect and Interact
1. **Join the meeting from the frontend:** - Click the "Join Agent Meeting" button in your browser. - Allow microphone permissions when prompted. 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see an alert confirming the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Final Output You have completed the implementation of an AI agent with real-time voice interaction using VideoSDK and a no-code agent from the dashboard. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `ROOM_ID`, `AGENT_ID`, and `VERSION_ID` are correctly set in `config.js`. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check browser permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the browser's developer console for any network errors. --- --- title: Agent Runtime with React hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using React frontend and a no-code backend. sidebar_label: With React pagination_label: Agent Runtime with React keywords: - ai agent - no-code - voice interaction - real-time communication - react sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: with-react --- import Step from '@site/src/components/Step' # Agent Runtime with React VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your React application within minutes. This guide shows you how to connect a React frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Node.js installed on your device - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token. ::: ## Project Structure Your project structure should look like this. ```jsx title="Project Structure" root ├── node_modules ├── public ├── src │ ├── config.js │ ├── App.js │ └── index.js └── .env ``` You will be working on the following files: - `App.js`: Responsible for creating a basic UI for joining the meeting - `config.js`: Responsible for storing the token, room ID, and agent credentials - `index.js`: This is the entry point of your React application. ## Part 1: React Frontend
### Step 1: Getting Started with the Code!
#### Create new React App Create a new React App using the below command. ```bash $ npx create-react-app videosdk-ai-agent-react-app ``` #### Install VideoSDK Install the VideoSDK using the below-mentioned npm command. Make sure you are in your react app directory before you run this command. ```bash $ npm install "@videosdk.live/react-sdk" ```
### Step 2: Configure Environment and Credentials
Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `src/config.js`. You will get the Agent and Version IDs in the next section. ```js title="src/config.js" export const TOKEN = "YOUR_VIDEOSDK_AUTH_TOKEN"; export const ROOM_ID = "YOUR_MEETING_ID"; export const AGENT_ID = "YOUR_AGENT_ID"; export const VERSION_ID = "YOUR_VERSION_ID"; ```
### Step 3: Design the user interface (UI)
Create the main App component with audio-only interaction in `src/App.js`. This includes the "Connect Agent" button. ```js title="src/App.js" import React, { useEffect, useRef, useState } from "react"; import { MeetingProvider, MeetingConsumer, useMeeting, useParticipant } from "@videosdk.live/react-sdk"; import { TOKEN, ROOM_ID, AGENT_ID, VERSION_ID } from "./config"; function ParticipantAudio({ participantId }) { const { micStream, micOn, isLocal, displayName } = useParticipant(participantId); const audioRef = useRef(null); useEffect(() => { if (!audioRef.current) return; if (micOn && micStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(micStream.track); audioRef.current.srcObject = mediaStream; audioRef.current.play().catch(() => {}); } else { audioRef.current.srcObject = null; } }, [micStream, micOn]); return (

Participant: {displayName} | Mic: {micOn ? "ON" : "OFF"}

); } function Controls() { const { leave, toggleMic } = useMeeting(); const [connected, setConnected] = useState(false); const connectAgent = async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: TOKEN, }, body: JSON.stringify({ agentId: AGENT_ID, meetingId: ROOM_ID, versionId: VERSION_ID }), }); if (response.ok) { alert("Agent connected successfully!"); setConnected(true); } else { alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); alert("An error occurred while connecting the agent."); } }; return (
{!connected && }
); } function MeetingView({ meetingId, onMeetingLeave }) { const [joined, setJoined] = useState(null); const { join, participants } = useMeeting({ onMeetingJoined: () => setJoined("JOINED"), onMeetingLeft: onMeetingLeave, }); const joinMeeting = () => { setJoined("JOINING"); join(); }; return (

Meeting Id: {meetingId}

{joined === "JOINED" ? (
{[...participants.keys()].map((pid) => ( ))}
) : joined === "JOINING" ? (

Joining the meeting...

) : ( )}
); } export default function App() { const [meetingId] = useState(ROOM_ID); const onMeetingLeave = () => { // no-op; simple sample }; return ( {() => } ); } ```
## Part 2: Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard.
### Step 1: Create Your Agent
First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard.
### Step 2: Get Agent and Version ID
Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png)
### Step 3: Configure IDs in Frontend
Now, update your `src/config.js` file with these IDs. ```js title="src/config.js" export const TOKEN = "your_videosdk_auth_token_here"; export const ROOM_ID = "YOUR_MEETING_ID"; export const AGENT_ID = "paste_your_agent_id_here"; export const VERSION_ID = "paste_your_version_id_here"; ``` ## Part 3: Run the Application
### Step 1: Run the Frontend
Once you have completed all the steps mentioned above, start your React application: ```bash # Install dependencies npm install # Start the development server npm start ``` Open `http://localhost:3000` in your web browser.
### Step 2: Connect and Interact
1. **Join the meeting from the React app:** - Click the "Join" button in your browser - Allow microphone permissions when prompted 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see an alert confirming the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Final Output You have completed the implementation of an AI agent with real-time voice interaction using VideoSDK and a no-code agent from the dashboard in React. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `ROOM_ID`, `AGENT_ID`, and `VERSION_ID` are correctly set in `src/config.js`. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check browser permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the browser's developer console for any network errors. 4. **React build issues:** - Ensure Node.js version is compatible - Try clearing npm cache: `npm cache clean --force` - Delete `node_modules` and reinstall: `rm -rf node_modules && npm install` --- --- title: SIP hide_title: false hide_table_of_contents: false description: " A framework for creating AI-powered voice agents using VideoSDK and various SIP providers" pagination_label: "VideoSDK AI SIP Framework" keywords: - AI Agent SDK - VideoSDK Agents - SIP - Trunking - Python SDK - Voice AI - Real-time Communication - AI Integration - VideoSDK Cloud - Development sidebar_label: SIP slug: sip --- import Step from '@site/src/components/Step' import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # VideoSDK AI SIP Framework A production-ready framework for creating AI-powered voice agents using VideoSDK and various SIP providers (e.g., Twilio). This framework enables you to build and deploy sophisticated conversational AI agents that can handle both inbound and outbound phone calls with natural language processing. ## How It Works The framework simplifies a complex process into a manageable workflow. Here’s a high-level overview of the architecture: 1. **Phone Call**: A user calls a phone number you have acquired from a SIP provider (like Twilio, Plivo, etc.). 2. **SIP Provider**: The provider receives the call and sends a webhook notification to your application server. 3. **Your Application Server**: This is the application you build using this framework. * It receives the webhook. * It uses the `SIPManager` to create a secure VideoSDK room for the call. * It launches your custom AI Agent. * It responds to the SIP provider with instructions (e.g., TwiML) to forward the call's audio into the VideoSDK room. 4. **VideoSDK & AI Agent**: Your AI Agent joins the room, receives the live audio from the phone call, processes it using your chosen AI models (for speech-to-text, language understanding, and text-to-speech), and responds in real-time to create a seamless, interactive conversation. --- ## Prerequisites Before you get started, ensure you have the following: ### System Requirements - **Python**: 3.11 or higher - **Network**: Public internet access for webhook delivery ### Required Credentials - **VideoSDK Credentials**: Sign up at [app.videosdk.live](https://app.videosdk.live/) to get your token and SIP credentials. ![VideoSDK SIP Credentials](https://strapi.videosdk.live/uploads/sip_dashboard_screenshot_8025aba2ec.png) - **SIP Provider Account**: Obtain provider-specific credentials. - **AI Model Provider**: An account with Google, OpenAI, or another supported provider. --- ## Get Started ### 1. Installation Create and activate a virtual environment ```js python3 -m venv venv source venv/bin/activate ``` ```js python3 -m venv venv venv\Scripts\activate ``` Install the core framework ```bash pip install videosdk-plugins-sip ``` Install plugins for your chosen AI services (e.g., Google) ```bash pip install videosdk-plugins-google ``` ### 2. Environment Configuration Your agent requires credentials for both VideoSDK and your chosen SIP provider. You can provide these through environment variables (recommended) or directly in your code. Create a `.env` file in your project's root directory, edit the file with your credentials. #### **VideoSDK Credentials (Required)** These are essential for the framework to function. ```ini VIDEOSDK_AUTH_TOKEN=your_videosdk_jwt_token VIDEOSDK_SIP_USERNAME=your_videosdk_sip_username VIDEOSDK_SIP_PASSWORD=your_videosdk_sip_password ``` #### **AI Model Credentials (Required)** Add the API key for your chosen AI provider. ```ini GOOGLE_API_KEY=your_google_api_key_here ``` #### **SIP Provider Credentials** Fill in the details for the provider you will be using. The framework will automatically use the correct variables based on the `SIP_PROVIDER` you set. Get your credentials from the [Twilio console](https://console.twilio.com/dashboard). ```ini SIP_PROVIDER=twilio TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxx TWILIO_AUTH_TOKEN=your_auth_token TWILIO_PHONE_NUMBER=+1234567890 ``` Copy the example environment file and populate it with your credentials. ```bash cp env.example .env ``` Now, edit the `.env` file: ```ini # VideoSDK Configuration VIDEOSDK_AUTH_TOKEN=your_videosdk_jwt_token VIDEOSDK_SIP_USERNAME=your_videosdk_sip_username VIDEOSDK_SIP_PASSWORD=your_videosdk_sip_password # AI Model Configuration (Example for Google Gemini) GOOGLE_API_KEY=your_google_api_key # Provider Selection (currently, 'twilio' is supported) SIP_PROVIDER=twilio # Twilio Configuration TWILIO_ACCOUNT_SID=your_twilio_account_sid TWILIO_AUTH_TOKEN=your_twilio_auth_token TWILIO_PHONE_NUMBER=+1234567890 ``` ## AI Agent and SIP Setup Here’s how to structure your application.
### Step 1: Initialize the SIP Manager
The `create_sip_manager` function is the main entry point. It establishes the connection to your SIP provider by reading the environment variables you configured. ```python import os from dotenv import load_dotenv from videosdk.plugins.sip import create_sip_manager # Load variables from the .env file load_dotenv() # This function reads your .env variables and configures the correct provider sip_manager = create_sip_manager( provider=os.getenv("SIP_PROVIDER"), videosdk_token=os.getenv("VIDEOSDK_AUTH_TOKEN"), # The provider_config dictionary passes provider-specific environment variables. provider_config={ # Twilio "account_sid": os.getenv("TWILIO_ACCOUNT_SID"), "auth_token": os.getenv("TWILIO_AUTH_TOKEN"), "phone_number": os.getenv("TWILIO_PHONE_NUMBER"), } ) ```
### Step 2: Define Your Agent's Pipeline
The pipeline defines which AI models your agent uses. Here, we are using Google's Gemini for a [Pipeline](https://docs.videosdk.live/ai_agents/core-components/pipeline) in realtime mode. You could also use a Pipeline in cascading mode. ```python from videosdk.agents import Pipeline from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig def create_agent_pipeline(): """This creates the AI model pipeline for our agent.""" model = GeminiRealtime( api_key=os.getenv("GOOGLE_API_KEY"), model="gemini-3.1-flash-live-preview", config=GeminiLiveConfig( voice="Leda", # Choose your desired voice response_modalities=["AUDIO"], # We want the agent to speak back ), ) return Pipeline(llm=model) ```
### Step 3: Define Your Agent's Personality and Tools
The `Agent` class defines the system prompt (instructions), personality, and custom [function tools](https://docs.videosdk.live/ai_agents/core-components/agent) and [MCP Servers](https://docs.videosdk.live/ai_agents/mcp-integration) that your agent can use. ```python import asyncio from videosdk.agents import Agent, function_tool, JobContext from typing import Optional class SIPAIAgent(Agent): """An AI agent for handling voice calls.""" def __init__(self, ctx: Optional[JobContext] = None): super().__init__( instructions="You are a friendly and helpful voice assistant. Keep your responses concise.", tools=[self.end_call], # You can also integrate other function tools and MCP Servers here. ) self.ctx = ctx self.greeting_message = "Hello! Thank you for calling. How can I assist you today?" async def on_enter(self) -> None: pass async def greet_user(self) -> None: """Greets the user with the message defined above.""" await self.session.say(self.greeting_message) async def on_exit(self) -> None: pass ``` ## Server Setup and Deployment Your application must be accessible from the public internet so that your SIP provider can send it webhooks. You have two main options for this. For testing on your local machine, `ngrok` is the perfect tool. It creates a secure, public URL that tunnels directly to your local server. The `lifespan` manager in our example code handles this for you automatically. When you start the server, it will generate a unique URL and automatically configure the `SIPManager` with it. **Code Snippet (FastAPI Lifespan Manager):** ```python import os import logging from contextlib import asynccontextmanager from fastapi import FastAPI from pyngrok import ngrok logger = logging.getLogger(__name__) @asynccontextmanager async def lifespan(app: FastAPI): """Lifespan manager for FastAPI app startup and shutdown.""" port = int(os.getenv("PORT", 8000)) try: ngrok.kill() ngrok_auth_token = os.getenv("NGROK_AUTHTOKEN") if ngrok_auth_token: ngrok.set_auth_token(ngrok_auth_token) tunnel = ngrok.connect(port, "http") # The Base URL is generated here sip_manager.set_base_url(tunnel.public_url) logger.info(f"NGROK TUNNEL CREATED: {tunnel.public_url}") except Exception as e: logger.error(f"Failed to start ngrok tunnel: {e}") yield try: ngrok.kill() logger.info("Ngrok tunnel closed") except Exception as e: logger.error(f"Error closing ngrok tunnel: {e}") app = FastAPI(title="SIP AI Agent", lifespan=lifespan) ``` For a live application, you will deploy your code to a cloud server (e.g., AWS EC2, Google Cloud Run, Heroku) that has a permanent public IP address or domain name. In this case, you should **not** use the `ngrok` `lifespan` manager. Instead, set the base URL directly in your code. **Code Snippet (Cloud Server Setup):** ```python from fastapi import FastAPI # Your FastAPI app for production app = FastAPI(title="SIP AI Agent") # IMPORTANT: Set your server's public URL before starting the app. # This should be the actual domain where your service is hosted. PUBLIC_URL = "https://api.your-public-url.com" sip_manager.set_base_url(PUBLIC_URL) ``` :::note You must configure your SIP provider's webhook to point to `https://your-public-or-ngrok-url.com/webhook/incoming`. ::: ## API Endpoint Guide Your application server, powered by the `sip` framework, exposes a set of endpoints for controlling and monitoring calls. --- ### `POST /webhook/incoming` This is the **most important endpoint for handling inbound calls**. When a user calls your SIP provider's phone number, the provider sends an HTTP request (a webhook) to this URL. * **Purpose**: To serve as the primary entry point for all incoming phone calls. * **Provider Configuration**: You **must** configure this full URL in your SIP provider's dashboard for your phone number. * **Core Process**: 1. Receives the webhook from the SIP provider. 2. Creates a new VideoSDK room for the call. 3. Launches your `SIPAIAgent` in a separate process, which then waits in the room. 4. Responds to the provider with instructions (XML-based TwiML/ExoML) detailing how to forward the call's audio stream to the newly created room's SIP address. --- ### `POST /call/make` This endpoint allows you to **programmatically initiate an outbound call** from your agent to a user's phone number. ```bash # Replace with the destination phone number curl -X POST "http://localhost:8000/call/make?to_number=+1234567890" ``` * **Purpose**: To start new conversations with users. Ideal for automated reminders, lead qualification, or proactive support. * **Query Parameters**: | Parameter | Type | Description | Required | | :--- | :--- | :--- | :--- | | `to_number` | `string` | The full phone number to call, in E.164 format (e.g., `+15551234567`). | Yes | * **Core Process (Outbound Call Flow)**: 1. Your request hits the endpoint. 2. The `SIPManager` creates a VideoSDK room and immediately launches your `SIPAIAgent`. The agent then waits in the room. 3. The manager sends an API request to your SIP provider (e.g., Twilio), instructing it to call the `to_number`. 4. Crucially, it provides the SIP provider with a unique webhook URL for this specific call: `https:///sip/answer/{room_id}`. 5. When the user answers their phone, the SIP provider sends a webhook to that unique answer URL to connect the user to the waiting agent. --- ### `POST /sip/answer/{room_id}` This is an **internal-facing endpoint** designed to complete the outbound call loop. You will not call this endpoint directly. * **Purpose**: To serve as the dynamic "answer URL" for outbound calls. * **Path Parameters**: | Parameter | Type | Description | | :--- | :--- | :--- | | `room_id` | `string` | The unique ID of the VideoSDK room where the agent is waiting. | * **Core Process**: 1. This endpoint is called by the SIP provider *only after* the user answers an outbound call initiated by `/call/make`. 2. It uses the `room_id` to find the correct SIP address for the room where the agent is waiting. 3. It returns a simple TwiML/XML response that tells the provider how to bridge the just-answered call with the agent. --- ### `GET /sessions` A simple utility endpoint for **monitoring the health and status** of your service. * **Purpose**: To see how many calls are currently active. * **Core Process**: 1. Receives a simple `GET` request. 2. Checks the `SIPManager`'s internal state. 3. Returns a count of active sessions and a list of their corresponding room IDs. --- :::tip If you experience high latency when connecting a call, it may be due to a mismatch between the geographical region of your VideoSDK meeting server (which defaults to the nearest server region to you) and your SIP provider's region. To reduce latency, upgrade to an enterprise plan and set `VIDEOSDK_REGION=sip_provider_region` in your `.env` file for a low-latency experience. ::: --- --- title: Playground hide_title: false hide_table_of_contents: false description: "Test and interact with your VideoSDK AI agents in real-time using Playground mode. Learn how to enable the interactive testing environment for rapid development and debugging of voice AI agents." pagination_label: "Playground" keywords: - AI Agent SDK - VideoSDK Agents - Playground - Testing - Python SDK - Voice AI - Real-time Communication - AI Integration - VideoSDK Cloud - Development - Debugging image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Playground slug: playground --- # Agents Playground The Agents Playground provides an interactive testing environment where you can directly communicate with your AI agents during development. This feature enables rapid prototyping, testing, and debugging of your voice AI implementations without needing a separate client application. ## Overview Playground mode creates a web-based interface that connects directly to your agent session, allowing you to: - Test agent in real-time - Demonstrate agent capabilities to stakeholders ## Enabling Playground Mode To activate playground mode, simply set `playground: True` in your RoomOptions for JobContext. ### Basic Implementation ```python from videosdk.agents import RoomOptions, JobContext, WorkerJob async def entrypoint(ctx: JobContext): # Your agent implementation here # This is where you create your pipeline, agent, and session pass def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Test Agent", playground=True # Enable playground mode ) return JobContext(room_options=room_options) if __name__ == "__main__": from videosdk.agents import WorkerJob job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Accessing the Playground Once your agent session starts, the playground URL will be displayed in your terminal: ``` Agent started in playground mode Interact with agent here at: https://playground.videosdk.live?token={auth_token}&meetingId={meeting_id} ``` ### URL Structure The playground URL follows this format: ``` https://playground.videosdk.live?token={auth_token}&meetingId={meeting_id} ``` Where: - `auth_token`: videosdk_auth that is provided in session context or in env file. - `meeting_id`: The meeting ID specified in session context. **Note**: Playground mode is designed for development and testing purposes. For production deployments, ensure playground mode is disabled to maintain security and performance. --- --- title: AI Agent with Flutter - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using Flutter frontend and Python agent. sidebar_label: Flutter pagination_label: AI Agent with Flutter - Quick Start keywords: - ai agent - voice interaction - real-time communication - flutter sdk - python agent image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-flutter --- import AiAgentQuickStartFlutter from '@site/mdx/\_ai-agent-quick-start-flutter-v1.mdx'; --- --- title: AI Agent with iOS - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using iOS Swift frontend and Python backend. sidebar_label: iOS pagination_label: AI Agent with iOS - Quick Start keywords: - ai agent - voice interaction - real-time communication - ios sdk - python backend - swift image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-ios --- import AiAgentQuickStartiOS from '@site/mdx/\_ai-agent-quick-start-ios-v1.mdx'; --- --- title: AI Agent with IoT - Quick Start hide_title: false hide_table_of_contents: false description: Integrate a real-time AI agent with an ESP32 device using VideoSDK, enabling voice-based interaction through Google Gemini Live API. pagination_label: AI Agent with IoT - Quick Start keywords: - iot - esp32 - ai agent - videosdk - real-time communication image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-iot sidebar_label: Physical AI (IoT) --- import AiAgentQuickStartIoT from '@site/mdx/\_ai-agent-quick-start-iot-v1.mdx'; --- --- title: AI Agent with JavaScript - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using JavaScript frontend. sidebar_label: JavaScript pagination_label: AI Agent with JavaScript - Quick Start keywords: - ai agent - voice interaction - real-time communication - javascript sdk image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-js --- import AiAgentQuickStartJS from '@site/mdx/\_ai-agent-quick-start-js-v1.mdx'; --- --- title: AI Agent with React Native - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using a React Native frontend and Python backend. sidebar_label: React Native pagination_label: AI Agent with React Native - Quick Start keywords: - ai agent - voice interaction - real-time communication - react native sdk - python backend image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-react-native --- import AiAgentQuickStartReactNative from '@site/mdx/\_ai-agent-quick-start-react-native-v1.mdx'; --- --- title: AI Agent with React - Quick Start hide_title: false hide_table_of_contents: false description: VideoSDK enables the opportunity to integrate AI agents with real-time voice interaction using React frontend and Python backend. sidebar_label: React pagination_label: AI Agent with React - Quick Start keywords: - ai agent - voice interaction - real-time communication - react sdk - python backend image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-react --- import AiAgentQuickStartReact from '@site/mdx/\_ai-agent-quick-start-react-v1.mdx'; --- --- title: AI Agent with Unity - Quick Start hide_title: false hide_table_of_contents: false description: Integrate a real-time AI agent with Unity using VideoSDK, enabling voice-based interaction through Google Gemini Live API. sidebar_label: Unity pagination_label: AI Agent with Unity - Quick Start keywords: - unity - ai agent - videosdk - real-time communication image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: ai-agent-quickstart-unity --- import AiAgentQuickStartUnity from '@site/mdx/\_ai-agent-quick-start-unity-v1.mdx'; --- --- title: AI Telephony Agent Quick Start hide_title: false hide_table_of_contents: false description: "A comprehensive guide to creating a fully functional AI telephony agent using VideoSDK Agent SDK. Learn how to run the agent locally, connect it to the global telephone network using SIP, and enable it to handle both inbound and outbound phone calls." pagination_label: "AI Telephony Agent Quick Start" keywords: - AI Telephony Agent - Quick Start - VideoSDK Agents - AI Agent SDK - Python - SIP - Telephony - Phone Calls - Inbound Calls - Outbound Calls - Gemini - Google API - Voice Integration image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: AI Telephony Agent slug: ai-phone-agent-quick-start --- import TelephonyQuickStart from '@site/mdx/\_ai-telephony-agent-quick-start-v1.mdx'; --- --- title: AI Voice Agent Quick Start hide_title: false hide_table_of_contents: false description: "A step-by-step guide to quickly integrate an AI-powered voice agent into your VideoSDK meetings using the AI Agent SDK. Covers prerequisites, installation, custom agent creation, function tools, pipeline setup, and session management." pagination_label: "AI Voice Agent Quick Start" keywords: - AI Voice Agent - Quick Start - VideoSDK Agents - AI Agent SDK - Python - OpenAI - Gemini - Live API - Speech To Speech - Amazon Nova Sonic - AWS Nova Sonic - Function Tools - Realtime AI - Voice Integration - VideoSDK Meetings image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: AI Voice Agent slug: voice-agent-quick-start --- import Step from '@site/src/components/Step' import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import AllSDKCard from '@site/src/components/AllSDKCard' # AI Voice Agent Quick Start Get started with VideoSDK Agents in minutes. This guide covers both Realtime (speech-to-speech) and Cascaded (STT-LLM-TTS) pipeline implementations. ## Prerequisites Before you begin, ensure you have: - A VideoSDK authentication token (generate from [app.videosdk.live](https://app.videosdk.live)), follow to guide to [generate videosdk token](/ai_agents/authentication-and-token) - A VideoSDK meeting ID (you can generate one using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) or through the VideoSDK dashboard) - Python 3.12 or higher ## Understanding the Architecture Before diving into implementation, let's understand the two main pipeline architectures available: **Realtime** provides direct speech-to-speech processing with minimal latency: ![Realtime Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_realtime_pipeline.png) The realtime processes audio directly through a unified model that handles: - **User Voice Input** → **Speech to Speech model** → **Agent Voice Output** This approach offers the fastest response times and is ideal for real-time conversations. **Cascade** processes audio through distinct stages for maximum control: ![Cascade Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_casading_pipeline.png) The cascade processes audio through three sequential stages: - **User Voice Input** → **STT (Speech-to-Text)** → **LLM (Large Language Model)** → **TTS (Text-to-Speech)** → **Agent Voice Output** This approach provides better control over each processing stage and supports more complex AI reasoning. ## Installation Create and activate a virtual environment with Python 3.12 or higher: ```js python3.12 -m venv venv source venv/bin/activate ``` ```js python -m venv venv venv\Scripts\activate ``` ```bash pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" ``` > Want to use a different provider? Check out our plugins for [STT](https://docs.videosdk.live/ai_agents/plugins/stt/openai), [LLM](https://docs.videosdk.live/ai_agents/plugins/llm/openai), and [TTS](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs). ```bash pip install videosdk-agents # Choose your real-time provider: # For OpenAI pip install "videosdk-plugins-openai" # For Gemini (LiveAPI) pip install "videosdk-plugins-google" # For AWS Nova pip install "videosdk-plugins-aws" ``` ## Environment Setup It's recommended to use environment variables for secure storage of API keys, secret tokens, and authentication tokens. Create a `.env` file in your project root: ```shell title=".env" DEEPGRAM_API_KEY = "Your Deepgram API Key" OPENAI_API_KEY = "Your OpenAI API Key" ELEVENLABS_API_KEY = "Your ElevenLabs API Key" VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token" ``` > **API Keys** - Get API keys [Deepgram ↗](https://console.deepgram.com/), [OpenAI ↗](https://platform.openai.com/api-keys), [ElevenLabs ↗](https://elevenlabs.io/app/settings/api-keys) & [VideoSDK Dashboard ↗](https://app.videosdk.live/api-keys) follow to guide to [generate videosdk token ](/ai_agents/authentication-and-token) ```bash title=".env" VIDEOSDK_AUTH_TOKEN="VideoSDK Auth token" OPENAI_API_KEY="Your OpenAI API Key" // For Google Live API // GOOGLE_API_KEY="Google Live API Key" // For AWS Nova API // AWS_ACCESS_KEY_ID="AWS Key Id" // AWS_SECRET_ACCESS_KEY="AWS Secret Key" // AWS_DEFAULT_REGION="AWS Region" ``` > **API Keys** - Get API keys [OpenAI ↗](https://platform.openai.com/api-keys) or [Gemini ↗](https://aistudio.google.com/app/apikey) or [AWS Nova Sonic ↗](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html) & [VideoSDK Dashboard ↗](https://app.videosdk.live/api-keys)> follow to guide to [generate videosdk token ](/ai_agents/authentication-and-token)
### Step 1: Creating a Custom Agent
First, let's create a custom voice agent by inheriting from the base `Agent` class: ```python title="main.py" import asyncio, os from videosdk.agents import Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS # Pre-downloading the Turn Detector model pre_download_model() class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help?") async def on_exit(self): await self.session.say("Goodbye!") ``` ```python title="main.py" import asyncio, os from videosdk.agents import Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig from openai.types.beta.realtime.session import TurnDetection class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help?") async def on_exit(self): await self.session.say("Goodbye!") ``` This code defines a basic voice agent with: - Custom instructions that define the agent's personality and capabilities - An entry message when joining a meeting - State change handling to track the agent's current activity
### Step 2: Assembling and Starting the Agent Session
The pipeline connects your agent to an AI model. ```python title="main.py" async def start_session(context: JobContext): # Create agent agent = MyVoiceAgent() # Create pipeline pipeline = Pipeline( stt=DeepgramSTT(model="nova-2", language="en"), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) session = AgentSession( agent=agent, pipeline=pipeline ) try: await context.connect() await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create name="VideoSDK Cascaded Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ```python title="main.py" async def start_session(context: JobContext): # Initialize Model model = OpenAIRealtime( model="gpt-realtime-2025-08-28", config=OpenAIRealtimeConfig( voice="alloy", # Available voices:alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse modalities=["text", "audio"], turn_detection=TurnDetection( type="server_vad", threshold=0.5, prefix_padding_ms=300, silence_duration_ms=200, ) ) ) # Create pipeline pipeline = Pipeline( llm=model ) session = AgentSession( agent=MyVoiceAgent(), pipeline=pipeline ) try: await context.connect() await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create name="VideoSDK Realtime Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ```
### Step 3: Running the Project
Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your `.env` file is properly configured and all dependencies are installed. ```bash python main.py console ``` Want to see the magic instantly? Try console mode to interact with your agent directly through the terminal! No need to join a meeting room - just speak and listen through your local system. Perfect for quick testing and development. ![Console Mode](https://cdn.videosdk.live/website-resources/docs-resources/ai_agents_console_mode_image.png) Learn more about [Console Mode](/ai_agents/console_mode). ```bash python main.py ``` Once you run this command, a playground URL will appear in your terminal. You can use this URL to interact with your AI agent.
### Step 4: Connecting with VideoSDK Client Applications
When working with a Client SDK, make sure to create the room first using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) . Then, simply pass the generated `room id` in both your client SDK and the `RoomOptions` for your AI Agent so they connect to the same session. :::tip Get started quickly with the [Quick Start Example](https://github.com/videosdk-live/agents-quickstart/) for the VideoSDK AI Agent SDK — everything you need to build your first AI agent fast. ::: --- --- title: Authentication and Token | Video SDK hide_title: true hide_table_of_contents: false description: Video SDK and Audio SDK, developers need to implement a token server. This requires efforts on both the front-end and backend. sidebar_label: Authentication and Tokens pagination_label: Authentication and Tokens keywords: - audio calling - video calling - real-time communication - collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: authentication-and-token --- # Why we are using JWT based Token ? Token based authentication allows users to verify their identity by providing generated API key and secret. We use JWT token for the authentication purpose because Token-based authentication is **widely used** in modern web applications and APIs because it offers several benefits over traditional authentication. For example, it can **reduce the risk of the credentials being misused**, and it allows for **more fine-grained control** over access to resources. Additionally, tokens can be easily revoked or expired, making it easier to manage access rights. ## How to generate Token ? To manage secured communication, every participant that connects to the meeting needs an access token. You can easily generate this token by using your `apiKey` and `secret-key` which you can get from [VideoSDK Dashboard](https://app.videosdk.live/api-keys). ### 1. Generating token from Dashboard If you are looking to do **testing or for development purpose**, you can generate a temporary token from [VideoSDK Dashboard's API section](https://app.videosdk.live/api-keys). import ReactPlayer from "react-player";
:::tip The best practice for getting token includes generating it from your backend server which will help in **keeping your credentials safe**. ::: ### 2. Generating token in your backend - Your server will generate access token using your API key and secret. - While generating a token, you can provide **expiration time, permissions and roles** which are discussed later in this section. - Your client obtains token from your backend server. - For token validation, client will pass this token to VideoSDK server. - VideoSDK server will only allow entry in the meeting if the token is valid. ![img2.png](/img/authentication-and-token.png) import GenerateToken from "@site/src/theme/GenerateTokenContainer"; Follow our official example repositories to setup token API [videosdk-rtc-api-server-examples](https://github.com/videosdk-live/videosdk-rtc-api-server-examples) ### Payload while generating token For AI Agent authentication, the payload is simplified to include only the essential parameters: ```js { apikey: API_KEY, //MANDATORY permissions: [`allow_join`], //MANDATORY } ``` - **`apikey`(Mandatory)**: This must be the API Key generated from the VideoSDK Dashboard. You can get it from [here](https://app.videosdk.live/api-keys). - **`permissions`(Mandatory)**: For AI agents, typically use `allow_join` to enable the agent to join meetings directly. Available permissions for AI agents: - **`allow_join`**: The AI agent is **allowed to join** the meeting directly. - **`ask_join`**: The AI agent is required to **ask for permission to join** the meeting. Then, you have to sign this payload with your **`SECRET KEY`** and `jwt` options using the **`HS256 algorithm`**. ### Expiration time You can set any expiration time to the token. But in the **production environment**, it is recommended to generate a token with **short expiration time** because by any chance if someone gets hold of the token, it won't be valid for a longer period of time. ### What happens if token is expired? If your token is expired, the user won't be able to join the meeting and all the API calls will give error with message `Token is invalid or expired`. :::note Token is validated only once while joining the meeting, so if a person joins the meeting and the token gets expired after that, there won't be any issue in the current meeting. ::: ## How to check validity of token? 1. After generating the token, visit [jwt.io](https://jwt.io) and paste your token in the given area. 2. You will be able to see the payload you passed while generating the token and also be able to see the expiration time and token creation time. ![img1.png](/img/validate-token.png) --- --- title: Console Mode for AI Agents hide_title: false hide_table_of_contents: false description: "Learn how to use VideoSDK AI Agents in console mode for direct terminal-based voice interactions without joining a meeting room." pagination_label: "Console Mode" keywords: - AI Voice Agent - Console Mode - Terminal Interaction - VideoSDK Agents - CLI Mode - Voice Testing - Local Development - Quick Testing - Videosdk Console Mode image: img/videosdklive-thumbnail.jpg sidebar_position: 8 sidebar_label: Console Mode slug: console-mode --- # Console Mode for AI Agents Console mode allows you to interact with your AI agent directly through the terminal without joining a VideoSDK meeting room. This is particularly useful for: - Quick testing of agent functionality - Local development and debugging - Testing function tools and MCP integrations - Validating pipeline configurations ## How It Works When running your agent script in console mode: 1. The agent runs in a terminal-based environment 2. Your microphone input is captured directly through the terminal 3. Agent responses are played through your system audio 4.Run the full Pipeline (cascading or realtime mode) locally without connecting to a meeting. This makes it easier to verify that audio flows, agent logic, and response generation are working correctly before deploying into a live session. 1. Function tools, MCP integrations, and other features remain fully functional ## Using Console Mode To use console mode, simply add the `console` argument when running your agent script: ```bash python main.py console ``` import ReactPlayer from 'react-player'
The console will display: - Agent speech output - User speech input - Various latency metrics (STT, TTS, LLM,EOU) - Pipeline processing information This flexibility allows you to use the same agent code for both development and production environments. --- --- title: Agent Session hide_title: false hide_table_of_contents: false description: "Discover how the `AgentSession` in VideoSDK's AI Agent SDK orchestrates various components into a unified workflow, managing the agent's interaction lifecycle and context for seamless real-time communication." pagination_label: "Agent Session" keywords: - AgentSession - AI Agent SDK - VideoSDK Agents - Component Orchestration - Session Management - Context Handling - Agent Workflow - Real-time AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 8 sidebar_label: Agent Session slug: agent-session --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; # Agent Session The `AgentSession` is the central orchestrator that integrates the `Agent` and `Pipeline` into a cohesive workflow. It manages the complete lifecycle of an agent's interaction within a VideoSDK meeting, handling initialization, execution, and cleanup. ![Agent Session](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_agent_session.png) ## Core Features - **Component Orchestration:** Unifies agent and pipeline components. - **Lifecycle Management:** Handles session start, execution, and cleanup ## State Management The `AgentSession` provides comprehensive state tracking for both users and agents, automatically emitting state change events for real-time monitoring. :::tip Version Requirement The state management features and enhanced methods (`reply()`, `interrupt()`) are available in versions above v0.0.35. ::: ### User States - **IDLE** - User is not actively speaking or listening - **SPEAKING** - User is currently speaking - **LISTENING** - User is actively listening to the agent ### Agent States - **STARTING** - Agent is initializing - **IDLE** - Agent is ready and waiting - **SPEAKING** - Agent is currently generating speech - **LISTENING** - Agent is processing user input - **THINKING** - Agent is processing and generating response - **CLOSING** - Agent is shutting down ### State Event Monitoring State changes are automatically emitted as events that you can listen to: ```python title="main.py" def on_user_state_changed(data): print("User state:", data) def on_agent_state_changed(data): print("Agent state:", data) session.on("user_state_changed", on_user_state_changed) session.on("agent_state_changed", on_agent_state_changed) ``` ## Constructor Parameters ```python AgentSession( agent: Agent, pipeline: Pipeline, wake_up: Optional[int] = None ) ``` ### Wake-Up Call Wake-up call automatically triggers actions when users are inactive for a specified period of time, helping maintain engagement. ```python title="main.py" # Configure wake-up timer session = AgentSession( agent=MyAgent(), pipeline=pipeline, wake_up=10 # Trigger after 10 seconds of inactivity ) # Set callback function async def on_wake_up(): await session.say("Are you still there? How can I help?") session.on_wake_up = on_wake_up ``` :::note Important: If a `wake_up` time is provided, you must set a callback function before starting the session. If no `wake_up` time is specified, no timer or callback will be activated. ::: ## Basic Usage To get an agent running, you initialize an `AgentSession` with your custom `Agent` and a configured `Pipeline`. The session handles the underlying connection and data flow. ### Example Implementation: import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```python title="main.py" from videosdk.agents import AgentSession, Agent, Pipeline, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAIRealtime class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") def setup_state_monitoring(self): def on_user_state_changed(data): print(f"User state changed to: {data['state']}") def on_agent_state_changed(data): print(f"Agent state changed to: {data['state']}") self.session.on("user_state_changed", on_user_state_changed) self.session.on("agent_state_changed", on_agent_state_changed) async def start_session(ctx: JobContext): model = OpenAIRealtime(model="gpt-4o-realtime-preview") pipeline = Pipeline(llm=model) session = AgentSession( agent=MyAgent(), pipeline=pipeline ) await ctx.connect() await session.start() # Session runs until manually stopped or meeting ends def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Assistant Bot" ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ```python title="main.py" from videosdk.agents import AgentSession, Agent, Pipeline, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAISTT, OpenAITTS, OpenAILLM class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") def setup_state_monitoring(self): def on_user_state_changed(data): print(f"User state changed to: {data['state']}") def on_agent_state_changed(data): print(f"Agent state changed to: {data['state']}") self.session.on("user_state_changed", on_user_state_changed) self.session.on("agent_state_changed", on_agent_state_changed) async def start_session(ctx: JobContext): # Configure individual components stt = OpenAISTT(model="whisper-1") llm = OpenAILLM(model="gpt-4") tts = OpenAITTS(model="tts-1", voice="alloy") pipeline = Pipeline( stt=stt, llm=llm, tts=tts ) session = AgentSession( agent=MyAgent(), pipeline=pipeline ) await ctx.connect() await session.start() # Session runs until manually stopped or meeting ends def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Assistant Bot" ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## Development and Testing Features The `AgentSession` supports several modes for development, testing, and user engagement: ### Playground Mode Playground mode provides a web-based interface for testing your agent without building a separate client application. #### Usage To activate playground mode, simply set `playground: True` in your RoomOptions for JobContext. ```python title="main.py" from videosdk.agents import RoomOptions, JobContext, WorkerJob async def entrypoint(ctx: JobContext): # Your agent implementation here # This is where you create your pipeline, agent, and session pass def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Test Agent", playground=True # Enable playground mode ) return JobContext(room_options=room_options) if __name__ == "__main__": from videosdk.agents import WorkerJob job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` When enabled, the playground URL is automatically displayed in your terminal for easy access. :::note Note: Playground mode is designed for development and testing purposes. For production deployments, ensure playground mode is disabled to maintain security and performance. ::: ### Console Mode Console mode allows you to test your agent directly in the terminal using your microphone and speakers, without joining a VideoSDK meeting. #### Usage To use console mode, simply add the console argument when running your agent script: ```bash python main.py console ``` import ReactPlayer from 'react-player'
The console will display: - Agent speech output - User speech input - Various latency metrics (STT, TTS, LLM,EOU) - Pipeline processing information This flexibility allows you to use the same agent code for both development and production environments. ## Session Lifecycle Management The `AgentSession` provides methods to control the agent's presence and behavior in the meeting. }, { title: "say(message: str)", description: "Sends a message from the agent to the meeting participants. Allows the agent to communicate with users in the meeting.", link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20say,self%2C%20message%3A%C2%A0str)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE", icon: }, { title: "close()", description: "Gracefully shuts down the session. Finalizes metrics collection, cancels wake-up timer, and calls agent's on_exit() hook.", link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=Methods-,async%20def%20close,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE", icon: }, { title: "leave()", description: "Leaves the meeting without full session cleanup. Provides a quick exit option while maintaining session state.", link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20leave,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE", icon: }, { title: "reply(instructions, wait_for_playback)", description: "Generate agent responses using instructions and current chat context. Includes playback control and prevents concurrent calls.", icon: , link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20reply,self%2C%20instructions%3A%C2%A0str%2C%20wait_for_playback%3A%C2%A0bool%C2%A0%3D%C2%A0True)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE" }, { title: "interrupt()", description: "Immediately interrupt the agent's current operation, stopping speech generation and LLM processing for emergency stops or user interruptions.", icon: , link: "https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20interrupt,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE" } ]} /> ### Example of Managing the Lifecycle: ```python title="main.py" import asyncio from videosdk.agents import AgentSession, Agent, Pipeline, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAIRealtime class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") # LIFECYCLE: Agent entry point - called when session starts async def on_enter(self): await self.session.say("Hello! How can I help you today?") # LIFECYCLE: Agent exit point - called when session ends async def on_exit(self): print("Agent is leaving the session") @function_tool async def provide_summary(self) -> str: """Provide a conversation summary using the new reply method""" await self.session.reply("Let me summarize our conversation so far...") return "Summary provided" @function_tool async def stop_speaking(self) -> str: """Emergency stop functionality""" await self.session.interrupt() return "Agent stopped successfully" async def run_agent_session(ctx: JobContext): # LIFECYCLE STAGE 1: Session Creation model = OpenAIRealtime(model="gpt-4o-realtime-preview") pipeline = Pipeline(llm=model) session = AgentSession(agent=MyAgent(), pipeline=pipeline) try: # LIFECYCLE STAGE 2: Connection Establishment await ctx.connect() # LIFECYCLE STAGE 3: Session Start await session.start() # LIFECYCLE STAGE 4: Session Running await asyncio.Event().wait() finally: # LIFECYCLE STAGE 5: Session Cleanup await session.close() # LIFECYCLE STAGE 6: Context Shutdown await ctx.shutdown() # LIFECYCLE STAGE 0: Context Creation def make_context() -> JobContext: room_options = RoomOptions(room_id="your-room-id", auth_token="your-token") return JobContext(room_options=room_options) if __name__ == "__main__": # LIFECYCLE ORCHESTRATION: Worker Job Management # Creates and starts the worker job that manages the entire lifecycle job = WorkerJob(entrypoint=run_agent_session, jobctx=make_context) job.start() ``` ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. }, { title: "Agent to agent Example", description: "Agent Session with Customer and Loan agent", link: "https://github.com/videosdk-live/agents/tree/main/examples/a2a", icon: } ]} /> --- --- title: Agent hide_title: false hide_table_of_contents: false description: "Learn about the `Agent` base class in the VideoSDK AI Agent SDK. Understand how to create custom agents, define system prompts, manage state, and register function tools." pagination_label: "Agent" keywords: - Agent Class - AI Agent SDK - VideoSDK Agents - Custom Agents - System Prompts - State Management - Function Tools - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Agent slug: agent --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; # Agent The `Agent` class is the base class for defining AI agent behavior and capabilities. It provides the foundation for creating intelligent conversational agents with support for function tools, MCP servers, and advanced lifecycle management. ![Agent](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_agent.png) ## Basic Usage ### Simple Agent This is how you can initialize a simple agent with the `Agent` class, where `instructions` defines how the agent should behave. ```python title="main.py" from videosdk.agents import Agent class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant." ) ``` ## Agent with Function Tools Function tools allow your agent to perform actions and interact with external services, extending its capabilities beyond simple conversation. You can register tools that are defined either outside or inside your agent class. ### External Tools External tools are defined as standalone functions and are passed into the agent's constructor via the tools list. This is useful for sharing common tools across multiple agents. ```python title="main.py" from videosdk.agents import Agent, function_tool # External tool defined outside the class @function_tool(description="Get weather information") def get_weather(location: str) -> str: """Get weather information for a specific location.""" # Weather logic here return f"Weather in {location}: Sunny, 72°F" class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant.", tools=[get_weather] # Register the external tool ) ``` ### Internal Tools Internal tools are defined as methods within your agent class and are decorated with `@function_tool`. This is useful for logic that is specific to the agent and needs access to its internal state (`self`). ```python title="main.py" from videosdk.agents import Agent, function_tool class FinanceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful financial assistant." ) self.portfolio = {"AAPL": 10, "GOOG": 5} @function_tool def get_portfolio_value(self) -> dict: """Get the current value of the user's stock portfolio.""" # In a real scenario, you'd fetch live stock prices # This is a simplified example return {"total_value": 5000, "holdings": self.portfolio} ``` ## Agent with MCP Server `MCPServerStdio` enables your agent to communicate with external processes via standard input/output streams. This is ideal for integrating complex, standalone Python scripts or other local executables as tools. ```python title="main.py" import sys from pathlib import Path from videosdk.agents import Agent, MCPServerStdio # Path to your external Python script that runs the MCP server mcp_server_path = Path(__file__).parent / "mcp_server_script.py" class MCPAgent(Agent): def __init__(self): super().__init__( instructions="You are an assistant that can leverage external tools via MCP.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(mcp_server_path)], session_timeout=30 ) ] ) ``` ## Agent Lifecycle and Methods The `Agent` class provides lifecycle hooks and methods to manage state and behavior at critical points in the agent's session. ### Lifecycle Hooks These methods are designed to be overridden in your custom agent class to implement specific behaviors. - `async def on_enter(self) -> None`: Called once when the agent successfully joins the meeting. This is the ideal place for introductions or initial actions, such as greeting participants. - `async def on_exit(self) -> None`: Called when the agent is about to exit the meeting. Use this for cleanup tasks or for saying goodbye. ```python title="main.py" from videosdk.agents import Agent class LifecycleAgent(Agent): async def on_enter(self): print("Agent has entered the meeting.") await self.session.say("Hello everyone! I'm here to help.") async def on_exit(self): print("Agent is exiting the meeting.") await self.session.say("It was a pleasure assisting you. Goodbye!") ``` ## Human in the Loop (HITL) Human in the Loop enables AI agents to escalate specific queries to human operators for review and approval. This implementation uses Discord as the human interface through an MCP server, allowing seamless handoffs between AI automation and human oversight. ### Use Cases - **Discount Requests**: AI escalates pricing queries to human sales agents - **Complex Support**: Technical issues requiring human expertise - **Policy Decisions**: Requests that need human approval or clarification - **Escalation Scenarios**: Situations where AI confidence is low ### Implementation The HITL pattern combines the Agent's MCP server capability with a Discord-based human interface: ```python title="main.py" from videosdk.agents import Agent, MCPServerStdio, Pipeline, AgentSession, JobContext, RoomOptions, WorkerJob from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.anthropic import AnthropicLLM from videosdk.plugins.google import GoogleTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector import pathlib import sys import os from typing import Optional class CustomerAgent(Agent): def __init__(self, ctx: Optional[JobContext] = None): current_dir = pathlib.Path(__file__).parent discord_mcp_server_path = current_dir / "discord_mcp_server.py" super().__init__( instructions="You are a customer-facing agent for VideoSDK. You have access to various tools to assist with customer inquiries, provide support, and handle tasks. When a user asks for a discount percentage, always use the appropriate tool to retrieve and provide the accurate answer from your superior human agent.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(discord_mcp_server_path)], session_timeout=30 ), ] ) self.ctx = ctx async def on_enter(self) -> None: """Called when the agent first joins the meeting""" await self.session.say("Hi! I'm your VideoSDK customer support agent. How can I help you today?") async def on_exit(self) -> None: """Called when the agent exits the meeting""" await self.session.say("Thank you for contacting VideoSDK support. Have a great day!") # Pipeline configuration integrated into the main setup def create_pipeline() -> Pipeline: """Create and configure the pipeline with all components""" return Pipeline( stt=DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")), llm=AnthropicLLM(api_key=os.getenv("ANTHROPIC_API_KEY")), tts=GoogleTTS(api_key=os.getenv("GOOGLE_API_KEY")), vad=SileroVAD(), turn_detector=TurnDetector(threshold=0.8) ) async def start_session(ctx: JobContext): """Main entry point that creates agent, pipeline, and starts the session""" # Create the pipeline pipeline = create_pipeline() # Create the agent with context agent = CustomerAgent(ctx=ctx) # Create the agent session session = AgentSession( agent=agent, pipeline=pipeline ) try: # Connect to the room await ctx.connect() # Start the agent session await session.start() # Keep running until interrupted import asyncio await asyncio.Event().wait() finally: # Clean up resources await session.close() await ctx.shutdown() def make_context() -> JobContext: """Create the job context with room configuration""" room_options = RoomOptions( room_id=os.getenv("VIDEOSDK_ROOM_ID", "your-room-id"), auth_token=os.getenv("VIDEOSDK_AUTH_TOKEN"), name="VideoSDK Customer Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ```python title="discord_mcp_server.py" import asyncio import os from mcp.server.fastmcp import FastMCP import discord from discord.ext import commands class DiscordHuman: def __init__(self, user_id: int, channel_id: int, bot_token: str): self.user_id = user_id self.channel_id = channel_id self.bot_token = bot_token self.bot = commands.Bot(command_prefix="!", intents=discord.Intents.all()) self.response_future = None self.setup_bot_events() def setup_bot_events(self): @self.bot.event async def on_ready(): print(f'{self.bot.user} has connected to Discord!') @self.bot.event async def on_message(message): if (message.author.id == self.user_id and message.channel.id in [thread.id for thread in self.bot.get_all_channels() if hasattr(thread, 'parent')] and self.response_future and not self.response_future.done()): self.response_future.set_result(message.content) async def start_bot(self): """Start the Discord bot""" await self.bot.start(self.bot_token) async def ask(self, question: str) -> str: if not self.bot.is_ready(): return "❌ Discord bot is not ready" try: channel = self.bot.get_channel(self.channel_id) if not channel: return "❌ Channel not found" thread = await channel.create_thread( name=question[:100], type=discord.ChannelType.public_thread ) await thread.send(f"<@{self.user_id}> {question}") self.response_future = asyncio.get_event_loop().create_future() try: response = await asyncio.wait_for(self.response_future, timeout=600) return response except asyncio.TimeoutError: return "⏱️ Timed out waiting for a human response" except Exception as e: return f"❌ Error: {str(e)}" # Initialize Discord human instance discord_human = DiscordHuman( user_id=int(os.getenv("DISCORD_USER_ID")), channel_id=int(os.getenv("DISCORD_CHANNEL_ID")), bot_token=os.getenv("DISCORD_TOKEN") ) # MCP Server Setup mcp = FastMCP("HumanInTheLoopServer") @mcp.tool(description="Ask a human agent via Discord for a specific user query such as discount percentage, etc.") async def ask_human(question: str) -> str: """Ask a human agent via Discord for assistance""" return await discord_human.ask(question) async def main(): """Main function to start both the Discord bot and MCP server""" # Start Discord bot in background bot_task = asyncio.create_task(discord_human.start_bot()) # Wait a moment for bot to initialize await asyncio.sleep(2) # Start MCP server await mcp.run() if __name__ == "__main__": asyncio.run(main()) ``` Set the following environment variables: ```bash title=".env" DISCORD_TOKEN=your_discord_bot_token DISCORD_USER_ID=human_operator_user_id DISCORD_CHANNEL_ID=channel_id_for_escalations DEEPGRAM_API_KEY=your_deepgram_key ANTHROPIC_API_KEY=your_anthropic_key GOOGLE_API_KEY=your_google_key VIDEOSDK_AUTH_TOKEN=your_videosdk_token VIDEOSDK_ROOM_ID=your_room_id ``` The Discord MCP server provides the `ask_human` tool that creates Discord threads for human operator responses. This leverages the same MCP integration pattern shown in the previous section. Complete implementation with full source code, setup instructions, and configuration examples available in the [VideoSDK Agents GitHub repository](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop). --> ## Examples - Try Out Yourself Checkout the examples of function tool usage and MCP server. }, { title: "MCP Server", description: "Implement agent with MCP server integration", link: "https://github.com/videosdk-live/agents/blob/main/examples/mcp_example.py", icon: }, { title: "Human in the Loop", description: "Escalate queries to human operators via Discord", link: "https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop", icon: } ]} columns={3} /> --- --- title: Avatar Server hide_title: false hide_table_of_contents: false description: "Understand how the Avatar Server works in VideoSDK Agents — a separate participant that receives agent audio via data channel, renders synchronized video, and publishes it directly to the room." pagination_label: "Avatar Server" keywords: - Avatar Server - Avatar Worker - Avatar Runner - VideoSDK Agents - Local Avatar - Cloud Avatar Plugin - AvatarAudioIn - AvatarServer - AvatarSynchronizer - AvatarRenderer - Data Channel image: img/videosdklive-thumbnail.jpg sidebar_position: 11 sidebar_label: Avatar Server slug: avatar-server --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Avatar Server When you add an avatar to your agent, a second participant — the **Avatar Server** — joins the VideoSDK room alongside the agent. Your agent continues handling the conversation (STT → LLM → TTS) and streams its TTS audio output to the Avatar Server over VideoSDK's built-in **data channel**. The Avatar Server receives that audio, renders video frames synchronized to the speech, and publishes both audio and video tracks directly back into the room. The end user only ever sees and hears the Avatar Server's output. The agent participant publishes silence and no video. ![Avatar Server](https://cdn.videosdk.live/website-resources/docs-resources/agent_v1_avatar_server.jpg) --- ## How the Audio Gets There — Data Channel The agent and Avatar Server communicate entirely over VideoSDK's built-in data channels. No external message broker, queue, or WebSocket is needed. | Message | Direction | Reliability | Purpose | |---|---|---|---| | PCM audio chunk | Agent → Avatar Server | **Unreliable** | Raw TTS audio, chunked at ≤15 KB | | `segment_end` | Agent → Avatar Server | Reliable | TTS turn has finished | | `INTERRUPT` | Agent → Avatar Server | Reliable | Stop playback immediately | | `stream_ended` | Avatar Server → Agent | Reliable | Playback complete acknowledgment | Audio chunks use unreliable delivery — a dropped packet is better than a delayed one. Control signals use reliable delivery so that a missed interrupt or segment boundary never permanently desyncs state. --- ## Two Ways to Run an Avatar Server There are two paths depending on whether you want the framework to handle A/V synchronization for you or whether you are building your own rendering backend. ### Path 1 - Local Avatar ![Local Avatar Server](https://cdn.videosdk.live/website-resources/docs-resources/agent_v1_local_avatar.jpg) When using a local avatar, the framework's built-in components handle receiving audio from the data channel, orchestrating your renderer, and pacing frames into the room. You only need to implement the visual rendering logic itself. **`AvatarAudioIn`** — runs inside your Avatar Server process. It listens on the data channel, reassembles the PCM stream, handles interrupts (clearing its buffer with a 0.3 s cooldown to drop any in-flight chunks), and exposes a clean async iterator of audio frames and segment markers. **`AvatarServer`** — the orchestrator. It drains `AvatarAudioIn`, feeds each frame into your `AvatarRenderer`, forwards the rendered output into `AvatarSynchronizer`, and sends a `stream_ended` acknowledgment back to the agent at the end of each TTS turn. **`AvatarSynchronizer`** — paces audio and video frames into their respective custom tracks at the configured FPS. At 30 FPS and 24 kHz, each video frame corresponds to exactly 800 audio samples. It sleeps between frames if the renderer runs faster than real time. **`AvatarRenderer`** — the only thing you implement. For each audio frame you receive, produce one video frame and yield them in order (video first, then audio). The framework wires everything else. A small **dispatcher** (`POST /launch`) runs as a separate HTTP service and spawns one Avatar Server process per room on demand. ### Path 2 - Cloud Plugin or Custom Backend ![Remote Avatar Server](https://cdn.videosdk.live/website-resources/docs-resources/agent_v1_remote_avatar.jpg) For a custom or cloud-hosted Avatar Server, your backend joins the VideoSDK room directly as the Avatar Server participant. It subscribes to the agent's data channel, receives the raw PCM audio, renders video using its own engine, and publishes custom audio + video tracks back to the room — all without using any of the local framework components. The framework generates a pre-signed VideoSDK JWT for the Avatar Server and passes it to your plugin's `connect()` call. Your backend uses that token to join the room and begin receiving. **What your backend needs to do:** 1. Join the VideoSDK room using the token received from `connect()` 2. Subscribe to the agent's data channel 3. Receive incoming PCM audio chunks (unreliable) and control messages (reliable) 4. Render video frames using your own engine in sync with the audio 5. Publish custom audio and video tracks as the Avatar Server participant **Your plugin on the agent side needs only three things:** ```python class MyProviderAvatarConnection: def __init__(self, provider_url: str): self.provider_url = provider_url @property def participant_id(self) -> str: return "my_provider_avatar" async def connect(self, room_id: str, token: str) -> None: # framework passes a pre-signed VideoSDK JWT # tell your backend to join and start rendering async with httpx.AsyncClient() as client: await client.post( f"{self.provider_url}/v1/avatar/start", json={"room_id": room_id, "token": token}, ) async def aclose(self) -> None: pass ``` This pattern is the foundation for any cloud-hosted avatar provider — their backend joins the room, receives audio from the data channel, renders video, and publishes it back, all on their own infrastructure. --- ## Comparison | | Local Avatar | Cloud / Custom Backend | |---|---|---| | **Who runs the Avatar Server** | You, on your own machine or server | Your backend or a cloud provider | | **Audio received via** | Framework's `AvatarAudioIn` (data channel) | Your own data channel subscriber | | **A/V synchronization** | Framework's `AvatarSynchronizer` | Your own engine or provider's | | **What you implement** | `AvatarRenderer` (one class) | Full backend service | | **Best for** | Custom visuals, full control, prototyping | Production lip-sync, managed infrastructure | | **Examples** | Circular glow, waveform visualizer | Any cloud avatar provider | --- ## Key Components | Component | Description | |----------|-------------| | `AvatarAudioOut` | Agent-side component that dispatches the Avatar Server and streams TTS audio in chunks over the data channel | | `AvatarAudioIn` | Service-side component that receives data channel messages and provides an async iterator of audio frames | | `AvatarServer` | Service-side orchestrator that connects `AvatarAudioIn` → renderer → synchronizer → media tracks | | `AvatarSynchronizer` | Handles timing and pacing of audio + video frames based on configured FPS | | `AvatarRenderer` | Abstract base class — implement this to define avatar visuals and rendering logic | | `AvatarSettings` | Configuration object for resolution, FPS, and audio sample rate | | `generate_avatar_credentials` | Utility to sign a VideoSDK JWT for authenticating the Avatar Server participant | --- ## See Also import { AgentCardGrid, GithubIcon } from '@site/src/components/agent/cards'; }, { title: "Custom Backend Example", description: "Provider backend example — join room, receive data channel audio, publish custom tracks", link: "https://github.com/videosdk-live/agents/tree/main/examples/avatar/provider_backend", icon: } ]} /> --- --- title: Avatar hide_title: false hide_table_of_contents: false description: "Learn how to add virtual avatars to your VideoSDK AI Agents. Understand avatar integration, configuration, and how to create lifelike visual representations for your agents." pagination_label: "Avatar" keywords: - Avatar - Virtual Avatar - Simli - Visual Representation - AI Agent Avatar - Video Avatar - Agent Appearance - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 10 sidebar_label: Avatar slug: avatar --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Avatar Avatars add a visual, human-like presence to your AI agents, creating more engaging and natural interactions. The VideoSDK Agents framework supports virtual avatars through the Simli integration, providing lifelike video representations that sync with your agent's speech. ![Avatar](https://cdn.videosdk.live/website-resources/docs-resources/voice_agent_avatar.png) ## Overview Avatar functionality enables your AI agents to: - **Visual Presence**: Display a human-like avatar that represents your agent - **Lip Sync**: Automatically synchronize avatar mouth movements with speech - **Real-time Rendering**: Generate avatar video in real-time during conversations - **Customizable Appearance**: Choose from different avatar faces and styles - **Seamless Integration**: Works with all Pipeline modes (Cascading, Realtime, Hybrid) ## What Avatars Enable With avatar capabilities, your agents can: - Provide a more human and approachable interface - Increase user engagement through visual interaction - Create branded agent personalities with custom appearances - Enhance accessibility through visual communication cues - Build trust through consistent visual representation ## Simli Avatar Integration ### Basic Setup The Simli avatar integration provides high-quality virtual avatars with real-time lip synchronization. ```python title="main.py" from videosdk.plugins.simli import SimliAvatar, SimliConfig # Configure your avatar avatar_config = SimliConfig( apiKey="your-simli-api-key", faceId="0c2b8b04-5274-41f1-a21c-d5c98322efa9", # Default face syncAudio=True, handleSilence=True ) avatar = SimliAvatar(config=avatar_config) ``` ### Avatar Configuration Options Customize your avatar's behavior and appearance: ```python title="main.py" from videosdk.plugins.simli import SimliConfig config = SimliConfig( apiKey="your-simli-api-key", faceId="your-custom-face-id", # Choose avatar appearance syncAudio=True, # Enable lip sync handleSilence=True, # Manage silent periods maxSessionLength=1800, # 30 minutes max session maxIdleTime=300 # 5 minutes idle timeout ) ``` ## Pipeline Integration Add avatar to your cascade setup: ```python title="main.py" from videosdk.agents import Pipeline, AgentSession from videosdk.plugins.simli import SimliAvatar, SimliConfig # Configure avatar avatar_config = SimliConfig(apiKey="your-simli-api-key") avatar = SimliAvatar(config=avatar_config) # Create pipeline with avatar pipeline = Pipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, vad=your_vad_provider, turn_detector=your_turn_detector, avatar=avatar # Add avatar to pipeline ) ``` Integrate avatar with real-time models: ```python title="main.py" from videosdk.agents import Pipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig from videosdk.plugins.openai import OpenAIRealtime # Configure avatar avatar = SimliAvatar(config=SimliConfig(apiKey="your-api-key")) # Configure real-time model model = OpenAIRealtime(model="gpt-4o-realtime-preview") # Create pipeline with avatar pipeline = Pipeline( llm=model, avatar=avatar ) ``` :::info You can also specify the avatar in your room configuration: ```python title="main.py" from videosdk.agents import JobContext, RoomOptions def make_context(): avatar = SimliAvatar(config=SimliConfig(apiKey="your-api-key")) return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Avatar Agent", avatar=avatar # Specify avatar in room options ) ) ``` ::: ## Example - Try Out Yourself import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; } ]} /> --- --- title: Background Audio hide_title: false hide_table_of_contents: false description: "Learn about Background Audio in the VideoSDK AI Agent SDK. Enable ambient sounds, thinking audio, and background music to enhance conversational experiences." pagination_label: "Background Audio" keywords: - Background Audio - Thinking Audio - Ambient Sound - Background Music - override_thinking - RoomOptions - Audio Playback - VideoSDK Agents - Python SDK - AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 8 sidebar_label: Background Audio slug: background-audio --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Background Audio The Background Audio feature enables voice agents to play audio during conversations, enhancing user experience with ambient sounds and processing feedback. There are two ways to set the audio: 1. **Thinking Audio:** Plays automatically during agent processing (e.g., keyboard typing sounds) 2. **Background Audio:** Plays on-demand for ambient music or soundscapes ![Thinking Audio](https://assets.videosdk.live/images/thinking_audio.png) ![Background Audio](https://assets.videosdk.live/images/background-audio.png) ## Getting Started ### Enable Background Audio ```python from videosdk.agents import RoomOptions, JobContext room_options = RoomOptions( room_id="your-room-id", name="My Agent", # highlight-start background_audio=True # Enable background audio support #highlight-end ) context = JobContext(room_options=room_options) ``` ### Agent Methods **1. Set Thinking Audio** `set_thinking_audio()`: Configures audio that plays automatically while the agent processes responses. **Parameters:** - `file (str, optional)`: Path to custom WAV audio file. If not provided, uses built-in `agent_keyboard.wav` - `volume (float, optional)`: Volume of the audio. Default: `0.3` **Example:** ```python class MyAgent(Agent): def __init__(self): super().__init__(instructions="...") # highlight-start # Use default keyboard sound self.set_thinking_audio() # Or use custom audio # self.set_thinking_audio(file="path/to/custom.wav") # highlight-end ``` **2. Play Background Audio** `play_background_audio()`: Starts playing background audio during the conversation. **Parameters:** - `file (str, optional)`: Path to custom WAV audio file. If not provided, uses built-in `classical.wav` - `looping (bool, optional)`: Whether to loop the audio. Default: `False` - `override_thinking (bool, optional)`: Whether to stop thinking audio when background audio starts. Default: `True` - `volume (float, optional)`: Volume of the audio. Default: `1.0` **Example:** ```python @function_tool async def play_music(self): """Plays background music""" # highlight-start await self.play_background_audio( looping=True, override_thinking=False ) # highlight-end return "Music started" ``` **3. Stop Background Audio** `stop_background_audio()`: Stops currently playing background audio. **Example:** ```python @function_tool async def stop_music(self): """Stops background music""" # highlight-start await self.stop_background_audio() # highlight-end return "Music stopped" ``` ## Complete Example ```python title="main.py" from videosdk.agents import ( Agent, AgentSession, Pipeline, WorkerJob, JobContext, RoomOptions, function_tool ) from videosdk.plugins.openai import OpenAILLM, OpenAITTS from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector class MusicAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant. Use control_music to play or stop background music." ) #highlight-start # Enable thinking audio with default keyboard sound self.set_thinking_audio() #highlight-end async def on_enter(self): await self.session.say("Hello! Ask me to play music.") async def on_exit(self): await self.session.say("Goodbye! Hope you enjoyed the music.") @function_tool async def control_music(self, action: str): """ Controls background music. :param action: 'play' to start music, 'stop' to end it """ if action == "play": #highlight-start await self.play_background_audio( override_thinking=True, looping=True ) #highlight-end return "Music started" elif action == "stop": #highlight-start await self.stop_background_audio() #highlight-end return "Music stopped" return "Invalid action" async def entrypoint(ctx: JobContext): agent = MusicAgent() pipeline = Pipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=OpenAITTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) session = AgentSession( agent=agent, pipeline=pipeline ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context(): return JobContext( room_options=RoomOptions( room_id="", name="Music Agent", #highlight-start background_audio=True # Required! #highlight-end ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Pipeline Support Background audio works with both pipeline types: ### Cascading Mode - Thinking audio plays automatically during LLM processing - Background audio can be controlled via agent methods - Audio stops automatically when agent speaks ### Realtime Mode - Full background audio support with streaming models - Automatic lifecycle management during conversation turns ## Audio Behavior | Feature | Thinking Audio | Background Audio | |---------|---------------|------------------| | **Trigger** | Automatic during processing | Manual via `play_background_audio()` | | **Default File** | `agent_keyboard.wav` | `classical.wav` | | **Typical Duration** | Short (during LLM call) | Long/continuous | | **Looping** | Optional | Recommended (`looping=True`) | | **User Control** | No | Yes (via function tools) | | **Stops When** | Agent speaks | Agent speaks or `stop_background_audio()` | ## Audio File Requirements - **Format:** WAV (`.wav`) - **Recommended:** 16-bit PCM, 16kHz sample rate, mono channel - **Built-in files:** - `agent_keyboard.wav`: Default thinking sound - `classical.wav`: Default background music ## Best Practices 1. **Always enable in RoomOptions:** Set `background_audio=True` before using audio methods 2. **Use `override_thinking=True`:** When playing music to avoid overlapping sounds 3. **Loop background audio:** Set `looping=True` for continuous ambient sounds 4. **Control via function tools:** Let users control music through natural language 5. **Clean audio files:** Use high-quality WAV files to avoid distortion ## Common Use Cases - **Music player agent:** Control playback through conversation - **Ambient soundscapes:** Create atmosphere during interactions - **Processing feedback:** Custom thinking sounds for different agent personalities - **Hold music:** Play audio while agent performs long operations ## Example - Try It Yourself } ]} /> ## FAQs
Troubleshooting | Issue | Solution | |--------|-----------| | Audio not playing | Verify `background_audio=True` in `RoomOptions` | | Audio quality issues | Use WAV format with 16-bit PCM encoding | | Audio doesn't stop | Ensure `stop_background_audio()` is called properly | | Overlapping sounds | Use `override_thinking=True` when playing background audio |
--- --- title: Call Transfer hide_title: false hide_table_of_contents: false description: Learn how to enable your AI Agent to seamlessly transfer a live SIP call to a different phone number pagination_label: Call Transfer keywords: - AI Agent SDK - AI Telephony Agent - ai-agent - SIP - VideoSDK - Call Transfer - Telephony Integration - Webhooks image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Call Transfer slug: call-transfer --- import { AgentCardGrid, GithubIcon, DocumentIcon, } from "@site/src/components/agent/cards"; Call Transfer lets your AI Agent move an ongoing SIP call to another phone number without ending the current session. Instead of making the caller hang up and dial a new number, the agent can automatically route the call. ## How Call Transfer Works - The agent evaluates the user’s intent to determine when a call transfer is required and then triggers the function tool. - When the function tool is triggered, it tells the system to move the call to another phone number. - The ongoing SIP call is forwarded to the new number instantly, without disconnecting or redialing. ## Trigger Call Transfer To set up incoming call handling, outbound calling, and routing rules, check out the [Quick Start Example](https://docs.videosdk.live/telephony/ai-telephony-agent-quick-start#part-2-connect-your-agent-to-the-phone-network). ```python title="main.py" from videosdk.agents import Agent, function_tool, class CallTransferAgent(Agent): def __init__(self): super().__init__( instructions="You are the Call Transfer Agent Which Help and provide to transfer on going call to new number. use transfer_call tool to transfer the call to new number.", ) async def on_enter(self) -> None: await self.session.say("Hello Buddy, How can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye Buddy, Thank you for calling!") @function_tool async def transfer_call(self) -> None: """Transfer the call to Provided number""" token = os.getenv("VIDEOSDK_AUTH_TOKEN") transfer_to = os.getenv("CALL_TRANSFER_TO") return await self.session.call_transfer(token,transfer_to) ``` ## Example - Try It Yourself }, ]} columns={2} /> --- --- title: Context Window hide_title: false hide_table_of_contents: false description: "Learn about Context Window in the VideoSDK AI Agent SDK. Manage conversation history with automatic compression, truncation, and token budgeting for long-running voice agents." pagination_label: "Context Window" keywords: - Context Window - AI Agent SDK - VideoSDK Agents - Token Management - Conversation History - Context Compression - Context Truncation - LLM Memory - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Context Window slug: context-window --- import { AgentCardGrid, GithubIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon } from '@site/src/components/agent/cards'; # Context Window Context Window provides automatic conversation history management for the [Pipeline](https://docs.videosdk.live/ai_agents/core-components/pipeline). Instead of manually trimming messages, you configure a `ContextWindow` with token and item budgets — it automatically compresses old turns into summaries and truncates excess items before each LLM call, ensuring your agent maintains long-term memory without exceeding context limits. ![Context Window](https://cdn.videosdk.live/website-resources/docs-resources/context_window.jpg) :::tip Context Window replaces manual context management. All token budgeting, history compression, and truncation is now handled automatically through a single configuration object. ::: ## How Context Window Works Context Window is configured on a `Pipeline` instance via the `context_window` parameter. It runs a two-step management cycle before every LLM call: | Step | Action | Purpose | | :--- | :--- | :--- | | **1. Compress** | Summarize old conversation turns via LLM | Preserve long-term memory without keeping every message | | **2. Truncate** | Remove oldest non-protected items | Enforce hard token and item count limits | Three item types are **always protected** and never removed: | Protected Item | Reason | | :--- | :--- | | **System message** | Agent instructions must persist | | **Summary message** | Compressed history is the agent's long-term memory | | **Last user message** | LLMs require conversation to end with a user turn | ```python title="main.py" from videosdk.agents import Agent, Pipeline, AgentSession, JobContext, RoomOptions, WorkerJob, ContextWindow from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.cartesia import CartesiaTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector pipeline = Pipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=CartesiaTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), # Configure context window management context_window=ContextWindow( max_tokens=4000, max_context_items=20, keep_recent_turns=3, max_tool_calls_per_turn=10, ), ) ``` --- ## Configuration Parameters ### `max_tokens` Maximum estimated token budget for the entire conversation history. When exceeded, old turns are compressed then truncated. **Type:** `int | None` **Default:** `None` (no token limit) **Example:** ```python title="main.py" context_window=ContextWindow( max_tokens=4000, # ~5 city plans + conversation with 3 tools each ) ``` ### `max_context_items` Maximum number of items (messages + tool calls + tool results) in the context. Either limit can trigger compression/truncation. **Type:** `int | None` **Default:** `None` (no item limit) **Example:** ```python title="main.py" context_window=ContextWindow( max_context_items=20, # Keep context compact ) ``` ### `keep_recent_turns` Number of recent user-assistant exchanges kept verbatim during compression. Everything older gets summarized by the LLM. **Type:** `int` **Default:** `3` **Example:** ```python title="main.py" context_window=ContextWindow( keep_recent_turns=5, # Keep last 5 exchanges word-for-word ) ``` ### `max_tool_calls_per_turn` Maximum number of tool calls allowed in a single user turn. This is a safety limit to prevent infinite loops where the LLM keeps requesting tools without ever producing a text response. **Type:** `int` **Default:** `10` **Example:** ```python title="main.py" context_window=ContextWindow( max_tool_calls_per_turn=10, # Allow up to 10 sequential tool calls ) ``` :::note For multi-city queries like "Plan for Dubai AND Mumbai", each city requires 3 tool calls. Setting this to 10 gives headroom for 2-3 cities plus any redundant LLM calls. ::: ### `summary_llm` Optional separate LLM for generating summaries. If not set, the agent's main LLM is used automatically. Use a smaller/cheaper model to reduce costs. **Type:** `LLM | None` **Default:** `None` (uses agent's main LLM) **Example:** ```python title="main.py" from videosdk.plugins.openai import OpenAILLM context_window=ContextWindow( max_tokens=4000, keep_recent_turns=3, summary_llm=OpenAILLM(model="gpt-4o-mini"), # Cheaper model for summaries ) ``` --- ## The manage() Cycle The `manage()` method runs automatically before each LLM call. It performs two steps in order: ### Step 1: Compress When the context exceeds the token or item budget **and** there are enough old turns to compress (more than `keep_recent_turns + 1`), compression kicks in: 1. **Split** — Separate items into old turns and recent turns (keeping the last N user exchanges) 2. **Render** — Convert old items into human-readable text for the summarization prompt 3. **Summarize** — Call the LLM (or `summary_llm`) to generate a concise summary 4. **Replace** — Remove all old items and insert the summary as an assistant message marked `{"summary": True}` **What the summary preserves:** - Key facts, names, and numbers - Decisions made and their reasoning - Tool/function call results and outcomes - Commitments or promises the assistant made - User objectives, preferences, and unresolved tasks ### Step 2: Truncate After compression (or if compression wasn't needed), truncation enforces hard limits: 1. Remove the oldest non-protected items one at a time 2. Function call/output pairs are removed together to avoid orphaned tool calls 3. Continue until both `max_tokens` and `max_context_items` are satisfied 4. If only protected items remain, stop even if still over budget --- ## How Tool Chaining Works Context Window integrates seamlessly with tool chaining. Here's the lifecycle of a multi-tool turn: 1. User says "Plan for Dubai" → LLM returns `get_weather(Dubai)` 2. Tool executes → result added to context → LLM called again 3. LLM returns `get_clothing_advice(22°C)` → execute → call LLM again 4. LLM returns `get_activity_suggestion(22°C, "jacket")` → execute → call LLM 5. LLM returns text "Dubai is 22°C, wear a jacket, go hiking!" → spoken by TTS That's 3 tool calls + 1 text response = 4 rounds, well within `max_tool_calls_per_turn=10`. :::note Some LLMs (Anthropic Claude, OpenAI GPT-4o) can return multiple tool calls in a single response. These are collected and executed in parallel using `asyncio.gather`, then all results are added to context before the next LLM call. Google Gemini sends one tool call at a time (always sequential). ::: --- ## Complete Example Here's a full example combining Context Window with tool chaining for a production-ready travel assistant: ```python title="main.py" import aiohttp from videosdk.agents import Agent, AgentSession, Pipeline, function_tool, JobContext, RoomOptions, WorkerJob, ContextWindow from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.cartesia import CartesiaTTS from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector @function_tool async def get_weather(city: str) -> dict: """Get the current weather temperature for a given city.""" city_coords = { "dubai": (25.2048, 55.2708), "mumbai": (19.0760, 72.8777), "new york": (40.7128, -74.0060), } coords = city_coords.get(city.lower(), (25.2048, 55.2708)) lat, lon = coords url = f"https://api.open-meteo.com/v1/forecast?latitude={lat}&longitude={lon}¤t=temperature_2m" async with aiohttp.ClientSession() as session: async with session.get(url) as response: if response.status == 200: data = await response.json() temp = data["current"]["temperature_2m"] return {"city": city, "temperature": temp, "unit": "Celsius"} else: return {"city": city, "temperature": 25, "unit": "Celsius", "note": "fallback"} @function_tool async def get_clothing_advice(temperature: float) -> dict: """Get clothing recommendation based on temperature.""" if temperature > 35: advice = "Very light breathable clothes, hat, and sunscreen." elif temperature > 25: advice = "Light clothes like t-shirt and shorts." elif temperature > 15: advice = "Light jacket or sweater with comfortable pants." elif temperature > 5: advice = "Warm coat, scarf, and layered clothing." else: advice = "Heavy winter coat, gloves, hat, and thermal layers." return {"temperature": temperature, "clothing_advice": advice} class TravelAgent(Agent): def __init__(self): super().__init__( instructions=( "You are a helpful travel assistant. When a user asks what to do in a city:\n" "1. FIRST call get_weather to get the temperature\n" "2. THEN call get_clothing_advice with that temperature\n" "4. Combine results into a natural spoken response (2-3 sentences max)." ), tools=[get_weather, get_clothing_advice], ) async def on_enter(self) -> None: await self.session.say("Hi! I'm your travel assistant. Ask me about any city!") async def on_exit(self) -> None: pass async def start_session(context: JobContext): agent = TravelAgent() pipeline = Pipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=CartesiaTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), # ── Context Window Configuration ─────────────────────────── # # max_tokens: Token budget for the conversation. # With 2 tools per city, each city adds ~150 tokens. # 4000 tokens fits ~8 city plans + conversation. # # max_context_items: Maximum messages + tool calls. # Either limit can trigger compression/truncation. # # keep_recent_turns: Recent exchanges kept verbatim. # Everything older gets summarized by the LLM. # # max_tool_calls_per_turn: Safety limit per turn. # Prevents infinite tool-call loops. # context_window=ContextWindow( max_tokens=4000, max_context_items=20, keep_recent_turns=3, max_tool_calls_per_turn=10, ), ) session = AgentSession(agent=agent, pipeline=pipeline) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Travel Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` --- ## Token Estimation Context Window uses a lightweight heuristic for token counting (~4 characters per token). This is fast enough for real-time budget decisions but is not a replacement for provider-reported usage. | Item Type | Estimation Method | | :--- | :--- | | **Text message** | `len(text) // 4` | | **Image content** | Fixed 300 tokens | | **Function call** | `len(name) // 4 + len(arguments) // 4 + 5` | | **Function output** | `len(name) // 4 + len(output) // 4 + 5` | | **Per-item overhead** | 4 tokens | --- ## Parameter Reference | Parameter | Type | Default | Purpose | | :--- | :--- | :--- | :--- | | `max_tokens` | `int \| None` | `None` | Token budget for conversation history | | `max_context_items` | `int \| None` | `None` | Maximum items (messages + tool calls) | | `keep_recent_turns` | `int` | `3` | Recent exchanges kept verbatim | | `max_tool_calls_per_turn` | `int` | `10` | Safety limit for tool calls per turn | | `summary_llm` | `LLM \| None` | `None` | Optional dedicated LLM for summaries | ## Examples - Try Out Yourself } ]} /> --- --- title: Conversation Flow hide_title: false hide_table_of_contents: false description: "Explore the `Conversation Flow` component in the VideoSDK AI Agent SDK. Learn how it manages turn taking in the Agents" pagination_label: "Conversation Flow" keywords: - Conversation Flow - AI Agent SDK - VideoSDK Agents - AI Models - OpenAI - Gemini - Model Configuration - Streaming Audio - Multi-modal AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Conversation Flow slug: conversation-flow --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, ExternalLinkIcon } from '@site/src/components/agent/cards'; # Conversation Flow The `Conversation Flow` component manages the turn-taking logic in AI agent conversations, ensuring smooth and natural interactions. It is an inheritable class that allows you to inject custom logic into the `Cascade`, enabling advanced capabilities like context preservation, dynamic adaptation, and Retrieval-Augmented Generation (RAG) before the final LLM call. ![Conversation Flow](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_conversation_flow.png) :::note Conversation Flow is a powerful feature that currently works exclusively with the [Cascade ↗](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline). ::: ## Core Features The key methods allow you to inject custom logic at different stages of the conversation flow, enabling sophisticated AI agent behaviors while maintaining clean separation of concerns: ### **Core Capabilities** - **Turn-taking Management:** Control the flow and timing of agent and user turns - **Context Preservation:** Maintain conversation history and user data across turns (handled automatically) - **Advanced Flow Control:** Build stateful conversations that can adapt to user input - **Performance Optimization:** Fine-tune conversation processing for speed and efficiency - **Error Handling:** Implement robust error recovery and fallback mechanisms ### **Advanced Use Cases** - **RAG Implementation:** Retrieve relevant documents and context before LLM processing - **Memory Management:** Store and recall conversation history across sessions - **Content Filtering:** Apply safety checks and content moderation on input/output - **Analytics & Logging:** Track conversation metrics and user behavior patterns (built-in metrics integration) - **Business Logic Integration:** Add domain-specific processing and validation rules - **Multi-step Workflows:** Implement complex conversation flows with state management - **Function Tool Execution:** Automatic execution of function tools when requested by the LLM. ## Basic Usage ### Complete Setup with CascadingPipeline The recommended approach is to use `ConversationFlow` with a `CascadingPipeline`, which handles component configuration automatically: ```python title="main.py" from videosdk.agents import ConversationFlow, Agent, CascadingPipeline # First, define your agent class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant." ) async def on_enter(self): # Initialize agent state pass async def on_exit(self): # Cleanup resources pass # Create pipeline and conversation flow pipeline = CascadingPipeline(stt=my_stt, llm=my_llm, tts=my_tts) conversation_flow = ConversationFlow(MyAgent()) # Pipeline automatically configures all components pipeline.set_conversation_flow(conversation_flow) ``` ### Constructor Parameters The ConversationFlow constructor accepts comprehensive configuration options: ```python ConversationFlow( agent: Agent, stt: STT | None = None, llm: LLM | None = None, tts: TTS | None = None, vad: VAD | None = None, turn_detector: EOU | None = None, denoise: Denoise | None = None ) ``` To add custom behavior, you inherit from `ConversationFlow` and override its methods. ## Built-in Methods ### Core Processing Methods - `process_with_llm()`: Processes the current chat context with the LLM and handles function tool execution automatically. - `say(message: str)`: Direct TTS synthesis for agent responses. - `process_text_input(text: str)`: Handle text input for A2A communication, bypassing STT. ### Lifecycle Hooks Override these methods to add custom behavior at specific conversation points: ```python class CustomFlow(ConversationFlow): async def on_turn_start(self, transcript: str) -> None: """Called when a user turn begins.""" print(f"User said: {transcript}") async def on_turn_end(self) -> None: """Called when a user turn ends.""" print("Turn completed") ``` ## Automatic Features - **Context Management**: The conversation flow automatically manages the agent's chat context. Do not manually add user messages as this will create duplicates. - **Audio Processing**: Audio data is automatically processed through send_audio_delta(), handling denoising, STT, and VAD processing. - **Interruption Handling**: The system includes sophisticated interruption logic that gracefully handles user interruptions during agent responses. ## Custom Conversation Flows ### RAG (Retrieval-Augmented Generation) Integration Enhance your agent's knowledge by integrating RAG to retrieve relevant information from external documents and databases. **Benefits:** - Access external documents and FAQs - Reduce hallucinations with real data - Dynamic context retrieval ```python title="rag_example.py" class RAGConversationFlow(ConversationFlow): async def run(self, transcript: str) -> AsyncIterator[str]: # Retrieve relevant context context = await self.agent.retrieve_relevant_documents(transcript) # Add context to conversation if context: self.agent.chat_context.add_message( role="system", content=f"Use this information: {context}" ) # Generate response with enhanced context async for response in self.process_with_llm(): yield response ``` See our [RAG Integration Documentation](../core-components/rag) for complete implementation. ### Implementing Custom Flows You can create a custom flow by inheriting from `ConversationFlow` and overriding the `run` method. This allows you to intercept the user's transcript, modify it, manage state, and even change the response from the LLM. ```python title="main.py" from typing import AsyncIterator from videosdk.agents import ConversationFlow, Agent class CustomConversationFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.turn_count = 0 async def run(self, transcript: str) -> AsyncIterator[str]: """Override the main conversation loop to add custom logic.""" self.turn_count += 1 # You can access and add to the agent's chat context before calling the LLM self.agent.chat_context.add_message(role=ChatRole.USER, content=transcript) # Process with the standard LLM call async for response_chunk in self.process_with_llm(): # Apply custom processing to the response processed_chunk = await self.apply_custom_processing(response_chunk) yield processed_chunk async def apply_custom_processing(self, chunk: str) -> str: """A helper method to modify the LLM's output.""" if self.turn_count == 1: # Prepend a greeting on the first turn return f"Hello! {chunk}" elif self.turn_count > 5: # Offer to summarize after many turns return f"This is an interesting topic. To summarize: {chunk}" else: return chunk ``` ### Advanced Turn-Taking Logic For more complex interactions, you can implement a state machine within your conversation flow to manage different states of the conversation. ```python title="main.py" class AdvancedTurnTakingFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.conversation_state = "listening" # Initial state async def run(self, transcript: str) -> AsyncIterator[str]: """A state-driven conversation loop.""" if self.conversation_state == "listening": # If we were listening, we now process the user's input # and transition to the responding state. await self.process_user_input(transcript) self.conversation_state = "responding" async for response_chunk in self.process_with_llm(): yield response_chunk # Once done responding, go back to listening self.conversation_state = "listening" elif self.conversation_state == "waiting_for_confirmation": # Handle a confirmation state if "yes" in transcript.lower(): yield "Great! Proceeding." self.conversation_state = "listening" else: yield "Okay, cancelling." self.conversation_state = "listening" async def process_user_input(self, transcript: str): """Custom logic for processing user input.""" print(f"Processing user input in state: {self.conversation_state}") # Add logic here, e.g., check if the user is asking a question that needs confirmation if "delete my account" in transcript.lower(): self.conversation_state = "waiting_for_confirmation" ``` ### Context-Aware Conversations Maintain conversation history and user preferences to create a personalized and context-aware experience. ```python title="main.py" import time class ContextAwareFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.conversation_history = [] self.current_topic = "general" async def run(self, transcript: str) -> AsyncIterator[str]: # First, update the context with the new transcript await self.update_context(transcript) # The agent's chat_context (automatically managed) will be # used by process_with_llm() to generate a context-aware response. async for response_chunk in self.process_with_llm(): yield response_chunk async def update_context(self, transcript: str): """Update history and identify the topic before calling the LLM.""" self.conversation_history.append({ 'role': 'user', 'content': transcript, 'timestamp': time.time() }) await self.identify_topic(transcript) # Add topic-specific context (system messages are safe to add) if hasattr(self, 'current_topic'): self.agent.chat_context.add_message( role=ChatRole.SYSTEM, content=f"System note: The user is asking about {self.current_topic}." ) async def identify_topic(self, transcript: str): """A simple topic identification logic.""" if "weather" in transcript.lower(): self.current_topic = "weather" elif "finance" in transcript.lower(): self.current_topic = "finance" ``` ## Performance Optimization To ensure the best user experience, consider the following optimization strategies: - **Efficient Context**: Keep the context provided to the LLM concise. Summarize earlier parts of the conversation to reduce token count and improve LLM response time. - **Asynchronous Operations**: When performing RAG or calling external APIs for data, ensure the operations are fully asynchronous (async/await) to avoid blocking the event loop. - **Caching**: Cache frequently accessed data (e.g., from a database or RAG store) to reduce lookup latency on subsequent turns. - **Streaming**: The run method returns an `AsyncIterator`. Process and yield response chunks as soon as they are available from the LLM to minimize perceived latency for the user. ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. } ]} /> --- --- title: De-noise hide_title: false hide_table_of_contents: false description: "Learn how to enhance voice quality by removing background noise in VideoSDK AI Agents. Implement real-time audio denoising for clearer conversations." pagination_label: "De-noise" keywords: - Voice Enhancement - Noise Removal - Audio Denoising - RNNoise - Audio Processing - Background Noise - Voice Quality - AI Agent SDK - VideoSDK Agents - Real-time Audio - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 7 sidebar_label: De-noise slug: de-noise --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; # De-noise De-noise improves audio quality in your AI agent conversations by filtering out background noise. This creates more professional and engaging interactions, especially in noisy environments. ## Overview The VideoSDK Agents framework provides real-time audio denoising capabilities via `RNNoise` plugin that: - **Remove Background Noise**: Filters out ambient sounds, keyboard typing, air conditioning, and other distractions - **Enhance Voice Clarity**: Improves speech intelligibility and quality - **Work in Real-time**: Processes audio with minimal latency during live conversations - **Integrate Seamlessly**: Works with `Pipeline` in both cascading and realtime modes ## What De-noise Solves Without noise removal, your agents may struggle with: - Poor audio quality affecting transcription accuracy - Background noise interfering with conversation flow - Unprofessional sound quality in business applications - Difficulty understanding users in noisy environments With De-noise, you get: - Crystal clear audio for better user experience - Improved speech-to-text accuracy - Professional-grade audio quality - Better performance in various acoustic environments ## RNNoise Implementation `RNNoise` is a real-time noise suppression library that uses deep learning to distinguish between speech and noise, providing effective background noise removal. ### Key Features - **Real-time Processing**: Low-latency noise removal suitable for live conversations - **Adaptive Filtering**: Automatically adjusts to different types of background noise - **Speech Preservation**: Maintains voice quality while removing unwanted sounds - **Lightweight**: Efficient processing with minimal computational overhead ### Basic Setup ```python from videosdk.plugins.rnnoise import RNNoise # Initialize noise removal denoise = RNNoise() ``` ## Pipeline Integration Add noise removal to your cascade: ```python title="main.py" from videosdk.agents import Agent, Pipeline, AgentSession # highlight-start from videosdk.plugins.rnnoise import RNNoise # highlight-end # Add your preferred providers from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.silero import SileroVAD class EnhancedVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a professional assistant with crystal-clear audio quality. Help users with their questions while maintaining excellent conversation flow." ) async def on_enter(self): await self.session.say("Hello! I'm here with enhanced audio quality for our conversation.") async def on_exit(self): await self.session.say("Goodbye! It was great talking with you.") # Set up pipeline with noise removal pipeline = Pipeline( stt=DeepgramSTT(api_key="your-deepgram-key"), llm=OpenAILLM(api_key="your-openai-key", model="gpt-4"), tts=ElevenLabsTTS(api_key="your-elevenlabs-key", voice_id="your-voice-id"), vad=SileroVAD(), # highlight-start denoise=RNNoise() # Enable noise removal # highlight-end ) # Create and start session async def main(): session = AgentSession(agent=EnhancedVoiceAgent(), pipeline=pipeline) await session.start() if __name__ == "__main__": import asyncio asyncio.run(main()) ``` Integrate with real-time models: ```python title="main.py" from videosdk.agents import Agent, Pipeline, AgentSession # highlight-start from videosdk.plugins.rnnoise import RNNoise # highlight-end from videosdk.plugins.openai import OpenAIRealtime class EnhancedRealtimeAgent(Agent): def __init__(self): super().__init__( instructions="You are a professional assistant with crystal-clear audio quality. Engage in natural, real-time conversations while providing helpful responses." ) async def on_enter(self): await self.session.say("Hello! I'm ready for a real-time conversation with enhanced audio quality.") async def on_exit(self): await self.session.say("Thank you for the conversation! Take care.") # Set up real-time model model = OpenAIRealtime( model="gpt-4o-realtime-preview", api_key="your-openai-key", voice="alloy" # Choose from: alloy, echo, fable, onyx, nova, shimmer ) # Set up pipeline with noise removal pipeline = Pipeline( llm=model, # highlight-start denoise=RNNoise() # Enable noise removal # highlight-end ) # Create and start session async def main(): session = AgentSession(agent=EnhancedRealtimeAgent(), pipeline=pipeline) await session.start() if __name__ == "__main__": import asyncio asyncio.run(main()) ``` ## Audio Processing Flow When noise removal is enabled, audio processing follows this flow: 1. **Raw Audio Input:** Microphone captures audio with background noise 2. **Noise Removal:** `RNNoise` filters out unwanted sounds 3. **Enhanced Audio:** Clean audio is passed to speech processing 4. **Improved Results:** Better transcription and conversation quality ## Example - Try Out Yourself } ]} /> --- --- title: DTMF Events hide_title: false hide_table_of_contents: false description: Learn how to enable and listen to DTMF (Dual-Tone Multi-Frequency) events in VideoSDK AI Agents. pagination_label: DTMF Events keywords: - SIP - VideoSDK - DTMF - Telephony Integration - Webhooks - PubSub - Events image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: DTMF Events slug: dtmf-events --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; DTMF (Dual-Tone Multi-Frequency) events happen when a caller presses keys (0–9, *, #) on their phone or SIP device during a call. AI agents can listen for these events to capture user input, run specific actions, or respond to the caller based on the key they pressed. DTMF provides a simple and reliable way for users to interact with the agent during a call. ## How It Works - **DTMF Event Detection**: The agent detects key presses (0–9, *, #) from the caller during a call session. - **Real-Time Processing**: Each key press generates a DTMF event that is delivered to the agent immediately. - **Callback Integration**: A user-defined callback function handles incoming DTMF events. - **Action Execution**: The agent executes actions or triggers workflows based on the received DTMF input like building IVR flows, collecting user input, or triggering actions in your application. ## How to enable DTMF Events ### Step 1 : Activate DTMF Detection DTMF event detection can be enabled in two ways: When creating an Inbound SIP gateway in the VideoSDK dashboard, enable the `DTMF` option. ![dtmf-event](https://assets.videosdk.live/images/DTMF-events.png) Set the `enableDtmf` parameter to `true` when creating or updating a SIP gateway using the API. ```bash curl -H 'Authorization: $YOUR_TOKEN' \ -H 'Content-Type: application/json' \ -d '{ "name" : "Twilio Inbound Gateway", "enableDtmf" : "true", "numbers" : ["+0123456789"] }' \ -XPOST https://api.videosdk.live/v2/sip/inbound-gateways ``` ### Step 2 : Implementation To set up inbound calls, outbound calls, and routing rules check out the [Quick Start Example](https://docs.videosdk.live/telephony/managing-calls/making-outbound-calls). ```python title="main.py" from videosdk.agents import AgentSession, DTMFHandler async def entrypoint(ctx: JobContext): async def dtmf_callback(digit: int): if digit == 1: agent.instructions = "You are a Sales Representative. Your goal is to sell our products" await agent.session.say( "Routing you to Sales. Hi, I'm from Sales. How can I help you today?" ) elif digit == 2: agent.instructions = "You are a Support Specialist. Your goal is to help customers with technical issues." await agent.session.say( "Routing you to Support. Hi, I'm from Support. What issue are you facing?" ) else: await agent.session.say( "Invalid input. Press 1 for Sales or 2 for Support." ) #highlight-start dtmf_handler = DTMFHandler(dtmf_callback) #highlight-end session = AgentSession( #highlight-start dtmf_handler = dtmf_handler, #highlight-end ) ``` ## Example - Try It Yourself , }, ]} columns={2} /> --- --- title: Fallback Adapter hide_title: false hide_table_of_contents: false description: "Learn about Fallback and recovery for STT, LLM, and TTS providers in VideoSDK AI Agents." pagination_label: "Fallback Adapter" keywords: - Fallback Adapter - Fallback and recovery - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Fallback Adapter slug: fallback-adapter --- import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; # Fallback Adapter The `Fallback Adapter` provides automatic failover between multiple STT, LLM, or TTS providers. If a provider becomes unavailable, the system automatically switches to the next configured provider without interrupting the session. ## Features - **Automatic Fallback**: Switches to lower-priority providers if the primary provider fails. - **Cooldown-based Retry**: Implements a cooldown period before retrying a failed provider, preventing immediate repeated failures. - **Auto-Recovery**: Automatically switches back to a higher-priority provider once it becomes healthy again. - **Permanent Disable**: Permanently disables a provider after a configured number of failed recovery attempts. ## Example Usage Here is how you can implement fallback providers for STT, LLM, and TTS in your agent configuration. ```python from videosdk.agents import FallbackSTT, FallbackLLM, FallbackTTS from videosdk.plugins.openai import OpenAISTT, OpenAILLM, OpenAITTS from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.cerebras import CerebrasLLM from videosdk.plugins.cartesia import CartesiaTTS # Configure Fallback STT stt_provider = FallbackSTT( [OpenAISTT(), DeepgramSTT()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) # Configure Fallback LLM llm_provider = FallbackLLM( [OpenAILLM(model="gpt-4o-mini"), CerebrasLLM()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) # Configure Fallback TTS tts_provider = FallbackTTS( [OpenAITTS(voice="alloy"), CartesiaTTS()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) ``` ## Configuration Options You can configure the fallback behavior using the following parameters: | Parameter | Description | | :--- | :--- | | `temporary_disable_sec` | The duration (in seconds) to wait before retrying a failed provider. | | `permanent_disable_after_attempts` | The maximum number of recovery attempts allowed before a provider is permanently disabled. | ## Examples - Try Out Yourself } ]} /> --- --- title: Memory hide_title: false hide_table_of_contents: false description: "Enable your VideoSDK AI Agents with long-term memory to create personalized, context-aware conversations. This guide covers integrating memory providers like Mem0, retrieving context, and enhancing user experience." pagination_label: "Memory" keywords: - Memory - Long-term Memory - Conversation History - Personalization - Mem0 - AI Agent Memory - Context Retrieval - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 12 sidebar_label: Memory slug: memory --- import { AgentCardGrid, GithubIcon, CodeIcon } from '@site/src/components/agent/cards'; import Step from '@site/src/components/Step'; # Memory Give your AI agents the ability to remember past interactions and user preferences. By integrating a memory provider, your agent can move beyond the limits of its immediate context window to deliver truly personalized and context-aware conversations. ## How Memory Enhances Conversations A standard LLM's memory is limited to its context window. A dedicated memory provider solves this by creating a persistent, intelligent storage layer that recalls information across different sessions. ![Memory-enabled Conversation Flow](https://assets.videosdk.live/images/voice-agent-memory-manager.png) The agent stores key facts (name, preferences, interests) and retrieves them in later conversations to provide a personalized interaction. ## Implementation with Mem0 This guide demonstrates how to implement long-term memory using [**Mem0**](https://mem0.ai/). We'll build a personal assistant that remembers returning users. :::note For a complete working example, see the GitHub repository: - https://github.com/videosdk-live/agents-quickstart/tree/main/Memory ::: ### Prerequisites - A Mem0 API key from the [Mem0 dashboard](https://app.mem0.ai/). - Agent environment set up per the [AI Voice Agent Quickstart](/ai_agents/voice-agent-quick-start).
### Step 1: Create a Memory Manager
Create a class that wraps the Mem0 REST API. It handles fetching, storing, and deciding what's worth remembering. ```python title="main.py" import httpx class Mem0Memory: STORE_KEYWORDS = ( "remember", "my name", "i like", "i dislike", "favorite", "i prefer", "i love", "i hate", "i'm", "i am", "i work", ) def __init__(self, api_key: str, user_id: str): self.user_id = user_id self._client = httpx.AsyncClient( base_url="https://api.mem0.ai", headers={"Authorization": f"Token {api_key}", "Content-Type": "application/json"}, timeout=10.0, ) async def get_memories(self, limit: int = 5) -> list[str]: """Fetch all stored memories for this user.""" try: r = await self._client.get("/v1/memories/", params={"user_id": self.user_id}) r.raise_for_status() entries = r.json() if isinstance(r.json(), list) else r.json().get("results", []) return [ e.get("memory", "") for e in entries if isinstance(e, dict) and e.get("memory", "").strip() ][:limit] except Exception: return [] async def search(self, query: str, top_k: int = 5) -> list[str]: """Search for memories relevant to the user's current query.""" try: r = await self._client.post( "/v1/memories/search/", json={"query": query, "user_id": self.user_id, "top_k": top_k}, ) r.raise_for_status() results = r.json() if isinstance(r.json(), list) else r.json().get("results", []) return [e.get("memory", "") for e in results if isinstance(e, dict) and e.get("memory", "").strip()] except Exception: return [] async def store(self, user_msg: str, assistant_msg: str | None = None): """Store a conversation turn. Mem0 extracts what's worth remembering.""" messages = [{"role": "user", "content": user_msg}] if assistant_msg: messages.append({"role": "assistant", "content": assistant_msg}) r = await self._client.post( "/v1/memories/", json={"messages": messages, "user_id": self.user_id} ) r.raise_for_status() ```
### Step 2: Create the Agent with Personalized Greeting
At session startup, fetch stored memories and inject them into the agent's system prompt. The agent greets returning users differently from new users. ```python title="main.py" from videosdk.agents import Agent class PersonalAssistant(Agent): def __init__(self, instructions: str, memories: list[str]): self._memories = memories super().__init__(instructions=instructions) async def on_enter(self): if self._memories: await self.session.say("Hey! Welcome back. How can I help you today?") else: await self.session.say( "Hi there! I'm your personal assistant. " "Tell me about yourself so I can remember you next time!" ) async def on_exit(self): await self.session.say("Bye! I'll remember everything for next time.") ```
### Step 3: Store Memories with Pipeline Hooks
Use two hooks together: - **`user_turn_start`** — Search Mem0 for relevant memories and inject them into `chat_context` **before** the LLM runs. This lets the agent reference past conversations. - **`llm`** — After the LLM responds, store the conversation turn. Mem0 automatically extracts what's worth remembering. :::tip Review core concepts in the [Pipeline Hooks](./pipeline-hooks) guide. ::: ```python title="main.py" pending_msg = None @pipeline.on("user_turn_start") async def on_user(transcript: str): nonlocal pending_msg pending_msg = transcript # Search for relevant past memories and inject into context relevant = await memory.search(transcript) if relevant: context = "\n".join(f"- {m}" for m in relevant) agent.chat_context.add_message( role="system", content=f"Relevant memories about this user:\n{context}\n\nUse these to answer personally.", ) @pipeline.on("llm") async def on_llm(data: dict): nonlocal pending_msg if not memory or not pending_msg: pending_msg = None return # Store every turn — Mem0 decides what's worth remembering await memory.store(pending_msg, data.get("text", "")) pending_msg = None ```
### Step 4: Wire Everything Together
Initialize the memory manager, build personalized instructions, create the pipeline, register hooks, and start the session. ```python title="main.py" import os from videosdk.agents import Agent, AgentSession, Pipeline, WorkerJob, JobContext, RoomOptions from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector async def entrypoint(ctx: JobContext): # 1. Setup memory memory = Mem0Memory(api_key=os.getenv("MEM0_API_KEY"), user_id="demo-user") memories = await memory.get_memories() # 2. Build personalized instructions base = "You are a friendly personal assistant. Keep responses short and conversational." if memories: facts = "\n".join(f"- {m}" for m in memories) instructions = f"{base}\n\nYou already know this about the user:\n{facts}" else: instructions = base # 3. Create agent and pipeline agent = PersonalAssistant(instructions=instructions, memories=memories) pipeline = Pipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), ) # 4. Register memory hooks (see Step 3) pending_msg = None @pipeline.on("user_turn_start") async def on_user(transcript: str): nonlocal pending_msg pending_msg = transcript # Search and inject relevant memories before LLM runs relevant = await memory.search(transcript) if relevant: context = "\n".join(f"- {m}" for m in relevant) agent.chat_context.add_message( role="system", content=f"Relevant memories about this user:\n{context}\n\nUse these to answer personally.", ) @pipeline.on("llm") async def on_llm(data: dict): nonlocal pending_msg if not pending_msg: return # Store every turn — Mem0 decides what's worth remembering await memory.store(pending_msg, data.get("text", "")) pending_msg = None # 5. Start session session = AgentSession(agent=agent, pipeline=pipeline) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: return JobContext(room_options=RoomOptions(name="Personal Assistant", playground=True)) if __name__ == "__main__": WorkerJob(entrypoint=entrypoint, jobctx=make_context).start() ``` ## Example - Try It Yourself } ]} /> --- --- title: Multi Agent Switching hide_title: false hide_table_of_contents: false description: "Learn how to switch between multiple specialized agents in VideoSDK for context-aware workflow using real world examples" pagination_label: "Multi Agent Switching" keywords: - Multi Agent Switching - Multi-Agent Orchestration - VideoSDK Agents - VideoSDK AI Voice - Python SDK - Real-time Transcription - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Multi Agent Switching slug: multi-agent-switching --- import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; Multi agent switching allows you to break a complex workflow into multiple specialized agents, each responsible for a specific domain or task. Instead of relying on a single agent to manage every tool and decision, you can coordinate smaller agents that operate independently. ### Context Inheritance When switching agents, you can control whether the new agent should be aware of the previous conversation using the `inherit_context` flag. - **`inherit_context=True`**: The new agent receives the full chat context. This is ideal for maintaining continuity, so the user doesn't have to repeat information. - **`inherit_context=False`** (Default): The new agent starts with a fresh state. This is useful when switching to a completely unrelated task. ## How It Works - The primary VideoSDK agent identifies whether specialized assistance is needed based on the users intent. - It invokes a `function tool` to switch to the appropriate specialized agent. - Control automatically shifts to the new agent and has access to the previous chat context as `inherit_context=True` was passed. - The specialized agent handles the user’s request, and complete the interaction. ### Implementation ```python title="main.py" from videosdk.agents import Agent, function_tool, class TravelAgent(Agent): def __init__(self): super().__init__( instructions="""You are a travel assistant. Help users with general travel inquiries and guide them to booking when needed.""", ) async def on_enter(self) -> None: await self.session.reply(instructions="Greet the user and ask how you can help with their travel plans.") async def on_exit(self) -> None: await self.session.say("Safe travels!") @function_tool() async def transfer_to_booking(self) -> Agent: """Transfer the user to a booking specialist for reservations and scheduling.""" return BookingAgent(inherit_context=True) class BookingAgent(Agent): def __init__(self, inherit_context: bool = False): super().__init__( instructions="""You are a booking specialist. Help users book or modify flights, hotels, and travel reservations.""", inherit_context=inherit_context ) async def on_enter(self) -> None: await self.session.say("I'm a booking specialist. What would you like to book or modify today?") async def on_exit(self) -> None: await self.session.say("Your booking request is complete. Have a great trip!") ``` ## Example - Try It Yourself , }, { title: "Health Care Agent Example", description: "Implement Health Care Agent Example", link: "https://github.com/videosdk-live/agents-quickstart/blob/main/Multi%20Agent%20Switch/Health%20Care%20agent/", icon: , }, ]} columns={2} /> --- --- title: Overview hide_title: false hide_table_of_contents: false description: "Get an overview of the VideoSDK AI Agent SDK, a framework for building AI agents for real-time conversations. Learn about its core components: Agent, Pipeline, and Agent Session." pagination_label: "Overview" keywords: - AI Agent SDK - VideoSDK Agents - Core Components - Agent - Pipeline - Agent Session - Real-time AI - Agentic Workflow - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Overview slug: overview --- The VideoSDK AI Agent SDK provides a powerful framework for building AI agents that can participate in real-time conversations. This guide explains the core components and demonstrates how to create a complete agentic workflow. The SDK serves as a real-time bridge between AI models and your users, facilitating seamless voice and media interactions. ## Architecture The Agent Session orchestrates the entire workflow, combining the Agent with a Pipeline for real-time communication. The unified Pipeline automatically detects the best mode based on the components you provide — whether that's a full cascade STT-LLM-TTS setup, a realtime speech-to-speech model, or a hybrid of both. ![Overview](https://cdn.videosdk.live/website-resources/docs-resources/agent_v1_architecture.jpg) 1. **Agent** - This is the base class for defining your agent's identity and behavior. Here, you can configure custom instructions, manage its state, and register function tools. 2. **Pipeline** - This unified component manages the real-time flow of audio and data between the user and the AI models. It auto-detects the optimal mode based on the components you provide: - **Cascade Mode** - Provide STT, LLM, TTS, VAD, and Turn Detector for maximum flexibility and control over each processing stage. - **Realtime Mode** - Provide a realtime model (e.g., OpenAI Realtime, Google Gemini Live, AWS Nova Sonic) for lowest-latency speech-to-speech processing. - **Hybrid Mode** - Combine a realtime model with an external STT (for knowledge base support) or external TTS (for custom voice support). 3. **Agent Session** - This component brings together the agent and pipeline to manage the agent's lifecycle within a VideoSDK meeting. 4. **Pipeline Hooks** - A middleware system for intercepting and processing data at any stage of the pipeline. Use hooks for custom STT/TTS processing, observing or modifying LLM output, lifecycle events, and more. ## Supporting Components These components work behind the scenes to support the core functionality of the AI Agent SDK: - Execution & Lifecycle Management - **JobContext** - Provides the execution environment and lifecycle management for AI agents. It encapsulates the context in which an agent job is running. - **WorkerJob** - Manages the execution of jobs and worker processes using Python's multiprocessing, allowing for concurrent agent operations. - Configuration & Settings - **RoomOptions** - This allows you to configure the behavior of the session, including room settings and other advanced features for the agent's interaction within a meeting. - **Options** - This is used to configure the behavior of the worker, including logging and other execution settings. - External Integration - **MCP Servers** - These enable the integration of external tools through either stdio or HTTP transport. - **MCPServerStdio** - Facilitates direct process communication for local Python scripts. - **MCPServerHTTP** - Enables HTTP-based communication for remote servers and services. ## Advanced Features The AI Agent SDK includes a range of advanced features to build sophisticated conversational agents: import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, EyeIcon, RecordingIcon, NetworkIcon, MCPServerIcon, } from '@site/src/components/agent/cards'; }, { title: "Playground Mode", description: "A testing environment to experiment with different agent configurations", link: "https://docs.videosdk.live/ai_agents/core-components/agent-session#playground-mode", icon: }, { title: "Vision Integration", description: "Enable agents to receive and process video input from the meeting", link: "https://docs.videosdk.live/ai_agents/core-components/vision-and-multi-modality", icon: }, { title: "Recording Capabilities", description: "Record agent sessions for analysis and quality assurance", link: "https://docs.videosdk.live/ai_agents/core-components/recording", icon: }, { title: "A2A Communication", description: "Allows for seamless collaboration between specialized AI agents", link: "https://docs.videosdk.live/ai_agents/a2a/overview", icon: }, { title: "MCP Server Integration", description: "Connect agents to external tools and data sources", link: "https://docs.videosdk.live/ai_agents/mcp-integration", icon: } ]} /> ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent and customize according to your needs. }, { title: "Human in the loop", description: "Implement human intervention capabilities in AI agent conversations for better control and oversight", link: "https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop", icon: }, { title: "Enhanced Pronounciation", description: "Improve speech quality and pronunciation accuracy for better user experience and communication clarity", link: "https://github.com/videosdk-live/agents/blob/main/examples/enhanced_pronounciation.py", icon: }, { title: "PubSub Messaging", description: "Facilitates real-time messaging between agent and client", link: "https://github.com/videosdk-live/agents/blob/main/examples/pubsub_example.py", icon: } ]} /> --- --- title: Pipeline Hooks hide_title: false hide_table_of_contents: false description: "Learn about Pipeline Hooks in the VideoSDK AI Agent SDK. Intercept, observe, or modify processing at any stage of the pipeline with hooks for STT, LLM, TTS, vision, and lifecycle events." pagination_label: "Pipeline Hooks" keywords: - Pipeline Hooks - AI Agent SDK - VideoSDK Agents - Middleware - STT Hook - LLM Hook - TTS Hook - Vision Hook - Lifecycle Hooks - Conversation Flow - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Pipeline Hooks slug: pipeline-hooks --- import { AgentCardGrid, GithubIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon } from '@site/src/components/agent/cards'; # Pipeline Hooks Pipeline Hooks provide a middleware system for intercepting and processing data at different stages of the [Pipeline](https://docs.videosdk.live/ai_agents/core-components/pipeline). Instead of subclassing, you register hook functions using the `@pipeline.on()` decorator to add custom logic — from preprocessing audio before STT, to observing or modifying LLM output, to tracking lifecycle events. ![Pipeline Hooks](https://cdn.videosdk.live/website-resources/docs-resources/agent_v1_pipeline_hooks_1.jpg) :::tip Pipeline Hooks replace the previous `ConversationFlow` class. All custom turn-taking logic, RAG integration, content filtering, and lifecycle management is now done through hooks. ::: ## How Hooks Work Hooks are registered on a `Pipeline` instance using the `@pipeline.on("event_name")` decorator. There are three categories of hooks: | Category | Hooks | Purpose | | :--- | :--- | :--- | | **Stream Processing** | `stt`, `tts` | Replace or wrap the built-in STT/TTS processing | | **LLM** | `llm` | Observe or modify LLM-generated content before TTS | | **Lifecycle** | `user_turn_start`, `user_turn_end`, `agent_turn_start`, `agent_turn_end`, `vision_frame` | Observe events and trigger side effects | ```python title="main.py" from videosdk.agents import Pipeline from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.cartesia import CartesiaTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector pipeline = Pipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=CartesiaTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) # Register hooks using the decorator pattern @pipeline.on("user_turn_start") async def on_user_start(transcript: str): print(f"User said: {transcript}") @pipeline.on("agent_turn_end") async def on_agent_done(): print("Agent finished speaking") ``` --- ## Stream Processing Hooks Stream processing hooks replace or wrap the built-in component processing. They receive an async iterator as input and yield processed results. ### `stt` — Speech-to-Text Stream Hook Intercept and process the audio stream before/after STT processing. This hook **replaces** the built-in STT pipeline entirely — you are responsible for running STT within the hook. **Signature:** ```python async def stt_hook(audio_stream: AsyncIterator[bytes]) -> AsyncIterator[SpeechEvent] ``` **When it fires:** When audio arrives from the user and needs to be transcribed. **Behavior:** Only one `stt` hook can be registered. Registering a second will overwrite the first. **Example — Preprocess audio and normalize transcripts:** ```python title="main.py" import re @pipeline.on("stt") async def stt_hook(audio_stream): # Phase 1: Preprocess audio (filter small chunks) async def filtered_audio(): async for audio in audio_stream: if len(audio) < 300: continue # Skip very small audio chunks yield audio # Phase 2: Run STT on filtered audio and normalize output async for event in run_stt(filtered_audio()): if event.data and event.data.text: text = event.data.text.lower() # Remove filler words text = re.sub(r"\b(uh|um|like)\b", "", text) event.data.text = " ".join(text.split()) yield event ``` ### `tts` — Text-to-Speech Stream Hook Intercept and process the text stream before/after TTS synthesis. This hook **replaces** the built-in TTS processing — you run TTS within the hook. **Signature:** ```python async def tts_hook(text_stream: AsyncIterator[str]) -> AsyncIterator[bytes] ``` **When it fires:** When the LLM generates text that needs to be synthesized to speech. **Behavior:** Only one `tts` hook can be registered. Registering a second will overwrite the first. **Example — Format text before synthesis:** ```python title="main.py" @pipeline.on("tts") async def tts_hook(text_stream): # Preprocess text for better pronunciation async def preprocess(): async for text in text_stream: yield text.replace("AM", "A M").replace("PM", "P M") # Run TTS on preprocessed text async for audio in run_tts(preprocess()): yield audio ``` --- ## LLM Hook The `llm` hook fires when the LLM response is fully collected, before TTS synthesis begins. It can be used to **observe** or **modify** the generated response. ### `llm` **Signature:** ```python # Observe only (return None or no return) async def on_llm(data: dict) -> None # Modify response (return str or yield str chunks) async def on_llm(data: dict) -> str | AsyncIterator[str] ``` **Data format:** `{"text": "the full generated response"}` **When it fires:** After the LLM has finished generating its full response, before TTS synthesis. **Behavior:** You can register **multiple** `llm` hooks. They are chained — each receives the (possibly modified) text from the previous hook. If a hook returns/yields a string, it replaces the response. If it returns `None`, the text passes through unchanged. **Example — Observe:** ```python title="main.py" @pipeline.on("llm") async def on_llm(data: dict): text = data.get("text", "") logging.info(f"LLM generated: {text[:100]}...") await store_response(text) ``` **Example — Modify (return):** ```python title="main.py" @pipeline.on("llm") async def filter_response(data: dict): text = data.get("text", "") # Remove sensitive information before TTS return text.replace("SSN", "[REDACTED]") ``` **Example — Modify (yield chunks):** ```python title="main.py" @pipeline.on("llm") async def format_response(data: dict): text = data.get("text", "") yield text.replace("AM", "A M").replace("PM", "P M") ``` ## Vision Hook The `vision_frame` hook fires VProcess video frames when vision is enabled. Hooks are chained — each hook receives the output of the previous one. ### `vision_frame` **Signature:** ```python async def vision_hook(frame_stream: AsyncIterator[av.VideoFrame]) -> AsyncIterator[av.VideoFrame] ``` **Example:** ```python title="main.py" @pipeline.on("vision_frame") async def process_frames(frame_stream): async for frame in frame_stream: # Apply custom filter or analysis processed = apply_filter(frame) yield processed ``` --- ## Lifecycle Hooks Lifecycle hooks are side-effect-only hooks for observing events. They don't modify data flow — use them for logging, analytics, state management, and triggering external actions. :::note You can register **multiple** lifecycle hooks for the same event. All registered hooks will be called in order. ::: ### `user_turn_start` Called when the user's final transcript is available and a new turn begins. **Signature:** ```python async def on_user_turn_start(transcript: str) -> None ``` **Example:** ```python title="main.py" @pipeline.on("user_turn_start") async def on_user_turn_start(transcript: str): logging.info(f"User said: {transcript}") await analytics.track_user_input(transcript) ``` ### `user_turn_end` Called after the agent finishes generating and speaking its response for the current turn. **Signature:** ```python async def on_user_turn_end() -> None ``` **Example:** ```python title="main.py" @pipeline.on("user_turn_end") async def on_user_turn_end(): logging.info("Turn completed") await save_turn_to_history() ``` ### `agent_turn_start` Called when the agent starts speaking (first audio byte is sent to the audio track). **Signature:** ```python async def on_agent_turn_start() -> None ``` **Example:** ```python title="main.py" @pipeline.on("agent_turn_start") async def on_agent_turn_start(): logging.info("Agent started speaking") ``` ### `agent_turn_end` Called when the agent finishes speaking (audio track buffer has fully played out). **Signature:** ```python async def on_agent_turn_end() -> None ``` **Example:** ```python title="main.py" @pipeline.on("agent_turn_end") async def on_agent_turn_end(): logging.info("Agent finished speaking") await update_conversation_state() ``` --- ## Complete Example Here's a full example combining multiple hooks for a production-ready voice agent: ```python title="main.py" import logging import re from videosdk.agents import Agent, Pipeline, AgentSession, JobContext from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.google import GoogleLLM from videosdk.plugins.cartesia import CartesiaTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector class VoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful customer support agent." ) async def entrypoint(ctx: JobContext): agent = VoiceAgent() pipeline = Pipeline( stt=DeepgramSTT(), llm=GoogleLLM(), tts=CartesiaTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), ) # --- Stream Processing Hooks --- @pipeline.on("stt") async def stt_hook(audio_stream): """Filter audio and clean up transcripts""" async def filtered_audio(): async for audio in audio_stream: if len(audio) < 300: continue yield audio async for event in run_stt(filtered_audio()): if event.data and event.data.text: text = event.data.text.lower() text = re.sub(r"\b(uh|um|like)\b", "", text) event.data.text = " ".join(text.split()) yield event @pipeline.on("tts") async def tts_hook(text_stream): """Improve pronunciation of abbreviations""" async def preprocess(): async for text in text_stream: yield text.replace("AM", "A M").replace("PM", "P M") async for audio in run_tts(preprocess()): yield audio # --- LLM Hook --- @pipeline.on("llm") async def on_llm(data: dict): """Observe or modify LLM output""" logging.info(f"[LLM] {data.get('text', '')[:80]}...") # --- Lifecycle Hooks --- @pipeline.on("user_turn_start") async def on_user_start(transcript: str): logging.info(f"[USER] {transcript}") @pipeline.on("user_turn_end") async def on_user_end(): logging.info("[TURN END]") @pipeline.on("agent_turn_start") async def on_agent_start(): logging.info("[AGENT SPEAKING]") @pipeline.on("agent_turn_end") async def on_agent_done(): logging.info("[AGENT DONE]") # --- Start Session --- session = AgentSession(agent=agent, pipeline=pipeline) await session.start(wait_for_participant=True, run_until_shutdown=True) ``` ## Migration from ConversationFlow If you were previously using `ConversationFlow`, here's how to migrate to Pipeline Hooks: | ConversationFlow Pattern | Pipeline Hooks Equivalent | | :--- | :--- | | `on_turn_start(transcript)` | `@pipeline.on("user_turn_start")` | | `on_turn_end()` | `@pipeline.on("user_turn_end")` | | Override `run()` for RAG | Use knowledge base or custom LLM logic | | Override `run()` for direct response | Use custom LLM logic | | Custom STT processing | `@pipeline.on("stt")` | | Custom TTS processing | `@pipeline.on("tts")` | | State machine in `run()` | Use conversational graph or custom logic | **Before (ConversationFlow):** ```python class RAGFlow(ConversationFlow): async def on_turn_start(self, transcript: str): logging.info(f"User: {transcript}") async def run(self, transcript: str): context = await retrieve_docs(transcript) if context: self.agent.chat_context.add_message( role="system", content=f"Context: {context}" ) async for chunk in self.process_with_llm(): yield chunk pipeline = CascadingPipeline(stt=stt, llm=llm, tts=tts) flow = RAGFlow(agent) pipeline.set_conversation_flow(flow) ``` **After (Pipeline Hooks):** ```python pipeline = Pipeline(stt=stt, llm=llm, tts=tts, vad=vad, turn_detector=turn_detector) @pipeline.on("user_turn_start") async def on_user_start(transcript: str): logging.info(f"User: {transcript}") @pipeline.on("llm") async def on_llm(data: dict): text = data.get("text", "") logging.info(f"LLM generated: {text[:100]}") ``` ## Hook Reference | Hook | Signature | Multiple? | Purpose | | :--- | :--- | :--- | :--- | | `stt` | `(AsyncIterator[bytes]) → AsyncIterator[SpeechEvent]` | No | STT processing | | `tts` | `(AsyncIterator[str]) → AsyncIterator[bytes]` | No | TTS processing | | `llm` | `(dict) → str \| AsyncIterator[str] \| None` | Yes | Observe or modify LLM-generated content | | `vision_frame` | `(AsyncIterator[VideoFrame]) → AsyncIterator[VideoFrame]` | Yes | Process video frames | | `user_turn_start` | `(str) → None` | Yes | Observe user turn start | | `user_turn_end` | `() → None` | Yes | Observe user turn end | | `agent_turn_start` | `() → None` | Yes | Observe agent speech start | | `agent_turn_end` | `() → None` | Yes | Observe agent speech end | ## Examples - Try Out Yourself }, { title: "Understanding Hooks", description: "Video explaining pipeline hooks", link: "https://youtu.be/ZGqmIu-tE18?feature=shared", icon: } ]} /> --- --- title: Pipeline Observability hide_title: false hide_table_of_contents: false description: "Learn how to use Pipeline Observability hooks in the VideoSDK AI Agent SDK. Capture metrics, recording lifecycle events, errors, and access session context history for debugging and post-processing." pagination_label: "Pipeline Observability" keywords: - Pipeline Observability - Metrics Hooks - AI Agent SDK - VideoSDK Agents - STT Latency - LLM TTFT - TTS TTFB - EOU Latency - Realtime Metrics - Recording Lifecycle - Error Hook - Context History - Session Debugging - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Pipeline Observability slug: pipeline-observability --- import { AgentCardGrid, GithubIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon } from '@site/src/components/agent/cards'; # Pipeline Observability Pipeline Observability gives you first-class visibility into what your agent is doing at runtime. You can capture component-level latency and token usage, observe recording lifecycle events, centralize error handling across the [Pipeline](https://docs.videosdk.live/ai_agents/core-components/pipeline), and inspect session context for post-processing — all through a small set of decorators on the `Pipeline` and a method on the `AgentSession`. :::tip These hooks are **side-effect-only**. They observe pipeline events without changing the data flow — safe to register many of them for logging, analytics, and external monitoring. ::: ## What You Get | Capability | API | Purpose | | :--- | :--- | :--- | | **Metrics Hooks** | `@pipeline.metrics.on(...)` | Capture latency, durations, and token usage across STT, LLM, TTS, EOU, and Realtime (S2S) | | **Recording Lifecycle Hooks** | `@pipeline.on("recording_started" \| "recording_stopped" \| "recording_failed")` | Observe when participant or track recording starts, stops, or fails | | **Error Hook** | `@pipeline.on("error")` | Centralize error handling across the entire pipeline | | **Context Access** | `session.get_context_history(...)` | Read session messages for debugging and post-processing | --- ## Metrics Hooks Metrics hooks fire at the end of each component's work for a turn and deliver a `dict` payload describing latency, durations, and (for LLM / Realtime) token usage. Use them to ship metrics to your APM, log lines to a SIEM, or compute SLOs in real time. Register a metrics hook with the `@pipeline.metrics.on("")` decorator: ```python title="main.py" @pipeline.metrics.on("llm") def on_llm_metrics(metrics: dict): print(f"[METRICS] LLM TTFT: {metrics.get('llm_ttft')}ms") ``` ### `stt` — Speech-to-Text Metrics **When it fires:** When an STT turn completes (final transcript available). **Payload keys:** `stt_latency` ```python title="main.py" @pipeline.metrics.on("stt") def on_stt_metrics(metrics: dict): """Fired when STT turn completes.""" print(f"[METRICS] STT Latency: {metrics.get('stt_latency')}ms") ``` ### `llm` — LLM Metrics **When it fires:** When the LLM finishes generating its response for a turn. **Payload keys:** `llm_ttft`, `llm_duration`, `prompt_tokens`, `completion_tokens`, `total_tokens` ```python title="main.py" @pipeline.metrics.on("llm") def on_llm_metrics(metrics: dict): """Fired when LLM generation completes.""" print( f"[METRICS] LLM TTFT: {metrics.get('llm_ttft')}ms | " f"Total Duration: {metrics.get('llm_duration')}ms" ) print( "[METRICS] LLM Tokens (P/C/T): " f"{metrics.get('prompt_tokens')}/" f"{metrics.get('completion_tokens')}/" f"{metrics.get('total_tokens')}" ) ``` ### `tts` — Text-to-Speech Metrics **When it fires:** When TTS finishes synthesizing the agent's response for a turn. **Payload keys:** `ttfb`, `tts_latency` ```python title="main.py" @pipeline.metrics.on("tts") def on_tts_metrics(metrics: dict): """Fired when TTS finishes speaking.""" print( f"[METRICS] TTS TTFB: {metrics.get('ttfb')}ms | " f"Total Latency: {metrics.get('tts_latency')}ms" ) ``` ### `eou` — End-of-Utterance Metrics **When it fires:** When the Turn Detector matches end-of-utterance. **Payload keys:** `eou_latency` ```python title="main.py" @pipeline.metrics.on("eou") def on_eou_metrics(metrics: dict): """Fired when TurnDetector matches end-of-utterance.""" print(f"[METRICS] EOU Latency: {metrics.get('eou_latency')}ms") ``` ### `realtime` — Realtime (S2S) Metrics **When it fires:** For realtime / speech-to-speech models like OpenAI Realtime, Gemini Live, or AWS Nova Sonic — fires once per turn after the model responds. **Payload keys:** `realtime_ttfb`, `realtime_input_tokens`, `realtime_output_tokens`, `realtime_total_tokens`, `realtime_input_text_tokens`, `realtime_output_text_tokens`, `realtime_input_audio_tokens`, `realtime_output_audio_tokens` ```python title="main.py" @pipeline.metrics.on("realtime") def on_realtime_metrics(metrics: dict): """Fired for realtime (speech-to-speech) models.""" print( "[METRICS] Realtime " f"TTFB: {metrics.get('realtime_ttfb')}ms | " f"Tokens (in/out/total): " f"{metrics.get('realtime_input_tokens')}/" f"{metrics.get('realtime_output_tokens')}/" f"{metrics.get('realtime_total_tokens')} | " f"TextTokens (in/out): " f"{metrics.get('realtime_input_text_tokens')}/" f"{metrics.get('realtime_output_text_tokens')} | " f"AudioTokens (in/out): " f"{metrics.get('realtime_input_audio_tokens')}/" f"{metrics.get('realtime_output_audio_tokens')}" ) ``` :::note Use `stt`, `llm`, `tts`, and `eou` metrics with Cascade pipelines. Use `realtime` metrics when the pipeline runs in Realtime (S2S) or Hybrid mode with a realtime model as the LLM. ::: ### Metrics Reference | Hook | Mode | Key Fields | | :--- | :--- | :--- | | `stt` | Cascade | `stt_latency` | | `llm` | Cascade | `llm_ttft`, `llm_duration`, `prompt_tokens`, `completion_tokens`, `total_tokens` | | `tts` | Cascade | `ttfb`, `tts_latency` | | `eou` | Cascade | `eou_latency` | | `realtime` | Realtime / Hybrid | `realtime_ttfb`, `realtime_input_tokens`, `realtime_output_tokens`, `realtime_total_tokens`, `realtime_input_text_tokens`, `realtime_output_text_tokens`, `realtime_input_audio_tokens`, `realtime_output_audio_tokens` | --- ## Recording Lifecycle Hooks Recording lifecycle hooks let you observe the start, stop, and failure of [Recording](https://docs.videosdk.live/ai_agents/core-components/recording) without polling APIs. They fire for both participant and track recordings. :::note These hooks only fire when recording is enabled. Turn recording on via [Observability Options](https://docs.videosdk.live/ai_agents/tracing-observability/observability-options) on `session.start()`, or by setting `recording=True` in [RoomOptions](https://docs.videosdk.live/ai_agents/core-components/room-options). ::: ### `recording_started` Fired when participant or track recording starts successfully. ```python title="main.py" @pipeline.on("recording_started") def on_recording_started(data): """Fired when participant or track recording starts successfully.""" print(f"[RECORDING HOOK] Started: {data}") ``` ### `recording_stopped` Fired when recording stops successfully (typically at session end). ```python title="main.py" @pipeline.on("recording_stopped") def on_recording_stopped(data): """Fired when recording stops successfully.""" print(f"[RECORDING HOOK] Stopped: {data}") ``` ### `recording_failed` Fired when recording fails to start or stop. Use this to surface issues to your monitoring system before the session ends. ```python title="main.py" @pipeline.on("recording_failed") def on_recording_failed(data): """Fired when recording fails to start or stop.""" print(f"[RECORDING HOOK] Failed: {data}") ``` --- ## Error Hook The `error` hook centralizes error handling across the pipeline. Instead of attaching error listeners on each component, register one hook on the pipeline and receive errors from STT, LLM, TTS, VAD, Turn Detector, and the underlying VideoSDK Room connection. **Payload keys:** `source`, `error` ```python title="main.py" @pipeline.on("error") def on_pipeline_error(data): """ Catch any errors from STT, LLM, TTS, VAD, Turn Detector, or the VideoSDK Room connection. """ source = data.get("source", "unknown") error = data.get("error", "No error details") print(f"[ERROR HOOK] Pipeline Error from {source}: {error}") ``` :::tip Pair the error hook with the [Fallback Adapter](https://docs.videosdk.live/ai_agents/core-components/fallback-adapter) — the hook gives you visibility into which component failed, while the adapter handles automatic provider failover. ::: --- ## Accessing Session Context `session.get_context_history()` returns the conversation messages accumulated during the session. Use it inside lifecycle methods like `Agent.on_exit()` to log final transcripts, send summaries to a backend, or build evaluation datasets. ### Signature ```python session.get_context_history( include_function_calls: bool = False, include_system_messages: bool = False, ) -> list[dict] ``` ### Parameters | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `include_function_calls` | `bool` | `False` | Include function/tool calls and their results in the returned history | | `include_system_messages` | `bool` | `False` | Include system messages (e.g., the agent's instructions) in the returned history | ### Return Value A list of message dicts. Each message has: - `role`: One of `user`, `assistant`, `system`, or `tool`. - `content`: A string, or a list of content parts (e.g., text and images for multi-modal turns). ### Example — Print Final Transcript on Session End ```python title="main.py" class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are VideoSDK's Voice Agent." ) async def on_exit(self) -> None: history = self.session.get_context_history( include_function_calls=True, include_system_messages=False, ) print("\n=== SESSION END: CONTEXT HISTORY ===") for msg in history: role = msg.get("role", "unknown").upper() content = msg.get("content", "") if isinstance(content, list): text_blocks = [ c if isinstance(c, str) else "[Image/Other]" for c in content ] content = " ".join(text_blocks) print(f"{role}: {content}") print("===================================\n") await self.session.say("Goodbye!") ``` --- ## Complete Example For a runnable script that wires up metrics, error, and recording hooks together with `session.get_context_history()` on exit, see the [Observability Hooks example on GitHub](https://github.com/videosdk-live/agents/blob/main/examples/observability_hooks.py). :::tip Pair with the Dashboard The metrics emitted by these hooks are also visualized on the VideoSDK Dashboard. See [Session Analytics](https://docs.videosdk.live/ai_agents/tracing-observability/session-analytics) and [Trace Insights](https://docs.videosdk.live/ai_agents/tracing-observability/traces) to inspect the same data alongside transcripts and recordings. ::: ## Examples - Try Out Yourself }, { title: "Pipeline Hooks", description: "Stream and lifecycle hooks for STT, LLM, TTS, and turns", link: "/ai_agents/core-components/pipeline-hooks", icon: } ]} /> --- --- title: Pipeline hide_title: false hide_table_of_contents: false description: "Explore the unified `Pipeline` component in the VideoSDK AI Agent SDK. Learn how it auto-detects modes (Cascade, Realtime, Hybrid) based on the components you provide." pagination_label: "Pipeline" keywords: - Pipeline Component - AI Agent SDK - VideoSDK Agents - Cascade Pipeline - Realtime - Speech-to-Speech - STT LLM TTS - Pipeline Modes - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Pipeline slug: pipeline --- import { AgentCardGrid, GithubIcon, RobotIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, GeminiIcon, OpenAIIcon, AWSNovaSonicIcon } from '@site/src/components/agent/cards'; # Pipeline The `Pipeline` is a unified, intelligent component that automatically configures itself based on the components you provide. Instead of choosing between separate pipeline classes, you simply pass the components you need — the Pipeline detects the optimal mode and wires everything together. :::tip The `Pipeline` replaces the previous `CascadePipeline` and `RealtimePipeline` classes. Instead of choosing between separate pipeline types, you now use a single `Pipeline` that auto-detects the right mode. For custom turn-taking and processing logic previously handled by `ConversationalFlow`, see [Pipeline Hooks](https://docs.videosdk.live/ai_agents/core-components/pipeline-hooks). ::: ## Core Architecture The Pipeline auto-detects which mode to use based on the components you provide: | Mode | Components Provided | Use Case | | :--- | :--- | :--- | | **Cascade** | STT + LLM + TTS + VAD + Turn Detector | Full voice agent with maximum control | | **Realtime (S2S)** | Realtime model only (e.g., OpenAI Realtime, Gemini Live) | Lowest latency speech-to-speech | | **Hybrid** | Realtime model + external STT or TTS | Knowledge base support, custom voice/STT | | **LLM + TTS** | LLM + TTS | Text-in, voice-out | | **STT + LLM** | STT + LLM | Voice-in, text-out | | **Partial** | Any other combination | Custom setups | ## Basic Usage ### Cascade Mode ![Cascade Mode](https://cdn.videosdk.live/website-resources/docs-resources/agent_v1_cascade_pipeline_1.jpg) Provide STT, LLM, TTS, VAD, and Turn Detector components to get a full Cascade pipeline with granular control over each stage. ```python title="main.py" from videosdk.agents import Pipeline, Agent, AgentSession from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful voice assistant." ) pipeline = Pipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) # The pipeline auto-detects: FULL_Cascade mode session = AgentSession(agent=MyAgent(), pipeline=pipeline) ``` ### Realtime Mode ![Realtime Mode](https://cdn.videosdk.live/website-resources/docs-resources/agent_v1_realtime_pipeline_1.jpg) Pass a realtime model (e.g., OpenAI Realtime, Google Gemini Live, AWS Nova Sonic) as the `llm` parameter to get a speech-to-speech pipeline with minimal latency. ```python title="main.py" from videosdk.agents import Pipeline, Agent, AgentSession from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful voice assistant." ) model = OpenAIRealtime( model="gpt-4o-realtime-preview", config=OpenAIRealtimeConfig( voice="alloy", response_modalities=["AUDIO"] ) ) pipeline = Pipeline(llm=model) # The pipeline auto-detects: REALTIME mode (FULL_S2S) session = AgentSession(agent=MyAgent(), pipeline=pipeline) ``` In addition to OpenAI, the Pipeline also supports other realtime models like Google Gemini (Live API) and AWS Nova Sonic. }, { title: "Google Gemini", description: "More about Gemini Realtime Plugin", link: "https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api", icon: }, { title: "AWS Nova Sonic", description: "More about AWSNovaSonic Realtime Plugin", link: "https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic", icon: } ]} columns={3} /> ### Hybrid Mode Combine a realtime model with an external STT or TTS. The Pipeline auto-detects the hybrid sub-mode based on which additional components you provide — no extra configuration needed. **Hybrid STT** — Use your own STT provider with a realtime model. This is useful when you need local knowledge base (KB) retrieval, since the external STT gives you the transcript text needed to query your KB before the realtime model responds. ```python title="main.py" from videosdk.agents import Pipeline, Agent, AgentSession, KnowledgeBase, KnowledgeBaseConfig from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from videosdk.plugins.sarvamai import SarvamAISTT from videosdk.plugins.silero import SileroVAD model = GeminiRealtime( model="gemini-3.1-flash-live-preview", config=GeminiLiveConfig( voice="Puck", response_modalities=["AUDIO"] ) ) # Provide external STT — Pipeline auto-detects hybrid_stt mode pipeline = Pipeline( stt=SarvamAISTT(), llm=model, vad=SileroVAD() ) ``` **Hybrid TTS** — Use your own TTS/voice provider with a realtime model. This is useful when you need a specific custom voice that the realtime model doesn't support. ```python title="main.py" from videosdk.agents import Pipeline from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig from videosdk.plugins.elevenlabs import ElevenLabsTTS model = OpenAIRealtime( model="gpt-4o-realtime-preview", config=OpenAIRealtimeConfig(voice="alloy") ) # Provide external TTS — Pipeline auto-detects hybrid_tts mode pipeline = Pipeline( llm=model, tts=ElevenLabsTTS() ) ``` #### Realtime Sub-Modes When using a realtime model, the Pipeline auto-detects the sub-mode: | Sub-Mode | What It Does | When To Use | | :--- | :--- | :--- | | `full_s2s` | End-to-end speech model (default) | Lowest latency, simplest setup | | `hybrid_stt` | External STT + Realtime LLM & TTS | Knowledge base retrieval, custom STT language support | | `hybrid_tts` | Realtime STT & LLM + External TTS | Custom voice support with a specific TTS provider | ## Advanced Configuration Fine-tune the behavior of each component by passing specific parameters during initialization. ```python title="main.py" from videosdk.agents import Pipeline, EOUConfig, InterruptConfig stt = DeepgramSTT( model="nova-2", language="en", punctuate=True, diarize=True ) llm = OpenAILLM( model="gpt-4o", temperature=0.7, max_tokens=1000 ) tts = ElevenLabsTTS( model="eleven_flash_v2_5", voice_id="21m00Tcm4TlvDq8ikWAM" ) vad = SileroVAD( threshold=0.35, min_silence_duration=0.5 ) turn_detector = TurnDetector( threshold=0.8, min_turn_duration=1.0 ) pipeline = Pipeline( stt=stt, llm=llm, tts=tts, vad=vad, turn_detector=turn_detector, eou_config=EOUConfig( mode="ADAPTIVE", min_max_speech_wait_timeout=[0.5, 0.8] ), interrupt_config=InterruptConfig( mode="HYBRID", interrupt_min_duration=0.5, interrupt_min_words=2, resume_on_false_interrupt=False ) ) ``` ### Configuration Parameters #### EOUConfig Controls end-of-utterance detection — how the pipeline decides the user has finished speaking. | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `mode` | `"DEFAULT"` \| `"ADAPTIVE"` | `"DEFAULT"` | `ADAPTIVE` uses LLM confidence to adjust wait time | | `min_max_speech_wait_timeout` | `[float, float]` | `[0.5, 0.8]` | Min and max wait time (seconds) after speech ends | #### InterruptConfig Controls how the pipeline handles user interruptions during agent speech. | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `mode` | `"VAD_ONLY"` \| `"STT_ONLY"` \| `"HYBRID"` | `"HYBRID"` | Detection method for interruptions | | `interrupt_min_duration` | `float` | `0.5` | Minimum speech duration (seconds) to trigger interrupt | | `interrupt_min_words` | `int` | `2` | Minimum words needed to confirm an interrupt | | `false_interrupt_pause_duration` | `float` | `2.0` | Pause duration (seconds) on false interrupt | | `resume_on_false_interrupt` | `bool` | `False` | Whether to resume agent speech after a false interrupt | ## Dynamic Component Changes The Pipeline supports swapping components at runtime without restarting. ### Swap Individual Components ```python # Change a single component during runtime await pipeline.change_component( tts=new_tts_provider ) ``` ### Reconfigure Entire Pipeline ```python # Reconfigure the full pipeline (can change modes) await pipeline.change_pipeline( stt=new_stt, llm=new_llm, tts=new_tts, vad=new_vad, turn_detector=new_turn_detector ) ``` :::note `change_component()` swaps individual components within the same pipeline mode. Use `change_pipeline()` when you need to reconfigure the entire pipeline or switch modes (e.g., from Cascade to realtime). ::: ## Plugin Ecosystem There are multiple plugins available for STT, LLM, and TTS. Check them out: ## Plugin Installation Install the plugins you need: ```python # Install specific provider plugins pip install videosdk-plugins-openai pip install videosdk-plugins-elevenlabs pip install videosdk-plugins-deepgram ``` ## Plugin Development To create custom plugins, follow the [plugin development guide ↗](https://github.com/videosdk-live/agents/blob/main/BUILD_YOUR_OWN_PLUGIN.md). Key requirements include: - Inherit from the correct base class (`STT`, `LLM`, or `TTS`) - Implement all abstract methods - Handle errors consistently using `self.emit("error", message)` - Clean up resources in the `aclose()` method ## Best Practices 1. **Component Selection:** Choose providers based on your specific requirements (latency, quality, cost) 2. **Mode Awareness:** Let the Pipeline auto-detect the mode — just provide the components you need and it will configure itself 3. **Error Handling:** Implement proper error handling and fallback strategies using the [Fallback Adapter](https://docs.videosdk.live/ai_agents/core-components/fallback-adapter) 4. **Resource Management:** Use the `cleanup()` method to properly release components 5. **Audio Format:** Ensure your custom plugins handle the 48kHz audio format correctly 6. **Custom Processing:** Use [Pipeline Hooks](https://docs.videosdk.live/ai_agents/core-components/pipeline-hooks) for custom turn-taking logic, RAG, content filtering, and lifecycle events ## Pipeline Mode Comparison | Feature | Cascade Mode | Realtime Mode | Hybrid Mode | | :--- | :--- | :--- | :--- | | **Control** | Maximum control over each component | Integrated model control | Mix of both | | **Flexibility** | Mix different providers | Single model provider | Partial provider choice | | **Latency** | Higher due to sequential processing | Lowest with streaming | Between Cascade and realtime | | **Customization** | Extensive via hooks and config | Limited to model capabilities | Selective customization | | **Complexity** | More components to configure | Simplest setup | Moderate | | **Cost** | Per-component pricing | Single model pricing | Mixed pricing | ## Examples - Try Out Yourself }, { title: "Pipeline Hooks", description: "Pipeline with custom hooks", link: "https://github.com/videosdk-live/agents/blob/main/examples/voice_pipeline_hooks.py", icon: }, { title: "OpenAI Realtime", description: "Realtime speech-to-speech implementation", link: "https://github.com/videosdk-live/agents-quickstart/tree/main/OpenAI", icon: }, { title: "Google Gemini (LiveAPI)", description: "Implement with Google Gemini (LiveAPI)", link: "https://github.com/videosdk-live/agents-quickstart/tree/main/Google%20Gemini%20(LiveAPI)", icon: }, { title: "AWS Nova Sonic", description: "Implement with AWS Nova Sonic", link: "https://github.com/videosdk-live/agents-quickstart/tree/main/AWS%20Nova%20Sonic", icon: } ]} /> --- --- title: Preemptive Response hide_title: false hide_table_of_contents: false description: "Learn how to enable Preemptive generation for faster STT responses using the VideoSDK AI Agent SDK." pagination_label: "Preemptive Response" keywords: - Preemptive Generation - Preemptive Response - Deepgram STT - Speech To Text - VideoSDK Agents - Python SDK - Real-time Transcription - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Preemptive Response slug: preemptive-response --- # Preemptive Response Preemptive Response is a feature that allows the Speech-to-Text (STT) engine to produce **partial, low-latency text output** while the user is still speaking. This is crucial for building highly responsive conversational AI agents. By enabling preemptive response, your agent can begin processing the user's intent and formulating a response before the full utterance is completed, significantly reducing the perceived latency. ## How It Works ![preemtive-response](https://assets.videosdk.live/images/preemptive-response.png) - User audio is streamed to the STT, which generates partial transcripts. - These partial transcripts are immediately sent to the LLM to enable preemptive (early) responses. - The LLM output is then passed to the TTS to generate the spoken response. ## Prerequisites Ensure you have the required packages installed: ```text pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" ``` :::tip Currently, preemptive response generation is limited to Deepgram’s STT implementation and is available only in the Flux model. ::: ## Enabling Preemptive Generation To enable this feature, set the `enable_preemptive_generation` flag to `True` when initializing your STT plugin (e.g., `DeepgramSTTV2`). ```python from videosdk.plugins.deepgram import DeepgramSTTV2 stt = DeepgramSTTV2( enable_preemptive_generation=True ) ``` ## Full Working Example The following example demonstrates how to build a voice agent with preemptive transcription enabled. This setup uses Deepgram for STT, OpenAI for LLM, and ElevenLabs for TTS. ```python import asyncio import os from videosdk.agents import Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTTV2 from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS # Pre-download the Turn Detector model to avoid delays during startup pre_download_model() class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") async def on_exit(self): await self.session.say("Goodbye!") async def start_session(context: JobContext): # 1. Create the agent agent = MyVoiceAgent() # 2. Define the pipeline with Preemptive Generation enabled pipeline = Pipeline( stt=DeepgramSTTV2( model="flux-general-en", enable_preemptive_generation=True # Enable low-latency partials ), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) # 3. Initialize the session session = AgentSession( agent=agent, pipeline=pipeline ) try: await context.connect() await session.start() # Keep the session running await asyncio.Event().wait() finally: # Clean up resources await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( name="VideoSDK Cascaded Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` --- --- title: Pub/Sub Messaging hide_title: false hide_table_of_contents: false description: "Learn how to implement real-time, bidirectional communication between your VideoSDK AI Agent and client applications using Pub/Sub messaging. This guide covers sending and receiving messages, handling events, and practical use cases." pagination_label: "Pub/Sub Messaging" keywords: - Pub/Sub - Real-time Messaging - Bidirectional Communication - VideoSDK Agents - AI Agent SDK - Event Handling - Client-Agent Communication - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 11 sidebar_label: Pub/Sub Messaging slug: pubsub-messaging --- import { AgentCardGrid, GithubIcon, CodeIcon } from '@site/src/components/agent/cards'; import Step from '@site/src/components/Step'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Pub/Sub Messaging Pub/Sub (Publish/Subscribe) messaging enables real-time, bidirectional communication between your AI agent and client applications within a VideoSDK meeting. This allows you to build interactive experiences where the client can send commands or data to the agent, and the agent can push updates or notifications back to the client, all without relying on voice. ![Pub/Sub Architecture Diagram](https://strapi.videosdk.live/uploads/user_agent_pubsub_chat_e62ce8f209.png) ## Key Features - **Send Messages**: Agents can publish messages to any specified Pub/Sub topic, which can be received by any participant (including client applications) subscribed to that topic. - **Receive Messages**: Agents can subscribe to topics to receive messages published by client applications or other participants. - **Bidirectional Flow**: Communication is not one-way. Both the agent and the client can publish and subscribe, enabling a fully interactive loop. - **Decoupled Communication**: The client and agent do not need to know about each other's existence directly. They communicate through shared topics, which simplifies the architecture. ## Implementation Implementing Pub/Sub involves two main parts: subscribing to topics to receive messages and publishing messages. Subscribing is typically the first step on both the agent and client side. Use the tabs below to see how to subscribe to a Pub/Sub topic across the AI Agent and client SDKs. ```python title="Subscribe on Room Context" from videosdk import PubSubSubscribeConfig def on_client_message(message): print(f"Received: {message}") await ctx.room.subscribe_to_pubsub(PubSubSubscribeConfig( topic="CHAT", cb=on_client_message )) ``` ```js title="Subscribe on meeting join" // Subscribe to CHAT meeting.on("meeting-joined", () => { meeting.pubSub.subscribe("CHAT", (data) => { const { message, senderId, senderName, timestamp } = data; console.log("Client command:", message); }); }); ``` ```js title="usePubSub hook" import { usePubSub } from "@videosdk.live/react-sdk"; function ClientCommands() { usePubSub("CHAT", { onMessageReceived: ({ message, senderId }) => { console.log("Client command:", message); }, }); return null; } ``` ```js title="usePubSub hook" import { usePubSub } from "@videosdk.live/react-native-sdk"; function ClientCommands() { const { messages } = usePubSub("CHAT", { onMessageReceived: (message) => { console.log("Client command:", message.message); }, }); return null; } ``` ```swift title="Subscribe with listener" class ClientCommandsListener: PubSubMessageListener { func onMessageReceived(message: PubSubMessage) { print("Client command: \(message.message)") } } let listener = ClientCommandsListener() meeting?.pubsub.subscribe(topic: "CHAT", forListener: listener) ``` ```kotlin title="Subscribe with listener" val listener = PubSubMessageListener { message -> Log.d("PubSub", "Client command: ${message.message}") } meeting?.pubSub?.subscribe("CHAT", listener) ``` ```dart title="Subscribe with handler" void messageHandler(PubSubMessage message) { print("Client command: ${message.message}"); } final messages = await room.pubSub.subscribe( "CHAT", messageHandler, ); ``` The most effective way for an agent to publish messages is by exposing a `function_tool`. This allows the LLM to decide when to send a message based on the conversation. To publish, you use `PubSubPublishConfig` and call the `publish_to_pubsub` method on the `JobContext` room object. ```python from videosdk import PubSubPublishConfig from videosdk.agents import Agent, function_tool, JobContext class MyPubSubAgent(Agent): def __init__(self, ctx: JobContext): super().__init__( instructions="You can send messages to the client using the send_message tool." ) self.ctx = ctx @function_tool async def send_message_to_client(self, message: str): """Sends a text message to the client application on the 'CHAT' topic.""" publish_config = PubSubPublishConfig( topic="CHAT", message=message ) await self.ctx.room.publish_to_pubsub(publish_config) return f"Message '{message}' sent to client." ``` To receive messages, the agent must subscribe to a topic using `PubSubSubscribeConfig` and the `subscribe_to_pubsub` method, which registers a callback function to handle incoming messages. This setup is typically done in your main `entrypoint` function after connecting to the room. ```python import asyncio from videosdk import PubSubSubscribeConfig from videosdk.agents import JobContext # Define the callback function that will process incoming messages def on_client_message(message): print(f"Received message from client: {message}") # Add your logic here to process the message. # For example, you could pass it to the agent's pipeline. async def entrypoint(ctx: JobContext): # ... (agent and session setup) try: await ctx.connect() await ctx.room.wait_for_participant() # Configure the subscription subscribe_config = PubSubSubscribeConfig( topic="CHAT", cb=on_client_message ) # Subscribe to the topic await ctx.room.subscribe_to_pubsub(subscribe_config) # Start the agent session await session.start() await asyncio.Event().wait() finally: await session.close() await ctx.shutdown() ``` ## Best Practices - **Topic Naming Conventions**: Use clear and consistent topic names (e.g., `CHAT`, `AGENT_STATUS`) to keep your application organized. - **Structured Data**: Use JSON for your message payloads. This makes messages easy to parse and allows for sending complex data structures. - **Error Handling**: Your callback function should gracefully handle malformed or unexpected messages to prevent crashes. - **Asynchronous Callbacks**: If your callback function performs long-running tasks, make sure it is `async` and consider running tasks in the background with `asyncio.create_task()` to avoid blocking the main event loop. ## Example - Try Out Yourself Check out our quickstart repository for a complete, runnable example of an agent using Pub/Sub. } ]} /> --- --- title: RAG (Retrieval-Augmented Generation) hide_title: false hide_table_of_contents: false description: "Learn how to implement Retrieval-Augmented Generation (RAG) with VideoSDK AI Agents to enhance your agent's knowledge base with external documents, databases, and real-time information retrieval capabilities." pagination_label: "RAG Integration" keywords: - RAG - Retrieval-Augmented Generation - Knowledge Base - Document Retrieval - Vector Database - Embeddings - Semantic Search - Context Retrieval - AI Agent Knowledge - Document Processing - Information Retrieval - VideoSDK Agents - AI Agent SDK - Python - Real-time Knowledge - External Data Sources - Voice Agent Sessions - Knowledge Management image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: RAG slug: rag --- import ReactPlayer from "react-player"; # RAG (Retrieval-Augmented Generation) integration **RAG** helps your AI agent find relevant information from documents to give better answers. It searches through your knowledge base and uses that context to respond more accurately. ## Architecture ![RAG](https://cdn.videosdk.live/website-resources/docs-resources/voice_agent_rag.png) The RAG pipeline flow: 1. **Voice Input** → STT converts speech to text 2. **Retrieval** → Knowledge base fetches relevant documents based on the transcript 3. **Augmentation** → Retrieved context is injected into the LLM prompt 4. **Generation** → LLM generates a grounded response using the context 5. **Voice Output** → TTS converts response to speech ## Managed RAG With Managed RAG, you can upload knowledge bases from the VideoSDK dashboard and attach them to your agent to enhance responses using retrieval-augmented generation. #### Step 1 : Upload Knowledge Base on the dashboard #### Step 2 : Configure it in Pipeline After uploading, the Knowledge Base is assigned a unique ID(as shown in Step 1), which you can use to load it, enabling the agent to fetch relevant information during conversations. ```python title="main.py" import os from videosdk.agents import KnowledgeBase, KnowledgeBaseConfig # Initialize Knowledge Base with ID from Dashboard kb_id = os.getenv("KNOWLEDGE_BASE_ID") config = KnowledgeBaseConfig(id=kb_id, top_k=3) # Load Knowledge Base and pass it to the agent agent = VoiceAgent( knowledge_base=KnowledgeBase(config) ) ``` ## Custom RAG Build your own RAG pipeline using any vector database (ChromaDB, Pinecone, etc.) with the `user_turn_start` hook. This hook fires when the user's transcript is ready — **before** the LLM is called — giving you the perfect place to retrieve documents and inject context. ### Prerequisites ```bash pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" pip install chromadb openai numpy ``` ```shell title=".env" DEEPGRAM_API_KEY = "Your Deepgram API Key" OPENAI_API_KEY = "Your OpenAI API Key" ELEVENLABS_API_KEY = "Your ElevenLabs API Key" VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token" ``` :::tip For a complete working example with all the code integrated together, check out our GitHub repository: [Custom RAG Example](https://github.com/videosdk-live/agents/blob/main/examples/custom_rag_agent.py) ::: ### Step 1: Agent with Vector Store Create a custom agent that initializes a ChromaDB collection with your documents and provides an async `retrieve()` method: ```python title="main.py" import os import chromadb from openai import OpenAI, AsyncOpenAI from videosdk.agents import Agent class RAGVoiceAgent(Agent): def __init__(self): super().__init__( instructions="""You are a helpful voice assistant that answers questions based on provided context. Use the retrieved documents to ground your answers. If no relevant context is found, say so. Be concise and conversational.""" ) self.openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY")) # Your documents — replace with your own data self.documents = [ "What is VideoSDK? VideoSDK is a comprehensive real-time communication platform that provides APIs and SDKs for video calling, live streaming, and AI-powered voice agents.", "How do I authenticate with VideoSDK? Use JWT tokens generated with your API key and secret from the VideoSDK dashboard. Set the token as the VIDEOSDK_AUTH_TOKEN environment variable.", "How do I build voice agents with VideoSDK? You can build voice agents by installing the Python library: pip install videosdk-agents. It supports Cascading, Realtime, and Hybrid modes. Visit https://www.videosdk.live/ for more information.", "What is a Pipeline in VideoSDK Agents? A Pipeline is a unified component that automatically detects the best mode (Cascading, Realtime, or Hybrid) based on the components you provide.", "If a user's question is related to VideoSDK and the answer is unknown, direct them to https://www.videosdk.live/ for more information." ] # Set up ChromaDB (in-memory; use PersistentClient for production) self.chroma_client = chromadb.Client() self.collection = self.chroma_client.create_collection(name="rag_docs") self._initialize_knowledge_base() def _initialize_knowledge_base(self): """Generate embeddings and store documents.""" client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) embeddings = [] for doc in self.documents: resp = client.embeddings.create(input=doc, model="text-embedding-ada-002") embeddings.append(resp.data[0].embedding) self.collection.add( documents=self.documents, embeddings=embeddings, ids=[f"doc_{i}" for i in range(len(self.documents))], ) ``` ### Step 2: Retrieval Method Add an async method that generates a query embedding and searches ChromaDB: ```python title="main.py" async def retrieve(self, query: str, k: int = 2) -> list[str]: """Retrieve top-k most relevant documents from the vector store.""" response = await self.openai_client.embeddings.create( input=query, model="text-embedding-ada-002" ) query_embedding = response.data[0].embedding results = self.collection.query( query_embeddings=[query_embedding], n_results=k ) return results["documents"][0] if results["documents"] else [] ``` ### Step 3: Agent Lifecycle ```python title="main.py" async def on_enter(self) -> None: await self.session.say( "Hello! I'm your VideoSDK assistant powered by a local knowledge base. " "Ask me anything about VideoSDK." ) async def on_exit(self) -> None: await self.session.say("Thank you for using VideoSDK. Goodbye!") ``` ### Step 4: Pipeline with RAG Hook Use the `user_turn_start` hook to retrieve documents and inject them into `chat_context` **before** the LLM runs. Optionally use the `llm` hook to observe the generated response. ```python title="main.py" import logging from videosdk.agents import Pipeline, AgentSession, JobContext, RoomOptions, WorkerJob from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.cartesia import CartesiaTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector logger = logging.getLogger(__name__) async def entrypoint(ctx: JobContext): agent = RAGVoiceAgent() pipeline = Pipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=CartesiaTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), ) @pipeline.on("user_turn_start") async def on_user_turn_start(transcript: str): """Retrieve docs and inject context before the LLM is called.""" context_docs = await agent.retrieve(transcript) if context_docs: context_str = "\n\n".join( f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs) ) agent.chat_context.add_message( role="system", content=f"Retrieved Context:\n{context_str}\n\nUse this context to answer the user's question.", ) logger.info(f"Injected {len(context_docs)} docs into chat context") @pipeline.on("llm") async def on_llm(data: dict): """Observe the LLM response (optional).""" text = data.get("text", "") logger.info(f"[LLM] Generated ({len(text)} chars): {text[:120]}...") session = AgentSession(agent=agent, pipeline=pipeline) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: return JobContext(room_options=RoomOptions(name="RAG Voice Assistant", playground=True)) if __name__ == "__main__": WorkerJob(entrypoint=entrypoint, jobctx=make_context).start() ``` :::note The `user_turn_start` hook fires when the user's final transcript is ready, **before** the LLM generates a response. This is the right place to retrieve documents and add them to `chat_context`. The LLM then sees the injected context and uses it to generate a grounded answer. ::: ## Advanced Features ### Dynamic Document Updates Add documents at runtime: ```python title="main.py" async def add_document(self, document: str, metadata: dict = None): """Add a new document to the knowledge base.""" response = await self.openai_client.embeddings.create( input=document, model="text-embedding-ada-002" ) self.collection.add( documents=[document], embeddings=[response.data[0].embedding], ids=[f"doc_{len(self.documents)}"], metadatas=[metadata] if metadata else None, ) self.documents.append(document) ``` ### Document Chunking Split large documents for better retrieval: ```python title="main.py" def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]: """Split document into overlapping chunks.""" words = text.split() chunks = [] for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i:i + chunk_size]) chunks.append(chunk) return chunks ``` ## Best Practices 1. **Retrieval Count:** Start with `k=2-3`, adjust based on response quality and latency 2. **Chunk Size:** Keep chunks between 300-800 words for optimal retrieval 3. **Context Window:** Ensure retrieved context fits within LLM token limits 4. **Persistent Storage:** Use `chromadb.PersistentClient(path="./chroma_db")` in production 5. **Error Handling:** Always handle retrieval failures gracefully 6. **Caching:** Cache embeddings for frequently asked queries to reduce latency #### Common Issues | Issue | Solution | | ------------------ | ----------------------------------------------------------------------------- | | Slow responses | Reduce retrieval count (k), use faster embedding model, or cache embeddings | | Irrelevant results | Improve document quality, adjust chunking strategy, or use metadata filtering | | Out of memory | Use `PersistentClient` instead of in-memory `Client` | --- --- id: recording title: Recording hide_title: false hide_table_of_contents: false description: "Learn how to enable the recording functionality with VideoSDK AI Agents for agent sessions and user interactions." pagination_label: "Recording" keywords: - Agent Recording - AI Agents - Recording - AI Agent Oversight - Traces - Playback - VideoSDK Agents - MCP Server - Python SDK - Audio Store - Autoscroll Transcript - Timestamped Playback image: img/videosdklive-thumbnail.jpg sidebar_position: 11 sidebar_label: Recording slug: recording --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon, APIIcon } from '@site/src/components/agent/cards'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Recording Recording capabilities in VideoSDK Agents allow you to capture and store meeting conversations, enabling features like conversation analysis, compliance documentation, and quality assurance. VideoSDK provides three distinct recording approaches, each suited for different use cases and requirements. ## Recording Types Overview VideoSDK offers three types of recording functionality: 1. **Participant Recording** - Built-in automatic recording managed by the agent framework 2. **Track Recording** - Individual audio/video track recording with granular control 3. **Meeting Recording** - Complete meeting session recording with composite output ## 1. Participant Recording (Built-in) Participant recording is the simplest approach, automatically managed by the VideoSDK Agents framework when you enable the `recording` parameter. ### How It Works When `recording=True` is set in `RoomOptions`, the system automatically: - Starts recording when the agent joins the meeting. - Starts recording for each participant as they join. - Stops and merges recordings when the session ends. ### Basic Setup ```python title="main.py" from videosdk.agents import JobContext, RoomOptions def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Recording Agent", #highlight-start recording=True # Enable automatic participant recording #highlight-end ) ) ``` ### Fine-grained Control with RecordingOptions `RecordingOptions` lets you opt in to additional streams beyond audio when `recording=True`. Audio is always recorded via the track API when `recording=True`. | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `video` | bool | False | Record the agent's camera video track (composite audio+video participant recording) | | `screen_share` | bool | False | Record the screen-share track. Requires `vision=True` in `RoomOptions`. | ```python title="main.py" from videosdk.agents import JobContext, RoomOptions, RecordingOptions # Audio only (default — recording_options omitted) def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", recording=True, ) ) # Audio + camera video (composite participant recording) def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", recording=True, recording_options=RecordingOptions(video=True), ) ) # Audio + screen share def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", vision=True, # required for screen share recording recording=True, recording_options=RecordingOptions(screen_share=True), ) ) ``` :::note `recording_options.screen_share=True` requires `vision=True` because vision subscribes to the video/share streams. Setting `screen_share=True` without `vision=True` raises a `ValueError` at startup. ::: ## 2. Track Recording Track recording provides granular control over individual audio and video tracks, allowing you to record specific streams with custom configurations. ### When to Use Track Recording - Need to record specific audio/video tracks separately - Require custom recording configurations per track - Want to control recording start/stop timing manually - Need different quality settings for different tracks ### Key Features - **Individual Control**: Start/stop recording for specific tracks - **Custom Configuration**: Set different recording parameters per track - **Flexible Output**: Choose output formats and quality settings - **Manual Management**: Full control over recording lifecycle ### API References for Track Recording }, { title: "Stop Track Recording", description: "This API lets you stop recording of track of participant of your room by passing roomId, participantId and kind as a body parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/stop-track-recording", icon: }, { title: "Fetch a Track Recording", description: "This API lets you fetch a particular track recording info by passing trackRecordingId as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-a-track-recording", icon: }, { title: "Fetch All Track Recordings", description: "This API lets you fetch details of your track recording by passing roomId, sessionId and participantId as query parameters.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-all-track-recordings", icon: }, { title: "Delete A Track Recording", description: "This API lets you delete a particular track recording by passing trackRecordingId as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/delete-track-recording", icon: } ]} /> ## 3. Meeting Recording Meeting recording captures the entire meeting session as a single composite recording, including all participants and their interactions. ### When to Use Meeting Recording - Need a single recording file for the entire meeting - Want automatic mixing of all audio/video streams - Require meeting-level recording controls - Need simplified post-processing workflow ### Key Features - **Composite Output**: Single recording file with all participants - **Automatic Mixing**: Audio/video streams automatically combined - **Meeting-level Control**: Start/stop recording for entire meeting - **Simplified Management**: One recording per meeting session ### API References for Meeting Recording }, { title: "Stop Recording", description: "This API lets you stop recording of your room by passing roomId as a body parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/stop-recording", icon: }, { title: "Fetch Recordings", description: "This API lets you fetch details of your recording by passing roomId and sessionId as query parameters.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-recordings", icon: }, { title: "List all Recordings", description: "This API lets you fetch a particular recording info by passing recording Id as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/fetch-recording-using-recordingId", icon: }, { title: "Delete a Recording", description: "This API lets you delete a particular recording by passing recording Id as parameter.", link: "https://docs.videosdk.live/api-reference/realtime-communication/delete-recording", icon: } ]} /> ## Choosing the Right Recording Type | Use Case | Recommended Type | Reason | | :--- | :--- | :--- | | Agent conversations with automatic management | Participant Recording | Built-in automation and channel separation | | Custom recording workflows | Track Recording | Granular control over individual streams | | Simple meeting archival | Meeting Recording | Single composite file for entire meeting | | Compliance and audit trails | Participant Recording | Automatic lifecycle management | | Advanced post-processing | Track Recording | Individual track access and control | ## Best Practices ### Recording Management - Choose the appropriate recording type based on your use case - Ensure proper authentication tokens for recording API access - Monitor recording status and handle errors gracefully - Plan for adequate storage capacity ### Privacy and Compliance - Inform participants that sessions are being recorded - Implement proper data retention and deletion policies - Ensure compliance with local privacy regulations - Use appropriate recording type for your compliance requirements --- --- id: room-options title: RoomOptions hide_title: false hide_table_of_contents: false description: "Learn how to configure RoomOptions for VideoSDK AI Agents to customize meeting connection, agent behavior, and session management." pagination_label: "RoomOptions" keywords: - RoomOptions - AI Agents Configuration - Meeting Connection - Agent Settings - VideoSDK Agents - Session Management - Python SDK - Agent Identity - Playground Mode image: img/videosdklive-thumbnail.jpg sidebar_position: 12 sidebar_label: RoomOptions slug: room-options --- # RoomOptions `RoomOptions` is a configuration class that defines how an AI agent connects to and behaves within a VideoSDK meeting room. It serves as the primary interface for customizing agent behavior, meeting connection parameters, and session management settings. ## Introduction The `RoomOptions` class is the central configuration point for VideoSDK AI agents, providing comprehensive control over how agents join meetings, interact with participants, and manage their sessions. This configuration is passed to the `JobContext` during agent initialization and influences all aspects of the agent's behavior within the meeting environment. ## Core Features - **Meeting Connection**: Configure room ID, authentication, and transport mode for VideoSDK meetings - **Agent Identity**: Set display name, participant ID, and visual representation - **Session Management**: Control automatic session termination and timeouts - **Media Capabilities**: Enable vision processing, meeting recording, and background audio - **Telemetry**: Configure traces, metrics, and log collection/export - **Transport Modes**: Support for VideoSDK, WebSocket, and WebRTC transports - **Development Tools**: Playground mode for testing and development - **Error Handling**: Custom error handling callbacks - **Avatar Integration**: Support for virtual avatars ## Basic Example ```python title="main.py" from videosdk.agents import RoomOptions, JobContext # Basic configuration room_options = RoomOptions( room_id="your-meeting-id", name="My AI Agent", playground=True ) # Create job context context = JobContext(room_options=room_options) ``` ## Parameters Parameters that you can pass with `RoomOptions`: ### Connection & Identity | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `room_id` | Optional[str] | None | Unique identifier for the VideoSDK meeting | | `auth_token` | Optional[str] | None | VideoSDK authentication token | | `name` | Optional[str] | "Agent" | Display name of the agent in the meeting | | `agent_participant_id` | Optional[str] | None | Custom participant ID for the agent | | `join_meeting` | Optional[bool] | True | Whether agent should join the meeting | | `signaling_base_url` | Optional[str] | "api.videosdk.live" | VideoSDK signaling server URL | ### Media & Features | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `playground` | bool | True | Enable playground mode for easy testing | | `vision` | bool | False | Enable video processing capabilities | | `recording` | bool | False | Enable meeting recording. When `True`, audio is always recorded via the track API. Use `recording_options` to additionally record camera video or screen share. | | `recording_options` | Optional[RecordingOptions] | None | Fine-grained control over what is recorded alongside audio (see [RecordingOptions](#recordingoptions)) | | `background_audio` | bool | False | Enable background audio (e.g., thinking sounds) | | `avatar` | Optional[Any] | None | Virtual avatar for visual representation | ### Session Management | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `auto_end_session` | bool | True | Automatically end session when participants leave | | `session_timeout_seconds` | Optional[int] | 5 | Seconds to wait before ending the session after the **last** participant leaves | | `no_participant_timeout_seconds` | Optional[int] | 90 | Seconds to wait for the **first** participant to join before automatically shutting down. Set to `None` to wait indefinitely. | | `on_room_error` | Optional[Callable] | None | Error handling callback function | ### Telemetry & Logging | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `traces` | Optional[TracesOptions] | TracesOptions() | OpenTelemetry trace export configuration | | `metrics` | Optional[MetricsOptions] | MetricsOptions() | Metrics collection and export configuration | | `logs` | Optional[LoggingOptions] | LoggingOptions() | Log collection and export configuration | | `send_logs_to_dashboard` | bool | False | Send logs to VideoSDK dashboard | | `dashboard_log_level` | str | "INFO" | Log level for dashboard logs | | `send_analytics_to_pubsub` | Optional[bool] | False | Send analytics data via PubSub | ### Transport Configuration | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `transport_mode` | Optional[str \| TransportMode] | "videosdk" | Transport mode: `"videosdk"`, `"websocket"`, or `"webrtc"` | | `websocket` | Optional[WebSocketConfig] | WebSocketConfig() | WebSocket transport configuration | | `webrtc` | Optional[WebRTCConfig] | WebRTCConfig() | WebRTC transport configuration | --- ### TracesOptions | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `enabled` | bool | True | Enable trace collection | | `export_url` | Optional[str] | None | Custom export endpoint URL | | `export_headers` | Optional[Dict[str, str]] | None | Custom headers for the export endpoint | ### MetricsOptions | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `enabled` | bool | True | Enable metrics collection | | `export_url` | Optional[str] | None | Custom export endpoint URL | | `export_headers` | Optional[Dict[str, str]] | None | Custom headers for the export endpoint | ### LoggingOptions | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `enabled` | bool | False | Enable log export | | `level` | str | "INFO" | Log level filter (DEBUG, INFO, WARNING, ERROR) | | `export_url` | Optional[str] | None | Custom export endpoint URL | | `export_headers` | Optional[Dict[str, str]] | None | Custom headers for the export endpoint | ### WebSocketConfig | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `port` | int | 8080 | Port for the WebSocket server | | `path` | str | "/ws" | Endpoint path for the WebSocket connection | ### WebRTCConfig | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `signaling_url` | Optional[str] | None | Signaling server URL (required for WebRTC mode) | | `signaling_type` | str | "websocket" | Type of signaling transport | | `ice_servers` | Optional[list] | `[{"urls": "stun:stun.l.google.com:19302"}]` | ICE server configuration for NAT traversal | ### RecordingOptions Controls what is recorded in addition to audio when `recording=True`. Audio is **always** recorded via the track API when `recording=True`; these fields opt in to additional streams. | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `video` | bool | False | Also record the agent's camera video track (composite audio+video via participant recording API) | | `screen_share` | bool | False | Also record the screen-share track. Requires `vision=True`. | ```python title="main.py" from videosdk.agents import RoomOptions, RecordingOptions # Audio-only recording (default when recording_options is omitted) room_options = RoomOptions( room_id="your-room-id", recording=True, ) # Audio + screen share recording room_options = RoomOptions( room_id="your-room-id", vision=True, # required for screen share recording recording=True, recording_options=RecordingOptions(screen_share=True), ) # Composite audio + video (participant recording API) room_options = RoomOptions( room_id="your-room-id", recording=True, recording_options=RecordingOptions(video=True), ) ``` :::note `recording_options.screen_share=True` requires `vision=True` because vision subscribes to the video/share streams needed for screen recording. Setting `screen_share=True` without `vision=True` raises a `ValueError` at startup. ::: ## Transport Modes The `RoomOptions` supports three transport modes: ```python title="main.py" from videosdk.agents import RoomOptions, WebSocketConfig, WebRTCConfig # Default: VideoSDK transport room_options = RoomOptions(room_id="your-meeting-id") # WebSocket transport room_options = RoomOptions( transport_mode="websocket", websocket=WebSocketConfig(port=8080, path="/ws") ) # WebRTC transport room_options = RoomOptions( transport_mode="webrtc", webrtc=WebRTCConfig(signaling_url="wss://your-signaling-server.com") ) ``` ## Telemetry Configuration Configure traces, metrics, and logging for observability: ```python title="main.py" from videosdk.agents import RoomOptions, TracesOptions, MetricsOptions, LoggingOptions room_options = RoomOptions( room_id="your-meeting-id", traces=TracesOptions( enabled=True, export_url="https://your-otel-collector.com/v1/traces", export_headers={"Authorization": "Bearer your-token"} ), metrics=MetricsOptions( enabled=True, export_url="https://your-otel-collector.com/v1/metrics" ), logs=LoggingOptions( enabled=True, level="DEBUG" ) ) ``` ## Additional Resources import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, EyeIcon, RecordingIcon, NetworkIcon, MCPServerIcon, } from '@site/src/components/agent/cards'; }, { title: "Vision Integration", description: "Enable agents to receive and process video input from the meeting", link: "https://docs.videosdk.live/ai_agents/core-components/vision-and-multi-modality", icon: }, { title: "Recording Capabilities", description: "Record agent sessions for analysis and quality assurance", link: "https://docs.videosdk.live/ai_agents/core-components/recording", icon: }, { title: "Avatar", description: "Use Avatar for visually engaging AI Voice Agent", link: "https://docs.videosdk.live/ai_agents/core-components/avatar", icon: } ]} /> --- --- title: Speech Handle hide_title: false hide_table_of_contents: false description: "Learn about Speech Handle in the VideoSDK AI Agent SDK. Understand how to control agent speech at both session and utterance levels, manage interruptions, and coordinate sequential speech playback." pagination_label: "Speech Handle" keywords: - Speech Handle - Agent Speech - Utterance Handle - Session Control - Interruption Handling - Async Await - AgentSession - TTS - VideoSDK Agents - Python SDK - AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Speech Handle slug: speech-handle --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Speech Handle Speech control in VideoSDK agents operates through two complementary layers: **session-level** methods for initiating speech and **utterance-level** handles for managing speech lifecycle. This document covers both aspects of controlling agent speech output. ## Session-Level Speech Control The `AgentSession` provides three primary methods for controlling agent speech output **1. Say** `say(message: str, interruptible: bool = True)`: Sends a direct message from the agent to meeting participants with interruption control. **Parameters:** - `message`: The message to be spoken. - `interruptible`: When `True`, the agent’s speech can be interrupted. When `False`, the agent will continue speaking until the message is fully delivered. Default is `True`. ```python # Basic usage # highlight-start await session.say("Critical update!", interruptible=False) # highlight-end # In agent lifecycle hooks class MyAgent(Agent): async def on_enter(self): # highlight-start await self.session.say("Welcome to the meeting!") # highlight-end ``` **2. Reply** `reply(instructions: str, wait_for_playback: bool = True, interruptible: bool = True)`: Generates agent responses dynamically using custom instructions with interruption control. **Parameters:** - `instructions`: Custom instructions for generating the response - `wait_for_playback`: When `True`, prevents user interruptions until playback completes - `interruptible`: When `True`, the agent’s response can be interrupted. When `False`, the agent will continue speaking without interruption. Default is `True`. ```python # Generate immediate response # highlight-start await session.reply(instructions="Please summarize the conversation so far", interruptible=False) # highlight-end # Wait for complete playback before allowing new inputs # highlight-start await session.reply( instructions="Explain the next steps", wait_for_playback=True ) # highlight-end # Practical example in function tools class MyAgent(Agent): @function_tool async def get_summary(self) -> str: #highlight-start await self.session.reply( instructions="Based on our conversation, let me provide a summary..." ) #highlight-end return "Summary generated" ``` **3. Interrupt** `interrupt()`: Immediately stops the agent's current speech operation. ```python # Emergency stop during agent response # highlight-start session.interrupt() # highlight-end # User interruption handling class InteractiveAgent(Agent): async def handle_user_input(self, user_input: str): if "stop" in user_input.lower(): #highlight-start self.session.interrupt() #highlight-end await self.session.reply(instructions="How can I help you instead?") @function_tool async def emergency_stop(self) -> str: """Stop current agent operation immediately""" # highlight-start self.session.interrupt() # highlight-end return "Agent stopped successfully" ``` ## Utterance-Level Management `UtteranceHandle` manages individual agent utterances, preventing overlapping speech and enabling graceful interruption handling. ### Core Concepts - **Lifecycle Management** - Each `UtteranceHandle` tracks a single utterance from creation through completion. - **Completion States** An utterance can complete in two ways: 1. **Natural Completion:** The TTS finishes playing the audio 2. **User Interruption:** The user starts speaking during playback - **Awaitable Pattern** - The handle supports Python's async/await syntax for sequential speech control. ### API Reference | Property/Method | Return Type | Description | |------------------|--------------|--------------| | `id` | `str` | Unique identifier for the utterance | | `done()` | `bool` | Returns `True` if utterance is complete | | `interrupted` | `bool` | Returns `True` if user interrupted | | `interrupt()` | `None` | Manually marks utterance as interrupted | | `__await__()` | `Generator` | Enables awaiting the handle | ### Usage Patterns - **Sequential Speech** To prevent overlapping TTS, await each handle before starting the next utterance: ```python # Correct approach handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 # Wait for first utterance to complete handle2 = self.session.say("Do you live in this city?") await handle2 # Wait for second utterance to complete ``` - **Checking Interruption Status** Access the current utterance handle via `self.session.current_utterance` to detect interruptions: ```python utterance: UtteranceHandle | None = self.session.current_utterance # In long-running operations, check periodically for i in range(10): if utterance and utterance.interrupted: logger.info("Task was interrupted by the user.") return "The task was cancelled because you interrupted me." await asyncio.sleep(1) ``` ### Best Practices - **Sequential Speech:** Always await handles when you need sequential speech to prevent audio overlap - **Interruption Handling:** Check `interrupted` status in long-running operations to enable graceful cancellation - **Handle References:** Store handle references if you need to check status later in your function - **Avoid Concurrent Tasks:** Don't use `create_task()` for speech that should play sequentially ### Common Use Cases - **Multi-part responses:** When function tools need to speak multiple sentences in sequence - **Long-running operations:** Tasks that should be cancellable when users interrupt - **Conversational flows:** Scenarios requiring precise timing between utterances ## Example - Try It Yourself } ]} /> ## FAQs
Troubleshooting | Issue | Solution | |--------|-----------| | Overlapping speech | Use `await` on handles instead of `create_task()` | | Tasks not cancelling on interruption | Check `utterance.interrupted` in loops | | Handle is None | Only available during function tool execution via `session.current_utterance` |
Correct Usage Pattern #### ✅ Correct: Sequential Speech Await each handle to prevent overlapping TTS. ```python handle1 = session.say("First") await handle1 handle2 = session.say("Second") await handle2 ``` --- #### ❌ Incorrect: Concurrent Speech Using `create_task()` causes audio overlap. ```python asyncio.create_task(session.say("First")) asyncio.create_task(session.say("Second")) ```
--- --- title: Turn Detection & Voice Activity Detection (VAD) hide_title: false hide_table_of_contents: false description: "Learn about Turn Detection in the VideoSDK AI Agent SDK. Understand Voice Activity Detection (VAD), End-of-Utterance (EOU) detection, and how to implement natural conversation flow in your AI agents." pagination_label: "Turn Detection & VAD" keywords: - Turn Detection - Voice Activity Detection - VAD - End of Utterance - EOU - Conversation Flow - SileroVAD - TurnDetector - AI Agent SDK - VideoSDK Agents - Speech Processing - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Turn Detection and VAD slug: turn-detection-and-vad --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Turn Detection and Voice Activity Detection In conversational AI, timing is everything. Traditional voice agents rely on simple silence-based timers (Voice Activity Detection or VAD) to guess when a user has finished speaking. This often leads to awkward interruptions or unnatural pauses. To solve this, VideoSDK created **Namo-v1**: an open-source, high-performance turn-detection model that understands the _meaning_ of the conversation, not just the silence. ![Namo Turn Detection](https://strapi.videosdk.live/uploads/namo_v1_turn_detection_12e042c6ec.png) ## From Silence Detection to Speech Understanding Namo shifts from basic audio analysis to sophisticated Natural Language Understanding (NLU), allowing your agent to know when a user is truly finished speaking versus just pausing to think. | Traditional VAD (Silence-Based) | Namo Turn Detector (Semantic-Based) | | :---------------------------------------------- | :------------------------------------------------------- | | **Listens for silence.** | **Understands words and context.** | | Relies on a fixed timer (e.g., 800ms). | Uses a transformer model to predict intent. | | Often interrupts or lags. | Knows when to wait and when to respond instantly. | | Struggles with natural pauses and filler words. | Distinguishes between a brief pause and a true endpoint. | This semantic understanding enables AI agents to respond quicker and more naturally, creating a fluid, human-like conversational experience. :::tip Learn More For a deep dive into Namo's architecture, performance benchmarks, and how to use it as a standalone model, check out the dedicated [**Namo Turn Detector plugin page**](/ai_agents/plugins/namo-turn-detector). ::: ## Implementation For the most robust setup, you can use VAD and Namo together. VAD acts as a basic speech detector, while Namo intelligently decides if the turn is over. ### 1. Voice Activity Detection (VAD) First, configure VAD to detect the presence of speech. This helps manage interruptions and acts as a first-pass filter. ```python from videosdk.plugins.silero import SileroVAD # Configure VAD to detect speech activity vad = SileroVAD( threshold=0.5, # Sensitivity to speech (0.3-0.8) min_speech_duration=0.1, # Ignore very brief sounds min_silence_duration=0.75 # Wait time before considering speech ended ) ``` ### 2. Namo Turn Detection Next, add the `NamoTurnDetectorV1` plugin to analyze the content of the speech and predict the user's intent. #### Multilingual Model If your agent needs to support multiple languages, use the default multilingual model. It's a single, powerful model that works across more than 20 languages. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download the multilingual model to avoid runtime delays pre_download_namo_turn_v1_model() # Initialize the multilingual Turn Detector turn_detector = NamoTurnDetectorV1( threshold=0.7 # Confidence level for triggering a response ) ``` The table below lists all supported languages with their performance metrics and language codes. #### Language-Specific Models For maximum performance and accuracy in a single language, use a specialized model. These models are faster and have a smaller memory footprint. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download a specific language model (e.g., German) pre_download_namo_turn_v1_model(language="de") # Initialize the Turn Detector for German turn_detector = NamoTurnDetectorV1( language="de", threshold=0.7 ) ``` Namo-v1-Korean, accuracy: '97.3%' }, { language: '🇹🇷 Turkish', code: 'tr', modelLink: Namo-v1-Turkish, accuracy: '96.8%' }, { language: '🇯🇵 Japanese', code: 'ja', modelLink: Namo-v1-Japanese, accuracy: '93.5%' }, { language: '🇮🇳 Hindi', code: 'hi', modelLink: Namo-v1-Hindi, accuracy: '93.1%' }, { language: '🇩🇪 German', code: 'de', modelLink: Namo-v1-German, accuracy: '91.9%' }, { language: '🇬🇧 English', code: 'en', modelLink: Namo-v1-English, accuracy: '91.5%' }, { language: '🇳🇱 Dutch', code: 'nl', modelLink: Namo-v1-Dutch, accuracy: '90.0%' }, { language: '🇮🇳 Marathi', code: 'mr', modelLink: Namo-v1-Marathi, accuracy: '89.7%' }, { language: '🇨🇳 Chinese', code: 'zh', modelLink: Namo-v1-Chinese, accuracy: '88.8%' }, { language: '🇵🇱 Polish', code: 'pl', modelLink: Namo-v1-Polish, accuracy: '87.8%' }, { language: '🇳🇴 Norwegian', code: 'no', modelLink: Namo-v1-Norwegian, accuracy: '87.3%' }, { language: '🇮🇩 Indonesian', code: 'id', modelLink: Namo-v1-Indonesian, accuracy: '87.1%' }, { language: '🇵🇹 Portuguese', code: 'pt', modelLink: Namo-v1-Portuguese, accuracy: '86.9%' }, { language: '🇮🇹 Italian', code: 'it', modelLink: Namo-v1-Italian, accuracy: '86.8%' }, { language: '🇪🇸 Spanish', code: 'es', modelLink: Namo-v1-Spanish, accuracy: '86.7%' }, { language: '🇩🇰 Danish', code: 'da', modelLink: Namo-v1-Danish, accuracy: '86.5%' }, { language: '🇧🇩 Bengali', code: 'bn', modelLink: Namo-v1-Bengali, accuracy: '79.2%' }, { language: '🇸🇦 Arabic', code: 'ar', modelLink: Namo-v1-Arabic, accuracy: '79.7%' }, { language: '🇫🇮 Finnish', code: 'fi', modelLink: Namo-v1-Finnish, accuracy: '84.8%' }, { language: '🇫🇷 French', code: 'fr', modelLink: Namo-v1-French, accuracy: '85.0%' }, { language: '🇺🇦 Ukrainian', code: 'uk', modelLink: Namo-v1-Ukrainian, accuracy: '86.2%' }, { language: '🇻🇳 Vietnamese', code: 'vi', modelLink: Namo-v1-Vietnamese, accuracy: '82.37%' }, { language: '🇷🇺 Russian', code: 'ru', modelLink: Namo-v1-Russian, accuracy: '84.1%' } ]} /> :::note To see all available models for different languages, along with their benchmarks and accuracy, please visit our [Hugging Face models page](https://huggingface.co/videosdk-live/models). ::: ### 3. Adaptive End-of-Utterance (EOU) Handling The **Adaptive EOU** mode dynamically adjusts the speech-wait timeout based on the confidence scores. This ensures that the agent waits longer when the user is hesitant and responds faster when the user's intent is clear, creating a more natural conversational flow. You can configure this by setting the `eou_config` in your pipeline options: ```python pipeline = Pipeline( # ... other config eou_config=EOUConfig( mode='ADAPTIVE', # or 'DEFAULT' min_max_speech_wait_timeout=[0.5, 0.8] # Min 0.5s, Max 0.8s wait ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `mode` | `str` | • **DEFAULT**: Uses a fixed timeout value.
• **ADAPTIVE**: Dynamically adjusts timeout based on confidence scores.| | `min_max_speech_wait_timeout` | `list[float]` | Defines the minimum and maximum wait time (in seconds)| ##### Example | User Input | Agent Reaction | Wait Time | Example | |------------|----------------|-----------|---------| | **Mode = DEFAULT**
Speaks clearly | Responds immediately | ~0.5s | `“Book a meeting for tomorrow at 10.”` | | **Mode = DEFAULT**
Pauses or hesitates mid-sentence | Waits slightly longer | ~0.8s | `“Book a meeting for… um… tomorrow…”` | | **Mode = ADAPTIVE** | Adjusts based on speech clarity | Scaled between min/max | `“Remind me to call… uh… John later.”` | ### 4. Interruption Detection (VAD + STT) Interruption Detection controls when the system should treat user speech as an intentional interruption. It evaluates both voice activity and recognized speech content to avoid triggering interruptions from short noises, filler words, or background audio. The agent only stops or responds when the user clearly intends to speak. #### Configuration Example (HYBRID mode) ```python pipeline = Pipeline( # ... other config interrupt_config=InterruptConfig( mode="HYBRID", interrupt_min_duration=0.2, # 200ms of continuous speech interrupt_min_words=2, # At least 2 words recognized ) ) ``` #### VAD_ONLY mode ```python pipeline = Pipeline( # ... other config interrupt_config=InterruptConfig( mode="VAD_ONLY", interrupt_min_duration=0.2, # 200ms of continuous speech ) ) ``` #### STT_ONLY mode ```python pipeline = Pipeline( # ... other config interrupt_config=InterruptConfig( mode="STT_ONLY", interrupt_min_words=2, # At least 2 words recognized ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `mode` | `str` |• **HYBRID** : Combines VAD and STT. Requires both audio detection and recognized words to trigger an interruption.
• **VAD_ONLY** : Uses only raw speech activity detection. Faster but may be triggered by background noise.
• **STT_ONLY** : Relies only on recognized words from the transcript. Slower but ensures speech is intelligible. | | `interrupt_min_duration` | `float` | Minimum duration (in seconds) of continuous speech required to trigger interruption. | | `interrupt_min_words` | `int` | Minimum number of words that must be recognized (used in `HYBRID` and `STT_ONLY` modes). | ### 5. False-Interruption Recovery The **False-Interruption Recovery** feature detects accidental or brief user noises and allows the agent to automatically resume speaking when interruptions are not genuine. #### Configuration Example ```python pipeline = Pipeline( # ... other config interrupt_config=InterruptConfig( false_interrupt_pause_duration=2.0, # Wait 2 seconds to confirm interruption resume_on_false_interrupt=True, # Auto-resume if interruption is brief ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `false_interrupt_pause_duration` | `float` | Duration (in seconds) to wait after detecting an interruption before considering it false. If the user doesn't continue speaking within this time, the interruption is considered accidental and the agent resumes. | | `resume_on_false_interrupt` | `bool` | If `True`, the agent will automatically resume speaking after detecting a false interruption. If `False`, the agent will remain paused even after brief interruptions. | ## Pipeline Integration Combine VAD and Namo in your `Pipeline` to bring it all together. ```python from videosdk.agents import Pipeline from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download the model you intend to use pre_download_namo_turn_v1_model(language="en") pipeline = Pipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, # highlight-start vad=SileroVAD(threshold=0.5), turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7) # highlight-end ) ``` :::tip The `Pipeline` in realtime mode for providers like OpenAI includes built-in turn detection, so external VAD and Turn Detector components are not required. ::: ## Example Implementation Here’s a complete example showing Namo in a conversational agent. ```python title="main.py" from videosdk.agents import Agent, Pipeline, AgentSession from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from your_providers import your_stt_provider, your_llm_provider, your_tts_provider class ConversationalAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant that waits for users to finish speaking before responding." ) async def on_enter(self): await self.session.say("Hello! I'm listening and will respond when you're ready.") # 1. Pre-download the model to ensure fast startup pre_download_namo_turn_v1_model(language="en") # 2. Set up the pipeline with Namo for intelligent turn detection pipeline = Pipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, # highlight-start vad=SileroVAD(threshold=0.5), turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7) # highlight-end ) # 3. Create and start the session session = AgentSession(agent=ConversationalAgent(), pipeline=pipeline) # ... connect to your call transport ``` ## Examples - Try It Yourself }, { title: "Cascading Pipleine", description: "Turn-Detection and VAD with cascade", link: "https://github.com/videosdk-live/agents/blob/main/examples/cascade_basic.py", icon: } ]} columns={2} /> --- --- title: Utterance Handle hide_title: false hide_table_of_contents: false description: "Learn about UtteranceHandle in the VideoSDK AI Agent SDK. Understand how to manage agent utterances, prevent overlapping speech, and handle user interruptions gracefully." pagination_label: "Utterance Handle" keywords: - Utterance Handle - Speech Management - Interruption Handling - Async Await - Conversation Flow - AgentSession - TTS - VideoSDK Agents - Python SDK - AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 7 sidebar_label: Utterance Handle slug: utterance-handle --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, GithubIcon } from '@site/src/components/agent/cards'; import { LanguageTable } from '@site/src/components/agent'; # Utterence Handle `UtteranceHandle` is a lifecycle management class for agent utterances in the videosdk-agents framework. It solves two critical problems: - preventing overlapping text-to-speech (TTS) output - enabling graceful interruption handling when users speak during agent responses. This is essential for creating natural conversational experiences where agents can generate multiple sequential speech outputs without audio overlap. ## Core Concepts ### Lifecycle Management Each `UtteranceHandle` instance tracks a single utterance from creation through completion. The handle manages state transitions automatically as the conversation progresses. ### Completion States An utterance can complete in two ways: 1. **Natural Completion:** The TTS finishes playing the audio to completion 2. **User Interruption:** The user starts speaking, triggering an interruption ### Awaitable Pattern The handle is compatible with Python's async/await syntax. This allows you to write sequential speech code that waits for each utterance to complete before starting the next one. ## API Reference ### Properties | Property/Method | Return Type | Description | |----------------|-------------|-------------| | id | str | Unique identifier for the utterance | | done() | bool | Returns True if utterance is complete | | interrupted | bool | Returns True if user interrupted | | interrupt() | None | Manually marks utterance as interrupted | | __await__() | Generator | Enables awaiting the handle | ### Methods - `interrupt()`: Manually marks the utterance as interrupted - `__await__()`: Enables awaiting the handle to wait for completion ## Usage Patterns ### Sequential Speech To prevent overlapping TTS, await each handle before starting the next utterance: ```python # Correct approach handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 # Wait for first utterance to complete handle2 = self.session.say("Do you live in this city?") await handle2 # Wait for second utterance to complete ``` ### Checking Interruption Status Access the current utterance handle via `self.session.current_utterance` in function tools to detect interruptions: ```python utterance: UtteranceHandle | None = self.session.current_utterance # In long-running operations, check periodically for i in range(10): if utterance and utterance.interrupted: logger.info("Task was interrupted by the user.") return "The task was cancelled because you interrupted me." await asyncio.sleep(1) ``` ## Anti-Pattern: Concurrent Speech Never use `asyncio.create_task()` for speech that should be sequential, as this causes overlapping audio: ```python # INCORRECT - causes overlapping speech asyncio.create_task(self.session.say(f"The current temperature is {temperature}°C.")) asyncio.create_task(self.session.say("Do you live in this city?")) ``` ## Integration with AgentSession The `session.say()` method returns an `UtteranceHandle` instance. During function tool execution, the current utterance is accessible via `self.session.current_utterance`. The handle's lifecycle is managed automatically by the session, with completion and interruption states updated as the conversation progresses. ### Complete Example ```python @function_tool async def get_weather(self, latitude: str, longitude: str) -> dict: utterance: UtteranceHandle | None = self.session.current_utterance # Fetch weather data temperature = await fetch_temperature(latitude, longitude) # Sequential speech with await handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 handle2 = self.session.say("Do you live in this city?") await handle2 # Check if user interrupted if utterance and utterance.interrupted: return {"response": "Weather request cancelled due to user interruption."} return {"response": f"The temperature is {temperature}°C."} ``` ## Best Practices 1. Always await handles when you need sequential speech to prevent audio overlap 2. Check `interrupted` status in long-running operations to enable graceful cancellation 3. Store handle references if you need to check status later in your function 4. Avoid `create_task()` for speech that should play sequentially ## Common Use Cases - **Multi-part responses:** When function tools need to speak multiple sentences in sequence - **Long-running operations:** Tasks that should be cancellable when users interrupt - **Conversational flows:** Scenarios requiring precise timing between utterances ## Example - Try It Yourself } ]} /> ## FAQs
Troubleshooting | Issue | Solution | |--------|-----------| | Overlapping speech | Use `await` on handles instead of `create_task()` | | Tasks not cancelling on interruption | Check `utterance.interrupted` in loops | | Handle is None | Only available during function tool execution via `session.current_utterance` |
Correct Usage Pattern #### ✅ Correct: Sequential Speech Await each handle to prevent overlapping TTS. ```python handle1 = session.say("First") await handle1 handle2 = session.say("Second") await handle2 ``` --- #### ❌ Incorrect: Concurrent Speech Using `create_task()` causes audio overlap. ```python asyncio.create_task(session.say("First")) asyncio.create_task(session.say("Second")) ```
--- --- title: Vision & Multi-modality hide_title: false hide_table_of_contents: false description: "Learn how to add vision and multi-modal capabilities to your VideoSDK AI Agents. Understand image processing, live video input, and multi-modal conversation flows." pagination_label: "Vision & Multi-modality" keywords: - Vision - Multi-modality - Image Processing - Live Video - Visual AI - ImageContent - Gemini Live - Real-time Vision - AI Agent SDK - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 9 sidebar_label: Vision & Multi-modality slug: vision-and-multi-modality --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Vision & Multi-modality Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see. The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases. ## Pipeline Architecture Overview The framework provides two pipeline types with different vision support: | Pipeline Mode | Vision Capabilities | Supported Models | Use Cases | |---|---|---|---| | Cascading Mode | Live frame capture & static images | OpenAI, Anthropic, Google | On-demand frame analysis, document analysis, visual Q&A | | Realtime Mode | Continuous live video streaming | Google Gemini Live only | Real-time visual interactions, live video commentary | ## Cascading Mode Vision The Pipeline in cascading mode supports vision through two approaches: capturing live video frames from participants, or processing static images. This works with all supported LLM providers (OpenAI, Anthropic, Google). ### Enabling Vision Enable vision capabilities by setting `vision=True` in RoomOptions: ```python from videosdk.agents import JobContext, RoomOptions room_options = RoomOptions( room_id="your-room-id", name="Vision Agent", #highlight-start vision=True # Enable vision capabilities #highlight-end ) job_context = JobContext(room_options=room_options) ``` ### Live Frame Capture Capture video frames from meeting participants on-demand using `agent.capture_frames()`: ```python from videosdk.agents import Agent, AgentSession, Pipeline from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.google import GoogleLLM from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector class VisionAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant that can analyze images." ) async def entrypoint(ctx: JobContext): agent = VisionAgent() pipeline = Pipeline( stt=DeepgramSTT(), llm=GoogleLLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) session = AgentSession( agent=agent, pipeline=pipeline ) shutdown_event = asyncio.Event() #highlight-start async def on_pubsub_message(message): print("Pubsub message received:", message) if isinstance(message, dict) and message.get("message") == "capture_frames": print("Capturing frame....") try: frames = agent.capture_frames(num_of_frames=1) if frames: print(f"Captured {len(frames)} frame(s)") await session.reply( "Please analyze this frame and describe what you see in details within one line.", frames=frames ) else: print("No frames available. Make sure vision is enabled in RoomOptions.") except ValueError as e: print(f"Error: {e}") def on_pubsub_message_wrapper(message): asyncio.create_task(on_pubsub_message(message)) #highlight-end #rest of the code.. ``` :::tip The `capture_frames` function returns an array and the max number of frames you can input is 5 (`num_of_frames <=5`) ::: **Key Features:** - **On-Demand Capture:** Capture frames only when needed, triggered by events or user requests - **Event-Driven:** Use PubSub or other triggers to capture frames at the right moment - **Flexible Analysis:** Send custom instructions along with frames for specific analysis tasks ### Silent Capture (Saving Captured Frames) You can save captured video frames to disk for later analysis or debugging. The frames returned by `agent.capture_frames()` are `av.VideoFrame` objects that can be converted to JPEG images. (Silent capture - as it doesn't invoke any agent speech saying the image is being captured unless explicity set to do so) ```python title="main.py" import io from av import VideoFrame from PIL import Image def save_frame_as_jpeg(frame: VideoFrame, filename: str) -> None: """Save a video frame as a JPEG file.""" img = frame.to_image() # Convert to PIL Image img.save(filename, format="JPEG") # In your agent code frames = agent.capture_frames(num_of_frames=1) if frames: # Save the first frame save_frame_as_jpeg(frames[0], "captured_frame.jpg") # Or save as bytes for uploading/processing buffer = io.BytesIO() frames[0].to_image().save(buffer, format="JPEG") jpeg_bytes = buffer.getvalue() ``` **Use Cases:** - **Debugging:** Save frames to verify what the agent is seeing - **Logging:** Archive frames for audit trails or quality assurance - **Preprocessing:** Save frames before sending to external vision APIs - **Thumbnails:** Generate preview images for user interfaces ### Static Image Processing For pre-existing images or URLs, use the `ImageContent` class: ```python from videosdk.agents import ChatRole, ImageContent # Add image from URL agent.chat_context.add_message( role=ChatRole.USER, content=[ImageContent(image="https://example.com/image.jpg")] ) # Add image with custom settings image_content = ImageContent( image="https://example.com/document.png", inference_detail="high" # "auto", "high", or "low" ) agent.chat_context.add_message( role=ChatRole.USER, content=[image_content] ) ``` ### Provider Support All major LLM providers support vision in cascading mode: | Provider | Vision Models | Capabilities | |-----------|----------------|---------------| | OpenAI | GPT-4 Vision models | Configurable detail levels, URL & base64 support | | Anthropic | Claude 3 models | Advanced image understanding, document analysis | | Google | Gemini models | Comprehensive visual analysis, multi-image support | ### Best Practices - **Frame Timing:** Capture frames at meaningful moments (e.g., when user asks "what do you see?") - **Error Handling:** Always check if frames are available before processing - **Vision Enablement:** Ensure `vision=True` is set in `RoomOptions` for frame capture - **Image Quality:** Use appropriate resolutions for your use case (1024x1024 recommended for detailed analysis) *Here is the example you can try out : [Cascade Vision Example](https://github.com/videosdk-live/agents/blob/main/examples/vision/vision_cascade.py)* --- ## Realtime Mode Vision The Pipeline in realtime mode enables continuous live video processing for real-time visual interactions. Video frames are automatically streamed to the model as they arrive. ### Live Video Processing Live video input is enabled through the `vision` parameter in `RoomOptions` and requires Google's Gemini Live model. ```python title="main.py" from videosdk.agents import Agent, AgentSession, Pipeline, WorkerJob, JobContext, RoomOptions from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig async def start_session(context: JobContext): # Initialize Gemini with vision capabilities model = GeminiRealtime( model="gemini-3.1-flash-live-preview", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = Pipeline(llm=model) agent = VisionAgent() session = AgentSession( agent=agent, pipeline=pipeline, ) await session.start(wait_for_participant=True, run_until_shutdown=True) # Enable live video processing def make_context() -> JobContext: room_options = RoomOptions(room_id="", name="Sandbox Agent", playground=True, #highlight-start vision=True #highlight-end ) return JobContext( room_options=room_options ) ``` ### Video Processing Flow When vision is enabled, the system automatically does following: 1. **Continuous Capture**: Captures video frames from meeting participants 2. **Frame Processing**: Processes frames at optimal intervals (throttled to 0.5 seconds) 3. **Model Integration**: Sends visual data to the Gemini Live model 4. **Context Integration**: Integrates visual understanding with conversation context ### Realtime Mode Limitations - **Model Restriction**: Only works with `GeminiRealtime` model - **Network Requirements**: Requires stable network connections for optimal performance - **Frame Rate**: Automatically throttled to prevent overwhelming the model *Here is the example you can try out : [**Realtime Vision Example**](https://github.com/videosdk-live/agents/blob/main/examples/vision/vision_realtime.py)* ## Choosing the Right Approach | Use Case | Recommended Pipeline | Why | |-----------|----------------------|-----| | On-demand frame analysis | Pipeline (Cascading) | Capture frames only when needed, works with all LLM providers | | Document/image Q&A | Pipeline (Cascading) | Process static images with custom instructions | | Real-time video commentary | Pipeline (Realtime) | Continuous streaming for live visual interactions | | Multi-provider support | Pipeline (Cascading) | Works with OpenAI, Anthropic, and Google | | Lowest latency | Pipeline (Realtime) | Direct streaming to Gemini Live model | ## Examples - Try Out Yourself Checkout examples of using Realtime and Cascading Vision functionality import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, DocumentIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; }, { title: "Realtime Vision", description: "Continuous video streaming with Gemini Realtime API", link: "https://github.com/videosdk-live/agents/blob/main/examples/vision/vision_realtime.py", icon: } ]} /> ## Frequently Asked Questions
Can I use vision with any LLM provider? Pipeline vision in cascading mode works with OpenAI, Anthropic, and Google LLMs. Vision in realtime mode only works with Google's Gemini Live model.
How do I capture frames at specific moments? Use event-driven triggers like PubSub messages or user speech to call `agent.capture_frames()` at the right time. See the example code above for implementation details.
What's the difference between frame capture and continuous streaming? Frame capture (cascading mode) captures frames on-demand when you call `capture_frames()`. Continuous streaming (realtime mode) automatically sends video frames to the model in real-time.
--- --- title: Voice Mail Detection hide_title: false hide_table_of_contents: false description: "Learn how VideoSDK AI agents detect voicemail systems during outbound calls and take actions such as leaving a voicemail message or ending the call" pagination_label: "Voice Mail Detection" keywords: - Voice Mail Detection - VideoSDK Agents - VideoSDK AI Voice - Python SDK - Real-time Transcription - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Voice Mail Detection slug: voice-mail-detection --- import { AgentCardGrid, GithubIcon, } from "@site/src/components/agent/cards"; # Voice Mail Detection Voice Mail Detection allows you to automatically handle voicemail scenarios when making outbound calls with a VideoSDK AI agent. When an outbound call is forwaded to a voicemail system, the detector triggers a callback so your agent can take the action such as leaving a voicemail message or ending the call. ## What Problem This Solves In outbound calling workflows, unanswered calls are often routed to voicemail systems. Without detection, agents may continue speaking or wait unnecessarily. Voice Mail Detection lets you: - Detect voicemail systems automatically - Control how your agent responds - End calls cleanly after voicemail handling :::info To set up an outbound calling, and routing rules check out the [Quick Start Example](https://docs.videosdk.live/telephony/managing-calls/making-outbound-calls). ::: ## Enabling Voice Mail Detection To use voicemail detection, import and add `VoiceMailDetector` to your agent configuration and register a callback that defines how voicemail should be handled. ```python from videosdk.agents import VoiceMailDetector from videosdk.plugins.openai import OpenAILLM async def voice_mail_callback(message): print("Voice Mail message received:", message) # highlight-start voicemail = VoiceMailDetector( llm=OpenAILLM(), duration=5, callback=custom_callback_voicemail, ) # highlight-end session = AgentSession( # highlight-start voice_mail_detector=voicemail # highlight-end ) ``` ## Parameters | Parameter | Description | |----------|-------------| | `llm` | LLM to process the detected voicemail. | | `duration` | The minimum period of silence (in seconds) that triggers voicemail detection. | | `callback` | A function that is called whenever a voicemail is detected, allowing for custom actions like hanging up or leaving a message. | ## Example - Try It Yourself , }, ]} columns={2} /> --- --- title: Worker hide_title: false hide_table_of_contents: false description: "The `Worker` class in VideoSDK's AI Agent SDK serves as the central orchestrator that manages job execution, backend registration, and agent lifecycle coordination. It handles task execution through configurable process/thread executors, manages VideoSDK room connections, and coordinates between agents, pipelines, and infrastructure components for seamless real-time AI communication." pagination_label: "Worker" keywords: - Worker - AI Agent SDK - VideoSDK Agents - Job Orchestration - Task Execution - Backend Registration - Agent Lifecycle - Process Management - Session Coordination - Real-time AI - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 13 sidebar_label: Worker slug: worker --- import { AgentCardGrid, SettingsIcon, PlayIcon, CodeIcon, ExternalLinkIcon, RobotIcon, GithubIcon } from '@site/src/components/agent/cards'; # Worker This document covers the `worker` and `job` execution system that manages `agent` processes, handles backend registration, and coordinates job assignment and execution. This system provides the foundation for running VideoSDK agents either locally or as part of a distributed backend infrastructure. ## Architecture Overview The `worker` and `job` system consists of three primary components that work together to execute agent code: - **WorkerJob**: The main entry point that configures and starts agent execution - **Worker**: Manages process pools, backend communication, and job lifecycle - **JobContext**: Provides runtime context and resources to agent entrypoint functions ![Worker](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_worker.png) ## Core Components ### Worker Class The `Worker` class manages the complete lifecycle of agent execution, including process management, backend communication, and job coordination. **Core Responsibilities:** - Process pool management and lifecycle - Backend registry communication - Job assignment and execution coordination - Resource monitoring and cleanup - Error handling and recovery ### WorkerJob The `WorkerJob` class serves as the primary entry point for creating and running agents. It accepts an `entrypoint function` and `configuration options`, then delegates to the Worker class for execution. ```python from videosdk.agents import WorkerJob, Options, JobContext, RoomOptions # Configure worker options options = Options( agent_id="MyAgent", max_processes=5, register=True, # Registers worker with backend for job scheduling ) # Set up room configuration room_options = RoomOptions( name="My Agent", ) # Create job context job_context = JobContext(room_options=room_options) # Define your agent entrypoint async def your_agent_function(ctx: JobContext): # Your agent logic here await ctx.connect() # Agent implementation... # Create and start the worker job # highlight-start job = WorkerJob( entrypoint=your_agent_function, jobctx=lambda: job_context, options=options, ) job.start() # highlight-end ``` - **Entrypoint:** An async function that serves as your agent's main execution logic. This function receives a `JobContext` parameter and contains your agent implementation. - **JobContext:** Provides the runtime environment for your agent, managing room connections and VideoSDK integration. It handles room setup, authentication, and cleanup operations. - **Options:** Configuration settings for worker execution including process management, authentication, and backend registration. You can find worker options [here ↗](https://docs.videosdk.live/ai_agents/deployments/self-hosting/worker-configuration#worker-options-explained). **Key Methods:** - `start()`: Initiates worker execution based on configuration ## Deployments Choose how to deploy your VideoSDK agents based on your infrastructure needs and requirements. ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. } ]} /> --- --- title: Deploy Your Agents hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Deploy Your Agents" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Deploy Your Agents slug: deploy-your-agents --- # Deploy Your Agents This guide shows you how to deploy AI Agents with the [videosdk-agents](https://pypi.org/project/videosdk-agents/) python package. Once your AI Agent is ready to use, you need to create an AI Deployment. The AI Deployment is responsible for running your AI Agent. Before proceeding, ensure you have completed the steps under **Prerequisites**. ## Prerequisites To deploy your AI Deployment, make sure you have: - Created an AI Deployment using the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment). - A VideoSDK authentication token (generate from [VideoSDK Dashboard](https://app.videosdk.live)) ## YAML Configuration Create a `videosdk.yaml` file with the following structure: ``` version: "1.0" deployment: id: your_ai_deployment_id entry: path: entry_point_for_deployment env: # Optional to run your agent locally path: "./.env" secrets: VIDEOSDK_AUTH_TOKEN: your_auth_token deploy: cloud: true ``` ### Field Descriptions | Field | Description | | ----------------------------- | ---------------------------------------------------------------------------------------------- | | `deployment.id` | The `deploymentId` obtained from the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment) | | `deployment.entry.path` | Path to the entry point script for your AI Deployment. | | `env.path` | Path to your `.env` file, used only when running the agent locally. | | `secrets.VIDEOSDK_AUTH_TOKEN` | Your VideoSDK auth token (required for deployment). | | `deploy.cloud` | Set to `true` to allow deploying the deployment to VideoSDK Cloud, when using the deploy command. Use `false` to avoid accidental deploys. | ## CLI Commands - ###### Run the AI Deployment locally for Testing. ``` videosdk run ``` - ###### Deploy the AI Deployment. ``` videosdk deploy ``` ## Next Steps After deploying your AI Deployment, you can start using it by: 1. Creating a new session using the [Start Session API](/api-reference/agent-cloud/start-session) 2. Ending the session using the [End Session API](/api-reference/agent-cloud/end-session) --- --- title: Authentication hide_title: false hide_table_of_contents: false description: "Learn how to authenticate with the VideoSDK CLI using login and logout commands. Set up your credentials for Agent Cloud deployments." pagination_label: "CLI Authentication" keywords: - VideoSDK CLI - Authentication - Login - Logout - Auth Token - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Authentication slug: authentication --- # Authentication Before deploying agents to Agent Cloud, you need to authenticate the VideoSDK CLI with your account. This section covers the authentication commands that link your local development environment to your VideoSDK account. ## Login The `login` command authenticates your CLI session with your VideoSDK account. ### Usage ```bash videosdk auth login ``` ### What Happens 1. **Browser Opens**: The CLI automatically opens your default browser to the authentication page 2. **Login & Confirm**: You log in to your VideoSDK account and confirm the CLI authentication request 3. **Token Storage**: Once approved, an authentication token is securely stored locally 4. **Ready to Deploy**: The stored token will be used for all future CLI commands ### Example Output ```bash $ videosdk auth login ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Authentication ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ℹ Initiating browser authentication... ✓ Opened authentication URL in browser https://app.videosdk.live/cli/confirm-auth?requestId=abc123xyz ⠋ Waiting for authentication... ✓ Successfully authenticated! ``` ### Notes - The CLI will wait for you to complete the authentication in the browser - If the browser doesn't open automatically, copy and paste the displayed URL - Authentication will timeout if not completed within the specified time - You can cancel the authentication anytime by pressing `Ctrl+C` ## Logout The `logout` command removes the stored authentication token from your local environment. ### Usage ```bash videosdk auth logout ``` ### What Happens 1. **Token Removal**: The stored authentication token is removed from your local configuration 2. **Session End**: Your CLI session is disconnected from the VideoSDK account ### Example Output ```bash $ videosdk auth logout ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Logout ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✓ Successfully logged out ``` ### Notes - After logging out, you'll need to run `videosdk auth login` again before using authenticated commands - This command does not affect any existing deployments on Agent Cloud ## Next Steps After authenticating, you're ready to initialize your agent. See the [Initialize Agent](./init-agent) documentation for more details. --- --- title: Build & Push hide_title: false hide_table_of_contents: false toc_max_heading_level: 2 description: "Learn how to build and push Docker images for your AI agents using the VideoSDK CLI. Package your agent code and deploy to container registries." pagination_label: "CLI Build & Push" keywords: - VideoSDK CLI - Docker Build - Docker Push - Container Registry - Agent Deployment image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Build & Push slug: build-push --- # Build & Push After developing your AI agent, you need to package it as a Docker image and push it to a container registry before deploying to Agent Cloud. This section covers the build and push commands. ## Prerequisites Before building your image, you must have a `Dockerfile` and a `requirements.txt` file in your project's root directory. The `Dockerfile` is automatically generated when you run [videosdk agent init](./init-agent). ### requirements.txt Create a `requirements.txt` file listing all the Python packages your agent needs. ### Dockerfile A `Dockerfile` is automatically created when you run `videosdk agent init`. Below is a minimal **multi-stage build** example that keeps your final image small while ensuring all build-time dependencies are met. ```dockerfile # Stage 1: Build Stage FROM python:3.12-slim AS builder WORKDIR /app # Install build-time dependencies # These are only needed to compile/build packages like aec-audio-processing RUN apt-get update && apt-get install -y --no-install-recommends \ build-essential \ python3-dev \ swig \ pkg-config \ meson \ ninja-build \ && rm -rf /var/lib/apt/lists/* # Create a virtual environment to keep dependency installation isolated and easy to move RUN python -m venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Stage 2: Runtime Stage FROM python:3.12-slim WORKDIR /app # Copy the virtual environment from the builder stage # This includes all the installed Python packages but NONE of the build tools COPY --from=builder /opt/venv /opt/venv # Ensure the app uses the virtual environment's Python and packages ENV PATH="/opt/venv/bin:$PATH" # Install minimal runtime libraries if needed (e.g., libstdc++ for compiled extensions) RUN apt-get update && apt-get install -y --no-install-recommends \ libstdc++6 \ && rm -rf /var/lib/apt/lists/* # Copy your application code COPY agent.py . # Run the application CMD ["python", "agent.py"] ``` ## Build The `build` command creates a Docker image for your agent using a Dockerfile. ### Usage ```bash videosdk agent build [OPTIONS] ``` ### Options | Option | Short | Description | Default | | --------- | ----- | -------------------------------------------------------- | -------------------- | | `--image` | `-i` | Image name with optional tag (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--file` | `-f` | Path to Dockerfile | `./Dockerfile` | ### What Happens 1. **Dockerfile Detection**: The CLI locates your Dockerfile (default: `./Dockerfile`) 2. **Image Build**: Docker builds the image for the `linux/arm64` platform 3. **Local Storage**: The built image is stored in your local Docker registry ### Example Output ```bash $ videosdk agent build --image myrepo/myagent:v1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Building Docker Image ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Platform linux/arm64 Image myrepo/myagent:v1 Dockerfile /path/to/your/project/Dockerfile ──────────────────────────────────────── [Docker build output...] ✓ Successfully built image: myrepo/myagent:v1 ``` ### Examples ```bash # Build with explicit image name videosdk agent build --image myrepo/myagent:v1 # Build with custom Dockerfile videosdk agent build --image myrepo/myagent:v1 --file Dockerfile.prod # Build using image from videosdk.yaml videosdk agent build ``` ### Notes - The image name must be lowercase - In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). - If `--image` is not provided, the CLI reads from `agent.image` in your `videosdk.yaml` - Docker must be installed and running on your machine - The build uses `linux/arm64` platform for Agent Cloud compatibility ## Push The `push` command uploads your Docker image to a container registry. ### Usage ```bash videosdk agent push [OPTIONS] ``` ### Options | Option | Short | Description | Default | | ------------ | ----- | ----------------------------------------------- | -------------------- | | `--image` | `-i` | Image name with tag (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--server` | `-s` | Registry server URL | `docker.io` | | `--username` | `-u` | Registry username for authentication | None | | `--password` | `-p` | Registry password for authentication | None | ### What Happens 1. **Authentication** (optional): If credentials are provided, the CLI logs into the registry 2. **Image Push**: The Docker image is uploaded to the specified registry 3. **Ready for Deploy**: The image is now available for Agent Cloud deployments ### Example Output ```bash $ videosdk agent push --image myrepo/myagent:v1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Pushing Docker Image ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Image myrepo/myagent:v1 Registry docker.io ──────────────────────────────────────── [Docker push output...] ✓ Successfully pushed image: myrepo/myagent:v1 ``` ### Examples ```bash # Push to Docker Hub (default) videosdk agent push --image myrepo/myagent:v1 # Push to GitHub Container Registry with authentication videosdk agent push --image myrepo/myagent:v1 --server ghcr.io -u username -p token # Push to private registry videosdk agent push --image myrepo/myagent:v1 --server registry.example.com -u user -p pass # Push using image from videosdk.yaml videosdk agent push ``` ### Supported Registries | Registry | Server URL | | ------------------------- | ------------------------------------------ | | Docker Hub | `docker.io` (default) | | GitHub Container Registry | `ghcr.io` | | AWS ECR | `.dkr.ecr..amazonaws.com` | | Google Container Registry | `gcr.io` | | Azure Container Registry | `.azurecr.io` | ### Notes - Ensure the image is built before pushing (`videosdk agent build`) - Replace `myrepo` with your actual Docker registry username. - For Docker Hub, you can omit `--server` as it's the default - For private registries, you must provide authentication credentials - The registry server is automatically detected from the image name if `--server` is not specified ## yaml Configuration Both commands can read the image name from your `videosdk.yaml` configuration file: ```yaml agent: id: your-agent-id image: myrepo/myagent:v1 ``` When the `image` is configured in `videosdk.yaml`, you can simply run: ```bash videosdk agent build videosdk agent push ``` ## Example Here's a typical workflow for building and pushing your agent: ```bash # 1. Build the Docker image videosdk agent build --image myrepo/myagent:v1 # 2. Push to container registry videosdk agent push --image myrepo/myagent:v1 # 3. Deploy to Agent Cloud (covered in deployment docs) videosdk agent deploy --image myrepo/myagent:v1 ``` ## Next Steps After pushing your image to a container registry, you're ready to deploy your agent to Agent Cloud. See the [deployment documentation](./deploy) for more details. --- --- title: Deploy & Version Commands hide_title: false hide_table_of_contents: false toc_max_heading_level: 2 description: "Learn how to deploy and manage versions of your AI agents on Agent Cloud using VideoSDK CLI commands." pagination_label: "CLI Deployment" keywords: - VideoSDK CLI - Agent Deploy - Version Management - Agent Cloud - Replicas - Scaling image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Deployment slug: deploy --- # Deploy & Version Commands This section covers all CLI commands for deploying and managing versions of your AI agents on Agent Cloud. ## Deploy Create a new version of your agent on VideoSDK cloud. ### Usage ```bash videosdk agent deploy [NAME] [OPTIONS] ``` ### Arguments | Argument | Required | Description | | -------- | -------- | --------------------- | | `NAME` | No | Optional version name | ### Options | Option | Short | Description | Default | | --------------------- | ----- | ------------------------------------------------------- | -------------------- | | `--image` | `-i` | Docker image URL (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--version-tag` | | Version tag (e.g., `main/0.0.2`) | None | | `--min-replica` | | Minimum number of replicas | `0` | | `--max-replica` | | Maximum number of replicas | `3` | | `--profile` | | Compute profile: `cpu-small`, `cpu-medium`, `cpu-large` | `cpu-small` | | `--agent-id` | | Agent ID | From `videosdk.yaml` | | `--deployment-id` | | Deployment ID | From `videosdk.yaml` | | `--env-secret` | | ID of the environment secret set to use | From `videosdk.yaml` | | `--image-pull-secret` | | Name of the image pull secret for private registries | From `videosdk.yaml` | | `--region` | | Deployment region (e.g., `in002`, `us002`) | `us002` | :::note In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). ::: ### Example Output ```bash $ videosdk agent deploy --image myrepo/myagent:v1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Creating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Version Name my-agent's version Agent ID abc123xyz Deployment ID dep-456 Image myrepo/myagent:v1 ──────────────────────────────────────── ⠋ Creating Version... ✓ Version created successfully for agent: abc123xyz Version ID ver-789 ℹ Next step: Check version status videosdk agent version status -v ver-789 ``` ### Examples ```bash # Basic deployment videosdk agent deploy --image myrepo/myagent:v1 # Named version with version tag videosdk agent deploy my-version --image myrepo/myagent:v1 --version-tag main/0.0.1 # Deployment with secrets and custom profile videosdk agent deploy --image myrepo/myagent:v1 --env-secret my-secrets --profile cpu-medium # Deployment to specific region videosdk agent deploy --image myrepo/myagent:v1 --region in002 ``` ## List List all versions for an agent deployment. ### Usage ```bash videosdk agent version list [OPTIONS] ``` ### Options | Option | Description | Default | | ----------------- | ----------------------------------------------------- | -------------------- | | `--agent-id` | Agent ID | From `videosdk.yaml` | | `--deployment-id` | Deployment ID | From `videosdk.yaml` | | `--page` | Page number | `1` | | `--per-page` | Items per page | `10` | | `--sort` | Sort order: `1` (oldest first) or `-1` (newest first) | `-1` | ### Example Output ```bash $ videosdk agent version list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Versions ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Agent ID abc123xyz Deployment ID dep-456 ┌────────────┬─────────┬──────────┬────────────┬─────────────┐ │ Version ID │ Status │ Region │ Profile │ Replicas │ ├────────────┼─────────┼──────────┼────────────┼─────────────┤ │ ver-001 │ active │ us002 │ cpu-small │ 2/5 │ │ ver-002 │ inactive│ in002 │ cpu-medium │ 0/10 │ └────────────┴─────────┴──────────┴────────────┴─────────────┘ ``` ### Examples ```bash # List all versions videosdk agent version list # List versions for specific agent videosdk agent version list --agent-id abc123 # Paginated listing videosdk agent version list --page 2 --per-page 20 # Sort oldest first videosdk agent version list --sort 1 ``` ## Update Update an existing version configuration. ### Usage ```bash videosdk agent version update [OPTIONS] ``` ### Options | Option | Short | Description | Required | | --------------------- | ----- | --------------------------- | -------- | | `--version-id` | `-v` | Version ID to update | **Yes** | | `--min-replica` | | New minimum replicas | No | | `--max-replica` | | New maximum replicas | No | | `--profile` | | New compute profile | No | | `--image-pull-secret` | | New image pull secret name | No | | `--env-secret` | | New environment secret name | No | ### Example Output ```bash $ videosdk agent version update -v ver123 --min-replica 3 --max-replica 15 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Updating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Min Replicas 3 Max Replicas 15 ⠋ Updating Version... ✓ Version updated successfully ``` ### Examples ```bash # Update replica counts videosdk agent version update -v ver123 --min-replica 2 --max-replica 10 # Update to larger profile videosdk agent version update -v ver123 --profile cpu-large # Update environment secrets videosdk agent version update -v ver123 --env-secret new-secrets ``` ## Activate Activate a version to start receiving traffic. ### Usage ```bash videosdk agent version activate [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ---------------------- | -------- | | `--version-id` | `-v` | Version ID to activate | **Yes** | ### Example Output ```bash $ videosdk agent version activate -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Activating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Activating Version... ✓ Version activated successfully ``` ## Deactivate Deactivate a version to stop receiving new traffic. ### Usage ```bash videosdk agent version deactivate [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ------------------------------------------ | -------- | | `--version-id` | `-v` | Version ID to deactivate | **Yes** | | `--force` | | Force deactivate even with active sessions | No | ### Example Output ```bash $ videosdk agent version deactivate -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Deactivating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Deactivating Version... ✓ Version deactivated successfully ``` ### Examples ```bash # Graceful deactivation videosdk agent version deactivate -v ver123 # Force deactivation (terminates active sessions) videosdk agent version deactivate -v ver123 --force ``` ## Status Get the current status of a version. ### Usage ```bash videosdk agent version status [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ------------------- | -------- | | `--version-id` | `-v` | Version ID to check | **Yes** | ### Example Output ```bash $ videosdk agent version status -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Getting Version Status ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Version ID ver123 Status active Replicas 3/5 running Health healthy ℹ Next step: Start a session videosdk agent session start -v ver123 ``` ## Describe Get detailed information about a version. ### Usage ```bash videosdk agent version describe [OPTIONS] ``` ### Options | Option | Short | Description | Required | | -------------- | ----- | ---------------------- | -------- | | `--version-id` | `-v` | Version ID to describe | **Yes** | ### Example Output ```bash $ videosdk agent version describe -v ver123 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Describing Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Version ID ver123 Agent ID abc123xyz Deployment ID dep-456 Status active Image myrepo/myagent:v1 Profile cpu-medium Min Replicas 2 Max Replicas 10 Current Replicas 5 Region in002 Created At 2026-01-15 10:30:00 ``` ## Quick Reference | Command | Description | | ----------------------------------- | ------------------------- | | `videosdk agent deploy` | Create a new version | | `videosdk agent version list` | List all versions | | `videosdk agent version update` | Update version config | | `videosdk agent version activate` | Activate a version | | `videosdk agent version deactivate` | Deactivate a version | | `videosdk agent version status` | Get version status | | `videosdk agent version describe` | Get detailed version info | ## videosdk.yaml Reference You can configure your entire deployment process in the `videosdk.yaml` file. This allows you to run commands like `videosdk agent build`, `videosdk agent push`, and `videosdk agent deploy` without providing extra flags. ```yaml agent: id: ag_xxxxxx # automatically generated name: agent-test # automatically generated, if not provided in videosdk agent init build: image: username/myagent:v1 # docker image logs: id: b_xxxxxx # automatically generated enabled: true # build logs enabaled or not deploy: id: dep_xxxxxx # automatically generated replicas: min: 0 max: 3 profile: cpu-small region: us002 # options: in002, us002 secrets: env: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx # env secrets, global image-pull: image-pull-secret-name # image pull secrets for private images, region specific ``` ## Next Steps - Learn about managing [environment secrets](./env-secrets.md) or [image pull secrets](./image-pull-secrets.md) for your deployments - View [sessions](./sessions) for your agents - Use [Up & Down](./up-down) commands to streamline your workflow --- --- title: Environment Secrets hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Manage environment secrets for your AI agents using the VideoSDK CLI." pagination_label: "Environment Secrets" keywords: - VideoSDK CLI - Secrets - Environment Variables - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Environment Secrets slug: env-secrets --- # Environment Secrets Environment secrets are key-value pairs that are securely injected as environment variables into your agent containers at runtime. ## List List all secret sets. ### Usage ```bash videosdk agent secrets list ``` ### Example Output ```bash $ videosdk agent secrets list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ┌──────────────────┬─────────────────┬──────────┐ │ Name │ Secret ID │ Type │ ├──────────────────┼─────────────────┼──────────┤ │ my-secrets │ sec-abc123 │ env │ │ prod-credentials │ sec-xyz789 │ env │ └──────────────────┴─────────────────┴──────────┘ ✓ Secrets listed successfully ``` ## Create Create a new secret set. ### Usage ```bash videosdk agent secrets create [OPTIONS] ``` ### Options | Option | Short | Description | Default | | ---------- | ----- | -------------------------------------- | ----------------------- | | `--file` | `-f` | Path to .env file with key=value pairs | None (interactive mode) | | `--region` | | Region for storing secrets | None | ### Example Output ```bash $ videosdk agent secrets create my-secrets --file .env ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Creating Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Secret Name: my-secrets File: .env Secrets to be saved: - API_KEY: ****** - DATABASE_URL: ****** Confirm action ❯ Save secrets Cancel Saving secrets... Secrets saved successfully. ✓ Secret 'my-secrets' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret ID saved to videosdk.yaml: ``` ### videosdk.yaml Structure When saved to `videosdk.yaml`, the secret ID is added under the `secrets` section: ```yaml secrets: env: ``` ``` ### Examples ```bash # Create from .env file videosdk agent secrets create my-secrets --file .env # Create interactively (will prompt for key-value pairs) videosdk agent secrets create my-secrets # Create with specific region videosdk agent secrets create my-secrets --file .env --region in002 ``` ## Add Add new keys to an existing secret set. ### Usage ```bash videosdk agent secrets add ``` ### Example Output ```bash $ videosdk agent secrets add my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding to Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding secret... Enter key: NEW_API_KEY Enter value: ******** Add another secret? ❯ Yes No Secrets to be saved: - NEW_API_KEY: ****** Confirm action ❯ Save secrets Cancel Secret added successfully. ✓ Keys added to secret 'my-secrets' successfully ``` ## Remove Remove specific keys from a secret set. ### Usage ```bash videosdk agent secrets remove ``` ### Example Output ```bash $ videosdk agent secrets remove my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing Keys from Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing secret... Enter key: OLD_API_KEY Remove another key? ❯ Yes No Secret removed successfully. ``` ## Describe Show details of a secret set (keys only, values are hidden). ### Usage ```bash videosdk agent secrets describe ``` ### Example Output ```bash $ videosdk agent secrets describe my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Describing Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Name my-secrets Secret ID sec-abc123 Type env ┌──────────────────┬──────────┐ │ Key │ Value │ ├──────────────────┼──────────┤ │ OPENAI_API_KEY │ ****** │ │ DATABASE_URL │ ****** │ │ SECRET_TOKEN │ ****** │ └──────────────────┴──────────┘ ``` ## Delete Permanently delete a secret set. ### Usage ```bash videosdk agent secrets delete ``` ### Example Output ```bash $ videosdk agent secrets delete my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Deleting Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✓ Secret 'my-secrets' deleted successfully ``` > This action is permanent and cannot be undone. All keys in the secret set will be deleted. ## Using Environment Secrets in Deployments Once you've created environment secrets, you can reference them when deploying your agent: ```bash videosdk agent deploy --image myrepo/myagent:v1 --env-secret my-secrets ``` --- --- title: Image Pull Secrets hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Create and manage image pull secrets for private registries using the VideoSDK CLI." pagination_label: "Image Pull Secrets" keywords: - VideoSDK CLI - Secrets - Image Pull Secret - Container Registry - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: Image Pull Secrets sidebar_position: 5 slug: image-pull-secrets --- # Image Pull Secrets Image pull secrets store container registry credentials, allowing Agent Cloud to pull images from private registries. ## Create Image Pull Secret Create an image pull secret for private container registries. ### Usage ```bash videosdk agent image-pull-secret [OPTIONS] ``` ### Arguments | Argument | Required | Description | | -------- | -------- | ------------------------------ | | `name` | **Yes** | Name for the image pull secret | ### Options | Option | Short | Required | Description | Default | | ------------ | ----- | -------- | --------------------------------- | ------- | | `--server` | | **Yes** | Registry server URL | — | | `--username` | `-u` | **Yes** | Registry username | — | | `--password` | `-p` | **Yes** | Registry password or access token | — | | `--region` | | No | Deployment region for the secret | `us002` | ::::note `--username` and `--password` are **container registry credentials**, not your VideoSDK login or cloud console password. :::: ## Obtaining Registry Credentials Depending on your registry provider, follow the steps below to get the correct username and password: ### AWS Elastic Container Registry (ECR) - **Account ID & Region**: Your server URL will be `.dkr.ecr..amazonaws.com`. - **Username**: Always use `AWS`. - **Password**: Generate a temporary token using the AWS CLI: ```bash aws ecr get-login-password --region ``` ### Azure Container Registry (ACR) - **Username**: Use the **Registry Name** (found in Azure Portal > ACR > Access Keys). - **Password**: Enable the **Admin user** in ACR settings and use one of the generated passwords. - *Alternatively*, use a **Service Principal** Application ID as the username and its Secret as the password. ### Google Artifact Registry (GAR) - **Server URL**: `-docker.pkg.dev` - **Username**: Always use `_json_key`. - **Password**: The **content of your Service Account JSON key file**. ```bash # Example usage -p "$(cat keyfile.json)" ``` - **Permissions**: Ensure the service account has the `Artifact Registry Reader` role. ### Docker Hub - **Username**: Your Docker Hub username. - **Password**: Use a **Personal Access Token (PAT)** instead of your account password. - Go to **Account Settings** > **Security** > **New Access Token**. ### GitHub Container Registry (GHCR) - **Username**: Your GitHub username. - **Password**: A **Personal Access Token (classic)** with `read:packages` scope. --- ### What Happens 1. The CLI validates the required registry details provided via flags. 2. Credentials are securely stored on VideoSDK Cloud. 3. **Automatic Configuration**: The CLI prompts you to save the secret to your `videosdk.yaml` file. If confirmed, the secret name is automatically added under the `secrets` section. #### Example Output ```bash ✓ Image pull secret 'my-registry-secret' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret Name saved to videosdk.yaml: my-registry-secret ``` #### videosdk.yaml Structure ```yaml secrets: image-pull: my-registry-secret ``` ### Examples ```bash # ECR (AWS) videosdk agent image-pull-secret my-ecr-secret \ --server 1234567890.dkr.ecr.ap-south-1.amazonaws.com \ -u AWS \ -p $(aws ecr get-login-password --region ap-south-1) # ACR (Azure) videosdk agent image-pull-secret my-acr-secret \ --server myregistry.azurecr.io \ -u myusername \ -p mypassword # Google Artifact Registry (GAR) videosdk agent image-pull-secret my-gcr-secret \ --server https://-docker.pkg.dev \ -u _json_key \ -p "$(cat keyfile.json)" \ --region us002 # Docker Hub videosdk agent image-pull-secret my-dockerhub-secret \ --server https://index.docker.io/v1/ \ -u myusername \ -p mypassword \ --region us002 ``` ## Using Image Pull Secrets in Deployments Once you've created an image pull secret, you can reference it when deploying your agent: ```bash videosdk agent deploy --image ghcr.io/myorg/myagent:v1 --image-pull-secret ghcr-secret ``` ## Next Steps: Registry-Specific Guides Follow these step-by-step guides for building, pushing, and deploying agents from popular registries to VideoSDK Cloud: - [Deploy from AWS ECR to VideoSDK Cloud](/ai_agents/deployments/agent-cloud/deployment-guides/ecr-to-videosdk-cloud) - [Deploy from Azure Container Registry (ACR) to VideoSDK Cloud](/ai_agents/deployments/agent-cloud/deployment-guides/acr-to-videosdk-cloud) - [Deploy from Google Container Registry (GCR) to VideoSDK Cloud](/ai_agents/deployments/agent-cloud/deployment-guides/gcr-to-videosdk-cloud) --- --- title: Initialize Agent hide_title: false hide_table_of_contents: false description: "Learn how to initialize a new AI agent deployment using the VideoSDK CLI. Set up your agent configuration and deployment settings." pagination_label: "CLI Initialize" keywords: - VideoSDK CLI - Initialize Agent - Agent Init - videosdk.yaml - Agent Cloud sidebar_position: 2 sidebar_label: Initialize slug: init-agent --- # Initialize Agent The `init` command sets up a new agent deployment by creating an agent and a deployment in VideoSDK cloud. It also generates a `videosdk.yaml` configuration file and a `Dockerfile` in your project directory. ## Initialize Create a new agent and deployment. ### Usage ```bash videosdk agent init --name my-agent ``` ### Options | Option | Short | Description | Default | | ----------- | ----- | ------------------------------------------------ | --------------------------- | | `--name` | `-n` | Name for your deployment | Auto-generated if not provided | | `--template`| `-t` | Template ID to use (e.g., Template01) | None | ### What Happens 1. **Cloud Creation**: The CLI communicates with VideoSDK cloud to create a new agent and a corresponding deployment. 2. **Config Generation**: A `videosdk.yaml` file is created in your current directory. This file contains the unique IDs for your agent and deployment. 3. **Dockerfile Generation**: A standard `Dockerfile` is automatically created, optimized for running VideoSDK AI agents. 4. **Project Setup**: Your local project is now linked to the cloud resources. ### Example Output ```bash $ videosdk agent init --name my-agent ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Initializing Deployment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Initializing Deployment... ✓ Deployment initialized successfully ℹ Next step: Build your agent Docker image videosdk agent build --image /: ``` ### videosdk.yaml Structure The generated `videosdk.yaml` file will look like this: ```yaml agent: id: ag_48bnvu name: agent-test deploy: id: ddv_h3b5sd ``` | Field | Description | | ---------- | ------------------------------------------------ | | `agent.id` | Unique identifier for your AI agent | | `agent.name`| Name of your agent | | `deploy.id`| Unique identifier for this specific deployment | ### Update Agent Code After the `videosdk.yaml` file is generated, you must update your agent's code with the `id` from the `agent` field. This links your local agent logic to the cloud resource. 1. Open your `videosdk.yaml` file and copy the `id` under the `agent` section. 2. In your Python agent code, set the `agent_id` in the `Options` class. **Example:** ```python if __name__ == "__main__": options = Options( agent_id="ag_48bnvu", # Use the id from your videosdk.yaml ) job = WorkerJob(entrypoint=start_session, jobctx=make_context, options=options) job.start() ``` ### Notes - You should run this command in the root of your agent's project directory. - The `videosdk.yaml` file should be committed to your version control system (e.g., Git). - If you already have a `videosdk.yaml` file, running `init` again will prompt you or might overwrite settings depending on the version. ## Next Steps After initializing your agent, the next step is to build and push your agent's Docker image. See the [Build & Push](./build-push) documentation for more details. --- --- title: Installation hide_title: false hide_table_of_contents: false description: "Get started with VideoSDK CLI. Learn how to install the CLI on Linux and macOS using curl or pip." pagination_label: "CLI Installation" keywords: - VideoSDK CLI - Installation - Install - curl - pip - Agent Cloud sidebar_position: 0 sidebar_label: Installation slug: installation --- # Installation To get started with VideoSDK Agent Cloud, you need to install the VideoSDK CLI. There are two ways to install it: ## Using pip You can also install the VideoSDK CLI using `pip`, the Python package manager. ```bash pip install videosdk-cli ``` ## Using curl This is the quickest way to install the VideoSDK CLI on Linux and macOS. ```bash curl -fsSL https://videosdk.live/install | bash ``` ## Verify Installation Once installed, you can verify the installation by checking the help command. ```bash videosdk --help ``` ## Next Steps After installing the CLI, the next step is to authenticate your account. See the [Authentication](./authentication) documentation for more details. --- --- title: Secrets Management hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Learn how to manage environment secrets and image pull secrets for your AI agents using the VideoSDK CLI." pagination_label: "CLI Secrets" keywords: - VideoSDK CLI - Secrets - Environment Variables - Image Pull Secret - Container Registry - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Secrets slug: secrets --- # Secrets Management This section covers all CLI commands for managing secrets used by your AI agents on Agent Cloud. Secrets allow you to securely store sensitive configuration values like API keys, database credentials, and registry authentication. There are two types of secrets: - **Environment secrets**: key-value pairs injected as environment variables. - **Image pull secrets**: container registry credentials for pulling private images. ## Environment Secrets Environment secrets are key-value pairs that are securely injected as environment variables into your agent containers at runtime. ### List List all secret sets. #### Usage ```bash videosdk agent secrets list ``` #### Example Output ```bash $ videosdk agent secrets list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ┌──────────────────┬─────────────────┬──────────┐ │ Name │ Secret ID │ Type │ ├──────────────────┼─────────────────┼──────────┤ │ my-secrets │ sec-abc123 │ env │ │ prod-credentials │ sec-xyz789 │ env │ └──────────────────┴─────────────────┴──────────┘ ✓ Secrets listed successfully ``` ### Create Create a new secret set. #### Usage ```bash videosdk agent secrets create [OPTIONS] ``` #### Options | Option | Short | Description | Default | | ---------- | ----- | -------------------------------------- | ----------------------- | | `--file` | `-f` | Path to .env file with key=value pairs | None (interactive mode) | | `--region` | | Region for storing secrets | None | #### Example Output ```bash $ videosdk agent secrets create my-secrets --file .env ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Creating Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Secret Name: my-secrets File: .env Secrets to be saved: - API_KEY: ****** - DATABASE_URL: ****** Confirm action ❯ Save secrets Cancel Saving secrets... Secrets saved successfully. ✓ Secret 'my-secrets' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret ID saved to videosdk.yaml: ``` #### videosdk.yaml Structure When saved to `videosdk.yaml`, the secret ID is added under the `secrets` section: ```yaml secrets: env: ``` ``` #### Examples ```bash # Create from .env file videosdk agent secrets create my-secrets --file .env # Create interactively (will prompt for key-value pairs) videosdk agent secrets create my-secrets # Create with specific region videosdk agent secrets create my-secrets --file .env --region in002 ``` ### Add Add new keys to an existing secret set. #### Usage ```bash videosdk agent secrets add ``` #### Example Output ```bash $ videosdk agent secrets add my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding to Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Adding secret... Enter key: NEW_API_KEY Enter value: ******** Add another secret? ❯ Yes No Secrets to be saved: - NEW_API_KEY: ****** Confirm action ❯ Save secrets Cancel Secret added successfully. ✓ Keys added to secret 'my-secrets' successfully ``` ### Remove Remove specific keys from a secret set. #### Usage ```bash videosdk agent secrets remove ``` #### Example Output ```bash $ videosdk agent secrets remove my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing Keys from Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Removing secret... Enter key: OLD_API_KEY Remove another key? ❯ Yes No Secret removed successfully. ``` ### Describe Show details of a secret set (keys only, values are hidden). #### Usage ```bash videosdk agent secrets describe ``` #### Example Output ```bash $ videosdk agent secrets describe my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Describing Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Name my-secrets Secret ID sec-abc123 Type env ┌──────────────────┬──────────┐ │ Key │ Value │ ├──────────────────┼──────────┤ │ OPENAI_API_KEY │ ****** │ │ DATABASE_URL │ ****** │ │ SECRET_TOKEN │ ****** │ └──────────────────┴──────────┘ ``` ### Delete Permanently delete a secret set. #### Usage ```bash videosdk agent secrets delete ``` #### Example Output ```bash $ videosdk agent secrets delete my-secrets ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Deleting Secret ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✓ Secret 'my-secrets' deleted successfully ``` :::caution This action is permanent and cannot be undone. All keys in the secret set will be deleted. ::: ## Image Pull Secrets Image pull secrets store container registry credentials, allowing Agent Cloud to pull images from private registries. ### Create Image Pull Secret Create an image pull secret for private container registries. #### Usage ```bash videosdk agent image-pull-secret [OPTIONS] ``` #### Arguments | Argument | Required | Description | | -------- | -------- | ------------------------------ | | `name` | **Yes** | Name for the image pull secret | #### Options | Option | Short | Required | Description | Default | | ------------------- | ----- | -------- | --------------------------------- | ------- | | `--server` | | **Yes** | Registry server URL | — | | `--username` | `-u` | **Yes** | Registry username | — | | `--password` | `-p` | **Yes** | Registry password or access token | — | | `--region` | | No | Deployment region for the secret | `us002` | ### What Happens 1. The CLI validates the required registry details provided via flags. 2. Credentials are securely stored and can be referenced in deployments. 3. **Automatic Configuration**: The CLI prompts you to save the secret to your `videosdk.yaml` file. If confirmed, the secret name is automatically added under the `secrets` section. #### Example Output ```bash ✓ Image pull secret 'my-registry-secret' created successfully Do you want to add this secret to videosdk.yaml? [y/N]: y ✓ Secret Name saved to videosdk.yaml: my-registry-secret ``` #### videosdk.yaml Structure ```yaml secrets: image-pull: my-registry-secret ``` #### Examples ```bash # ECR (AWS) videosdk agent image-pull-secret my-ecr-secret \ --server 1234567890.dkr.ecr.ap-south-1.amazonaws.com \ -u AWS \ -p $(aws ecr get-login-password --region ap-south-1) # ACR (Azure) videosdk agent image-pull-secret my-acr-secret \ --server myregistry.azurecr.io \ -u myusername \ -p mypassword # GCR (GCP) videosdk agent image-pull-secret my-gcr-secret \ --server https://-docker.pkg.dev \ -u _json_key \ -p "$(cat keyfile.json)" \ --region us002 # Docker Hub videosdk agent image-pull-secret my-dockerhub-secret \ --server https://index.docker.io/v1/ \ -u myusername \ -p mypassword \ --region us002 ``` ## Using Secrets in Deployments Once you've created secrets, you can reference them when deploying your agent: :::note In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). Replace it with your actual username. ::: ### Environment Secrets ```bash videosdk agent deploy --image myrepo/myagent:v1 --env-secret my-secrets ``` ### Image Pull Secrets ```bash videosdk agent deploy --image ghcr.io/myorg/myagent:v1 --image-pull-secret ghcr-secret ``` ### Combined Example ```bash videosdk agent deploy \ --image ghcr.io/myorg/myagent:v1 \ --env-secret prod-credentials \ --image-pull-secret ghcr-secret \ --min-replica 2 \ --max-replica 10 ``` ## Quick Reference | Command | Description | | ----------------------------------------- | --------------------------- | | `videosdk agent secrets list` | List all secret sets | | `videosdk agent secrets create ` | Create a new secret set | | `videosdk agent secrets add ` | Add keys to a secret | | `videosdk agent secrets remove ` | Remove keys from a secret | | `videosdk agent secrets describe ` | Show secret details | | `videosdk agent secrets delete ` | Delete a secret set | | `videosdk agent image-pull-secret ` | Create registry credentials | ## Best Practices 1. **Use .env files for bulk creation**: When you have many secrets, create a `.env` file and use `--file .env` 2. **Separate secrets by environment**: Create different secret sets for development, staging, and production 3. **Rotate secrets regularly**: Delete and recreate secrets periodically for security 4. **Use descriptive names**: Name your secrets clearly (e.g., `prod-api-keys`, `staging-db-creds`) --- --- title: Session Management hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Learn how to start, stop, and list agent sessions using the VideoSDK CLI." pagination_label: "CLI Sessions" keywords: - VideoSDK CLI - Sessions - Agent Sessions - Room Management - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Sessions slug: sessions --- # Session Management This section covers all CLI commands for managing agent sessions on Agent Cloud. Sessions represent individual instances of your agent running in rooms. ## Session Commands Control individual agent sessions - start agents in rooms and stop running sessions. ### Start Start an agent session in a room. #### Usage ```bash videosdk agent session start [OPTIONS] ``` #### Options | Option | Short | Description | Default | | -------------- | ----- | -------------------------------------------------- | -------------------- | | `--version-id` | `-v` | Version ID to use | Latest version | | `--room-id` | `-r` | Room ID to join (creates new room if not provided) | Auto-created | | `--agent-id` | `-a` | Agent ID | From `videosdk.yaml` | #### Example Output ```bash $ videosdk agent session start -v ver123 -r room-abc ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Starting Session ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Starting Session... ✓ Session started successfully Room ID room-abc ℹ Useful commands: View logs: videosdk agent version logs Stop session: videosdk agent session stop -r room-abc ``` #### Examples ```bash # Start with specific version and room videosdk agent session start -v ver123 -r room-abc # Start with specific version (creates new room) videosdk agent session start -v ver123 # Start with latest version in existing room videosdk agent session start -r room-abc # Start with latest version (creates new room) videosdk agent session start ``` ### Stop Stop an agent session. #### Usage ```bash videosdk agent session stop [OPTIONS] ``` #### Options | Option | Short | Description | Required | | -------------- | ----- | ------------------ | --------------------------- | | `--room-id` | `-r` | Room ID of session | **Yes** (or `--session-id`) | | `--session-id` | `-s` | Session ID to stop | **Yes** (or `--room-id`) | :::note Either `--room-id` or `--session-id` must be provided. ::: #### Example Output ```bash $ videosdk agent session stop -r room-abc ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Stopping Session ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⠋ Ending Session... ✓ Session ended successfully ``` #### Examples ```bash # Stop by room ID videosdk agent session stop -r room-abc # Stop by session ID videosdk agent session stop -s session-123 ``` ## Sessions List View and filter all sessions for your agent. ### List List all sessions for an agent. #### Usage ```bash videosdk agent sessions list [OPTIONS] ``` #### Options | Option | Short | Description | Default | | -------------- | ----- | ----------------------------------------------------- | -------------------- | | `--agent-id` | | Agent ID | From `videosdk.yaml` | | `--version-id` | `-v` | Filter by Version ID | None | | `--room-id` | | Filter by Room ID | None | | `--session-id` | | Filter by Session ID | None | | `--page` | | Page number | `1` | | `--per-page` | | Items per page | `10` | | `--sort` | | Sort order: `1` (oldest first) or `-1` (newest first) | `-1` | #### Example Output ```bash $ videosdk agent sessions list ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Listing Sessions ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Agent ID abc123xyz Deployment ID dep-456 +------------+----------+---------------+---------+----------+ | Session ID | Room ID | Deployment ID | Status | Duration | +------------+----------+---------------+---------+----------+ | sess-001 | room-abc | dep-456 | running | 5m 30s | | sess-002 | room-xyz | dep-456 | ended | 12m 45s | | sess-003 | room-123 | dep-456 | ended | 3m 15s | +------------+----------+---------------+---------+----------+ ``` #### Examples ```bash # List all sessions videosdk agent sessions list # List sessions for specific agent videosdk agent sessions list --agent-id abc123 # Filter by version videosdk agent sessions list --version-id ver123 # Filter by room videosdk agent sessions list --room-id room-abc # Paginated listing videosdk agent sessions list --page 2 --per-page 20 # Sort oldest first videosdk agent sessions list --sort 1 ``` ## Quick Reference | Command | Description | | ------------------------------ | ------------------------ | | `videosdk agent session start` | Start an agent in a room | | `videosdk agent session stop` | Stop an agent session | | `videosdk agent sessions list` | List all sessions | ## Workflow Example Here's a typical workflow for managing agent sessions: ```bash # 1. Start a session with your deployed version videosdk agent session start -v ver123 # 2. Check running sessions videosdk agent sessions list # 3. View logs for debugging videosdk agent logs -v ver123 # 4. Stop the session when done videosdk agent session stop -r room-abc ``` --- --- title: Up & Down Commands hide_title: false hide_table_of_contents: false toc_max_heading_level: 2 description: "Learn how to use the up and down commands to quickly build, push, and deploy your AI agents, or stop all running versions." pagination_label: "CLI Up & Down" keywords: - VideoSDK CLI - Agent Up - Agent Down - Deployment - Automation image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Up & Down slug: up-down --- # Up & Down Commands The `up` and `down` commands provide a streamlined way to manage your agent's lifecycle on Agent Cloud. ## Up The `up` command is a powerful command that performs the **build**, **push**, and **deploy** actions together in a single command. This significantly speeds up the development-to-deployment workflow by replacing three separate steps with one. ### Usage ```bash videosdk agent up [OPTIONS] ``` ### Options | Option | Short | Description | Default | | ------------ | ----- | ------------------------------------------------ | -------------------- | | `--image` | `-i` | Image name with tag (e.g., `myrepo/myagent:v1`) | From `videosdk.yaml` | | `--file` | `-f` | Path to Dockerfile | `./Dockerfile` | | `--server` | `-s` | Registry server URL | `docker.io` | | `--username` | `-u` | Registry username for authentication | None | | `--password` | `-p` | Registry password for authentication | None | | `--skip-build` | | Skip build step (use existing local image) | - | | `--skip-push` | | Skip push step (image already in registry) | - | ### What Happens 1. **Build**: The CLI builds the Docker image for the `linux/arm64` platform locally (unless `--skip-build` is used). 2. **Push**: The CLI pushes the image to your container registry (unless `--skip-push` is used). 3. **Deploy**: The CLI creates and activates a new version on Agent Cloud using the image. ### Example Output ```bash $ videosdk agent up --image myrepo/myagent:v1 ◆ Step 1/3 — Building Docker Image Platform linux/arm64 Image myrepo/myagent:v1 Dockerfile /path/to/your/project/Dockerfile ──────────────────────────────────────── [Docker build output...] ✓ Successfully built image: myrepo/myagent:v1 ◆ Step 2/3 — Pushing Docker Image Image myrepo/myagent:v1 Registry docker.io ──────────────────────────────────────── [Docker push output...] ✓ Pushed image: myrepo/myagent:v1 ◆ Step 3/3 — Deploying Agent Agent ID: ag_xxxxxx Image: myrepo/myagent:v1 Min Replicas: 1 Max Replicas: 3 Profile: cpu-small ──────────────────────────────────────────────────────────── Creating Version ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 ✓ Agent is up! Version ID: v_xxxxxx ℹ Check status: videosdk agent version status ℹ View logs: videosdk agent logs ℹ Take it down: videosdk agent down ``` ### Examples ```bash # Build, push, and deploy using defaults from videosdk.yaml videosdk agent up # Specify image and registry credentials videosdk agent up --image myrepo/myagent:v1 -u username -p password # Skip build and use existing image videosdk agent up --image myrepo/myagent:v1 --skip-build # Skip build and push, only deploy videosdk agent up --image myrepo/myagent:v1 --skip-build --skip-push ``` --- ## Down The `down` command deactivates **all running versions** of a specific agent. This is the quickest way to stop all active instances of your agent across all deployed versions. ### Usage ```bash videosdk agent down [OPTIONS] ``` ### Options | Option | Short | Description | Default | | --------- | ----- | ------------------------------------------ | ------- | | `--force` | | Force deactivate even with active sessions | No | | `--yes` | `-y` | Skip confirmation prompt | No | ### What Happens 1. **Version Identification**: The CLI identifies all active versions for the specified agent (from `videosdk.yaml`). 2. **Deactivation**: It sends a deactivation request for every active version. 3. **Session Transition**: Once deactivated, these versions will stop receiving new sessions. Existing sessions will continue until they finish (unless `--force` is used). ### Example Output ```bash $ videosdk agent down ◆ Bringing Agent Down Found 1 active version(s): • v_xxxxxx — my-agent's version Deactivate 1 version(s)? [y/N]: y Deactivating v_xxxxxx ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 ✓ Deactivated v_xxxxxx ✓ Agent is down. All versions deactivated. ``` ### Examples ```bash # Graceful deactivation with confirmation videosdk agent down # Force deactivation (terminates active sessions) videosdk agent down --force # Skip confirmation prompt videosdk agent down -y ``` --- ## Next Steps - Learn more about individual [Build & Push](./build-push.md) commands. - Explore detailed [Deployment & Version](./deploy.md) management. --- --- title: Deploy from ACR to VideoSDK Cloud hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Build, push, and deploy an AI agent image from Azure Container Registry (ACR) to VideoSDK Cloud." pagination_label: "ACR → VideoSDK Cloud" keywords: - VideoSDK CLI - Azure - ACR - Image Pull Secret - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: ACR → VideoSDK Cloud sidebar_position: 2 slug: acr-to-videosdk-cloud --- # Deploy from Azure Container Registry (ACR) to VideoSDK Cloud This guide walks you through building, pushing, and deploying an agent container image stored in **Azure Container Registry (ACR)** to **VideoSDK Agent Cloud**. You will: 1. Build your Docker image with the VideoSDK CLI. 2. Push the image to Azure Container Registry. 3. Create an image pull secret in VideoSDK Cloud for ACR. 4. Deploy your agent using the ACR image and image pull secret. ## Prerequisites - A working agent project with a `Dockerfile`. - Azure account with access to an ACR registry. - Azure CLI installed and logged in. - VideoSDK CLI installed and authenticated. ## 1. Build Image with VideoSDK CLI Use the `videosdk agent build` command to build your Docker image. ```bash videosdk agent build \ --image myregistry.azurecr.io/my-agent:latest ``` > Replace `myregistry` with your ACR registry name and `my-agent` with your repository name. ## 2. Push Image to ACR Login to ACR and push the built image. ```bash # Login to ACR az acr login --name myregistry # Push using VideoSDK CLI videosdk agent push \ --image myregistry.azurecr.io/my-agent:latest \ --server myregistry.azurecr.io ``` ## 3. Create Image Pull Secret for ACR To allow VideoSDK Cloud to pull your private image, you need to create an image pull secret with your ACR credentials. ### Obtaining ACR Credentials You can use either an **Admin User** or a **Service Principal**: 1. **Admin User (Simplest)**: - Go to the **Azure Portal** > **Container Registries** > Select your registry. - Under **Settings**, select **Access keys**. - Ensure **Admin user** is enabled. - Use the **Registry name** as `-u` (username) and one of the **passwords** as `-p`. 2. **Service Principal (Recommended)**: - Create a Service Principal with `AcrPull` permissions. - Use the **Application (client) ID** as `-u` (username) and the **Client Secret** as `-p` (password). ### Create the Secret ```bash videosdk agent image-pull-secret my-acr-secret \ --server myregistry.azurecr.io \ -u myusername \ -p mypassword ``` > Replace `myusername` and `mypassword` with the values obtained above. ## 4. Deploy Agent Using ACR Image Now deploy your agent, referencing both the ACR image and the image pull secret you just created. ```bash videosdk agent deploy \ --image myregistry.azurecr.io/my-agent:latest \ --image-pull-secret my-acr-secret \ --min-replica 1 \ --max-replica 3 ``` Your agent will now run on VideoSDK Cloud using the image stored in Azure Container Registry. --- --- title: Deploy from ECR to VideoSDK Cloud hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Build, push, and deploy an AI agent image from AWS ECR to VideoSDK Cloud." pagination_label: "ECR → VideoSDK Cloud" keywords: - VideoSDK CLI - AWS - ECR - Image Pull Secret - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: ECR → VideoSDK Cloud sidebar_position: 1 slug: ecr-to-videosdk-cloud --- # Deploy from AWS ECR to VideoSDK Cloud This guide walks you through building, pushing, and deploying an agent container image stored in **AWS Elastic Container Registry (ECR)** to **VideoSDK Agent Cloud**. You will: 1. Build your Docker image with the VideoSDK CLI. 2. Push the image to AWS ECR. 3. Create an image pull secret in VideoSDK Cloud for ECR. 4. Deploy your agent using the ECR image and image pull secret. ## Prerequisites - A working agent project with a `Dockerfile`. - AWS account with permissions to create and push images to ECR. - AWS CLI installed and configured (`aws configure`). - VideoSDK CLI installed and authenticated. ## 1. Build Image with VideoSDK CLI Use the `videosdk agent build` command to build your Docker image. ```bash videosdk agent build \ --image 1234567890.dkr.ecr.ap-south-1.amazonaws.com/my-agent:latest ``` > Replace `1234567890` with your AWS account ID, `ap-south-1` with your ECR region, and `my-agent` with your repository name. ## 2. Push Image to ECR Authenticate to ECR and push the built image to your ECR repository. ```bash # Login to ECR aws ecr get-login-password --region ap-south-1 \ | docker login \ --username AWS \ --password-stdin 1234567890.dkr.ecr.ap-south-1.amazonaws.com # Push using VideoSDK CLI videosdk agent push \ --image 1234567890.dkr.ecr.ap-south-1.amazonaws.com/my-agent:latest ``` ## 3. Create Image Pull Secret for ECR To allow VideoSDK Cloud to pull your private image, you need to create an image pull secret with your ECR credentials. ### Obtaining ECR Details 1. **Server URL**: Your ECR server URL follows the format: `.dkr.ecr..amazonaws.com`. - **Account ID**: Find your **Account ID** in the top-right corner of the **AWS Management Console**. - **Region**: The **Region code** (e.g., `us-east-1` or `ap-south-1`) where your ECR repository is located. 2. **Username**: Always use `AWS` for ECR. 3. **Password**: This is a temporary authorization token generated by the AWS CLI. ### Create the Secret ```bash videosdk agent image-pull-secret my-ecr-secret \ --server 1234567890.dkr.ecr.ap-south-1.amazonaws.com \ -u AWS \ -p $(aws ecr get-login-password --region ap-south-1) ``` > Ensure you have the AWS CLI configured (`aws configure`) with permissions to access ECR. Replace the account ID and region with your own. ## 4. Deploy Agent Using ECR Image Now deploy your agent, referencing both the ECR image and the image pull secret you just created. ```bash videosdk agent deploy \ --image 1234567890.dkr.ecr.ap-south-1.amazonaws.com/my-agent:latest \ --image-pull-secret my-ecr-secret \ --min-replica 1 \ --max-replica 3 ``` Your agent will now run on VideoSDK Cloud using the image stored in AWS ECR. --- --- title: Deploy from GCR (Artifact Registry) to VideoSDK Cloud hide_title: false hide_table_of_contents: false toc_max_heading_level: 3 description: "Build, push, and deploy an AI agent image from Google Artifact Registry (GAR) to VideoSDK Cloud." pagination_label: "GCR → VideoSDK Cloud" keywords: - VideoSDK CLI - Google Cloud - GCR - Artifact Registry - Google Artifact Registry - Image Pull Secret - Agent Cloud image: img/videosdklive-thumbnail.jpg sidebar_label: GCR → VideoSDK Cloud sidebar_position: 3 slug: gcr-to-videosdk-cloud --- # Deploy from GCR (Artifact Registry) to VideoSDK Cloud This guide walks you through building, pushing, and deploying an agent container image stored in **Google Artifact Registry** to **VideoSDK Agent Cloud**. ## Prerequisites - A working AI agent project. - A Google Cloud project with the **Artifact Registry API** enabled. - `gcloud` CLI installed and authenticated. - VideoSDK CLI installed and authenticated. ## 1. Set Up Google Cloud Credentials To allow VideoSDK Cloud to pull images from your private Artifact Registry, you need a Service Account with the correct permissions. ### Create a Service Account ```bash # Create service account gcloud iam service-accounts create videosdk-ar-puller \ --display-name "VideoSDK Artifact Registry Puller" ``` ### Grant Artifact Registry Reader Role ```bash # Grant Artifact Registry Reader role at the project level gcloud projects add-iam-policy-binding \ --member="serviceAccount:videosdk-ar-puller@.iam.gserviceaccount.com" \ --role="roles/artifactregistry.reader" ``` ### Generate JSON Key ```bash # Generate JSON key and save to keyfile.json gcloud iam service-accounts keys create keyfile.json \ --iam-account videosdk-ar-puller@.iam.gserviceaccount.com ``` > **Warning**: Keep `keyfile.json` secure and do not commit it to version control. --- ## 2. Create and Configure Repository Artifact Registry organizes images into repositories. It is recommended to create a dedicated repository for your VideoSDK workers. ### Create Repository ```bash # Create a single-purpose repository gcloud artifacts repositories create videosdk-worker \ --repository-format=docker \ --location= \ --description="Docker repository for VideoSDK Worker" ``` ### Grant Repository-Specific Access (Optional) If you prefer more granular access control, grant read-only access only to the specific repository instead of the entire project: ```bash gcloud artifacts repositories add-iam-policy-binding videosdk-worker \ --location= \ --member="serviceAccount:videosdk-ar-puller@.iam.gserviceaccount.com" \ --role="roles/artifactregistry.reader" ``` --- ## 3. Build and Push Image Before pushing, configure Docker to authenticate with your regional registry. ### Authenticate Docker ```bash # Replace with your repository location (e.g., us-central1) gcloud auth configure-docker -docker.pkg.dev ``` ### Build and Push Update your `videosdk.yaml` or use the CLI flags to specify the target image path. ```bash # Build the image videosdk agent build --image -docker.pkg.dev//videosdk-worker/my-agent:v1 # Push the image videosdk agent push --image -docker.pkg.dev//videosdk-worker/my-agent:v1 ``` --- ## 4. Create Image Pull Secret Create the secret in VideoSDK Cloud using the service account key. ```bash videosdk agent image-pull-secret my-gcr-secret \ --server https://-docker.pkg.dev \ -u _json_key \ -p "$(cat keyfile.json)" \ --region us002 ``` --- ## 5. Deploy Agent Use the `videosdk agent up` command for a streamlined workflow that handles the final build-push-deploy sequence, or use `deploy` if you've already pushed. ```bash # Deploy using the GAR image and secret videosdk agent deploy \ --image -docker.pkg.dev//videosdk-worker/my-agent:v1 \ --image-pull-secret my-gcr-secret ``` Alternatively, use the shortcut: ```bash videosdk agent up --image -docker.pkg.dev//videosdk-worker/my-agent:v1 ``` --- ## Summary of URL Format | Field | Format | Example | | ---------- | ------------------------------------------- | ------------------------------------------- | | **Server** | `https://-docker.pkg.dev` | `https://us-central1-docker.pkg.dev` | | **Image** | `-docker.pkg.dev///:` | `us-central1-docker.pkg.dev/my-proj/worker/agent:v1` | --- --- title: Introduction hide_title: false hide_table_of_contents: false description: "Learn the fundamental terminology and concepts of Agent Cloud deployment, including what Agent Cloud is, how deployments work, and understanding versioning with replicas and resource profiles." pagination_label: "Introduction" keywords: - AI Agent SDK - VideoSDK Agents - Agent Cloud - Deployment - Version - Replicas - Cloud Infrastructure - Low Code - CLI Deployment image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Introduction slug: introduction --- ## What is Agent Cloud? **Agent Cloud** is VideoSDK's fully managed cloud infrastructure for deploying and running AI voice agents. It abstracts away all the complexity of server management, scaling, and maintenance, allowing you to focus entirely on building your agent logic. Agent Cloud supports two deployment workflows: ### Low-Code Deployment (UI-Based) For users who prefer a visual approach, Agent Cloud provides a **low-code interface** where you can: - Design your AI agent directly from the VideoSDK dashboard - Configure agent behavior, prompts, and integrations through the UI - Deploy with a single click – no coding required This approach is ideal for rapid prototyping, non-technical users, or teams that want to iterate quickly without writing deployment scripts. ### Developer Deployment (CLI-Based) For developers who build custom AI voice agents using the **VideoSDK Pipeline**, Agent Cloud provides a **CLI-based deployment** workflow: - Develop your AI voice agent using the VideoSDK Agents Python SDK - Use the VideoSDK CLI to package and deploy your agent to the cloud - Manage deployments, versions, and configurations programmatically This approach gives developers full control over their agent code while leveraging the managed infrastructure benefits of Agent Cloud. :::info Check out the [CLI Installation Guide](./cli/installation) to get started with deploying your agents to Agent Cloud. ::: ### Agent Cloud Architecture A single deployment can have **multiple running versions** simultaneously, allowing you to manage and update your agents with flexibility. ![Agent Cloud Architecture](https://assets.videosdk.live/images/cloud-deployment.png) --- ## What is a Deployment? A **Deployment** represents a managed instance of your AI agent running on VideoSDK's cloud infrastructure. When you deploy an agent to Agent Cloud, VideoSDK handles: - **Infrastructure Provisioning**: Automatically allocates compute resources - **Load Balancing**: Distributes incoming requests across available replicas - **Health Monitoring**: Continuously monitors agent health and restarts failed instances - **Scaling**: Automatically scales replicas based on demand within configured limits Each deployment is identified by a unique name and contains configuration for how your agent should be run, scaled, and managed. --- ## What is a Version? A **Version** represents a specific release of your AI agent within a deployment. Each time you update your agent code or configuration and deploy it, a new version is created. ### Version Configuration Every version includes the following configurable parameters: | Parameter | Description | | ---------------- | -------------------------------------------------------------------------------------------------------------------------------- | | **Min Replicas** | The minimum number of agent instances that should always be running. This ensures baseline availability even during low traffic. | | **Max Replicas** | The maximum number of agent instances that can be scaled up to during high demand. This caps your resource usage and costs. | | **Profile** | The compute resource profile that defines CPU and memory allocation for each replica. | ### Resource Profiles Agent Cloud offers predefined resource profiles to match your agent's computational requirements: | Profile | Description | Best For | | -------------- | -------------------------------------------------------------------- | ----------------------------------------------------- | | **cpu-small** | Lightweight compute resources with minimal CPU and memory allocation | Simple agents, low-traffic applications | | **cpu-medium** | Balanced compute resources suitable for most production workloads | Standard agents, moderate traffic | | **cpu-large** | High-performance compute resources with increased CPU and memory | Complex agents, high-traffic, compute-intensive tasks | ### Deployment Regions Agent Cloud is available in multiple regions to ensure low latency and compliance with data residency requirements: | Region | Location | Description | | --------- | ------------- | ---------------------------------------------- | | **in002** | India | Optimized for users in the Indian subcontinent | | **us002** | United States | Optimized for users in North America (default) | :::note If no region is specified during deployment, **us002** (United States) is used as the default region. ::: Choose a region closest to your users for the best performance. You can specify the region when deploying your agent using the `--region` flag: ```bash videosdk agent deploy --image myrepo/myagent:v1 --region in002 ``` :::note In examples like `myrepo/myagent:v1`, `myrepo` is a placeholder for your Docker registry username (e.g., your Docker Hub username). ::: ### Replica Scaling Replicas are individual instances of your agent running within a version. Agent Cloud automatically manages replicas based on your configuration: - **Minimum Replicas (`minReplica`)**: Guarantees this many instances are always running, ensuring your agent is ready to handle requests without cold start delays. - **Maximum Replicas (`maxReplica`)**: Sets the upper limit for scaling. When traffic increases, Agent Cloud automatically spins up additional replicas up to this limit. **Example Configuration:** ``` Min Replicas: 2 Max Replicas: 10 Profile: cpu-medium ``` In this example, your agent will always have at least 2 instances running but can scale up to 10 instances during peak demand, each using medium-tier compute resources. ## Summary | Term | Definition | | ---------------- | ----------------------------------------------------------------------------------------------------- | | **Agent Cloud** | VideoSDK's managed cloud platform for deploying AI voice agents | | **Deployment** | A managed instance of your agent on Agent Cloud, capable of running multiple versions | | **Version** | A specific release of your agent within a deployment, with its own scaling and resource configuration | | **Replica** | An individual running instance of your agent within a version | | **Min Replicas** | Minimum number of agent instances always running | | **Max Replicas** | Maximum number of agent instances during peak scaling | | **Profile** | Compute resource tier (cpu-small, cpu-medium, cpu-large) for each replica | | **Region** | Geographic location for deployment (in002 for India, us002 for US) | Understanding these concepts is essential for effectively deploying and managing your AI agents on Agent Cloud. In the following guides, we'll explore how to create deployments, manage versions, and configure scaling for your specific use case. --- --- --- title: Agent Cloud-v1 (Managed) hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Deploy Your Agents" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Agent Cloud-v1 (Managed) slug: agent-cloud-v1 --- # Agent Cloud-v1 This guide shows you how to deploy AI Agents with the [videosdk-agents](https://pypi.org/project/videosdk-agents/) python package. Once your AI Agent is ready to use, you need to create an AI Deployment. The AI Deployment is responsible for running your AI Agent. Before proceeding, ensure you have completed the steps under **Prerequisites**. ## Prerequisites To deploy your AI Deployment, make sure you have: - Created an AI Deployment using the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment). - A VideoSDK authentication token (generate from [VideoSDK Dashboard](https://app.videosdk.live)) ## YAML Configuration Create a `videosdk.yaml` file with the following structure: ``` version: "1.0" deployment: id: your_ai_deployment_id entry: path: entry_point_for_deployment env: # Optional to run your agent locally path: "./.env" secrets: VIDEOSDK_AUTH_TOKEN: your_auth_token deploy: cloud: true ``` ### Field Descriptions | Field | Description | | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | | `deployment.id` | The `deploymentId` obtained from the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment) | | `deployment.entry.path` | Path to the entry point script for your AI Deployment. | | `env.path` | Path to your `.env` file, used only when running the agent locally. | | `secrets.VIDEOSDK_AUTH_TOKEN` | Your VideoSDK auth token (required for deployment). | | `deploy.cloud` | Set to `true` to allow deploying the deployment to VideoSDK Cloud, when using the deploy command. Use `false` to avoid accidental deploys. | ## CLI Commands - ###### Run the AI Deployment locally for Testing. ``` videosdk run ``` - ###### Deploy the AI Deployment. ``` videosdk deploy ``` ## Next Steps After deploying your AI Deployment, you can start using it by: 1. Creating a new session using the [Start Session API](/api-reference/agent-cloud/start-session) 2. Ending the session using the [End Session API](/api-reference/agent-cloud/end-session) --- --- title: Agents deployments hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Overview slug: introduction --- # Deployments ### Overview The VideoSDK Agents framework provides multiple deployment options to run your AI agents in production environments. Understanding these options helps you choose the right deployment strategy for your specific use case. VideoSDK Agents supports two primary deployment modes: 1. **Agent Cloud (Managed)** - Fully managed deployment hosted on VideoSDK infrastructure 2. **Self-Hosting** - Self-managed deployment on your own infrastructure (EC2, Docker, Kubernetes, etc.) ### [Agent Cloud (Hosted on Our Infrastructure)](./agent-cloudv1.md) Agent Cloud is a fully managed service that handles the deployment, scaling, and maintenance of your AI agents. When you deploy to Agent Cloud: - **Zero Infrastructure Management**: No need to manage servers, containers, or scaling - **Automatic Scaling**: Built-in load balancing and auto-scaling capabilities - **High Availability**: Redundant infrastructure with automatic failover - **Managed Updates**: Automatic security patches and framework updates - **Global Distribution**: Agents deployed across multiple regions for low latency - **Built-in Monitoring**: Integrated metrics, logging, and health monitoring **Best for**: Teams that want to focus on agent development rather than infrastructure management, or applications with variable traffic patterns. ### [Self-Hosting (EC2, Docker, or Custom Infrastructure)](./self-hosting/understanding-worker.md) Self-hosting gives you complete control over your deployment environment and infrastructure. When self-hosting: - **Full Control**: Complete control over hardware, networking, and configuration - **Custom Integrations**: Ability to integrate with existing infrastructure and tools - **Cost Optimization**: Potential cost savings for high-volume, predictable workloads - **Compliance**: Meet specific security, compliance, or data residency requirements - **Custom Scaling**: Implement your own scaling strategies and resource management **Best for**: Organizations with existing infrastructure, specific compliance requirements, or predictable high-volume workloads. ### When to Choose Agent Cloud vs Self-Hosting #### Choose Agent Cloud when: - You want to get started quickly without infrastructure setup - You have variable or unpredictable traffic patterns - You need global distribution and low latency - You want automatic scaling and high availability - You prefer a managed service with built-in monitoring #### Choose Self-Hosting when: - You need to meet specific compliance or security requirements - You have predictable, high-volume workloads where cost optimization is important - You require custom integrations with existing systems - You need complete control over the deployment environment ### Common Terminology Understanding these key terms will help you navigate the deployment documentation: | Term | Definition | | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Agent** | Your AI application built using the VideoSDK Agents framework. An agent can handle voice conversations, process audio, and respond with synthesized speech. | | **Worker** | A runtime component that executes your agent code. Workers can run in different environments (Agent Cloud or self-hosted) and handle job assignments from the backend registry system. | | **Backend Registry** | The central service that manages worker registration, job assignment, and load balancing. Workers connect to this registry to receive job assignments and report their status. | | **Job** | A single execution instance of your agent. When a user starts a conversation, the backend registry assigns a job to an available worker. | | **JobContext** | The execution context for a job, containing room configuration, pipeline setup, and session management. This is the main interface your agent code interacts with. | | **Worker Registration** | The process by which self-hosted workers register themselves with the VideoSDK backend registry to receive job assignments. | | **Load Threshold** | A configuration parameter that determines when a worker is considered "at capacity" and should not receive new job assignments. | | **Health Check** | Regular monitoring of worker status to ensure they're available and functioning correctly. Workers provide health endpoints for monitoring. | | **Resource Management** | The system for managing worker resources including process/thread allocation, memory limits, and concurrent job handling. | | **Session Management** | Handles the lifecycle of agent sessions including automatic session ending, timeouts, and cleanup. | | **Horizontal Scaling** | The manual process of deploying additional worker instances to handle increased load (requires manual deployment of new worker instances). | | **Vertical Scaling** | The automatic scaling within a single worker up to its configured maximum capacity (`max_processes`). | | **Dispatch API** | A REST API endpoint that allows you to dynamically dispatch agents to meetings on-demand. | | **AI Deployment** | The deployment configuration that runs your AI agent, either in Agent Cloud or self-hosted environments. | This terminology will be referenced throughout the deployment documentation as we explore specific deployment scenarios and configurations. --- --- title: Dispatch Agents hide_title: false hide_table_of_contents: false description: "Dynamically dispatch AI agents to meetings using the VideoSDK API." pagination_label: "Dispatch Agents" keywords: - AI Agent SDK - VideoSDK Agents - Dispatch API - Agent Assignment image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Dispatch Agents slug: dispatch-agents --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Dispatch Agents Dynamically assign your AI agents to meetings using the VideoSDK dispatch API. This API supports dispatching for both self-hosted agents created with the Agents SDK and agents managed through the VideoSDK dashboard (Agent Runtime). ## How It Works 1. **Your app** calls the dispatch API 2. **VideoSDK backend** finds an available server 3. **Server spawns a job/process** to join the meeting 4. **Agent starts** and begins processing in the meeting ## API Usage ### Endpoint ```bash POST https://api.videosdk.live/v2/agent/dispatch ``` ### Request Body Parameters | Parameter | Type | Required | Description | | :---------- | :----- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------- | | meetingId | string | Yes | The ID of the meeting to which the agent should be dispatched. | | agentId | string | Yes | The ID of the agent to dispatch. | | metadata | object | No | Optional metadata to pass to the agent, such as variables. | | versionId | string | No | The specific version of a dashboard-managed agent to dispatch. If omitted, the latest deployed version is used. Not for self-hosted agents. | ### Example Request ```bash curl -X POST "https://api.videosdk.live/v2/agent/dispatch" \ -H "Authorization: YOUR_VIDEOSDK_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "meetingId": "xxxx-xxxx-xxxx", "agentId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "metadata": { "variables":[ { "name":"fname", "value":"john" } ] }, "versionId":"abcd-abcd-abcd-abcd" }' ``` ### Responses **On Success** A successful request will return a confirmation that the dispatch has been initiated. ```json { "message": "Agent dispatch requested successfully.", "data": { "success": true, "status": "assigned", "roomId": "xxxx-xxxx-xxxx", "agentId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" } } ``` **On Error** If the dispatch fails, you will receive one of the following error messages: This error occurs when no servers and agents are configured to handle the request. ```json { "message": "No workers available" } ``` This error is specific to **self-hosted (Agents SDK) agents**. It means that while the `agentId` is valid, no server has been configured for the specific `agentId`. ```json { "message": "No workers have registered with agentId 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'" } ``` This error is specific to **dashboard-managed agents**. It indicates that the agent exists but has no deployed versions available for dispatch or the specific version user wants to dispatch is not deployed . ```json { "message": "No agent is deployed with agentId 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'" } ``` ## Dispatching Your Agent The prerequisites for dispatching an agent depend on how it was created. ### For Self-Hosted Agents (Agents SDK) If you created your agent using the Python Agents SDK, you are responsible for hosting it. Your server must be: 1. **Registered**: The server must be configured with `register=True`. 2. **Connected**: The server must be running and connected to the VideoSDK backend. 3. **Available**: The server must have the capacity to handle new jobs. The `versionId` parameter is not applicable in this scenario. **Server Configuration Example** ```python from videosdk.agents import Options options = Options( agent_id="MyAgent", # Must match agentId in API call register=True, # Required for dispatch max_processes=10, load_threshold=0.75, ) ``` ### For Dashboard-Managed Agents (Agent Runtime) If you created your agent using the dashboard interface, VideoSDK manages the hosting for you. The only prerequisite is that your agent must be **deployed**. - You can deploy your agent via the dashboard. - You can use the optional `versionId` parameter in your dispatch request to specify which deployed version of the agent to use. - If `versionId` is not provided, the **latest deployed version** will be dispatched by default. ## Code Examples ```python import requests def dispatch_agent(auth_token, meeting_id, agent_id, metadata=None, version_id=None): url = "https://api.videosdk.live/v2/agent/dispatch" headers = { "Authorization": auth_token, "Content-Type": "application/json" } payload = { "meetingId": meeting_id, "agentId": agent_id, } if metadata: payload["metadata"] = metadata if version_id: payload["versionId"] = version_id response = requests.post(url, headers=headers, json=payload) return response.json() # Usage result = dispatch_agent("your-token", "room-123", "MyAgent") ``` ```javascript async function dispatchAgent(authToken, meetingId, agentId, metadata, versionId) { const url = "https://api.videosdk.live/v2/agent/dispatch"; const headers = { Authorization: authToken, "Content-Type": "application/json", }; const body = { meetingId, agentId, }; if (metadata) { body.metadata = metadata; } if (versionId) { body.versionId = versionId; } const response = await fetch(url, { method: "POST", headers, body: JSON.stringify(body), }); return response.json(); } // Usage dispatchAgent("your-token", "room-123", "MyAgent"); ``` --- --- title: AWS EC2 Deployment hide_title: false hide_table_of_contents: false description: "Deploy your VideoSDK AI Agent on AWS EC2 with minimal setup." pagination_label: "AWS EC2 Deployment" keywords: - AI Agent SDK - VideoSDK Agents - AWS EC2 - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: AWS EC2 slug: aws-ec2 --- # AWS EC2 Deploy your VideoSDK AI Agent Worker on AWS EC2 instances. ## Prerequisites - AWS account - SSH key pair - VideoSDK authentication token ## Quick Setup ### 1. Launch EC2 Instance ```bash aws ec2 run-instances \ --image-id ami-0c02fb55956c7d316 \ --instance-type t3.medium \ --key-name your-key-pair \ --security-group-ids sg-xxxxxxxxx \ --user-data file://user-data.sh ``` ### 2. User Data Script ```bash #!/bin/bash yum update -y yum install -y python3 python3-pip git # Clone private repository with token git clone https://YOUR_TOKEN@github.com/your-org/your-agent.git /opt/agent cd /opt/agent # Install dependencies pip3 install -r requirements.txt # Create systemd service cat > /etc/systemd/system/agent-worker.service << EOF [Unit] Description=VideoSDK Agent Worker After=network.target [Service] Type=simple User=ec2-user WorkingDirectory=/opt/agent Environment=VIDEOSDK_AUTH_TOKEN=your_auth_token ExecStart=/usr/bin/python3 main.py Restart=always [Install] WantedBy=multi-user.target EOF # Start the service systemctl enable agent-worker systemctl start agent-worker ``` ### 3. Security Group Configure your security group with these rules: - **SSH (22)**: Your IP - **Custom TCP (8081)**: Your IP (for health checks) - **HTTPS (443)**: 0.0.0.0/0 (for VideoSDK API) ## Deploy Updates ```bash # Connect to your instance ssh -i your-key.pem ec2-user@your-instance-ip # Update your agent cd /opt/agent git pull systemctl restart agent-worker ``` ## Monitor ```bash # Check service status systemctl status agent-worker # View logs journalctl -u agent-worker -f ``` ## Scaling > To support more concurrent agents, you can spin up additional EC2 instances using the same process. Each instance will register with the VideoSDK backend registry and automatically receive job assignments. The backend will distribute the load across all available workers. **To add more instances:** 1. Use the same user data script 2. Launch additional EC2 instances 3. Each instance will automatically join the worker pool 4. The VideoSDK backend will handle load balancing **Example:** ```bash # Launch multiple instances aws ec2 run-instances \ --image-id ami-0c02fb55956c7d316 \ --instance-type t3.medium \ --key-name your-key-pair \ --security-group-ids sg-xxxxxxxxx \ --user-data file://user-data.sh \ --count 3 ``` --- --- title: Docker Deployment hide_title: false hide_table_of_contents: false description: "Deploy your VideoSDK AI Agent using Docker containers." pagination_label: "Docker Deployment" keywords: - AI Agent SDK - VideoSDK Agents - Docker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Docker slug: docker --- # Docker Deploy your VideoSDK AI Agent Worker using Docker containers. ## Prerequisites - Docker installed - VideoSDK authentication token ## Quick Setup ### 1. Create Dockerfile ```dockerfile FROM python:3.11-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y \ gcc \ && rm -rf /var/lib/apt/lists/* # Copy requirements and install Python dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose debug port EXPOSE 8081 # Run the worker CMD ["python", "main.py"] ``` ### 2. Build and Run ```bash # Build the image docker build -t my-agent-worker . # Run the container docker run -d \ --name my-agent-worker \ -p 8081:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` ### 3. Docker Compose (Optional) Create `docker-compose.yml`: ```yaml title="docker-compose.yml" version: "3.8" services: agent-worker: build: . ports: - "8081:8081" environment: - VIDEOSDK_AUTH_TOKEN=${VIDEOSDK_AUTH_TOKEN} restart: unless-stopped ``` Run with: ```bash docker-compose up -d ``` ## Deploy Updates ```bash # Stop container docker stop my-agent-worker # Remove old container docker rm my-agent-worker # Build new image docker build -t my-agent-worker . # Run new container docker run -d \ --name my-agent-worker \ -p 8081:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` ## Monitor ```bash # Check container status docker ps # View logs docker logs my-agent-worker # Execute commands in container docker exec -it my-agent-worker bash ``` ## Scaling > To support more concurrent agents, you can run multiple containers using the same image. Each container will register with the VideoSDK backend registry and automatically receive job assignments. **Run multiple containers:** ```bash # Run additional containers docker run -d \ --name my-agent-worker-2 \ -p 8082:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker docker run -d \ --name my-agent-worker-3 \ -p 8083:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` **Or scale with Docker Compose:** ```bash docker-compose up -d --scale agent-worker=3 ``` --- --- title: Kubernetes Deployment hide_title: false hide_table_of_contents: false description: "Deploy your VideoSDK AI Agent on Kubernetes clusters." pagination_label: "Kubernetes Deployment" keywords: - AI Agent SDK - VideoSDK Agents - Kubernetes - K8s - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Kubernetes slug: kubernetes --- # Kubernetes Deploy your VideoSDK AI Agent Worker on Kubernetes clusters. ## Prerequisites - Kubernetes cluster (EKS, GKE, or self-hosted) - kubectl configured - Docker image of your agent ## Quick Setup ### 1. Create Namespace ```bash kubectl create namespace agent-workers ``` ### 2. Create Secret ```bash kubectl create secret generic agent-secrets \ --from-literal=VIDEOSDK_AUTH_TOKEN=your_auth_token \ --namespace agent-workers ``` ### 3. Deploy Agent ```yaml title="deployment.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: agent-worker namespace: agent-workers spec: replicas: 3 selector: matchLabels: app: agent-worker template: metadata: labels: app: agent-worker spec: containers: - name: agent-worker image: your-registry/agent-worker:latest ports: - containerPort: 8081 env: - name: VIDEOSDK_AUTH_TOKEN valueFrom: secretKeyRef: name: agent-secrets key: VIDEOSDK_AUTH_TOKEN resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" ``` Apply the deployment: ```bash kubectl apply -f deployment.yaml ``` ## Monitor ```bash # Check deployment status kubectl get deployments -n agent-workers # Check pods kubectl get pods -n agent-workers # View logs kubectl logs -f deployment/agent-worker -n agent-workers ``` ## Deploy Updates ```bash # Update image kubectl set image deployment/agent-worker agent-worker=your-registry/agent-worker:latest -n agent-workers # Check rollout status kubectl rollout status deployment/agent-worker -n agent-workers ``` ## Scaling > To support more concurrent agents, you can scale the deployment by increasing the number of replicas. Each pod will register with the VideoSDK backend registry and automatically receive job assignments. **Scale the deployment:** ```bash # Scale to 5 replicas kubectl scale deployment agent-worker --replicas=5 -n agent-workers # Or use HPA for automatic scaling kubectl autoscale deployment agent-worker --cpu-percent=70 --min=2 --max=10 -n agent-workers ``` **Check scaling:** ```bash # View current replicas kubectl get deployment agent-worker -n agent-workers # View HPA status kubectl get hpa -n agent-workers ``` ## Cleanup ```bash # Delete deployment kubectl delete deployment agent-worker -n agent-workers # Delete namespace (removes everything) kubectl delete namespace agent-workers ``` --- --- title: Monitoring APIs hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Monitoring APIs slug: monitoring-apis --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Monitoring APIs Monitor your worker status and performance using HTTP endpoints. All endpoints are available at `http://localhost:8081`. ## Available Endpoints - **`/health`** - Basic health check - **`/worker`** - Worker status - **`/stats`** - Detailed statistics - **`/debug`** - Configuration info - **`/`** - Web dashboard ## Quick Health Check ```bash curl http://localhost:8081/health ``` **Response:** ``` OK ``` ## Worker Status ```bash curl http://localhost:8081/worker ``` **Response:** ```json { "agent_id": "MyAgent", "active_jobs": 3, "connected": true, "worker_id": "worker-123", "worker_load": 0.3 } ``` ## Detailed Statistics ```bash curl http://localhost:8081/stats ``` **Response:** ```json { "worker_load": 0.3, "current_jobs": 3, "max_processes": 10, "agent_id": "MyAgent", "backend_connected": true, "resource_stats": { "total_resources": 10, "available_resources": 7, "active_resources": 3 } } ``` ## Web Dashboard Open `http://localhost:8081/` in your browser for a visual interface showing: - Real-time worker status - Resource utilization - Active jobs - Performance metrics ## Integration Examples ```python import requests def check_worker_health(): response = requests.get("http://localhost:8081/health") return response.status_code == 200 def get_worker_stats(): response = requests.get("http://localhost:8081/stats") return response.json() # Usage if check_worker_health(): stats = get_worker_stats() print(f"Active jobs: {stats['current_jobs']}") ``` ```javascript async function checkWorkerHealth() { const response = await fetch("http://localhost:8081/health"); return response.ok; } async function getWorkerStats() { const response = await fetch("http://localhost:8081/stats"); return response.json(); } // Usage if (await checkWorkerHealth()) { const stats = await getWorkerStats(); console.log(`Active jobs: ${stats.current_jobs}`); } ``` ## Common Use Cases - **Health monitoring**: Use `/health` for load balancer checks - **Performance tracking**: Use `/stats` for resource monitoring - **Debugging**: Use `/debug` to verify configuration - **Visual monitoring**: Use web dashboard for real-time overview --- --- title: Understanding the Worker hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Understanding the Worker slug: understanding-worker --- # Understanding the Worker The **Worker** is the runtime engine that executes your AI agents in production. Think of it as the "server" that runs your agent code and handles multiple conversations simultaneously. ![AI Agent Worker](https://cdn.videosdk.live/website-resources/docs-resources/ai_agent_worker.png) ## What the Worker Does The Worker manages the lifecycle of your AI agents by: - **Executing** your agent code when users start conversations - **Managing** multiple concurrent conversations efficiently - **Connecting** to VideoSDK's backend to receive job assignments - **Monitoring** health and performance automatically - **Scaling** up or down based on demand ## Why Use the Built-in Worker? The VideoSDK Agents framework includes a production-ready Worker that handles all the complex infrastructure concerns, so you can focus on building your AI agent logic. **Key Benefits:** - **Production-Ready**: Built for real-world workloads with proper error handling - **Auto-Scaling**: Automatically handles multiple conversations within a single worker - **Health Monitoring**: Built-in health checks and status reporting - **Zero-Downtime**: Graceful shutdown and deployment capabilities --- --- title: Worker Configuration hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction to deployments" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - AI Integration - VideoSDK Cloud - Deployments - Worker - Self Hosting image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Worker Configuration slug: worker-configuration --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import DeploymentCard from '@site/src/components/DeploymentCard' # Worker Configuration Workers are the execution engines that run your **AI Agent jobs**. Think of them as the bridge between your **agent logic** and the **VideoSDK runtime**. This guide walks you through how to configure and tune a Worker for different environments — from local dev to production. ## Quick Start: Minimal Worker Here’s the simplest Worker setup to get going: ```python from videosdk.agents import WorkerJob, Options, JobContext, RoomOptions options = Options( agent_id="MyAgent", max_processes=5, register=True, # Registers worker with the backend for job scheduling ) room_options = RoomOptions( name="My Agent", ) job_context = JobContext(room_options=room_options) job = WorkerJob( entrypoint=your_agent_function, jobctx=lambda: job_context, options=options, ) job.start() ``` That’s enough to start processing jobs locally or in staging. ## Worker Options Explained The `Options` class gives you fine-grained control over Worker behavior: | Option | Purpose | Example | | -------------------- | ------------------------------------------- | ------------------------------- | | `agent_id` | Unique identifier for your agent | `"SupportBot01"` | | `max_processes` | Maximum concurrent jobs | `10` | | `num_idle_processes` | Pre-warmed processes for faster startup | `2` | | `load_threshold` | Max CPU/Load tolerance before refusing jobs | `0.75` | | `register` | Whether to register with backend | `True` (prod) / `False` (local) | | `log_level` | Logging verbosity | `"DEBUG"`, `"INFO"`, `"ERROR"` | | `host`, `port` | Bind address for health/status endpoints | `"0.0.0.0"`, `8081` | | `memory_warn_mb` | Trigger warning logs at this usage | `500.0` | | `memory_limit_mb` | Hard memory cap (`0` = unlimited) | `1000.0` | | `ping_interval` | Heartbeat interval in seconds | `30.0` | | `max_retry` | Max connection retries before giving up | `16` | ## Example Configurations **Standard Production** configuration for typical deployments: ```python options = Options( agent_id="StandardAgent", max_processes=5, register=True, log_level="INFO", ) ``` This configuration is suitable for: - Standard production deployments - Moderate traffic loads - Most business applications **High-Scale Production** configuration for enterprise workloads: ```python options = Options( agent_id="EnterpriseAgent", max_processes=20, num_idle_processes=5, load_threshold=0.8, memory_limit_mb=2000.0, register=True, log_level="DEBUG", ) ``` This configuration is optimized for: - Enterprise-scale deployments - High concurrent user loads - Advanced monitoring requirements **Local Development** configuration for development: ```python options = Options( agent_id="DevAgent", max_processes=1, register=False, # Don't register with backend log_level="DEBUG", host="localhost", port=8081, ) ``` This configuration is ideal for: - Local development and testing - Debugging agent behavior - Isolated development environments ## Hosting Environments ## Scaling Your Workers Workers can scale both **vertically** (more power per instance) and **horizontally** (more instances). - **Vertical Scaling** → Increase `max_processes` to run more jobs per worker. - **Horizontal Scaling** → Deploy multiple workers; the backend registry will balance load. - **Idle Processes** → Use `num_idle_processes` to reduce cold start latency. - **Load Threshold** → Tune `load_threshold` (default `0.75`) to prevent overload. - **Memory Safety** → Use `memory_warn_mb` and `memory_limit_mb` to keep processes healthy. ## Pro Tips - **Start small** → Begin with `max_processes=5` and adjust as you observe metrics. - **Log smart** → Use `DEBUG` in dev, but `INFO` or `WARN` in prod to reduce noise. - **Monitor & Auto-Scale** → Pair with metrics (Prometheus, Grafana, CloudWatch, etc.) to auto-scale horizontally. - **Keep processes warm** → Set at least `num_idle_processes=1` in production for faster first-response times. --- --- title: Function Tools hide_title: false hide_table_of_contents: false description: "Learn how to extend your VideoSDK AI Agent's capabilities with function tools. Create custom actions, integrate with external services, and enable your agent to perform tasks beyond conversation using the @function_tool decorator." pagination_label: "Function Tools" keywords: - Function Tools - function_tool - Agent Tools - Custom Actions - External Services - API Integration - Agent Capabilities - VideoSDK Agents - AI Agent SDK - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Function Tools slug: function-tools --- import { AgentCardGrid, GithubIcon } from '@site/src/components/agent/cards'; # Function Tools Function tools allow your AI agent to perform actions and interact with external services, extending its capabilities beyond simple conversation. By registering function tools, you enable your agent to execute custom logic, call APIs, access databases, and perform various tasks based on user requests. ## Overview Function tools are Python functions decorated with `@function_tool` that your agent can call during conversations. The LLM automatically decides when to use these tools based on the user's request and the tool's description. ## External Tools External tools are defined as standalone functions and passed into the agent's constructor via the `tools` parameter. This approach is useful for sharing common tools across multiple agents. ```python title="main.py" from videosdk.agents import Agent, function_tool # External tool defined outside the class @function_tool(description="Get weather information for a location") def get_weather(location: str) -> str: """Get weather information for a specific location.""" # Weather logic here return f"Weather in {location}: Sunny, 72°F" class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant.", tools=[get_weather] # Register the external tool ) ``` ## Internal Tools Internal tools are defined as methods within your agent class and decorated with `@function_tool`. This approach is useful for logic that is specific to the agent and needs access to its internal state (`self`). ```python title="main.py" from videosdk.agents import Agent, function_tool class FinanceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful financial assistant." ) self.portfolio = {"AAPL": 10, "GOOG": 5} @function_tool def get_portfolio_value(self) -> dict: """Get the current value of the user's stock portfolio.""" # Access agent state via self return {"total_value": 5000, "holdings": self.portfolio} ``` ## Async Function Tools Function tools can be asynchronous, which is essential for making HTTP requests, performing I/O operations, or integrating with async VideoSDK features. ```python title="main.py" import aiohttp from videosdk.agents import Agent, function_tool class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant that can fetch real-time weather data." ) @function_tool async def get_weather_async(self, location: str) -> dict: """Fetch real-time weather data from an API.""" async with aiohttp.ClientSession() as session: async with session.get(f"https://api.weather.com/{location}") as response: data = await response.json() return { "location": location, "temperature": data.get("temp"), "condition": data.get("condition") } ``` :::note **Sarvam AI LLM**: When using Sarvam AI as the LLM option, function tool calls and MCP tools will not work. Consider using alternative LLM providers if you need function tool support. ::: ## Examples - Try Out Yourself }, { title: "Real-life Usecase", description: "Complete example demonstrating internal and external function tools", link: "https://github.com/videosdk-live/agents/blob/ee3ced912078c3be9dd62c7576c95c1bbe227bae/examples/a2a/agents/customer_agent.py#L22", icon: } ]} columns={2} /> --- --- title: Human in the Loop hide_title: false hide_table_of_contents: false description: "Learn how to implement Human in the Loop (HITL) functionality with VideoSDK AI Agents using Discord integration for human oversight and intervention." pagination_label: "Human in the Loop" keywords: - Human in the Loop - HITL - Discord Integration - AI Agent Oversight - Human Intervention - VideoSDK Agents - MCP Server - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Human in the Loop slug: human-in-the-loop --- # Human in the Loop Human in the Loop (HITL) enables AI agents to escalate specific queries to human operators for review and approval. This implementation uses Discord as the human interface, allowing seamless handoffs between AI automation and human oversight. ## Overview The HITL system allows AI agents to: - Handle routine customer inquiries autonomously - Escalate specific queries (like discount requests) to human operators via Discord - Receive human responses and relay them back to customers - Maintain conversation flow while waiting for human input ## Use Cases - **Discount Requests**: AI escalates pricing queries to human sales agents - **Complex Support**: Technical issues requiring human expertise - **Policy Decisions**: Requests that need human approval or clarification - **Escalation Scenarios**: Situations where AI confidence is low ## Example Overview The implementation consists of two main components: 1. **Customer Agent**: VideoSDK AI agent that handles customer interactions and escalates specific queries 2. **Discord MCP Server**: MCP server that creates Discord threads for human operator responses ## Example Implementation ### Customer Agent Setup ```python from videosdk.agents import Agent, MCPServerStdio import pathlib import sys class CustomerAgent(Agent): def __init__(self, ctx: Optional[JobContext] = None): current_dir = pathlib.Path(__file__).parent discord_mcp_server_path = current_dir / "discord_mcp_server.py" super().__init__( instructions="You are a customer-facing agent for VideoSDK. You have access to various tools to assist with customer inquiries, provide support, and handle tasks. When a user asks for a discount percentage, always use the appropriate tool to retrieve and provide the accurate answer from your superior human agent.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(discord_mcp_server_path)], session_timeout=30 ), ] ) self.ctx = ctx ``` ### Discord MCP Server ```python from mcp.server.fastmcp import FastMCP import discord from discord.ext import commands class DiscordHuman: def __init__(self, user_id: int, channel_id: int): self.user_id = user_id self.channel_id = channel_id self.bot = commands.Bot(command_prefix="!", intents=discord.Intents.all()) self.response_future = None async def ask(self, question: str) -> str: channel = self.bot.get_channel(self.channel_id) thread = await channel.create_thread( name=question[:100], type=discord.ChannelType.public_thread ) await thread.send(f"<@{self.user_id}> {question}") self.response_future = self.loop.create_future() try: return await asyncio.wait_for(self.response_future, timeout=600) except asyncio.TimeoutError: return "⏱️ Timed out waiting for a human response" # MCP Server Setup mcp = FastMCP("HumanInTheLoopServer") @mcp.tool(description="Ask a human agent via Discord for a specific user query such as discount percentage, etc.") async def ask_human(question: str) -> str: return await discord_human.ask(question) ``` ### Pipeline Configuration ```python pipeline = Pipeline( stt=DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")), llm=AnthropicLLM(api_key=os.getenv("ANTHROPIC_API_KEY")), tts=GoogleTTS(api_key=os.getenv("GOOGLE_API_KEY")), vad=SileroVAD(), turn_detector=TurnDetector(threshold=0.8) ) ``` ### Environment Variables Set the following environment variables: ```bash DISCORD_TOKEN=your_discord_bot_token DISCORD_USER_ID=human_operator_user_id DISCORD_CHANNEL_ID=channel_id_for_escalations DEEPGRAM_API_KEY=your_deepgram_key ANTHROPIC_API_KEY=your_anthropic_key GOOGLE_API_KEY=your_google_key ``` ### Example Link Complete implementation with full source code, setup instructions, and configuration examples available in the [VideoSDK Agents GitHub repository](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop). --- --- title: Introduction hide_title: false hide_table_of_contents: false description: "Introduce yourself to the VideoSDK AI Agent SDK, a Python framework for integrating AI-powered voice agents into VideoSDK meetings. Understand its high-level architecture and how it bridges AI models with users for real-time interactions." pagination_label: "Introduction" keywords: - AI Agent SDK - VideoSDK Agents - Introduction - Python SDK - Voice AI - Real-time Communication - AI Integration - VideoSDK Cloud - Conversational AI - Build AI Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Introduction slug: introduction --- import { AgentCardGrid, GithubIcon, RobotIcon, DocumentIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon, TelephonyIcon, WaveformIcon, DocsIcon, CloudIcon, PuzzlePieceSimpleIcon, MetricsIcon, BulbIcon, DiscordIcon, SupportIcon } from '@site/src/components/agent/cards'; # AI Voice Agents The VideoSDK AI Agent SDK is a powerful Python framework for developers to seamlessly integrate intelligent, real-time voice agents into any application. Bridge the gap between advanced AI models and human interaction, creating natural, engaging, and responsive conversational experiences. , showArrow: false }, { title: "AI Telephony Agent Quickstart", description: "Build an AI Telephony Agent in less than 10 minutes", link: "/ai_agents/ai-phone-agent-quick-start", icon: , showArrow: false }, { title: "Github Repository", description: "The videosdk agent code and examples", link: "https://github.com/videosdk-live/agents", icon: }, { title: "Agent Starter Apps", description: "Ready-to-run starter apps to get your AI agent up and running fast.", link: "/ai_agents/agent-runtime/connect-agent/web-integrations/agent-starter-react", icon: } ]} /> ## The Architecture The VideoSDK AI Agents framework connects four key components to enable seamless AI voice interactions: - Your **Infrastructure** hosts the agent management system - The **Agent Worker** creates and manages AI sessions - The **VideoSDK Room** handles real-time meeting operations - **User Devices** connect through web, mobile apps, or phone calls to interact with intelligent agents that can listen, understand, and respond naturally in real-time conversations. ![Introduction](https://assets.videosdk.live/images/agent-architecture.png) ## Use Cases Here are some real-world applications where VideoSDK AI Agents can be deployed to create intelligent, voice-enabled experiences across different industries and scenarios. You can use this, or refer this to create your customized agent. ## The Building Blocks Our SDK is built on four primary, modular components that work together to create powerful and customizable agents. Understand these concepts, and you're ready to build. , showArrow: false }, { title: "Deployment Options", description: "Deploy your agent on cloud or self-host it on your own infrastructure", link: "/ai_agents/deployments/introduction", icon: , showArrow: false }, { title: "Observability", description: "Monitor and debug with confidence using our built-in session analytics, latency tracking, and detailed traces.", link: "/ai_agents/tracing-observability/session-analytics", icon: , showArrow: false }, { title: "Plugin Ecosystem", description: "Integrate with dozens of providers like OpenAI, Google, Anthropic, and Elevenlabs for STT, LLM, and TTS.", link: "/ai_agents/plugins/realtime/openai", icon: , showArrow: false } ]} /> ## Need Help? If you have any queries, please feel free to reach out to us using one of the following methods: }, { title: "GitHub", description: "Ask your questions on GitHub.", link: "https://github.com/videosdk-live/agents/issues", icon: }, { title: "Support", description: "Talk to an expert, book demo or talk to sales.", link: "https://www.videosdk.live/contact", icon: } ]} columns={3} /> ## Frequently Asked Questions
What programming language and version are required? The AI Agent SDK is built in Python. You'll need Python 3.12 or higher to use the SDK.
Can my agent answer phone calls? Yes. By integrating with our SIP/telephony services, your AI agent can join a room initiated by a standard phone call. This allows you to build powerful IVR systems, automated appointment schedulers, AI-powered call centers, and more.
What AI models are supported? The SDK supports various AI models including: - **Real-time Models**: OpenAI, Google Gemini, AWS Nova Sonic - **LLM Providers**: OpenAI, Google Gemini, Anthropic Claude, Sarvam AI, Cerebras - **TTS Providers**: ElevenLabs, OpenAI, Google, AWS Polly, Cartesia, and many more - **STT Providers**: OpenAI Whisper, Deepgram, Google, AssemblyAI, and others
Can I use my own custom models? Absolutely! The SDK's modular architecture allows you to create custom plugins for any AI provider. Check our [plugin development guide](https://github.com/videosdk-live/agents/blob/main/BUILD_YOUR_OWN_PLUGIN.md) for detailed instructions.
How is pricing handled for the AI Agent SDK? VideoSDK offers a free tier with limited usage. The AI Agent SDK itself is open-source, but you'll need API keys for the AI services you choose to use (OpenAI, Google, etc.). Check the [pricing page](https://www.videosdk.live/pricing) for VideoSDK usage limits.
Can agents handle more than just voice? Absolutely! Agents support multimodal interactions including vision processing, data messages, and real-time video streams. They can also use function tools to interact with external systems and APIs.
Is the SDK production-ready? Yes, the AI Agent SDK is stable and production-ready. It is designed to be self-hosted on your own infrastructure for full control and scalability, from a single server to a Kubernetes cluster. It includes comprehensive error handling, metrics collection, and deployment flexibility.
--- --- title: MCP Integration hide_title: false hide_table_of_contents: false description: "Learn how to integrate Model Context Protocol (MCP) servers with VideoSDK AI Agents to extend your agent's capabilities with external services, databases, and APIs using STDIO and HTTP transport methods." pagination_label: "MCP Integration" keywords: - MCP Integration - Model Context Protocol - MCP Client - MCP Servers - Multiple MCP Servers - MCP Server Client Example - VideoSDK Agents - AI Agent SDK - Python - MCP Tools - MCP Standard Input/Output (stdio) - MCP Streamable HTTP - MCP Server-Sent Events (SSE) - External APIs - Voice Agent Sessions - Real Time MCP image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: MCP Integration slug: mcp-integration --- The Model Context Protocol (MCP) is an open standard that enables AI assistants to securely connect to data sources and tools. With VideoSDK's AI Agents, you can seamlessly integrate MCP servers to extend your agent's capabilities with external services or applications, databases, and APIs. ## MCP Server Types VideoSDK supports two transport methods for MCP servers: ### 1. STDIO Transport - Direct process communication - Local Python scripts - Best for custom tools and functions - Ideal for server-side integrations ### 2. HTTP Transport (Streamable HTTP or SSE) - Network-based communication - External MCP services - Best for third-party integrations - Supports remote MCP servers ## How It Works with VideoSDK's AI Agent MCP tools are automatically discovered and made available to your agent. Agent will intelligently choose which tools to use based on user requests. When a user asks for information that requires external data, the agent will: - Identify the need for external data based on the user's request - Select appropriate tools from available MCP servers - Execute the tools with relevant parameters - Process the results and provide a natural language response This seamless integration allows your voice agent to access real-time data and external services while maintaining a natural conversational flow. ## Creating an MCP Server # Basic MCP Server Structure A simple MCP server using STDIO to return the current time. First, install the required package: ```bash pip install fastmcp ``` ```python title="mcp_stdio_example.py" from mcp.server.fastmcp import FastMCP import datetime # Create the MCP server mcp = FastMCP("CurrentTimeServer") @mcp.tool() def get_current_time() -> str: """Get the current time in the user's location""" # Get current time now = datetime.datetime.now() # Return formatted time string return f"The current time is {now.strftime('%H:%M:%S')} on {now.strftime('%Y-%m-%d')}" if __name__ == "__main__": # Run the server with STDIO transport mcp.run(transport="stdio") ``` ## Integrating MCP with VideoSDK Agent Now we'll see how to integrate MCP servers with your VideoSDK AI Agent: ```python title="main.py" import asyncio import pathlib import sys from videosdk.agents import Agent, AgentSession, Pipeline, MCPServerStdio, MCPServerHTTP from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig class MyVoiceAgent(Agent): def __init__(self): # Define paths to your MCP servers mcp_script = Path(__file__).parent.parent / "MCP_Example" / "mcp_stdio_example.py" super().__init__( instructions="""You are a helpful assistant with access to real-time data. You can provide current time information. Always be conversational and helpful in your responses.""", mcp_servers=[ # STDIO MCP Server (Local Python script for time) MCPServerStdio( executable_path=sys.executable, # Use current Python interpreter process_arguments=[str(mcp_script)], session_timeout=30 ), # HTTP MCP Server (External service example e.g Zapier) MCPServerHTTP( endpoint_url="https://your-mcp-service.com/api/mcp", session_timeout=30 ) ] ) async def on_enter(self) -> None: await self.session.say("Hi there! How can I help you today?") async def on_exit(self) -> None: await self.session.say("Thank you for using the assistant. Goodbye!") async def main(context: dict): # Configure Gemini Realtime model model = GeminiRealtime( model="gemini-3.1-flash-live-preview", config=GeminiLiveConfig( voice="Leda", # Available voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr response_modalities=["AUDIO"] ) ) pipeline = Pipeline(llm=model) agent = MyVoiceAgent() session = AgentSession( agent=agent, pipeline=pipeline, context=context ) try: # Start the session await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() if __name__ == "__main__": def make_context(): # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth return { "meetingId": "your_actual_meeting_id_here", # Replace with actual meeting ID "name": "AI Voice Agent", "videosdk_auth": "your_videosdk_auth_token_here" # Replace with actual token } ``` :::tip Get started quickly with the [Quick Start Example](https://github.com/videosdk-live/agents-quickstart/tree/main/MCP) for the VideoSDK AI Agent SDK With MCP — everything you need to build your first AI agent fast. ::: --- --- title: Anam AI Avatar hide_title: false hide_table_of_contents: false description: "Build video agents with unmatched realism using Anam AI avatars and the VideoSDK AI Agent SDK. This guide covers configuration, API integration, and adding a lifelike visual avatar to your agent." pagination_label: "Anam AI" keywords: - Anam - Anam AI - Avatar - Real-time - VideoSDK Agents - Python SDK - AI Agent - Virtual Avatar - Realistic Avatar image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Anam AI slug: anam-ai --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Anam AI Avatar The Anam AI Avatar plugin allows you to integrate a real-time, lip-synced AI avatar into your VideoSDK agent. It provides a visual representation of the agent with expressive facial movements synchronized to speech output, and works with `Pipeline` in both cascading and realtime modes. ## Installation Install the Anam-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-anam" ``` ## Authentication The Anam AI plugin requires an [Anam API key](https://www.anam.ai/). ## Importing ```python from videosdk.plugins.anam import AnamAvatar ``` ## Setup Credentials To use Anam AI, you need an API key and an avatar ID. You can get them from the [Anam AI Dashboard](https://www.anam.ai/). ```bash ANAM_API_KEY="YOUR_ANAM_API_KEY" ANAM_AVATAR_ID="YOUR_ANAM_AVATAR_ID" ``` ## Example Usage Here's how you can integrate the Anam AI Avatar with `Pipeline` in both realtime and cascading modes. This example shows how to add the Anam AI Avatar to a `Pipeline` (realtime mode). ```python import os from videosdk.agents import Pipeline from videosdk.plugins.anam import AnamAvatar # 1. Create an AnamAvatar instance anam_avatar = AnamAvatar( api_key=os.getenv("ANAM_API_KEY"), avatar_id=os.getenv("ANAM_AVATAR_ID"), ) # 2. Add the avatar to the pipeline pipeline = Pipeline( avatar=anam_avatar ) ``` For a full working example, see the [Anam Realtime Example on GitHub](https://github.com/videosdk-live/agents/blob/main/examples/avatar/anam_realtime_example.py). This example shows how to add the Anam AI Avatar to a `Pipeline` (cascading mode). ```python import os from videosdk.agents import Pipeline from videosdk.plugins.anam import AnamAvatar # 1. Create an AnamAvatar instance anam_avatar = AnamAvatar( api_key=os.getenv("ANAM_API_KEY"), avatar_id=os.getenv("ANAM_AVATAR_ID"), ) # 2. Add the avatar to the pipeline pipeline = Pipeline( avatar=anam_avatar ) ``` For a full working example, see the [Anam Cascade Example on GitHub](https://github.com/videosdk-live/agents/blob/main/examples/avatar/anam_cascade_example.py). ## Configuration Options ### `AnamAvatar` - `api_key`: (str, **required**) Your Anam API key. - `avatar_id`: (str, optional) The ID of the avatar to use. Defaults to `"d9ebe82e-2f34-4ff6-9632-16cb73e7de08"`. ## Additional Resources The following resources provide more information about using Anam AI with VideoSDK Agents SDK. - **[Anam AI docs](https://docs.anam.ai/)**: Anam AI official documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Simli Avatar hide_title: false hide_table_of_contents: false description: "Learn how to use Simli's real-time AI avatars with the VideoSDK AI Agent SDK. This guide covers configuration, API integration, and adding a visual avatar to your agent." pagination_label: "Simli Avatar" keywords: - Simli - Avatar - Real-time - VideoSDK Agents - Python SDK - AI Agent - Virtual Avatar image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Simli slug: simli --- # Simli Avatar The Simli Avatar plugin allows you to integrate a real-time, lip-synced AI avatar into your VideoSDK agent. This creates a more engaging and interactive experience for users by providing a visual representation of the AI agent. Simli offers two avatar types: Legacy (30 FPS) and Trinity (25 FPS). When creating a SimliAvatar, set is_trinity_avatar=True if you're using a Trinity avatar (default is False). Always select the correct faceID from the Simli dashboard. ## Installation Install the Simli-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-simli" ``` ## Authentication The Simli plugin requires an [Simli API key](https://app.simli.com/apikey). Set `SIMLI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.simli import SimliAvatar, SimliConfig ``` ## Setup Credentials To use Simli, you need an API key. You can get one from the [Simli Dashboard](https://app.simli.com/profile). Set up your credentials by exporting them as an environment variable: ```bash export SIMLI_API_KEY="YOUR_SIMLI_API_KEY" ``` You can also provide a `faceId` if you have a custom one. ```bash export SIMLI_FACE_ID="YOUR_FACE_ID" ``` ## Example Usage Here's how you can integrate the Simli Avatar with `Pipeline` in both cascading and realtime modes. ### Cascade Mode This example shows how to add the Simli Avatar to a `Pipeline` (cascading mode). ```python import os from videosdk.agents import Pipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig # Import other necessary components like STT, LLM, TTS # 1. Initialize SimliConfig simli_config = SimliConfig( apiKey=os.getenv("SIMLI_API_KEY"), faceId=os.getenv("SIMLI_FACE_ID"), # This is optional and has a default value ) # 2. Create a SimliAvatar instance # For Legacy avatars (default) simli_avatar = SimliAvatar(config=simli_config) # For Trinity avatars # simli_avatar = SimliAvatar( # config=simli_config, # is_trinity_avatar=True, # ) # 3. Add the avatar to the pipeline pipeline = Pipeline( # ... stt=stt, llm=llm, tts=tts avatar=simli_avatar ) ``` ### Real-time Pipeline This example shows how to add the Simli Avatar to a `Pipeline` (realtime mode). ```python import os from videosdk.agents import Pipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig # from videosdk.plugins.google import GeminiRealtime # Example model # 1. Initialize SimliConfig simli_config = SimliConfig( apiKey=os.getenv("SIMLI_API_KEY"), ) # 2. Create a SimliAvatar instance # For Legacy avatars (default) simli_avatar = SimliAvatar(config=simli_config) # For Trinity avatars # simli_avatar = SimliAvatar( # config=simli_config, # is_trinity_avatar=True, # ) # 3. Add the avatar to the pipeline pipeline = Pipeline( llm=your_realtime_model, # e.g., GeminiRealtime() avatar=simli_avatar ) ``` :::note When using an environment variable for credentials, you should still load it in your code using `os.getenv("SIMLI_API_KEY")` and pass it to `SimliConfig`. ::: ## Configuration Options You can customize the avatar's behavior using the `SimliConfig` and `SimliAvatar` classes. ### `SimliConfig` - `faceId`: (str, optional) The ID for the avatar face. You can find available faces in the [Simli Docs](https://docs.simli.com/api-reference/available-faces) or create your own. Defaults to `"0c2b8b04-5274-41f1-a21c-d5c98322efa9"`. - `maxSessionLength`: (int, optional) A hard time limit in seconds after which the session will disconnect. Defaults to `1800` (30 minutes). - `maxIdleTime`: (int, optional) A soft time limit in seconds that disconnects the session after a period of not sending data. Defaults to `300` (5 minutes). ### `SimliAvatar` - `config`: (`SimliConfig`) A `SimliConfig` object with your desired settings. - `is_trinity_avatar`: (bool, optional) Set to `True` when using Trinity avatars. Defaults to `False` for Legacy avatars. ## Additional Resources The following resources provide more information about using Simli with VideoSDK Agents SDK. - **[Simli docs](https://docs.simli.com/overview)**: Simli docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: RNNoise Denoise hide_title: false hide_table_of_contents: false description: "Learn how to use RNNoise with the VideoSDK AI Agent SDK. This guide covers how to denoise your audio input" pagination_label: "Denoise" keywords: - RNNoise - Denoise - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Denoise slug: denoise --- # Denoise The RNNoise plugin enhances audio quality by removing background noise from your audio input, resulting in improved speech-to-text (STT) accuracy and better overall audio processing performance. RNNoise is a real-time noise suppression library powered by a recurrent neural network that intelligently filters out environmental noise such as air conditioning, computer fans, and other stationary background sounds while preserving the clarity and quality of speech. ## Installation Install the RNNoise plugin for denoising in VideoSDK Agents package: ```bash pip install "videosdk-plugins-rnnoise" ``` ## Importing ```python from videosdk.plugins.rnnoise import RNNoise ``` ## Example Usage ```python from videosdk.plugins.rnnoise import RNNoise from videosdk.agents import Pipeline # Initialize the RNNoise Plugin rnnoise = RNNoise() # Add Denoise Plugin to pipline pipeline = Pipeline(denoise=rnnoise) ``` It also works with [`Pipeline`](/ai_agents/core-components/pipeline) in realtime mode. ## Example Usage in Realtime ```python from videosdk.plugins.rnnoise import RNNoise from videosdk.agents import Pipeline # Initialize the RNNoise Plugin rnnoise = RNNoise() # Add Denoise Plugin to pipeline pipeline = Pipeline(denoise=rnnoise) ``` ## Benefits - **Enhanced STT Accuracy**: Cleaner audio input leads to more accurate speech-to-text transcription - **Real-time Processing**: Processes audio streams with minimal latency for seamless user experience - **Intelligent Noise Reduction**: Effectively removes background noise while preserving speech clarity ## Additional Resources The following resources provide more information about using RNNoise with VideoSDK Agents SDK. - **[RNNoise project](https://github.com/xiph/rnnoise)**: The open source RNNoise library that powers the VideoSDK RNNoise plugin. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: VideoSDK Inference hide_title: false hide_table_of_contents: false description: "Learn how to use VideoSDK's Inference Gateway to easily integrate various AI models for STT, TTS, and Realtime communication without your API key." sidebar_label: VideoSDK Inference slug: videosdk-inference --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # VideoSDK Inference VideoSDK Inference provides a unified gateway to access various AI models for Speech-to-Text (STT), LLM (Large Language Models), Text-to-Speech (TTS), and Real-time multimodal communication. With VideoSDK Inference, you don't need to provide your own API keys for individual AI providers (like Sarvam AI, Google Gemini, etc.). VideoSDK handles the authentication and API connections through its unified gateway, allowing you to get started instantly. The services will be charged from your VideoSDK account balance. ## Installation The Inference plugin is part of the core VideoSDK Agents SDK. You can install it using pip: ```bash pip install videosdk-agents ``` ## Importing You can import the `STT`, `LLM`, `TTS`, `Denoise`, and `Realtime` classes from the `videosdk.agents.inference` module. ```python from videosdk.agents.inference import STT, LLM, TTS, Denoise, Realtime ``` ## Setup Authentication Authentication for the Inference gateway is handled via the `VIDEOSDK_AUTH_TOKEN` environment variable. ```bash VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token" ``` In cascading mode, you can use VideoSDK Inference to handle speech recognition and synthesis. This example shows how to use Sarvam AI's models via the VideoSDK gateway. ### Example Usage ```python import logging from videosdk.agents import ( Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob, ) # highlight-start from videosdk.agents.inference import STT, LLM, TTS, Denoise # highlight-end from videosdk.plugins.silero import SileroVAD # Minimal logging logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" ) class SimpleAgent(Agent): """Simple voice agent for testing inference STT.""" def __init__(self): super().__init__( instructions="You are a helpful voice assistant. Keep responses brief and conversational.", ) async def on_enter(self) -> None: await self.session.say( "Hello! I'm using VideoSDK Inference for speech recognition. How can I help you?" ) async def on_exit(self) -> None: await self.session.say("Goodbye!") async def entrypoint(ctx: JobContext): """Main entrypoint for the agent.""" agent = SimpleAgent() # Create pipeline with Inference STT, LLM, TTS & Denoise (via VideoSDK Gateway) pipeline = Pipeline( # highlight-start # Inference STT, LLM, TTS, Denoise (via VideoSDK Gateway) stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"), llm=LLM.google(model_id="gemini-2.5-flash"), tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"), denoise=Denoise.sanas(), # highlight-end vad=SileroVAD(), ) session = AgentSession( agent=agent, pipeline=pipeline, ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: """Create job context for playground mode.""" room_options = RoomOptions( name="Inference Test Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` The Pipeline in realtime mode uses the VideoSDK Inference Gateway to handle multimodal models like Gemini Live 2.5 Flash Native Audio, which manages the connection efficiently and reduces latency. ### Example Usage ```python import logging from videosdk.agents import ( Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob, ) # highlight-start from videosdk.agents.inference import Realtime # highlight-end # Minimal logging logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" ) class SimpleAgent(Agent): """Simple voice agent for testing inference realtime.""" def __init__(self): super().__init__( instructions="""You are a helpful and friendly voice assistant. You speak in a natural, conversational tone. Keep your responses concise but informative.""", ) async def on_enter(self) -> None: await self.session.say( "Hello! I'm using the VideoSDK Inference Gateway with Gemini. How can I help you today?" ) async def on_exit(self) -> None: await self.session.say("Goodbye! Have a great day!") async def entrypoint(ctx: JobContext): """Main entrypoint for the agent.""" agent = SimpleAgent() # Create Pipeline with Inference Realtime (Gemini) pipeline = Pipeline( # highlight-start llm=Realtime.gemini( model_id="gemini-3.1-flash-live-preview", voice="Puck", language_code="en-US", response_modalities=["AUDIO"], temperature=0.7 ), # highlight-end ) session = AgentSession( agent=agent, pipeline=pipeline, ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: """Create job context for playground mode.""" room_options = RoomOptions( name="Inference Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Configuration Options ### STT Configuration #### `STT.sarvam()` - `model_id`: (str) The specific Sarvam model ID (e.g., `"saarika:v2.5"`). - `language`: (str) Language code for transcription (e.g., `"en-IN"`). #### `STT.google()` - `model_id`: (str) The Google model ID (e.g., `"chirp_3"`). - `language`: (str) Language code for transcription (default: `"en-US"`). ### LLM Configuration #### `LLM.google()` - `model_id`: (str) The Gemini model version (e.g., `"gemini-2.5-flash"`). - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). ### TTS Configuration #### `TTS.sarvam()` - `model_id`: (str) The Sarvam model ID (e.g., `"bulbul:v2"`). - `speaker`: (str) The speaker name (e.g., `"anushka"`). - `language`: (str) Language code (e.g., `"en-IN"`). #### `TTS.google()` - `model_id`: (str) The Google model ID (e.g., `"Chirp3-HD"`). - `voice_id`: (str) The voice ID (e.g., `"Achernar"`). - `language`: (str) Language code (e.g., `"en-US"`). ### Denoise Configuration #### `Denoise.sanas()` - Integrates Sanas for real-time speech enhancement and noise suppression. ### Realtime Configuration #### `Realtime.gemini()` - `model_id`: (str) The Gemini model version (e.g., `"gemini-3.1-flash-live-preview"`). - `voice`: (str) The voice to use (e.g., `"Puck"`, `"Charon"`, `"Kore"`, `"Fenrir"`, `"Aoede"`). - `language_code`: (str) Language code (e.g., `"en-US"`). - `response_modalities`: (list) List of modalities, e.g., `["AUDIO"]` or `["TEXT", "AUDIO"]`. - `temperature`: (float) Sampling temperature (default: `0.7`). ## Additional Resources The following resources provide more information about using VideoSDK inferencing. - **[Inference Pricing](https://docs.videosdk.live/help_docs/pricing-inference)**: Detailed provider wise pricing import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Anthropic LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Anthropic's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Anthropic LLM" keywords: - Anthropic - Claude - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Anthropic slug: anthropic-llm --- # Anthropic LLM The Anthropic AI LLM provider enables your agent to use Anthropic AI's language models for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://docs.anthropic.com/en/docs/about-claude/models/overview) models. ## Installation Install the Anthropic-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-anthropic" ``` ## Importing ```python from videosdk.plugins.anthropic import AnthropicLLM ``` ## Authentication The Anthropic plugin requires an [Anthropic API key](https://console.anthropic.com/dashboard). Set `ANTHROPIC_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.anthropic import AnthropicLLM from videosdk.agents import Pipeline llm = AnthropicLLM( model="claude-sonnet-4-20250514", temperature=0.7, max_tokens=1024, ) pipeline = Pipeline(llm=llm) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options ### Core - `model` — The Claude model to use. Default: `"claude-sonnet-4-20250514"`. - `api_key` — Your Anthropic API key. Falls back to the `ANTHROPIC_API_KEY` environment variable. - `base_url` — Optional custom base URL for the Claude API. Default: `None`. - `temperature` — Sampling temperature. Default: `0.7`. - `tool_choice` — Tool selection mode: `"auto"`, `"required"`, `"none"`, or a dict `{"type": "function", "function": {"name": "my_tool"}}` to force a specific tool. Default: `"auto"`. - `max_tokens` — Maximum tokens in the response. Default: `1024`. - `top_p` — Nucleus sampling probability mass (float, optional). - `top_k` — Top-k sampling parameter (int, optional). ### Tool calling - `parallel_tool_calls` — Allow (`True`) or disallow (`False`) Claude from issuing multiple tool calls in a single response turn. When `False`, Claude is forced to call tools one at a time (optional). ### Prompt caching - `caching` — Set to `"ephemeral"` to enable [Anthropic prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching). When enabled, the SDK automatically applies cache markers to the system prompt, tool schemas, and the most recent conversation turns. Cache hits reduce input token costs. Default: `None` (disabled). ### Extended thinking - `thinking` — Dict that enables extended thinking (Claude's internal reasoning pass before answering). Example: `{"type": "enabled", "budget_tokens": 4096}`. When set, `max_tokens` must be greater than `budget_tokens`. Default: `None` (disabled). ## Prompt Caching Prompt caching reduces costs when the same system prompt, tool schemas, or recent conversation turns are reused across requests. Set `caching="ephemeral"` to let the SDK handle marker placement automatically. ```python from videosdk.plugins.anthropic import AnthropicLLM from videosdk.agents import Pipeline llm = AnthropicLLM( model="claude-sonnet-4-20250514", temperature=0.7, max_tokens=1024, caching="ephemeral", # cache system prompt + tools + last turns parallel_tool_calls=True, # let Claude call multiple tools at once ) pipeline = Pipeline(llm=llm) ``` When caching is active, `LLMResponse.metadata["usage"]` will include two additional keys: - `cache_creation_tokens` — tokens written to the cache on this request (charged at a higher rate, amortised over future hits) - `cache_read_tokens` — tokens read from the cache on this request (charged at a lower rate) :::note Prompt caching requires a minimum cacheable block size (1024 tokens for Sonnet/Opus, 2048 for Haiku). Very short system prompts or tool lists may not qualify for caching. ::: ## Extended Thinking Extended thinking gives Claude additional reasoning time before producing its final answer. This can improve accuracy on complex multi-step tasks. ```python from videosdk.plugins.anthropic import AnthropicLLM from videosdk.agents import Pipeline llm = AnthropicLLM( model="claude-sonnet-4-20250514", thinking={"type": "enabled", "budget_tokens": 4096}, max_tokens=8192, # must be greater than budget_tokens ) pipeline = Pipeline(llm=llm) ``` :::warning `max_tokens` must be strictly greater than `budget_tokens` inside the `thinking` dict. If `budget_tokens` is 4096, set `max_tokens` to at least 4097 (typically much higher to leave room for the final answer). ::: ## Additional Resources - **[Anthropic docs](https://docs.anthropic.com/en/docs/intro)**: Anthropic documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure OpenAI LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Azure OpenAI's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Azure OpenAI LLM" keywords: - OpenAI - Azure - Azure OpenAI - GPT-4o - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Azure OpenAI slug: azureopenai --- # Azure OpenAI LLM The Azure OpenAI LLM provider enables your agent to use Azure OpenAI's language models (like GPT-4o) for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://platform.openai.com/docs/models) models. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAILLM ``` ## Authentication The Azure OpenAI plugin requires an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set the following in your `.env` file: ``` AZURE_OPENAI_API_KEY=your-azure-api-key AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/ OPENAI_API_VERSION=2024-02-01 ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import Pipeline llm = OpenAILLM.azure( azure_deployment="gpt-4o", temperature=0.7, seed=42, parallel_tool_calls=True, ) pipeline = Pipeline(llm=llm) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key`, `videosdk_auth`, and other credential parameters from your code. ::: ## Configuration Options ### Core - `azure_deployment` — The Azure OpenAI deployment ID (defaults to the `model` value, e.g. `"gpt-4o"`, `"gpt-4o-mini"`). - `api_key` — Your Azure OpenAI API key. Falls back to the `AZURE_OPENAI_API_KEY` environment variable. - `azure_endpoint` — Your Azure OpenAI endpoint URL. Falls back to `AZURE_OPENAI_ENDPOINT`. - `api_version` — Azure OpenAI API version. Falls back to `OPENAI_API_VERSION`. - `azure_ad_token` — Azure AD bearer token (alternative to `api_key`). Falls back to `AZURE_OPENAI_AD_TOKEN`. - `organization` — OpenAI organisation ID. Falls back to `OPENAI_ORG_ID` (optional). - `project` — OpenAI project ID. Falls back to `OPENAI_PROJECT_ID` (optional). - `base_url` — Override the default API base URL (optional). - `temperature` — Sampling temperature (0.0 – 2.0). Default: `0.7`. - `tool_choice` — Tool selection mode: `"auto"`, `"required"`, `"none"`, or a dict to force a specific tool. Default: `"auto"`. - `max_completion_tokens` — Maximum tokens in the completion response (optional). ### Generation knobs - `top_p` — Nucleus sampling probability mass (float, optional). - `frequency_penalty` — Penalises repeated tokens by frequency (float, -2.0 – 2.0, optional). - `presence_penalty` — Penalises tokens already present in the response (float, -2.0 – 2.0, optional). - `seed` — Integer seed for deterministic sampling (optional). ### Tool calling - `parallel_tool_calls` — Allow (`True`) or disallow (`False`) multiple tool calls per turn (optional). ## Advanced Example ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import Pipeline llm = OpenAILLM.azure( azure_deployment="gpt-4o", temperature=0.7, top_p=0.95, frequency_penalty=0.1, seed=42, parallel_tool_calls=True, max_completion_tokens=2048, ) pipeline = Pipeline(llm=llm) ``` ## Additional Resources import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Cerebras LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Cerebras's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Cerebras LLM" keywords: - Cerebras - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Cerebras slug: Cerebras-llm --- # Cerebras LLM The Cerebras AI LLM provider enables your agent to use Cerebras AI's language models for text-based conversations and processing. ## Installation Install the Cerebras-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cerebras" ``` ## Importing ```python from videosdk.plugins.cerebras import CerebrasLLM ``` ## Authentication The Cerebras plugin requires an [Cerebras API key](https://cloud.cerebras.ai/). Set `CEREBRAS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cerebras import CerebrasLLM from videosdk.agents import Pipeline # Initialize the Cerebras LLM model llm = CerebrasLLM( model="llama3.3-70b", temperature=0.7, max_tokens=1024, ) # Add llm to pipeline pipeline = Pipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Cerebras model to use (default: `"llama3.3-70b"`). Supported models include: `llama3.3-70b`, `llama3.1-8b`, `llama-4-scout-17b-16e-instruct`, `qwen-3-32b`, `deepseek-r1-distill-llama-70b` (private preview) - `api_key`: (str) Your Cerebras API key. Can also be set via the `CEREBRAS_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (`"auto"`, `"required"`, `"none"`) (default: `"auto"`). - `max_completion_tokens`: (int) Maximum number of tokens to generate in the response (optional). - `top_p`: (float) Nucleus sampling probability (optional). - `seed`: (int) Random seed for reproducible completions (optional). - `stop`: (str) Stop sequence that halts generation when encountered (optional). - `user`: (str) Identifier for the end user triggering the request (optional). ## Additional Resources The following resources provide more information about using Cerebras with VideoSDK Agents SDK. - **[Cerebras docs](https://inference-docs.cerebras.ai/introduction)**: Cerebras documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Google LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Google's LLM models (Gemini) with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Google LLM" keywords: - Google - Gemini - gemini-2.0-flash-001 - gemini-3-flash-preview - gemini-3-pro-preview - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google slug: google-llm --- # Google LLM The Google LLM provider enables your agent to use Google's Gemini family of language models for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://ai.google.dev/gemini-api/docs/models) models. ## Installation Install the Google-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Importing ```python from videosdk.plugins.google import GoogleLLM ``` ## Authentication The Google plugin requires a [Gemini API key](https://aistudio.google.com/apikey). Set `GOOGLE_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.google import GoogleLLM from videosdk.agents import Pipeline llm = GoogleLLM( model="gemini-2.5-flash-lite", temperature=0.7, tool_choice="auto", max_output_tokens=1000, ) pipeline = Pipeline(llm=llm) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options ### Core - `model` — The Gemini model to use (e.g. `"gemini-2.5-flash-lite"`, `"gemini-3-flash-preview"`, `"gemini-3-pro-preview"`). Default: `"gemini-2.5-flash-lite"`. - `api_key` — Your Google API key. Falls back to the `GOOGLE_API_KEY` environment variable. - `temperature` — Sampling temperature. Default: `0.7`. - `tool_choice` — Tool selection mode: `"auto"`, `"required"`, `"none"`. Default: `"auto"`. - `max_output_tokens` — Maximum tokens in the response (optional). - `top_p` — Nucleus sampling probability mass (float, optional). - `top_k` — Restricts sampling to the top-k most probable tokens (int, optional). - `presence_penalty` — Penalises tokens that have already appeared (float, optional). - `frequency_penalty` — Penalises tokens by their existing frequency in the response (float, optional). ### Generation knobs - `seed` — Integer seed for deterministic sampling (optional). ### Safety - `safety_settings` — List of `google.genai.types.SafetySetting` objects (or equivalent dicts) to override the model's default content-safety thresholds (optional). ### Extended thinking - `thinking_budget` — Token budget for extended reasoning on **Gemini 2.5** models. Set to a positive integer to enable; `0` to explicitly disable. Omit (`None`) to use the API default (optional). - `thinking_level` — Qualitative reasoning effort for **Gemini 3** models: `"low"`, `"medium"`, `"high"`, or `"minimal"`. Ignored on Gemini 2.5 (optional). - `include_thoughts` — When `True`, the model's internal reasoning steps are surfaced in the response metadata alongside the final answer. Works with `thinking_budget` on Gemini 2.5 (bool, optional). ## Extended Thinking Gemini models support extended thinking — an internal reasoning pass the model performs before producing the final answer. ### Gemini 2.5 — `thinking_budget` ```python from videosdk.plugins.google import GoogleLLM from videosdk.agents import Pipeline llm = GoogleLLM( model="gemini-2.5-flash-lite", thinking_budget=1024, # token budget for internal reasoning include_thoughts=True, # surface thoughts in response metadata ) pipeline = Pipeline(llm=llm) ``` ### Gemini 3 — `thinking_level` ```python from videosdk.plugins.google import GoogleLLM from videosdk.agents import Pipeline llm = GoogleLLM( model="gemini-3-flash-preview", thinking_level="medium", # "low" | "medium" | "high" | "minimal" ) pipeline = Pipeline(llm=llm) ``` :::note `thinking_budget` and `thinking_level` are mutually exclusive. Use `thinking_budget` for Gemini 2.5 models and `thinking_level` for Gemini 3 models. The plugin automatically routes to the correct configuration based on the model name. ::: ## Vertex AI Integration You can use Gemini models through Vertex AI. This requires different authentication and configuration. ### Authentication for Vertex AI Create a service account, download the JSON key file, and set the path in your environment: ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export GOOGLE_CLOUD_LOCATION="your-gcp-location" ``` ### Example Usage with Vertex AI ```python from videosdk.plugins.google import GoogleLLM, VertexAIConfig from videosdk.agents import Pipeline llm = GoogleLLM( vertexai=True, vertexai_config=VertexAIConfig( project_id="videosdk", location="us-central1", ), ) pipeline = Pipeline(llm=llm) ``` ## Additional Resources - **[Gemini docs](https://ai.google.dev/gemini-api/docs/models)**: Google Gemini documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: LangChain & LangGraph LLM hide_title: false hide_table_of_contents: false description: "Use any LangChain BaseChatModel or a compiled LangGraph StateGraph as the LLM in your VideoSDK AI Agent pipeline." pagination_label: "LangChain & LangGraph LLM" keywords: - LangChain - LangGraph - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI - Function Tools - Tool Calling - Deep Agents - Langchain Open AI - Voice AI - State Machine - Strict Flow - Automation - State Graph - Tavily - Slack image: img/videosdklive-thumbnail.jpg sidebar_position: 7 sidebar_label: LangChain & LangGraph slug: langchain-llm --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # LangChain & LangGraph LLM The `videosdk-plugins-langchain` package provides two LLM adapters that let you drop any LangChain-compatible model or a full LangGraph workflow directly into the VideoSDK voice pipeline — no changes needed to the rest of your pipeline. | Adapter | When to use | | :--- | :--- | | `LangChainLLM` | Wrap a single `BaseChatModel` (OpenAI, Anthropic, Gemini, Mistral, …) and optionally attach LangChain-native tools | | `LangGraphLLM` | Wrap a compiled `StateGraph` — multi-node flows, conditional routing, tool nodes, planners, and more | ## Installation ```bash pip install "videosdk-plugins-langchain" ``` You also need the LangChain integration package for your chosen model provider, e.g.: ```bash pip install langchain-openai # OpenAI / Azure OpenAI pip install langchain-google-genai # Google Gemini pip install langchain-anthropic # Anthropic Claude ``` ## Importing ```python from videosdk.plugins.langchain import LangChainLLM, LangGraphLLM ``` --- ## LangChainLLM `LangChainLLM` adapts any LangChain `BaseChatModel` for use inside the VideoSDK pipeline. It supports two tool-calling modes that can be combined on the same instance. ### Tool-calling modes **Mode A — VideoSDK `@function_tool` methods** Define tools as `@function_tool` methods on your `Agent` subclass (exactly as you would with `OpenAILLM` or `GoogleLLM`). The adapter converts them to LangChain stubs for schema binding and lets VideoSDK dispatch and re-feed the results — the standard VideoSDK tool loop. **Mode B — LangChain-native tools** Pass LangChain tools (e.g. `TavilySearchResults`, `WikipediaQueryRun`, or any custom `@tool` function) at init via `tools=[...]`. The full call→execute→feed loop runs _inside_ the adapter; the voice pipeline only receives the final text stream. Both modes can be active simultaneously. ### Basic Usage ```python title="agent.py" from langchain_openai import ChatOpenAI from videosdk.plugins.langchain import LangChainLLM from videosdk.agents import Agent, Pipeline, function_tool class SlackAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful Slack assistant.") @function_tool async def post_message(self, channel: str, message: str) -> str: """Post a message to a Slack channel. Args: channel: Channel name (e.g. 'general') message: The message text to post """ # ... your Slack API call here return f"Message posted to #{channel}." langchain_llm = LangChainLLM( llm=ChatOpenAI(model="gpt-4o-mini", streaming=True), ) pipeline = Pipeline(llm=langchain_llm, ...) ``` ```python title="agent.py" from langchain_openai import ChatOpenAI from langchain_community.tools.tavily_search import TavilySearchResults from videosdk.plugins.langchain import LangChainLLM from videosdk.agents import Pipeline langchain_llm = LangChainLLM( llm=ChatOpenAI(model="gpt-4o-mini", streaming=True), tools=[TavilySearchResults(max_results=3)], ) pipeline = Pipeline(llm=langchain_llm, ...) ``` ### Configuration Options | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `llm` | `BaseChatModel` | required | Any LangChain chat model instance | | `tools` | `list \| None` | `None` | LangChain-native tools executed internally (Mode B) | | `max_tool_iterations` | `int` | `10` | Safety cap on consecutive internal tool-call rounds | :::note When using a `.env` file for credentials, pass credentials to the LangChain model constructor (e.g. `ChatOpenAI()`), not to `LangChainLLM`. The SDK reads environment variables automatically, so you can also omit them entirely and rely on the provider's SDK defaults. ::: ### Full Example — Voice-controlled Slack Assistant ```python title="agent.py" """ Voice-controlled Slack assistant powered by LangChain. Pipeline: DeepgramSTT + LangChainLLM + CartesiaTTS + SileroVAD + TurnDetector Env Vars: VIDEOSDK_AUTH_TOKEN, DEEPGRAM_API_KEY, OPENAI_API_KEY, CARTESIA_API_KEY, SLACK_BOT_TOKEN """ import os from slack_sdk.web.async_client import AsyncWebClient from langchain_openai import ChatOpenAI from videosdk.agents import Agent, AgentSession, Pipeline, WorkerJob, JobContext, RoomOptions, function_tool from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.cartesia import CartesiaTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.langchain import LangChainLLM pre_download_model() _slack = AsyncWebClient(token=os.environ.get("SLACK_BOT_TOKEN", "")) class SlackVoiceAgent(Agent): def __init__(self): super().__init__( instructions=( "You are Max, a voice-controlled Slack assistant. " "You can post messages to channels. " "After executing any action, confirm it briefly." ), ) async def on_enter(self) -> None: await self.session.say("Hey! I'm Max. Which channel would you like to post to?") @function_tool async def post_message(self, channel: str, message: str) -> str: """Post a message to a Slack channel. Args: channel: Channel name or ID (e.g. 'general', 'C01234ABCDE') message: The message text to post """ channel_name = channel.lstrip("#") try: await _slack.chat_postMessage(channel=f"#{channel_name}", text=message) return f"Message posted to #{channel_name}." except Exception as exc: return f"Failed to post to #{channel_name}: {exc}" async def entrypoint(ctx: JobContext): agent = SlackVoiceAgent() langchain_llm = LangChainLLM( llm=ChatOpenAI(model="gpt-4o-mini", streaming=True), ) pipeline = Pipeline( stt=DeepgramSTT(), llm=langchain_llm, tts=CartesiaTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), ) session = AgentSession(agent=agent, pipeline=pipeline) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: return JobContext(room_options=RoomOptions(room_id="", name="Slack Assistant", playground=True)) if __name__ == "__main__": WorkerJob(entrypoint=entrypoint, jobctx=make_context).start() ``` --- ## LangGraphLLM `LangGraphLLM` wraps a compiled LangGraph `StateGraph` as a VideoSDK LLM. The entire graph — nodes, edges, tool nodes, conditional routing, and internal state — runs as the "LLM" from the pipeline's perspective. ### Key concepts - **`output_node`** — Only text chunks emitted by this node name reach TTS. Use it to suppress intermediate planner/researcher nodes and expose only the final synthesis node to the voice pipeline. - **`stream_mode`** — `"messages"` (default) streams `AIMessageChunk` tokens. `"custom"` streams arbitrary objects emitted via `graph.send()`. Pass a list for both simultaneously. - **`subgraphs`** — Set `True` to stream tokens from nested subgraphs (requires LangGraph ≥ 0.2). - **`config`** — Optional LangGraph `RunnableConfig` dict for thread IDs, recursion limits, or custom callbacks. ### Basic Usage ```python title="agent.py" from videosdk.plugins.langchain import LangGraphLLM from videosdk.agents import Pipeline # graph is a compiled LangGraph StateGraph llm = LangGraphLLM( graph=my_compiled_graph, output_node="synthesizer_node", # only this node's text reaches TTS ) pipeline = Pipeline(llm=llm, ...) ``` ### Configuration Options | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `graph` | `CompiledStateGraph` | required | A compiled LangGraph graph (`StateGraph.compile()`) | | `output_node` | `str \| None` | `None` | Node name whose text is forwarded to TTS; `None` forwards all AI text | | `config` | `dict \| None` | `None` | LangGraph `RunnableConfig` (thread IDs, recursion limit, callbacks) | | `stream_mode` | `str \| list[str]` | `"messages"` | Streaming mode: `"messages"`, `"custom"`, or both as a list | | `subgraphs` | `bool` | `False` | Stream tokens from nested subgraphs | | `context` | `Any \| None` | `None` | LangGraph 2.0 context object injected at runtime | ### Full Example — Voice-driven Blog Writer This example shows a 3-question information-gathering flow before a multi-step sequential writing pipeline. ``` START → coordinator_node (extract topic / audience / tone) ↓ all 3 gathered? → planner_node (plan 4 BlogSection objects) → write_sections (4 sequential LLM calls — one per section) → compiler_node (join + save {slug}.md) → synthesizer_node ← OUTPUT NODE (spoken announcement) ↓ info still missing? → synthesizer_node (asks the next gathering question) END ``` ```python title="agent.py" """ Voice-driven blog writer powered by a sequential LangGraph pipeline. Env Vars: VIDEOSDK_AUTH_TOKEN, DEEPGRAM_API_KEY, GOOGLE_API_KEY, CARTESIA_API_KEY """ import os from langchain_core.messages import AIMessage, HumanMessage, SystemMessage from langgraph.graph import END, START, MessagesState, StateGraph from langchain_google_genai import ChatGoogleGenerativeAI from pydantic import BaseModel, Field from videosdk.agents import Agent, AgentSession, Pipeline, WorkerJob, JobContext, RoomOptions from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.cartesia import CartesiaTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.langchain import LangGraphLLM pre_download_model() # --- Pydantic schemas --- class GatheringInfo(BaseModel): topic: str = Field(default="") audience: str = Field(default="") tone: str = Field(default="") class BlogSection(BaseModel): name: str description: str class BlogSections(BaseModel): sections: list[BlogSection] class BlogState(MessagesState): topic: str audience: str tone: str sections: list[BlogSection] completed_sections: list[str] filename: str blog_done: bool # --- LLMs --- _MODEL = "gemini-2.5-flash" _coordinator_llm = ChatGoogleGenerativeAI(model=_MODEL).with_structured_output(GatheringInfo) _planner_llm = ChatGoogleGenerativeAI(model=_MODEL).with_structured_output(BlogSections) _writer_llm = ChatGoogleGenerativeAI(model=_MODEL, streaming=True) _synth_llm = ChatGoogleGenerativeAI(model=_MODEL, streaming=True) # --- Graph nodes (abbreviated — see full example in repo) --- def coordinator_node(state: BlogState) -> dict: ... def planner_node(state: BlogState) -> dict: ... def write_sections_node(state: BlogState) -> dict: ... def compiler_node(state: BlogState) -> dict: ... def synthesizer_node(state: BlogState) -> dict: ... def route_after_coordinator(state: BlogState) -> str: ... builder = StateGraph(BlogState) builder.add_node("coordinator", coordinator_node) builder.add_node("planner", planner_node) builder.add_node("write_sections", write_sections_node) builder.add_node("compiler", compiler_node) builder.add_node("synthesizer_node", synthesizer_node) builder.add_edge(START, "coordinator") builder.add_conditional_edges( "coordinator", route_after_coordinator, {"planner": "planner", "synthesizer_node": "synthesizer_node"}, ) builder.add_edge("planner", "write_sections") builder.add_edge("write_sections", "compiler") builder.add_edge("compiler", "synthesizer_node") builder.add_edge("synthesizer_node", END) blog_graph = builder.compile() # --- Agent --- class BlogWriterAgent(Agent): def __init__(self): super().__init__(instructions="You are Aria, a friendly AI writing assistant.") async def on_enter(self) -> None: await self.session.say( "Hi! I'm Aria. What topic would you like me to write a blog about?" ) async def entrypoint(ctx: JobContext): agent = BlogWriterAgent() langgraph_llm = LangGraphLLM( graph=blog_graph, output_node="synthesizer_node", # only synthesizer text reaches TTS ) pipeline = Pipeline( stt=DeepgramSTT(), llm=langgraph_llm, tts=CartesiaTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), ) session = AgentSession(agent=agent, pipeline=pipeline) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: return JobContext(room_options=RoomOptions(room_id="", name="Blog Writer", playground=True)) if __name__ == "__main__": WorkerJob(entrypoint=entrypoint, jobctx=make_context).start() ``` :::tip output_node Always set `output_node` to your final "speech synthesis" node when using multi-node graphs. Without it, intermediate planner/researcher node text is also forwarded to TTS, producing unexpected spoken output. ::: --- ## Choosing Between LangChainLLM and LangGraphLLM | Scenario | Recommended adapter | | :--- | :--- | | Simple model swap (e.g. use Mistral instead of OpenAI) | `LangChainLLM` | | Add web-search / RAG tools without changing agent code | `LangChainLLM` with `tools=[...]` | | Multi-step sequential pipeline (plan → write → compile) | `LangGraphLLM` | | Conditional routing / state machines | `LangGraphLLM` | | Mixture-of-experts or parallel sub-agents | `LangGraphLLM` | ## Additional Resources import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI LLM hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "OpenAI LLM" keywords: - OpenAI - GPT-4o - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: OpenAI slug: openai --- # OpenAI LLM The OpenAI LLM provider enables your agent to use OpenAI's language models (like GPT-4o) for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://platform.openai.com/docs/models) models. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAILLM ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import Pipeline llm = OpenAILLM( model="gpt-4o", temperature=0.7, top_p=0.95, seed=42, parallel_tool_calls=True, max_completion_tokens=1024, ) pipeline = Pipeline(llm=llm) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options ### Core - `model` — The OpenAI model to use (e.g. `"gpt-4o"`, `"gpt-4o-mini"`). Default: `"gpt-4o-mini"`. - `api_key` — Your OpenAI API key. Falls back to the `OPENAI_API_KEY` environment variable. - `base_url` — Custom base URL for the OpenAI API (optional). - `temperature` — Sampling temperature (0.0 – 2.0). Default: `0.7`. - `tool_choice` — Tool selection mode: `"auto"`, `"required"`, `"none"`, or a dict `{"type": "function", "function": {"name": "my_tool"}}` to force a specific tool. Default: `"auto"`. - `max_completion_tokens` — Maximum tokens in the completion response (optional). ### Generation knobs - `top_p` — Nucleus sampling: only the tokens comprising the top `top_p` probability mass are considered (float, optional). - `frequency_penalty` — Penalises tokens that appear frequently in the response so far; reduces repetition (float, -2.0 – 2.0, optional). - `presence_penalty` — Penalises tokens that have appeared at all in the response so far; encourages new topics (float, -2.0 – 2.0, optional). - `seed` — Integer seed for deterministic sampling. The same seed + same inputs will produce the same output (optional). ### Organisation and project - `organization` — Your OpenAI organisation ID. Falls back to the `OPENAI_ORG_ID` environment variable (optional). - `project` — Your OpenAI project ID. Falls back to the `OPENAI_PROJECT_ID` environment variable (optional). ### Tool calling - `parallel_tool_calls` — When `True`, allows the model to call multiple tools in a single turn. When `False`, forces one tool call at a time. Default: provider default (optional). ## Advanced Example ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import Pipeline llm = OpenAILLM( model="gpt-4o", temperature=0.7, top_p=0.95, frequency_penalty=0.1, presence_penalty=0.1, seed=42, parallel_tool_calls=True, max_completion_tokens=2048, ) pipeline = Pipeline(llm=llm) ``` ## Additional Resources - **[OpenAI docs](https://platform.openai.com/docs/)**: OpenAI documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Sarvam AI LLM hide_title: false hide_table_of_contents: false description: "Learn how to use Sarvam AI's LLM models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text-based AI capabilities for your conversational agents." pagination_label: "Sarvam AI LLM" keywords: - Sarvam AI - sarvam-m - LLM - Large Language Model - VideoSDK Agents - Python SDK - Text Generation - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Sarvam AI slug: sarvam-ai-llm --- # Sarvam AI LLM The Sarvam AI LLM provider enables your agent to use Sarvam AI's language models for text-based conversations and processing. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAILLM ``` :::note When using Sarvam AI as the LLM option, the function tool calls and MCP tool will not work. ::: ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAILLM from videosdk.agents import Pipeline # Initialize the Sarvam AI LLM model llm = SarvamAILLM( model="sarvam-105b", # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", temperature=0.7, tool_choice="auto", max_completion_tokens=1000, reasoning_effort="medium", # Optional: "low", "medium", "high" wiki_grounding=False, # Optional: enable Wikipedia-grounded responses top_p=1, frequency_penalty=0, presence_penalty=0, ) # Add llm to pipeline pipeline = Pipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Sarvam AI model to use (default: `"sarvam-m"`). - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (default: `"auto"`). - `max_completion_tokens`: (int) Maximum number of tokens in the completion response (optional). - `reasoning_effort`: (str) Controls reasoning depth for the model. Allowed values: `"low"`, `"medium"`, `"high"` (default: `None`). - `wiki_grounding`: (bool) Enables Wikipedia search to ground responses with factual information (default: `False`). - `top_p` : An alternative to sampling with temperature. Defaults to None. - `frequency_penalty`: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far. - `presence_penalty`: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far - `stop` : Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/api-reference-docs/chat/chat-completions)**: Sarvam's full docs site. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Namo Turn Detector hide_title: false hide_table_of_contents: false description: "Learn how to use NamoTurnDetectorV1 model with the VideoSDK AI Agent SDK. This guide covers model configuration." pagination_label: "Turn Detector" keywords: - Turn Detection - Namo Turn Detector - Large Language Model - VideoSDK Agents - Multilingual - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Namo Turn Detector slug: namo-turn-detector --- # Namo Turn Detector The Namo Turn Detector v1 utilizes a custom fine-tuned model from VideoSDK to accurately determine whether a user has finished speaking. This allows for precise management of conversation flow, especially in cascade setups. It can operate as a multilingual model or be configured for a specific language for optimized performance. ## Installation Install the Turn Detector-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-turn-detector" ``` ## Importing ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1 ``` ## Example Usage **1. For a specific language (e.g., English):** ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from videosdk.agents import Pipeline # Pre-download the English model to avoid delays pre_download_namo_turn_v1_model(language="en") # Initialize the Turn Detector for English turn_detector = NamoTurnDetectorV1( language="en", threshold=0.7 ) # Add the Turn Detector to a pipeline pipeline = Pipeline(turn_detector=turn_detector) ``` **2. For multilingual support:** If you don't specify a language, the detector will default to the multilingual model, which can handle various languages. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from videosdk.agents import Pipeline # Pre-download the multilingual model pre_download_namo_turn_v1_model() # Initialize the multilingual Turn Detector turn_detector = NamoTurnDetectorV1( threshold=0.7 ) # Add the Turn Detector to a cascade pipeline = Pipeline(turn_detector=turn_detector) ``` ## Configuration Options - `language`: (Optional, `str`): Specifies the language for the turn detection model. If left as `None` (the default), it loads a multilingual model capable of handling all supported languages. - `threshold`: (float) Confidence threshold for turn completion detection (0.0 to 1.0, default: `0.7`) ## Supported Languages The `NamoTurnDetectorV1` supports a wide range of languages when you specify the corresponding language code. If no language is specified, the multilingual model will be used. Here is a list of the supported languages and their codes: | Language | Code | | :--- | :--- | | Arabic | `ar` | | Bengali | `bn` | | Chinese | `zh` | | Danish | `da` | | Dutch | `nl` | | English | `en` | | Finnish | `fi` | | French | `fr` | | German | `de` | | Hindi | `hi` | | Indonesian |`id` | | Italian | `it` | | Japanese | `ja` | | Korean | `ko` | | Marathi | `mr` | | Norwegian | `no` | | Polish | `pl` | | Portuguese | `pt` | | Russian | `ru` | | Spanish | `es` | | Turkish | `tr` | | Ukrainian | `uk` | | Vietnamese |`vi` | ## Pre-downloading Model To avoid delays during agent initialization, you can pre-download the Hugging Face model: You can pre-download a specific language model: ```python from videosdk.plugins.turn_detector import pre_download_namo_turn_v1_model # Download the English model before the agent runs pre_download_namo_turn_v1_model(language="en") ``` Or pre-download the multilingual model: ```python from videosdk.plugins.turn_detector import pre_download_namo_turn_v1_model # Download the multilingual model pre_download_namo_turn_v1_model() ``` ## Additional Resources The following resources provide more information about VideoSDK Turn Detector plugin for AI Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: AWS Nova Sonic hide_title: false hide_table_of_contents: false description: "Learn how to use Amazon's Nova Sonic model with the VideoSDK AI Agent SDK. This guide covers model configuration, streaming audio, and integration with your agent pipeline." pagination_label: "Amazon Nova Sonic" keywords: - Amazon's Nova Sonic - AWS Nova Sonic - AWS Model - Amazon Nova Sonic - NovaSonicRealtime - NovaSonicLiveConfig - Real-time AI - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: AWS Nova Sonic slug: aws-nova-sonic --- # AWS Nova Sonic The AWS Nova Sonic provider enables your agent to use Amazon's Nova Sonic model for real-time, speech-to-speech AI interactions. ### Prerequisites Before Start Using AWS Nova Sonic with the VideoSDK AI Agent, ensure the following: - `AWS Account`: You have an active AWS account with permissions to access Amazon Bedrock. - `Model Access`: You've requested and obtained access to the Amazon Nova models (Nova Lite and Nova Canvas) via the Amazon Bedrock console. - `Region Selection`: You're operating in the US East (N. Virginia) (us-east-1) region, as model access is region-specific. - `AWS Credentials`: Your AWS credentials (aws_access_key_id and aws_secret_access_key) are configured, either through environment variables or your preferred credential management method. ## Installation Install the Gemini-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-aws" ``` ## Authentication The Amazon Nova Sonic plugin requires an [AWS API key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). Set the following environment variables in your `.env` file: ```shell AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= AWS_DEFAULT_REGION= ``` ## Importing ```python from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig ``` ## Example Usage ```python from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig from videosdk.agents import Pipeline # Initialize the Nova Sonic real-time model model = NovaSonicRealtime( model="amazon.nova-sonic-v1:0", # When AWS credentials and region are set in .env - DON'T pass credential parameters region="us-east-1", # Currently, only "us-east-1" is supported for Amazon Nova Sonic. aws_access_key_id="YOUR_ACCESS_KEY", aws_secret_access_key="YOUR_SECRET_KEY", config=NovaSonicConfig( voice="tiffany", # "tiffany","matthew", "amy" temperature=0.7, top_p=0.9, max_tokens=1024 ) ) # Create the pipeline with the model pipeline = Pipeline(llm=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: :::note To initiate a conversation with Amazon Nova Sonic, the user must speak first. The model listens for user input to begin the interaction. ::: ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The Amazon Nova Sonic model to use (e.g., "amazon.nova-sonic-v1:0"). - `region`: AWS region where the model is hosted (e.g., "us-east-1"). - `aws_access_key_id`: Your AWS access key ID. - `aws_secret_access_key`: Your AWS secret access key. - `config`: A NovaSonicConfig object for advanced options: - `voice`: (str or None) The voice to use for audio output (e.g., "matthew", "tiffany", "amy"). - `temperature`: (float or None) Sampling temperature for response randomness. - `top_p`: (float or None) Nucleus sampling probability. - `max_tokens`: (int or None) Maximum number of tokens in the output ## Additional Resources The following resources provide more information about using AWS Nova Sonic with VideoSDK Agents SDK. - **[Plugin quickstart](https://github.com/videosdk-live/agents-quickstart/blob/main/Realtime%20Pipeline/AWS%20Nova%20Sonic/aws_novasonic_agent_quickstart.py)**: Quickstart for the AWS Nova Sonic API plugin. - **[AWS Nova Sonic docs](https://docs.aws.amazon.com/nova/latest/userguide/speech.html)**: AWS Nova Sonic documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure Voice Live API hide_title: false hide_table_of_contents: false description: "Learn how to use Azure's Voice Live API with the VideoSDK AI Agent SDK. This guide covers model configuration, real-time speech interactions, and integration with your agent pipeline." pagination_label: "Azure Voice Live" keywords: - Azure - Voice Live API - AzureVoiceLive - AzureVoiceLiveConfig - Real-time AI - VideoSDK Agents - Python SDK - Speech-to-Speech - GPT-4o - Microsoft - Azure Speech Services - Azure AI Speech - Azure AI Foundry image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Azure Voice Live slug: azure-voice-live --- # Azure Voice Live API (Beta) The Azure Voice Live API provider enables your agent to use Microsoft's comprehensive speech-to-speech solution for low-latency, high-quality voice interactions. This unified API eliminates the need to manually orchestrate multiple components by integrating speech recognition, generative AI, and text-to-speech into a single interface. :::note Preview Feature This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live). ::: ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Authentication The Azure Voice Live plugin requires an Azure AI Services resource with Cognitive Services endpoint. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the AI Services resource endpoint and primary key. After your resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_VOICE_LIVE_ENDPOINT` and `AZURE_VOICE_LIVE_API_KEY` in your `.env` file: ```bash AZURE_VOICE_LIVE_ENDPOINT=your-azure-ai-service-endpoint AZURE_VOICE_LIVE_API_KEY=your-azure-ai-service-primary-key ``` ## Importing ```python from videosdk.plugins.azure import AzureVoiceLive, AzureVoiceLiveConfig from videosdk.agents import Pipeline ``` ## Example Usage ```python from videosdk.plugins.azure import AzureVoiceLive, AzureVoiceLiveConfig from videosdk.agents import Pipeline # Configure the Voice Live API settings config = AzureVoiceLiveConfig( voice="en-US-EmmaNeural", # Azure neural voice temperature=0.7, turn_detection_timeout=1000, enable_interruption=True ) # Initialize the Azure Voice Live model model = AzureVoiceLive( # When environment variables are set in .env - DON'T pass credentials # api_key="your-azure-speech-key", model="gpt-4o-realtime-preview", config=config ) # Create the pipeline with the model pipeline = Pipeline(llm=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key`, `speech_region`, and other credential parameters from your code. ::: :::note To initiate a conversation with Azure Voice Live, the user must speak first. The model listens for user input to begin the interaction. ::: ## Configuration Options - `model`: The Voice Live model to use (e.g., `"gpt-4o-realtime-preview"`, `"gpt-4o-mini-realtime-preview"`) - `api_key`: Your Azure Speech API key (can also be set via environment variable) - `speech_region`: Your Azure Speech region (can also be set via environment variable) - `credential`: Azure DefaultAzureCredential for authentication (alternative to API key) - `config`: An `AzureVoiceLiveConfig` object for advanced options: - `voice`: (str) The Azure neural voice to use (e.g., `"en-US-EmmaNeural"`, `"hi-IN-AnanyaNeural"`) - `temperature`: (float) Sampling temperature for response randomness (default: 0.7) - `turn_detection_timeout`: (int) Timeout for turn detection in milliseconds - `enable_interruption`: (bool) Allow users to interrupt the agent during speech - `noise_suppression`: (bool) Enable noise suppression for clearer audio - `echo_cancellation`: (bool) Enable echo cancellation - `phrase_list`: (List[str]) Custom phrases for improved recognition accuracy ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Additional Resources The following resources provide more information about using Azure Voice Live with VideoSDK Agents SDK. - **[Azure Voice Live API Documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live)**: Complete Azure Voice Live API documentation. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Overview of Azure Speech services. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Google Gemini (LiveAPI) hide_title: false hide_table_of_contents: false description: "Learn how to use Google's Gemini models with the VideoSDK AI Agent SDK. This guide covers model configuration, streaming audio, and integration with your agent pipeline." pagination_label: "Google Gemini" keywords: - Google Gemini - GeminiRealtime - GeminiLiveConfig - Real-time AI - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google Gemini (LiveAPI) slug: google-live-api --- # Google Gemini (LiveAPI) The Google Gemini (Live API) provider allows your agent to leverage Google's Gemini models for real-time, multimodal AI interactions. ## Installation Install the Gemini-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Authentication The Google plugin requires an [Gemini API key](https://aistudio.google.com/apikey). Set `GOOGLE_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig ``` ## Example Usage ```python from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from videosdk.agents import Pipeline # Initialize the Gemini real-time model model = GeminiRealtime( model="gemini-3.1-flash-live-preview", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) # Create the pipeline with the model pipeline = Pipeline(llm=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Vertex AI Integration You can also use Google's Gemini models through Vertex AI. This requires a different authentication and configuration setup. ### Authentication for Vertex AI For Vertex AI, you need to set up Google Cloud credentials. Create a service account, download the JSON key file, and set the path to this file in your environment. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` You should also configure your project ID and location. These can be set as environment variables or directly in the code. If not set, the `project_id` is inferred from the credentials file and the `location` defaults to `us-central1`. ```bash export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export GOOGLE_CLOUD_LOCATION="your-gcp-location" ``` ### Example Usage with Vertex AI To use Vertex AI, set `vertexai=True` when initializing `GeminiRealtime`. You can configure the project and location using `VertexAIConfig`, which will take precedence over environment variables. ```python from videosdk.plugins.google import GeminiRealtime, VertexAIConfig from videosdk.agents import Pipeline # Initialize the Gemini real-time model with Vertex AI configuration model = GeminiRealtime( model="gemini-live-2.5-flash-native-audio", vertexai=True, vertexai_config=VertexAIConfig( project_id="videosdk", location="us-central1" ) ) # Create the pipeline with the model pipeline = Pipeline(llm=model) ``` ## Vision Support Google Gemini Live can also accept `video stream` directly from the VideoSDK room. To enable this, simply turn on your camera and set the vision flag to true in the session context. Once that's done, start your agent as usual—no additional changes are required in the pipeline. ```python pipeline = Pipeline(llm=model) session = AgentSession( agent=my_agent, pipeline=pipeline, ) job_context = JobContext( room_options = RoomOptions( room_id = "YOUR_ROOM_ID", name = "Agent", vision = True ) ) ``` - `vision` (bool, room options) – when `True`, forwards Video Stream from VideoSDK's room to Gemini’s LiveAPI (defaults to `False`). ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The Gemini model to use (e.g., `"gemini-3.1-flash-live-preview"`). Other supported models include: `"gemini-2.5-flash-preview-native-audio-dialog"` and `"gemini-2.5-flash-exp-native-audio-thinking-dialog"`. - `api_key`: Your Google API key (can also be set via environment variable) - `config`: A `GeminiLiveConfig` object for advanced options: - `voice`: (str or None) The voice to use for audio output (e.g., `"Puck"`). - `language_code`: (str or None) The language code for the conversation (e.g., `"en-US"`). - `temperature`: (float or None) Sampling temperature for response randomness. - `top_p`: (float or None) Nucleus sampling probability. - `top_k`: (float or None) Top-k sampling for response diversity. - `candidate_count`: (int or None) Number of candidate responses to generate. - `max_output_tokens`: (int or None) Maximum number of tokens in the output. - `presence_penalty`: (float or None) Penalty for introducing new topics. - `frequency_penalty`: (float or None) Penalty for repeating tokens. - `response_modalities`: (List[str] or None) List of enabled output modalities (e.g., `["TEXT"]`or `["AUDIO"]`(one at a time)). - `output_audio_transcription`: (`AudioTranscriptionConfig` or None) Configuration for audio output transcription. ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Plugin quickstart]()**: Quickstart for the Gemini Realtime API plugin. - **[Gemini docs](https://ai.google.dev/gemini-api/docs/live)**: Gemini Live API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's real-time models with the VideoSDK AI Agent SDK. This guide covers model configuration, streaming audio, and integration with your agent pipeline." pagination_label: "OpenAI" keywords: - OpenAI - GPT-4o - Real-time AI - VideoSDK Agents - Python SDK - OpenAIRealtime - OpenAIRealtimeConfig image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: OpenAI slug: openai --- # OpenAI The OpenAI provider enables your agent to use OpenAI's real-time models (like GPT-4o) for text and audio interactions. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig from videosdk.agents import Pipeline from openai.types.beta.realtime.session import TurnDetection # Initialize the OpenAI real-time model model = OpenAIRealtime( model="gpt-realtime-2025-08-28", # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", config=OpenAIRealtimeConfig( voice="alloy", # alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse modalities=["text", "audio"], turn_detection=TurnDetection( type="server_vad", threshold=0.5, prefix_padding_ms=300, silence_duration_ms=200, ), tool_choice="auto" ) ) # Create the pipeline with the model pipeline = Pipeline(llm=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The OpenAI model to use (e.g., `"gpt-realtime-2025-08-28"`) - `api_key`: Your OpenAI API key (can also be set via environment variable) - `config`: An `OpenAIRealtimeConfig` object for advanced options: - `voice`: (str) The voice to use for audio output (e.g., `"alloy"`). - `temperature`: (float) Sampling temperature for response randomness. - `turn_detection`: (`TurnDetection` or None) Configure how the agent detects when a user has finished speaking. - `input_audio_transcription`: (`InputAudioTranscription` or None) Configure audio-to-text (e.g., Whisper). - `tool_choice`: (str or None) Tool selection mode (e.g., `"auto"`). - `modalities`: (list[str]) List of enabled modalities (e.g., `["text", "audio"]`). ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[Plugin quickstart](https://github.com/videosdk-live/agents-quickstart/tree/main/Realtime%20Pipeline/OpenAI)**: Quickstart for the OpenAI Realtime API plugin. - **[OpenAI docs](https://platform.openai.com/docs/guides/realtime)**: OpenAI Realtime API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Ultravox hide_title: false hide_table_of_contents: false description: "Learn how to use Ultravox's real-time AI models with the VideoSDK AI Agent SDK. This guide covers model configuration, function calling, MCP integration, and connecting to your agent pipeline." pagination_label: "Ultravox" keywords: - Ultravox - Real-time AI - VideoSDK Agents - Python SDK - UltravoxRealtime - UltravoxLiveConfig image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Ultravox slug: ultravox --- # Ultravox The Ultravox provider enables your agent to use Ultravox's models for real-time, conversational AI interactions. ## Installation Install the Ultravox-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-ultravox" ``` ## Authentication The Ultravox plugin requires an [Ultravox API key](https://app.ultravox.ai/). Set the `ULTRAVOX_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.ultravox import UltravoxRealtime, UltravoxLiveConfig ``` ## Example Usage ```python from videosdk.plugins.ultravox import UltravoxRealtime, UltravoxLiveConfig from videosdk.agents import Pipeline # Initialize the Ultravox real-time model model = UltravoxRealtime( model="fixie-ai/ultravox", # When ULTRAVOX_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-ultravox-api-key", config=UltravoxLiveConfig( voice="54ebeae1-88df-4d66-af13-6c41283b4332" ) ) # Create the pipeline with the model pipeline = Pipeline(llm=model) ``` :::note When using a `.env` file for credentials, you do not need to pass the `api_key` as an argument to the model instance; the SDK reads it automatically. ::: ## Key Features - **Real-time Interactions**: Utilize Ultravox's powerful models for low-latency voice conversations. - **Function Calling**: Empower your agent to perform actions like retrieving weather data or calling external APIs. - **Custom Agent Behaviors**: Define a unique personality and interaction style for your agent through system prompts. - **Call Control**: Agents can manage the conversation flow and gracefully terminate calls. - **MCP Integration**: Connect to external tools and data sources using the Model Context Protocol (MCP) via `MCPServerStdio` for local processes or `MCPServerHTTP` for remote services. ## Configuration Options - `model`: The Ultravox model to use (e.g., `"fixie-ai/ultravox"`). - `api_key`: Your Ultravox API key (can also be set via the `ULTRAVOX_API_KEY` environment variable). - `config`: An `UltravoxLiveConfig` object for advanced options: - `voice`: (str) The Voice ID for the synthesized speech. - `language_hint`: (str) A hint for the conversation's language (e.g., `"en"`). - `temperature`: (float) Controls the randomness of responses (0.0 to 1.0). - `vad_turn_endpoint_delay`: (int) Delay in milliseconds for voice activity detection to determine the end of a turn. - `vad_minimum_turn_duration`: (int) The minimum duration in milliseconds for a valid speech turn. ## Additional Resources The following resources provide more information about using Ultravox with the VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: xAI (Grok) hide_title: false hide_table_of_contents: false description: "Learn how to use xAI's Grok models with the VideoSDK AI Agent SDK. This guide covers model configuration, real-time speech interactions, and integration with your agent pipeline." pagination_label: "xAI (Grok)" keywords: - xAI - Grok - Real-time AI - VideoSDK Agents - Python SDK - XAIRealtime - XAIRealtimeConfig image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: xAI (Grok) slug: xai-grok --- # xAI (Grok) The xAI (Grok) provider enables your agent to use xAI's powerful Grok models for real-time, multimodal AI interactions. ## Installation Install the xAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-xai" ``` ## Authentication The xAI plugin requires an [xAI API key](https://console.x.ai). Set `XAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.xai import XAIRealtime, XAIRealtimeConfig ``` ## Example Usage ```python from videosdk.plugins.xai import XAIRealtime, XAIRealtimeConfig from videosdk.agents import Pipeline # Initialize the xAI Grok real-time model model = XAIRealtime( model="grok-4-1-fast-non-reasoning", # When XAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-xai-api-key", config=XAIRealtimeConfig( voice="Eve", # collection_id="your-collection-id" # Optional ) ) # Create the pipeline with the model pipeline = Pipeline(llm=model) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit the `api_key` parameter from your code. ::: ## Key Features - **Multi-modal Interactions**: Utilize xAI's powerful Grok models for voice and text. - **Function Calling**: Define custom tools to retrieve weather data, interact with external APIs, or perform other actions. - **Web Search**: Enable real-time web search capabilities by setting `enable_web_search=True`. - **X Search**: Access X (formerly Twitter) content by setting `enable_x_search=True` and providing `allowed_x_handles`. ## Configuration Options - `model`: The Grok model to use (e.g., `"grok-4-1-fast-non-reasoning"`). - `api_key`: Your xAI API key (can also be set via the `XAI_API_KEY` environment variable). - `config`: An `XAIRealtimeConfig` object for advanced options: - `voice`: (str) The voice to use for audio output (e.g., `"Eve"`, `"Ara"`, `"Rex"`, `"Sal"`, `"Leo"`). - `enable_web_search`: (bool) Enable or disable web search capabilities. - `enable_x_search`: (bool) Enable or disable search on X (Twitter). - `allowed_x_handles`: (List[str]) A list of allowed X handles to search within. - `collection_id`: (str, optional) The ID of a custom collection from your xAI Console storage to provide additional context. - `turn_detection`: Configuration for detecting when a user has finished speaking. ## Collection Storage xAI Grok supports using "collections" to provide additional context to your agent, grounding its responses in your own documents or data. To use a collection: 1. **Navigate to xAI Console**: Go to your [console.x.ai](https://console.x.ai) dashboard. 2. **Access Storage**: Click on the **Storage** section in the sidebar. 3. **Create New Collection**: Click the "Create New Collection" button. 4. **Upload Files**: Upload your relevant documents or data files to the new collection. 5. **Get Collection ID**: Once the collection is created, copy its **Collection ID**. 6. **Use in Config**: Pass the copied ID to your agent's configuration: ```python config=XAIRealtimeConfig( voice="Eve", collection_id="your-collection-id-from-console", # ... other config options ) ``` The agent will now use the content of this collection to inform its responses. ## Additional Resources The following resources provide more information about using xAI (Grok) with the VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Silero VAD hide_title: false hide_table_of_contents: false description: "Learn how to use Silero's VAD with the VideoSDK AI Agent SDK. This guide covers model configuration, related events." pagination_label: "Silero VAD" keywords: - Silero - VAD - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Silero VAD slug: silero-vad --- # Silero VAD The Silero VAD (Voice Activity Detection) provider enables your agent to detect when users start and stop speaking. When added to a pipeline, it automatically enables interrupt functionality - allowing users to interrupt the agent mid-response. ## Installation Install the Silero VAD-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-silero" ``` ## Importing ```python from videosdk.plugins.silero import SileroVAD ``` ## Example Usage ```python from videosdk.plugins.silero import SileroVAD from videosdk.agents import Pipeline # Initialize the Silero VAD vad = SileroVAD( input_sample_rate=48000, model_sample_rate=16000, threshold=0.3, min_speech_duration=0.1, min_silence_duration=0.75, prefix_padding_duration=0.3 ) # Add VAD to pipeline - automatically enables interrupts pipeline = Pipeline(vad=vad) ``` ## Configuration Options - `input_sample_rate`: (int) Sample rate of input audio in Hz (default: `48000`) - `model_sample_rate`: (Literal[8000, 16000]) Model's expected sample rate (default: `16000`) - `threshold`: (float) Voice activity detection sensitivity (0.0 to 1.0, default: `0.3`) - `min_speech_duration`: (float) Minimum speech duration to trigger detection in seconds (default: `0.1`) - `min_silence_duration`: (float) Minimum silence duration to end speech detection in seconds (default: `0.75`) - `max_buffered_speech`: (float) Maximum speech buffer duration in seconds (default: `60.0`) - `force_cpu`: (bool) Force CPU usage instead of GPU acceleration (default: `True`) - `prefix_padding_duration`: (float) Audio padding before speech detection in seconds (default: `0.3`) ## Additional Resources The following resources provide more information about using Silero VAD with VideoSDK Agents SDK. - **[Silero VAD project](https://github.com/snakers4/silero-vad)**: The open source VAD model that powers the VideoSDK Silero VAD plugin. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: AssemblyAI STT hide_title: false hide_table_of_contents: false description: "Learn how to use AssemblyAI's real-time speech-to-text models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing streaming transcription." pagination_label: "AssemblyAI STT" keywords: - AssemblyAI - real-time transcription - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: AssemblyAI slug: assemblyai --- # AssemblyAI STT The AssemblyAI STT provider enables your agent to use AssemblyAI's real-time WebSocket API for fast and accurate speech-to-text conversion. ## Installation Install the AssemblyAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-assemblyai" ``` ## Authentication The AssemblyAI plugin requires an [AssemblyAI API key](https://www.assemblyai.com/dashboard/docs/your-api-key). Set `ASSEMBLYAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.assemblyai import AssemblyAISTT ``` ## Example Usage ```python from videosdk.plugins.assemblyai import AssemblyAISTT from videosdk.agents import Pipeline # Initialize the AssemblyAI STT model stt = AssemblyAISTT( api_key="your-assemblyai-api-key", language_code="en_us" ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your AssemblyAI API key (required, can also be set via `ASSEMBLYAI_API_KEY` environment variable). - `language_code`: The language code for transcription (e.g., `"en_us"`, `"es"`). ## Additional Resources The following resources provide more information about using AssemblyAI with the VideoSDK Agents SDK. - **[AssemblyAI Docs](https://www.assemblyai.com/docs/guides/speech-to-text/real-time-streaming-transcription)**: AssemblyAI's official real-time streaming transcription documentation. ``` import PluginResourceCards from '@site/src/components/PluginResourceCards' ``` --- --- title: Azure OpenAI STT hide_title: false hide_table_of_contents: false description: "Learn how to use Azure OpenAI's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Azure OpenAI's services" pagination_label: "Azure OpenAI STT" keywords: - OpenAI - Azure - Azure OpenAI - gpt-4o-mini-transcribe - whisper-1 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Azure OpenAI slug: azureopenai --- # Azure OpenAI STT The Azure OpenAI STT provider enables your agent to use Azure OpenAI's speech-to-text models (like Whisper) for converting audio input to text. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAISTT ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAISTT from videosdk.agents import Pipeline # Initialize the Azure OpenAI STT model stt = OpenAISTT.azure( azure_deployment="gpt-4o-transcribe", language="en", ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o-mini-transcribe"`, `"gpt-4o-transcribe"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `language`: (str) Language code for transcription (default: `"en"`) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure STT hide_title: false hide_table_of_contents: false description: "Learn how to use Azure's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Azure's services" pagination_label: "Azure STT" keywords: - Azure - Speech-to-Text - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI - Microsoft - Azure Speech Services - Azure AI Speech - Azure AI Foundry image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Azure AI Speech slug: azure-ai-stt --- # Azure STT The Azure STT provider enables your agent to use Microsoft Azure's advanced speech-to-text models for high-accuracy, real-time audio transcription with support for multiple languages and custom phrase lists. ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Importing ```python from videosdk.plugins.azure import AzureSTT ``` ## Authentication The Azure STT plugin requires an Azure AI Speech Service resource. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in your `.env` file: ```bash AZURE_SPEECH_KEY=your-azure-speech-key AZURE_SPEECH_REGION=your-azure-region ``` ## Example Usage ```python from videosdk.plugins.azure import AzureSTT from videosdk.agents import Pipeline # Initialize the Azure STT model stt = AzureSTT( language="en-US", sample_rate=16000, enable_phrase_list=True, phrase_list=["VideoSDK", "artificial intelligence", "machine learning"] ) # Add stt to cascade pipeline = Pipeline(stt=stt) ``` :::note When using environment variables for credentials, don't pass the `speech_key` and `speech_region` as arguments to the model instance. The SDK automatically reads the environment variables. ::: ## Configuration Options - `speech_key`: (Optional[str]) Azure Speech API key. Uses `AZURE_SPEECH_KEY` environment variable if not provided. - `speech_region`: (Optional[str]) Azure Speech region (e.g., `"eastus"`, `"westus2"`). Uses `AZURE_SPEECH_REGION` environment variable if not provided. - `language`: (str) The language code for transcription (default: `"en-US"`). See [supported languages](https://learn.microsoft.com/en-us/globalization/locale/standard-locale-names). - `sample_rate`: (int) The target audio sample rate in Hz for transcription (default: `16000`). The input audio at 48000Hz will be resampled to this rate. - `enable_phrase_list`: (bool) Whether to enable phrase list for better recognition accuracy (default: `False`). - `phrase_list`: (Optional[List[str]]) List of phrases to boost recognition for domain-specific terms (default: `None`). ## Additional Resources The following resources provide more information about using Azure with VideoSDK Agents SDK. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Complete overview of Azure Speech services. - **[Azure STT docs](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-speech-to-text)**: Azure Speech-to-Text documentation. - **[Getting Started Guide](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text?tabs=macos&pivots=programming-language-python#prerequisites)**: Azure STT setup and prerequisites. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Cartesia STT hide_title: false hide_table_of_contents: false description: "Learn how to use Cartesia's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Cartesia's services" pagination_label: "Cartesia STT" keywords: - Cartesia - Speech-to-Text - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Cartesia slug: cartesia-stt --- # Cartesia STT The Cartesia STT provider enables your agent to use Cartesia's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Cartesia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cartesia" ``` ## Importing ```python from videosdk.plugins.cartesia import CartesiaSTT ``` ## Authentication The Cartesia plugin requires a [Cartesia API key](https://play.cartesia.ai/keys). Set `CARTESIA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cartesia import CartesiaSTT from videosdk.agents import Pipeline # Initialize the Cartesia STT model stt = CartesiaSTT( # When CARTESIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-cartesia-api-key", language="en-US", model="ink-whisper", ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using an environment variable for credentials, don't pass the `api_key` as an argument to the model instance. The SDK automatically reads the environment variable. ::: ## Configuration Options - `api_key`: (str) Your Cartesia API key. Can also be set via the `CARTESIA_API_KEY` environment variable. - `model`: (str) The Cartesia STT model to use (e.g., `"ink-whisper"`). Defaults to `"ink-whisper"`. - `language`: (str) Language code for transcription (default: `"en"`). ## Additional resources The following resources provide more information about using Cartesia with VideoSDK Agents. - **[Cartesia docs](https://docs.cartesia.ai/build-with-cartesia/models/stt)**: Cartesia STT docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Deepgram STT hide_title: false hide_table_of_contents: false description: "Learn how to use Deepgram's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Deepgram's services" pagination_label: "Deepgram STT" keywords: - Deepgram - nova-2 - nova-3 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Deepgram slug: deepgram --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Deepgram STT The Deepgram STT provider enables your agent to use Deepgram's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Deepgram-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-deepgram" ``` ## Authentication The Deepgram plugin requires a [Deepgram API key](https://console.deepgram.com/). Set `DEEPGRAM_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.deepgram import DeepgramSTT ``` ```python from videosdk.plugins.deepgram import DeepgramSTTV2 ``` ## Example Usage ```python from videosdk.plugins.deepgram import DeepgramSTT from videosdk.agents import Pipeline # Initialize the Deepgram STT model stt = DeepgramSTT( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="nova-2", language="en-US", interim_results=True, punctuate=True, smart_format=True, profanity_filter=False, numerals=False, tag=None, enable_diarization=False, ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` ```python from videosdk.plugins.deepgram import DeepgramSTTV2 from videosdk.agents import Pipeline # Initialize the Deepgram STT V2 model with Flux stt = DeepgramSTTV2( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="flux-general-en", eager_eot_threshold=0.6, eot_threshold=0.8, eot_timeout_ms=7000, enable_preemptive_generation=True, tag=None ) # Add stt to cascade pipeline = Pipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Deepgram API key (can also be set via DEEPGRAM_API_KEY environment variable) - `model`: The Deepgram model to use (e.g., "`nova-2`", "`nova-3`", "`whisper-large`") (default: "`nova-2`") - `language`: (str) Language code for transcription (e.g., "`en-US`", "`es`", "`fr`") (default: "`en-US`") - `interim_results`: (bool) Enable real-time partial transcription results (default: `True`) - `punctuate`: (bool) Add punctuation to transcription (default: `True`) - `smart_format`: (bool) Apply intelligent formatting to output (default: `True`) - `filler_words`: (bool) Include filler words like "uh", "um" in transcription (default: `True`) - `sample_rate`: (int) Audio sample rate in Hz (default: `48000`) - `endpointing`: (int) Silence detection threshold in milliseconds (default: `50`) - `base_url`: (str) WebSocket endpoint URL (default: `"wss://api.deepgram.com/v1/listen"`) - `profanity_filter`: (bool) Whether to filter profanity from the transcription. Defaults to `False`. - `numerals`: (bool) Whether to include numerals in the transcription. Defaults to `False`. - `tag`: (str | list[str]) Tag or list of tags to add to the requests for usage reporting. Defaults to `None`. - `enable_diarization`: (bool) Diarize recognizes speaker changes and assigns a speaker to each word in the transcript. Defaults to `False`. - `api_key`: Your Deepgram API key (can also be set via DEEPGRAM_API_KEY environment variable) - `model`: The Flux model to use - language is embedded in model name (default: "`flux-general-en`")(currently only english is available) - `input_sample_rate`: (int) Input audio sample rate in Hz (default: `48000`) - `target_sample_rate`: (int) Target sample rate for Deepgram processing (default: `16000`) - `eager_eot_threshold`: Confidence threshold for early end-of-turn detection, range 0.0-1.0 (default: `0.6`) - Lower values = more aggressive early detection - Higher values = wait for higher confidence before early turn end - `eot_threshold`: Standard end-of-turn confidence threshold, range 0.0-1.0 (default: `0.8`) - Controls when a turn is definitively ended - `eot_timeout_ms`: Timeout in milliseconds before forcing end-of-turn (default: `7000`) - Maximum silence duration before automatically ending turn - `base_url`: (str) WebSocket endpoint URL (default: `"wss://api.deepgram.com/v2/listen"`) - `tag`: (str | list[str]) Tag or list of tags to add to the requests for usage reporting. Defaults to `None`. - `enable_preemptive_generation`: (bool) Enable preemptive generation based on EagerEndOfTurn events (default: `False`). ## Additional Resources The following resources provide more information about using Deepgram with VideoSDK Agents SDK. - **[Deepgram docs V1](https://developers.deepgram.com/docs/live-streaming-audio)**: Deepgram's STT V1 docs - **[Deepgram docs V2](https://developers.deepgram.com/docs/flux/quickstart)**: Deepgram's STT V2 docs - **[Github URL V1](https://github.com/videosdk-live/agents/blob/main/videosdk-plugins/videosdk-plugins-deepgram/videosdk/plugins/deepgram/stt.py)** : Deepgram STT Plugin Source Code - **[Github URL V2](https://github.com/videosdk-live/agents/blob/main/videosdk-plugins/videosdk-plugins-deepgram/videosdk/plugins/deepgram/stt_v2.py)** : Deepgram STT V2 Plugin Source Code import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: ElevenLabs STT hide_title: false hide_table_of_contents: false description: "Learn how to use ElevenLabs's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for ElevenLabs's services" pagination_label: "ElevenLabs STT" keywords: - ElevenLabs - scribe_v2_realtime - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: ElevenLabs slug: eleven-labs --- # ElevenLabs STT The ElevenLabs STT provider enables your agent to use `ElevenLabs` advanced speech-to-text models for high-accuracy, real-time audio transcription with advanced voice activity detection. ## Installation Install the ElevenLabs-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-elevenlabs" ``` ## Importing ```python from videosdk.plugins.elevenlabs import ElevenLabsSTT ``` ## Authentication The ElevenLabs plugin requires an [ElevenLabs API key](https://elevenlabs.io/app/settings/api-keys). Set `ELEVENLABS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.elevenlabs import ElevenLabsSTT from videosdk.agents import Pipeline # Initialize the ElevenLabs STT model stt = ElevenLabsSTT( # When ELEVENLABS_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-elevenlabs-api-key", model_id="scribe_v2_realtime", language_code="en", commit_strategy="vad", vad_silence_threshold_secs=0.8, vad_threshold=0.4, min_speech_duration_ms=50, min_silence_duration_ms=50, include_language_detection=False ) # Add stt to cascade pipeline = Pipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your ElevenLabs API key (can also be set via ELEVENLABS_API_KEY environment variable) - `model_id`: (str) STT model identifier (default: `"scribe_v2_realtime"`) - `language_code`: (str) Language code for transcription (default: `"en"`) - `sample_rate`: (int) Sample rate of input audio in Hz (default: `48000`) - `commit_strategy`: (str) Strategy for committing transcripts (default: `"vad"`) - `"vad"` - Voice Activity Detection based commit strategy - `vad_silence_threshold_secs`: (float) Duration of silence in seconds to detect end-of-speech (default: `0.8`) - `vad_threshold`: (float) Threshold for detecting voice activity (default: `0.4`) - `min_speech_duration_ms`: (int) Minimum duration in milliseconds for a speech segment (default: `50`) - `min_silence_duration_ms`: (int) Minimum duration in milliseconds of silence to consider end-of-speech (default: `50`) - `include_language_detection`: (bool) Whether to include language detection in the transcription (default: `False`) ## Additional Resources The following resources provide more information about using ElevenLabs with VideoSDK Agents SDK. - **[ElevenLabs docs](https://elevenlabs.io/docs)**: ElevenLabs STT docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Gladia STT hide_title: false hide_table_of_contents: false description: "Learn how to use Gladia's real-time speech-to-text models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing streaming transcription." pagination_label: "Gladia STT" keywords: - Gladia - STT - Speech-to-Text - real-time transcription - code-switching - VideoSDK Agents - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Gladia slug: gladia --- # Gladia STT The Gladia STT provider enables your agent to use Gladia's fast and accurate speech-to-text models for real-time audio transcription with support for multiple languages and code-switching. ## Installation Install the Gladia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-gladia" ``` ## Authentication The Gladia plugin requires a [Gladia API key](https://app.gladia.io/signup). Set `GLADIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.gladia import GladiaSTT ``` ## Example Usage ```python from videosdk.plugins.gladia import GladiaSTT from videosdk.agents import Pipeline # Initialize the Gladia STT model stt = GladiaSTT( # When GLADIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-gladia-api-key", languages=["en"], code_switching=True, receive_partial_transcripts=True ) # Add stt to a cascade pipeline = Pipeline(stt=stt) ``` :::note When using a `.env` file for credentials, you do not need to pass the `api_key` as an argument to the model instance; the SDK reads it automatically. ::: ## Configuration Options - `api_key`: (str, optional) Your Gladia API key. Can also be set via the `GLADIA_API_KEY` environment variable. - `model`: (str, optional) The model to use. Defaults to `"solaria-1"`. - `languages`: (List[str], optional) A list of language codes to detect (e.g., `["en", "fr"]`). Defaults to `["en"]`. - `code_switching`: (bool, optional) Enables automatic language switching between the provided languages. Defaults to `True`. - `input_sample_rate`: (int, optional) The sample rate of the incoming audio. Defaults to `48000`. - `output_sample_rate`: (int, optional) The sample rate Gladia should process. Defaults to `16000`. - `encoding`: (str, optional) The audio encoding format. Defaults to `"wav/pcm"`. - `bit_depth`: (int, optional) The bit depth of the audio. Defaults to `16`. - `channels`: (int, optional) The number of audio channels. Defaults to `1` (mono). - `receive_partial_transcripts`: (bool, optional) Set to `True` to receive interim transcription results for lower latency. Defaults to `False`. --- --- title: Google STT hide_title: false hide_table_of_contents: false description: "Learn how to use Google's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Google's services" pagination_label: "Google STT" keywords: - Google - Speech-to-Text - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google slug: google --- # Google STT The Google STT provider enables your agent to use Google's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Google-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Importing ```python from videosdk.plugins.google import GoogleSTT, VoiceActivityConfig ``` ## Setup Credentials/Authentication To use Google STT, you need to set up your Google Cloud credentials. You can do this by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` Alternatively, you can pass the path to the key file directly to the `GoogleSTT` constructor via the `api_key` parameter. **or** Set `GOOGLE_APPLICATION_CREDENTIALS` in your `.env` file. ## Example Usage ```python from videosdk.plugins.google import GoogleSTT, VoiceActivityConfig from videosdk.agents import Pipeline voice_activity_timeout = VoiceActivityConfig( speech_start_timeout=1.0, speech_end_timeout=5.0 ) # Initialize the Google STT model stt = GoogleSTT( # If GOOGLE_APPLICATION_CREDENTIALS is set, you can omit api_key api_key="/path/to/your/keyfile.json", languages="en-US", model="latest_long", interim_results=True, punctuate=True, profanity_filter=False, voice_activity_timeout = voice_activity_timeout ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using an environment variable for credentials, don't pass the `api_key` as an argument to the model instance. The SDK automatically reads the environment variable. ::: ## Configuration Options - `api_key`: (str) Path to your Google Cloud service account JSON file. This can also be set via the `GOOGLE_APPLICATION_CREDENTIALS` environment variable. - `languages`: (Union[str, list[str]]) Language code or a list of language codes for transcription (default: `"en-US"`). - `model`: (str) The Google STT model to use (e.g., `"latest_long"`, `"telephony"`) (default: `"latest_long"`). - `sample_rate`: (int) The target audio sample rate in Hz for transcription (default: `16000`). The input audio at 48000Hz will be resampled to this rate. - `interim_results`: (bool) Enable real-time partial transcription results (default: `True`). - `punctuate`: (bool) Add punctuation to transcription (default: `True`). - `min_confidence_threshold`: (float) The minimum confidence level for a transcription result to be considered valid (default: `0.1`). - `location`: (str) The Google Cloud location to use for the STT service (default: `"global"`). - `profanity_filter`: (bool) detect profane words and return only the first letter followed by asterisks in the transcript (default: `False`). - `voice_activity_timeout`: (`VoiceActivityConfig`) Configure speech activity timeouts (default: `None`). - `speech_start_timeout`: (float) Seconds to wait for speech to begin before timing out. Minimum `0.5` (default: `1.0`). - `speech_end_timeout`: (float) Seconds of silence after speech before ending. Minimum `0.1` (default: `5.0`). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Google STT docs](https://cloud.google.com/speech-to-text/docs)**: Google Cloud STT documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Navana STT hide_title: false hide_table_of_contents: false description: "Learn how to use Navana's Bodhi STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech-to-text, with a focus on Indian languages." pagination_label: "Navana STT" keywords: - Navana - Bodhi - Indian languages - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Navana slug: navana --- # Navana STT The Navana STT provider enables your agent to use Navana's Bodhi speech-to-text models, which are highly optimized for a variety of Indian languages and accents. ## Installation Install the Navana-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-navana" ``` ## Authentication The Navana plugin requires a **Customer ID** and an **API Key** from your [Navana Bodhi account](https://bodhi.navana.ai/). Set both `NAVANA_API_KEY` and `NAVANA_CUSTOMER_ID` in your `.env` file. ## Importing ```python from videosdk.plugins.navana import NavanaSTT ``` ## Example Usage ```python from videosdk.plugins.navana import NavanaSTT from videosdk.agents import Pipeline # Initialize the Navana STT model stt = NavanaSTT( api_key="your-navana-api-key", customer_id="your-navana-customer-id", model="en-in-general-v2-8khz", language="en-IN" ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key`, `customer_id`, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Navana API key (required, can also be set via `NAVANA_API_KEY` environment variable). - `customer_id`: Your Navana Customer ID (required, can also be set via `NAVANA_CUSTOMER_ID` environment variable). - `model`: The Navana STT model to use (e.g., `"en-in-general-v2-8khz"`, `"hi-general-v2-8khz"`). - `language`: The language code for transcription (e.g., `"en-IN"`, `"hi-IN"`). ## Additional Resources The following resources provide more information about using Navana with the VideoSDK Agents SDK. - **[Navana Docs](https://navana.gitbook.io/bodhi/streaming-asr/streaming-websocket)**: Navana's official streaming API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Nvidia STT hide_title: false hide_table_of_contents: false description: "Learn how to use Nvidia's Riva STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Nvidia's services" pagination_label: "Nvidia STT" keywords: - Nvidia - Riva - Parakeet - STT - VideoSDK Agents - Python SDK - Speech To Text - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_label: Nvidia slug: nvidia --- # Nvidia STT The Nvidia STT provider enables your agent to use Nvidia's Riva speech-to-text models for high-performance, low-latency speech recognition. ## Installation Install the Nvidia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-nvidia" ``` ## Authentication The Nvidia plugin requires an Nvidia API key. Set `NVIDIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.nvidia import NvidiaSTT ``` ## Example Usage ```python from videosdk.plugins.nvidia import NvidiaSTT from videosdk.agents import Pipeline # Initialize the Nvidia STT model stt = NvidiaSTT( # When NVIDIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-nvidia-api-key", model="parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer", language_code="en-US", profanity_filter=False, automatic_punctuation=True ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Nvidia API key (required, can also be set via environment variable) - `model`: The Nvidia Riva model to use (default: `"parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer"`) - `server`: The Nvidia Riva server address (default: `"grpc.nvcf.nvidia.com:443"`) - `function_id`: The specific function ID for the service (default: `"1598d209-5e27-4d3c-8079-4751568b1081"`) - `language_code`: Language code for transcription (default: `"en-US"`) - `sample_rate`: Audio sample rate in Hz (default: `16000`) - `profanity_filter`: (bool) Enable or disable profanity filtering (default: `False`) - `automatic_punctuation`: (bool) Enable or disable automatic punctuation (default: `True`) - `use_ssl`: (bool) Enable SSL connection (default: `True`) ## Additional Resources The following resources provide more information about using Nvidia Riva with VideoSDK Agents SDK. - **[Nvidia Riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)**: Nvidia Riva documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI STT hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for OpenAI's services" pagination_label: "OpenAI STT" keywords: - OpenAI - gpt-4o-mini-transcribe - whisper-1 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: OpenAI slug: openai --- # OpenAI STT The OpenAI STT provider enables your agent to use OpenAI's speech-to-text models (like Whisper) for converting audio input to text. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAISTT ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAISTT from videosdk.agents import Pipeline # Initialize the OpenAI STT model stt = OpenAISTT( # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", model="whisper-1", language="en", prompt="Transcribe this audio with proper punctuation and formatting." ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your OpenAI API key (required, can also be set via environment variable) - `model`: The OpenAI STT model to use (e.g., `"whisper-1"`, `"gpt-4o-mini-transcribe"`) - `base_url`: Custom base URL for OpenAI API (optional) - `prompt`: (str) Custom prompt to guide transcription style and format - `language`: (str) Language code for transcription (default: `"en"`) - `turn_detection`: (dict) Configuration for detecting conversation turns ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/guides/speech-to-text)**: OpenAI STT API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Sarvam AI STT hide_title: false hide_table_of_contents: false description: "Learn how to use Sarvam AI's STT models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing speech to text for Sarvam AI's services" pagination_label: "Sarvam AI STT" keywords: - Sarvam AI - saarika:v2 - saaras:v3 - STT - Large Language Model - VideoSDK Agents - Python SDK - Speech To Text - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Sarvam AI slug: sarvam-ai --- # Sarvam AI STT The Sarvam AI STT provider enables your agent to use Sarvam AI's speech-to-text models for transcription. This provider uses Voice Activity Detection (VAD) to send audio chunks for transcription after a period of silence. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAISTT ``` ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAMAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAISTT from videosdk.agents import Pipeline # Initialize the Sarvam AI STT model stt = SarvamAISTT( # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", model="saaras:v3", language="en-IN", ) # Add stt to pipeline pipeline = Pipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `model`: (str) The Sarvam AI model to use (default: `"saaras:v3"`). - `language`: (str) Language code for transcription (default: `"en-IN"`). - `input_sample_rate`: (int) The sample rate of the audio from the source in Hz (default: `48000`). - `output_sample_rate`: (int) The sample rate to which the audio is resampled before sending for transcription (default: `16000`). - `mode`: (str) Mode of operation. Only applicable for `saaras:v3`. Allowed values: `"transcribe"`, `"translate"`, `"verbatim"`, `"translit"`, `"codemix"` (default: `"transcribe"` for `saaras:v3`, `None` for other models). - `high_vad_sensitivity`: (bool) Whether to use high sensitivity voice activity detection (default: `None`). - `flush_signal`: (bool) Whether to send flush signal (default: `None`). - `translation`: (bool) Enable speech-to-text translation. Supported on `saaras:v3` and `saaras:v2.5` models. When enabled, routes to the translation endpoint (default: `False`). - `prompt`: (str) Prompt to guide the translation. Only applicable when `translation` is `True` (default: `None`). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/api-reference-docs/getting-started/models/saaras)**: Sarvam's full docs site. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: AWS Polly TTS hide_title: false hide_table_of_contents: false description: "Learn how to use AWS Polly's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for AWS's services" pagination_label: "AWS Polly TTS" keywords: - AWS Polly - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: AWS Polly slug: aws-polly-tts --- # AWS Polly TTS The AWS Polly TTS provider enables your agent to use AWS Polly's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the AWS Poly-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-aws" ``` ## Importing ```python from videosdk.plugins.aws import AWSPollyTTS ``` ## Authentication - `AWS Account`: You have an active AWS account with permissions to access Amazon Polly. - `Region Selection`: You're operating in the US East (N. Virginia) (us-east-1) region, as model access is region-specific. - `AWS Credentials`: Your AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) are configured, either through environment variables or your preferred credential management method. ## Example Usage ```python from videosdk.plugins.aws import AWSPollyTTS from videosdk.agents import Pipeline # Initialize the AWS Polly TTS model tts = AWSPollyTTS( voice="Joanna", engine="neural", speed=1.2, pitch=0.1, ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `voice`: (str) Voice ID for the TTS output (default: `"Joanna"`). - `engine`: (str) Polly engine type: `"standard"` or `"neural"` (default: `"neural"`). - `region`: (str) AWS region for Polly service (default: `"us-east-1"` or from `AWS_DEFAULT_REGION`). - `aws_access_key_id`: (str) AWS access key ID (optional; can be set via environment variable). - `aws_secret_access_key`: (str) AWS secret access key (optional; can be set via environment variable). - `aws_session_token`: (str) Optional AWS session token for temporary credentials. - `speed`: (float) Speech rate multiplier (e.g., `1.0` is normal speed, `1.5` is 50% faster). - `pitch`: (float) Pitch adjustment multiplier (e.g., `0.0` is normal, `0.2` raises pitch). ## Additional Resources The following resources provide more information about using AWS Polly with VideoSDK Agents SDK. - **[AWS Polly docs](https://docs.aws.amazon.com/polly/latest/dg/what-is.html)**: AWS Polly documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure OpenAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Azure OpenAI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Azure OpenAI's services" pagination_label: "Azure OpenAI TTS" keywords: - OpenAI - Azure - Azure OpenAI - gpt-4o-mini-tts - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Azure OpenAI slug: azureopenai --- # Azure OpenAI TTS The Azure OpenAI TTS provider enables your agent to use Azure OpenAI's text-to-speech models for converting text responses to natural-sounding audio output. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAITTS ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAITTS from videosdk.agents import Pipeline # Initialize the Azure OpenAI TTS model tts = OpenAITTS.azure( azure_deployment="gpt-4o-mini-tts", speed=1.0, response_format="pcm" ) # Add tts to cascade pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o-mini-tts"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `voice`: (str) Voice to use for audio output (e.g., `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"`) - `speed`: (float) Speed of the generated audio (0.25 to 4.0, default: 1.0) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Azure TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Azure's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Azure's services" pagination_label: "Azure TTS" keywords: - Azure - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI - Microsoft - Azure Speech Services - Azure AI Speech - Azure AI Foundry image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Azure AI Speech slug: azure-ai-tts --- # Azure TTS The Azure TTS provider enables your agent to use Microsoft Azure's high-quality text-to-speech models for generating natural-sounding voice output with advanced voice tuning and expressive speaking styles. ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Importing ```python from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle ``` ## Authentication The Azure TTS plugin requires an Azure AI Speech Service resource. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in your `.env` file: ```bash AZURE_SPEECH_KEY=your-azure-speech-key AZURE_SPEECH_REGION=your-azure-region ``` ## Example Usage ```python from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle from videosdk.agents import Pipeline # Configure voice tuning for prosody control voice_tuning = VoiceTuning( rate="fast", volume="loud", pitch="high" ) # Configure speaking style for expressive speech speaking_style = SpeakingStyle( style="cheerful", degree=1.5 ) # Initialize the Azure TTS model tts = AzureTTS( voice="en-US-EmmaNeural", language="en-US", tuning=voice_tuning, style=speaking_style ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `speech_key`, `speech_region`, and other credential parameters from your code. ::: ## Configuration Options - `speech_key`: (Optional[str]) Azure Speech API key. Uses `AZURE_SPEECH_KEY` environment variable if not provided. - `speech_region`: (Optional[str]) Azure Speech region (e.g., `"eastus"`, `"westus2"`). Uses `AZURE_SPEECH_REGION` environment variable if not provided. - `speech_endpoint`: (Optional[str]) Custom endpoint URL. Uses `AZURE_SPEECH_ENDPOINT` environment variable if not provided. - `voice`: (str) Voice name to use for audio output (default: `"en-US-EmmaNeural"`). Get available voices using the [Azure voices API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python#select-synthesis-language-and-voice). - `language`: (str) Language code (optional, inferred from voice if not specified). - `tuning`: (`VoiceTuning`) Voice tuning object for rate, volume, and pitch control: - `rate`: (str) Speaking rate (`"x-slow"`, `"slow"`, `"medium"`, `"fast"`, `"x-fast"` or percentage like `"50%"`) - `volume`: (str) Speaking volume (`"silent"`, `"x-soft"`, `"soft"`, `"medium"`, `"loud"`, `"x-loud"` or percentage) - `pitch`: (str) Voice pitch (`"x-low"`, `"low"`, `"medium"`, `"high"`, `"x-high"` or frequency like `"+50Hz"`) - `style`: (`SpeakingStyle`) Speaking style object for expressive speech: - `style`: (str) Speaking style (e.g., `"cheerful"`, `"sad"`, `"angry"`, `"excited"`, `"friendly"`) - `degree`: (float) Style intensity from 0.01 to 2.0 (default: 1.0) - `deployment_id`: (str) Custom deployment ID for custom models. - `speech_auth_token`: (str) Authorization token for authentication. ## Voice Selection You can find available voices using the Azure Voices List API: ```bash curl --location --request GET 'https://eastus2.tts.speech.microsoft.com/cognitiveservices/voices/list' \ --header 'Ocp-Apim-Subscription-Key: YOUR_SPEECH_KEY' ``` Popular voice options include: - `en-US-EmmaNeural` (Female, neutral) - `en-US-BrianNeural` (Male, neutral) - `en-US-AriaNeural` (Female, cheerful) - `en-GB-SoniaNeural` (Female, British) ## Additional Resources The following resources provide more information about using Azure with VideoSDK Agents SDK. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Complete overview of Azure Speech services. - **[Azure TTS docs](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-text-to-speech)**: Azure Text-to-Speech documentation. - **[Voice Selection Guide](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python#select-synthesis-language-and-voice)**: Guide for selecting synthesis language and voice. - **[Speech Synthesis Markup](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#adjust-prosody)**: Learn about prosody adjustments and voice tuning. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: CambAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use CambAI's TTS models with the VideoSDK AI Voice Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for CambAI's services" pagination_label: "CambAI TTS" keywords: - CambAI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI - AI Voice Agents image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: CambAI slug: cambai-tts --- # CambAI TTS The CambAI TTS provider enables your agent to use CambAI's high-quality, low-latency text-to-speech models for generating natural-sounding voice output with advanced voice customization capabilities. ## Installation Install the CambAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cambai" ``` ## Importing ```python from videosdk.plugins.cambai import CambAITTS, InferenceOptions, VoiceSettings, OutputConfiguration ``` ## Authentication The CambAI plugin requires a [CambAI API key](https://studio.camb.ai/). Set `CAMBAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cambai import CambAITTS, InferenceOptions, VoiceSettings, OutputConfiguration from videosdk.agents import Pipeline inference_options = InferenceOptions( stability=0.5, temperature=0.7, inference_steps=60, speaker_similarity=0.8, localize_speaker_weight=0.5, acoustic_quality_boost=True ) # Configure voice settings voice_settings = VoiceSettings( enhance_reference_audio_quality=False, maintain_source_accent=False, ) output_configuration = OutputConfiguration( format="wav", sample_rate=24000, # Audio sample rate duration=None ) # Initialize CambAI TTS with optional audio output settings tts = CambAITTS( speech_model="mars-pro", voice_id=147320, language="en-us", user_instructions=None, # Optional for mars-instruct enhance_named_entities_pronunciation=True, voice_settings=voice_settings, inference_options=inference_options, output_configuration=output_configuration, ) # Add TTS to a cascade pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your CambAI API key. Can also be set via the `CAMBAI_API_KEY` environment variable. - `speech_model`: (str) The CambAI TTS model to use (e.g., `"mars-pro"`, `"mars-flash"`, `"mars-instruct"`). Defaults to `"mars-pro"`. - `voice_id`: (int) Numeric voice profile ID from CambAI's voice library. Defaults to `147320`. - `language`: (str) BCP-47 locale string (e.g., `"en-us"`). Defaults to `"en-us"`. - `user_instructions`: (str) Style and tone guidance for the generated speech. Only supported when `speech_model` is set to `"mars-instruct"`. - `enhance_named_entities_pronunciation`: (bool) Improve pronunciation of brand names and proper nouns (default: `False`). - `voice_settings`: (`VoiceSettings`) Voice behaviour preferences: - `enhance_reference_audio_quality`: (bool) Enhance the quality of reference audio (default: `False`) - `maintain_source_accent`: (bool) Preserve the original speaker's accent (default: `False`) - `inference_options`: (`InferenceOptions`) Model sampling controls: - `stability`: (float) Voice stability control (optional) - `temperature`: (float) Sampling temperature (optional) - `inference_steps`: (int) Number of inference steps (optional) - `speaker_similarity`: (float) Speaker similarity control (optional) - `localize_speaker_weight`: (float) Speaker localization weight (optional) - `acoustic_quality_boost`: (bool) Enable acoustic quality enhancement (optional) - `output_configuration`: (`OutputConfiguration`) Audio output format and pacing options: - `format`: (str) Output audio format. Currently `"wav"` is supported (default: `"wav"`) - `sample_rate`: (int) Audio sample rate in Hz (default: `24000`) - `duration`: (float) Target speech duration in seconds. When set, the model attempts to pace the audio to match the requested duration. Omit or set to `None` for natural pacing (optional) ## Additional Resources The following resources provide more information about using CambAI with VideoSDK Agents. - **[CambAI docs](https://docs.camb.ai/)**: CambAI TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Cartesia TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Cartesia's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Cartesia's services" pagination_label: "Cartesia TTS" keywords: - Cartesia - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Cartesia slug: cartesia-tts --- # Cartesia TTS The Cartesia TTS provider enables your agent to use Cartesia's high-quality, low-latency text-to-speech models for generating natural-sounding voice output. ## Installation Install the Cartesia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cartesia" ``` ## Importing ```python from videosdk.plugins.cartesia import CartesiaTTS ``` ## Authentication The Cartesia plugin requires a [Cartesia API key](https://play.cartesia.ai/keys). Set `CARTESIA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cartesia import CartesiaTTS from videosdk.agents import Pipeline # Initialize the Cartesia TTS model tts = CartesiaTTS( # When CARTESIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-cartesia-api-key", model="sonic-2", voice_id="794f9389-aac1-45b6-b726-9d9369183238", language="en", pronunciation_dict_id= None, max_buffer_delay_ms=None, word_timestamps=True ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Cartesia API key. Can also be set via the `CARTESIA_API_KEY` environment variable. - `model`: (str) The Cartesia TTS model to use (e.g., `"sonic-2"`, `"sonic-turbo"`). Defaults to `"sonic-2"`. - `voice_id`: (str) The ID of the voice to use for generating speech. - `language`: (str) The language of the voice (e.g., `"en"`, `"fr"`). Defaults to `"en"`. - `pronounciation_dict_id`: (str) The ID of the pronunciation dictionary to use for generating speech. - `max_buffer_delay_ms` : (int) Maximum buffering delay before audio streaming starts. Values between 0-5000ms are supported. Defaults to `3000ms`. - `word_timestamps`: (bool) Enable word-level timestamps in the TTS output. Defaults to `False`. ## Additional resources The following resources provide more information about using Cartesia with VideoSDK Agents. - **[Cartesia docs](https://docs.cartesia.ai/build-with-cartesia/models/tts)**: Cartesia TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Deepgram TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Deepgram's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Deepgram's services" pagination_label: "Deepgram TTS" keywords: - Deepgram TTS - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Deepgram TTS slug: deepgram --- # Deepgram TTS The Deepgram TTS provider enables your agent to use Deepgram's high-quality text-to-speech models for generating natural, expressive voice output with advanced voice capabilities. ## Installation Install the Deepgram-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-deepgram" ``` ## Importing ```python from videosdk.plugins.deepgram import DeepgramTTS ``` ## Authentication The Deepgram plugin requires a [Deepgram API key](https://developers.deepgram.com/docs/create-additional-api-keys). Set `DEEPGRAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.deepgram import DeepgramTTS from videosdk.agents import Pipeline # Initialize the Deepgram TTS model tts = DeepgramTTS( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="aura-asteria-en", encoding="linear16", # linear16, mulaw, alaw, opus, mp3, flac, aac sample_rate=24000 ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model` : The Deepgram model to use (e.g., `"aura-asteria-en"`, `"aura-luna-en"`) - `api_key`: Your Deepgram API key (can also be set via environment variable) - `encoding` : (str) Encoding allows you to specify the expected encoding of your audio output (default : `"linear16"`) - `sample_rate`: (int) Sample rate for output (default: `24000`) ## Additional Resources The following resources provide more information about using Deepgram with VideoSDK Agents SDK. - **[Deepgram docs](https://developers.deepgram.com/reference/text-to-speech-api/speak-streaming)**: Deepgram TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: ElevenLabs TTS hide_title: false hide_table_of_contents: false description: "Learn how to use ElevenLabs's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for ElevenLabs's services" pagination_label: "ElevenLabs TTS" keywords: - ElevenLabs - eleven_flash_v2_5 - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: ElevenLabs slug: eleven-labs --- # ElevenLabs TTS The ElevenLabs TTS provider enables your agent to use ElevenLabs' high-quality text-to-speech models for generating natural, expressive voice output with advanced voice cloning capabilities. ## Installation Install the ElevenLabs-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-elevenlabs" ``` ## Importing ```python from videosdk.plugins.elevenlabs import ElevenLabsTTS, VoiceSettings ``` ## Authentication The ElevenLabs plugin requires an [ElevenLabs API key](https://elevenlabs.io/app/settings/api-keys). Set `ELEVENLABS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.elevenlabs import ElevenLabsTTS, VoiceSettings from videosdk.agents import Pipeline # Configure voice settings voice_settings = VoiceSettings( stability=0.71, similarity_boost=0.5, style=0.0, use_speaker_boost=True ) # Initialize the ElevenLabs TTS model tts = ElevenLabsTTS( # When ELEVENLABS_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-elevenlabs-api-key", model="eleven_flash_v2_5", voice="EXAVITQu4vr4xnSDxMaL", speed=1.0, response_format="pcm_24000", enable_streaming=True, enable_ssml_parsing=False, apply_text_normalization="auto", auto_mode="auto", voice_settings=voice_settings, word_timestamps=True, ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The ElevenLabs model to use (e.g., `"eleven_flash_v2_5"`, `"eleven_multilingual_v2"`) - `voice`: (str) Voice ID to use for audio output (default: "EXAVITQu4vr4xnSDxMaL") - `speed`: (float) Speed of the generated audio (default: 1.0) - `api_key`: Your ElevenLabs API key (can also be set via environment variable) - `response_format`: (str) Audio format for output (default: `"pcm_24000"`) - `voice_settings`: (`VoiceSettings`) Advanced voice configuration options: - `stability`: (float) Voice stability (0.0 to 1.0, default: 0.71) - `similarity_boost`: (float) Voice similarity enhancement (0.0 to 1.0, default: 0.5) - `style`: (float) Voice style exaggeration (0.0 to 1.0, default: 0.0) - `use_speaker_boost`: (bool) Enable speaker boost for clarity (default: `True`) - `base_url`: (str) Custom base URL for ElevenLabs API (optional) - `enable_streaming`: (bool) Enable real-time audio streaming (default: `False`) - `enable_ssml_parsing`: (bool) Whether to enable SSML parsing (default: `False`) - `apply_text_normalization`: (str) Controls text normalization (e.g., spelling out numbers). Modes: - "auto" (default) – System decides automatically - "on" – Always applied - "off" – Skipped - `Note`: For `eleven_turbo_v2_5` and `eleven_flash_v2_5` models, enabling text normalization requires an Enterprise plan. - `auto_mode`: (bool) Reduces latency by disabling chunk schedule and buffers. Recommended for full sentences/phrases. - `word_timestamps`: (bool) Enable word-level timestamps in the TTS output. Defaults to `False`. ## Additional Resources The following resources provide more information about using ElevenLabs with VideoSDK Agents SDK. - **[ElevenLabs docs](https://elevenlabs.io/docs)**: ElevenLabs TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Google TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Google's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Google's services" pagination_label: "Google TTS" keywords: - Google - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: Google slug: google-tts --- # Google TTS The Google TTS plugin enables your agent to use Google's text-to-speech models for generating natural-sounding voice output. It supports low-latency gRPC streaming with Chirp 3 HD voices and Vertex AI endpoints. ## Installation ```bash pip install "videosdk-plugins-google" ``` ## Authentication Set your Google API key as an environment variable: ```bash export GOOGLE_API_KEY="your-google-api-key" ``` You can obtain an API key from the [Google AI Studio](https://aistudio.google.com/apikey). ## Example Usage ```python from videosdk.plugins.google import GoogleTTS, GoogleVoiceConfig from videosdk.agents import Pipeline # Configure voice settings voice_config = GoogleVoiceConfig( languageCode="en-US", name="en-US-Chirp3-HD-Aoede", ssmlGender="FEMALE" ) # Initialize the Google TTS model tts = GoogleTTS( # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", speed=1.0, pitch=0.0, voice_config=voice_config, custom_pronunciations=[{"tomato": "təˈmeɪtoʊ"}], # Optional IPA overrides ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` ### Vertex AI To use the Vertex AI endpoint instead of an API key, authenticate using [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/application-default-credentials) and set your project ID: ```bash export GOOGLE_CLOUD_PROJECT="my-gcp-project" ``` ```python from videosdk.plugins.google import GoogleTTS, VertexAIConfig tts = GoogleTTS( vertexai=True, vertexai_config=VertexAIConfig(location="us-central1"), streaming=False, # Streaming cannot be used with Vertex AI ) ``` :::note - `streaming=True` (the default) requires a Chirp 3 HD voice (e.g. `en-US-Chirp3-HD-Aoede`) and cannot be combined with `vertexai=True`. - Vertex AI requires a GCP project ID via `VertexAIConfig(project_id="...")`, the `GOOGLE_CLOUD_PROJECT` env variable, or a `GOOGLE_APPLICATION_CREDENTIALS` service-account file. ::: ## Configuration Options - `api_key`: (str) Your Google Cloud TTS API key. Can also be set via the `GOOGLE_API_KEY` environment variable. - `speed`: (float) The speaking rate of the generated audio (default: `1.0`). - `pitch`: (float) The pitch of the generated audio. Can be between -20.0 and 20.0 (default: `0.0`). - `response_format`: (str) The format of the audio response. Currently only supports `"pcm"` (default: `"pcm"`). - `voice_config`: (`GoogleVoiceConfig`) Configuration for the voice to be used. - `languageCode`: (str) The language code of the voice (e.g., `"en-US"`, `"en-GB"`) (default: `"en-US"`). - `name`: (str) The name of the voice to use (e.g., `"en-US-Chirp3-HD-Aoede"`, `"en-US-News-N"`) (default: `"en-US-Chirp3-HD-Aoede"`). - `ssmlGender`: (str) The gender of the voice (`"MALE"`, `"FEMALE"`, `"NEUTRAL"`) (default: `"FEMALE"`). - `custom_pronunciations`: (list[dict] | dict | None) IPA pronunciation overrides for specific words (e.g., `[{"tomato": "təˈmeɪtoʊ"}]`). Defaults to `None`. - `streaming`: (bool) Use gRPC `StreamingSynthesize` for lower-latency audio generation. Only compatible with Chirp 3 HD voices and cannot be combined with `vertexai=True` (default: `True`). - `vertexai`: (bool) Use the Vertex AI TTS endpoint with Application Default Credentials (ADC) instead of an API key (default: `False`). - `vertexai_config`: (`VertexAIConfig`) Project and region settings for Vertex AI. - `project_id`: (str | None) Your GCP project ID. Falls back to `GOOGLE_CLOUD_PROJECT` or `GOOGLE_APPLICATION_CREDENTIALS` (default: `None`). - `location`: (str) GCP region for the TTS endpoint (default: `"us-central1"`). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Google TTS docs](https://cloud.google.com/text-to-speech/docs)**: Google Cloud TTS documentation. - **[Chirp 3 HD voices](https://cloud.google.com/text-to-speech/docs/chirp3-hd)**: Available voices for low-latency streaming synthesis. - **[Vertex AI TTS](https://cloud.google.com/vertex-ai/docs/text-to-speech)**: Vertex AI Text-to-Speech documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Groq TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Groq's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Groq's services" pagination_label: "Groq TTS" keywords: - Groq - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Groq slug: groq-ai-tts --- # Groq TTS The Groq TTS provider enables your agent to use Groq's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Groq-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-groq" ``` ## Importing ```python from videosdk.plugins.groq import GroqTTS ``` ## Authentication The Groq plugin requires an [Groq API key](https://console.groq.com/keys). Set `GROQ_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.groq import GroqTTS from videosdk.agents import Pipeline # Initialize the Groq AI TTS model tts = GroqTTS( model="playai-tts", voice="Fritz-PlayAI", ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model` (str): The TTS model to use. Default: "playai-tts" - `voice` (str): The voice to use. Default: "Fritz-PlayAI" - `speed` (float): Speed of speech (0.5 to 5.0). Default: 1.0 - `api_key` (str, optional): Groq API key. If not provided, uses GROQ_API_KEY environment variable ## Additional Resources The following resources provide more information about using Groq with VideoSDK Agents SDK. - **[Groq docs](https://console.groq.com/docs/text-to-speech)**: Groq TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Hume AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Hume AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Hume AI's services" pagination_label: "Hume AI TTS" keywords: - Hume AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Hume AI slug: hume-ai-tts --- # Hume AI TTS The Hume AI TTS provider enables your agent to use Hume AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Hume AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-humeai" ``` ## Importing ```python from videosdk.plugins.humeai import HumeAITTS ``` ## Authentication The Hume plugin requires an [Hume API key](https://platform.hume.ai/settings/keys). Set `HUMEAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.hume import HumeAITTS from videosdk.agents import Pipeline # Initialize the Hume AI TTS model tts = HumeAITTS( voice="Serene Assistant", instant_mode=True, ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `instant_mode`: (bool) Whether to use instant mode synthesis (default: `True`). Instant mode requires specifying a voice. - `voice`: (str) Voice name to use (default: `"Serene Assistant"`). Required when `instant_mode` is `True`. - `speed`: (float) Speaking rate multiplier (default: `1.0`). Values >1.0 increase speed. - `api_key`: (str) Hume AI API key. Can also be set via the `HUMEAI_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Hume with VideoSDK Agents SDK. - **[Hume AI docs](https://dev.hume.ai/docs/text-to-speech-tts)**: Hume AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Inworld AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Inworld AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Inworld AI's services" pagination_label: "Inworld AI TTS" keywords: - Inworld AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Inworld AI slug: inworld-ai-tts --- # Inworld AI TTS The Inworld AI TTS provider enables your agent to use Inworld AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Inworld AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-inworldai" ``` ## Importing ```python from videosdk.plugins.inworld import InworldAITTS ``` ## Authentication The Inworld plugin requires an [Inworld API key](https://studio.inworld.ai/login). Set `INWORLD_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.inworldai import InworldAITTS from videosdk.agents import Pipeline # Initialize the Inworld AI TTS model tts = InworldAITTS( api_key="your-api-key", voice_id="Hades", model_id="inworld-tts-1" ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model_id`: (str) Inworld TTS model identifier (default: `"inworld-tts-1"`). - `voice_id`: (str) Voice identifier to use (default: `"Hades"`). - `temperature`: (float) Sampling temperature for variation in prosody (default: `0.8`). - `api_key`: (str) Inworld API key. Can also be set via the `INWORLD_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Inworld with VideoSDK Agents SDK. - **[Inworld AI docs](https://docs.inworld.ai/docs/introduction)**: Inworld AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: LMNT AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use LMNT AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for LMNT AI's services" pagination_label: "LMNT AI TTS" keywords: - LMNT AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: LMNT AI slug: lmnt-ai-tts --- # LMNT AI TTS The LMNT AI TTS provider enables your agent to use LMNT AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the LMNT AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-lmnt" ``` ## Importing ```python from videosdk.plugins.lmnt import LMNTTTS ``` ## Authentication The LMNT plugin requires an [LMNT API key](https://app.lmnt.com/account). Set `LMNT_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.lmnt import LMNTTTS from videosdk.agents import Pipeline # Initialize the LMNT TTS model tts = LMNTTTS( voice="ava", model="blizzard", language="auto", ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your LMNT API key (can also be set via `LMNT_API_KEY` environment variable) - `voice`: Voice ID to use for synthesis (required) - `model`: Model to use for synthesis (default: "blizzard") - `language`: Language code for synthesis (default: "auto") ## Additional Resources The following resources provide more information about using LMNT with VideoSDK Agents SDK. - **[LMNT docs](https://docs.lmnt.com/)**: LMNT API docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Murf AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Murf AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Murf AI's services" pagination_label: "Murf AI TTS" keywords: - Murf AI - Falcon - GEN2 - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Murf AI slug: murf-ai-tts --- # Murf AI TTS The Murf AI TTS provider enables your agent to use Murf AI's high-quality text-to-speech models for generating natural, expressive voice output with advanced voice customization. ## Installation Install the Murf AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-murfai" ``` ## Importing ```python from videosdk.plugins.murfai import MurfAITTS, MurfAIVoiceSettings ``` ## Authentication The Murf AI plugin requires a [Murf AI API key](https://murf.ai/). Set `MURFAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.murfai import MurfAITTS, MurfAIVoiceSettings from videosdk.agents import Pipeline # Configure voice settings voice_settings = MurfAIVoiceSettings( pitch=0, rate=0, style="Conversational", variation=1, multi_native_locale=None ) # Initialize the Murf AI TTS model tts = MurfAITTS( # When MURFAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-murfai-api-key", region="US_EAST", model="Falcon", voice="en-US-natalie", voice_settings=voice_settings, enable_streaming=True ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Murf AI API key (can also be set via MURFAI_API_KEY environment variable) - `region`: (str) The region code for API deployment (default: `"US_EAST"`) - Available regions: `"GLOBAL"`, `"US_EAST"`, `"US_WEST"`, `"INDIA"`, `"CANADA"`, `"SOUTH_KOREA"`, `"UAE"`, `"JAPAN"`, `"AUSTRALIA"`, `"EU_CENTRAL"`, `"UK"`, `"SOUTH_AFRICA"` - `model`: (str) The Murf AI model to use (default: `"Falcon"`) - Available models: `"Gen2"`, `"Falcon"` - `voice`: (str) Voice ID to use for audio output (default: `"en-US-natalie"`) - `voice_settings`: (`MurfAIVoiceSettings`) Advanced voice configuration options: - `pitch`: (int) Voice pitch adjustment, range varies by voice (default: `0`) - `rate`: (int) Speech rate adjustment, range varies by voice (default: `0`) - `style`: (str) Voice style/emotion (default: `"Conversational"`) - `variation`: (int) Voice variation for diversity (default: `1`) - `multi_native_locale`: (str) Optional locale for multi-native voices (default: `None`) - `enable_streaming`: (bool) Enable WebSocket streaming for low latency. When `False`, uses HTTP chunked transfer (default: `True`) ## Additional Resources The following resources provide more information about using Murf AI with VideoSDK Agents SDK. - **[Murf AI docs](https://murf.ai/api/docs/introduction/overview)**: Murf AI TTS docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Neuphonic TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Neuphonic's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Neuphonic's services" pagination_label: "Neuphonic TTS" keywords: - Neuphonic - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Neuphonic slug: neuphonic-tts --- # Neuphonic TTS The Neuphonic TTS provider enables your agent to use Neuphonic's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Neuphonic-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-neuphonic" ``` ## Importing ```python from videosdk.plugins.neuphonic import NeuphonicTTS ``` ## Authentication The Neuphonic plugin requires an [Neuphonic API key](https://app.neuphonic.com/apikey). Set `NEUPHONIC_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.neuphonic import NeuphonicTTS from videosdk.agents import Pipeline # Initialize the Neuphonic AI TTS model tts = NeuphonicTTS( lang_code="en", voice_id="8e9c4bc8-3979-48ab-8626-df53befc2090", speed=1.0, ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Neuphonic API key (can also be set via `NEUPHONIC_API_KEY` environment variable) - `lang_code`: Language code for the desired language (e.g., 'en', 'es', 'de', 'nl', 'hi') - `voice_id`: The voice ID for the desired voice - `speed`: Playback speed of the audio (range: 0.7-2.0, default: 1.0) ## Additional Resources The following resources provide more information about using Neuphonic with VideoSDK Agents SDK. - **[Neuphonic AI docs](https://docs.neuphonic.com/)**: Neuphonic docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Nvidia TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Nvidia's Riva TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Nvidia's services" pagination_label: "Nvidia TTS" keywords: - Nvidia - Riva - TTS - VideoSDK Agents - Python SDK - Text To Speech - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_label: Nvidia slug: nvidia --- # Nvidia TTS The Nvidia TTS provider enables your agent to use Nvidia's Riva text-to-speech models for converting text responses to natural-sounding audio output with low latency. ## Installation Install the Nvidia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-nvidia" ``` ## Authentication The Nvidia plugin requires an Nvidia API key. Set `NVIDIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.nvidia import NvidiaTTS ``` ## Example Usage ```python from videosdk.plugins.nvidia import NvidiaTTS from videosdk.agents import Pipeline # Initialize the Nvidia TTS model tts = NvidiaTTS( # When NVIDIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-nvidia-api-key", voice_name="Magpie-Multilingual.EN-US.Aria", language_code="en-US", sample_rate=24000 ) # Add tts to cascade pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Nvidia API key (required, can also be set via environment variable) - `server`: The Nvidia Riva server address (default: `"grpc.nvcf.nvidia.com:443"`) - `function_id`: The specific function ID for the service (default: `"877104f7-e885-42b9-8de8-f6e4c6303969"`) - `voice_name`: (str) The voice to use (default: `"Magpie-Multilingual.EN-US.Aria"`) - `language_code`: (str) Language code for synthesis (default: `"en-US"`) - `sample_rate`: (int) Audio sample rate in Hz (default: `24000`) - `use_ssl`: (bool) Enable SSL connection (default: `True`) ## Additional Resources The following resources provide more information about using Nvidia Riva with VideoSDK Agents SDK. - **[Nvidia Riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)**: Nvidia Riva documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: OpenAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use OpenAI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for OpenAI's services" pagination_label: "OpenAI TTS" keywords: - OpenAI - gpt-4o-mini-tts - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: OpenAI slug: openai --- # OpenAI TTS The OpenAI TTS provider enables your agent to use OpenAI's text-to-speech models for converting text responses to natural-sounding audio output. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAITTS ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAITTS from videosdk.agents import Pipeline # Initialize the OpenAI TTS model tts = OpenAITTS( # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", model="tts-1", voice="alloy", speed=1.0, response_format="pcm" ) # Add tts to cascade pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The OpenAI TTS model to use (e.g., `"tts-1"`, `"tts-1-hd"`) - `voice`: (str) Voice to use for audio output (e.g., `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"`) - `speed`: (float) Speed of the generated audio (0.25 to 4.0, default: 1.0) - `instructions`: (str) Custom instructions to guide speech synthesis style - `api_key`: Your OpenAI API key (can also be set via environment variable) - `base_url`: Custom base URL for OpenAI API (optional) - `response_format`: (str) Audio format for output (default: `"pcm"`) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/guides/text-to-speech)**: OpenAI TTS API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Papla Media TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Papla Media's text-to-speech service with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing TTS for Papla Media." pagination_label: "Papla Media TTS" keywords: - Papla Media - TTS - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Papla Media slug: papla-media --- # Papla Media TTS The Papla Media TTS provider enables your agent to use Papla Media's text-to-speech service for converting text responses into spoken audio. ## Installation Install the Papla Media-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-papla" ``` ## Importing ```python from videosdk.plugins.papla import PaplaTTS ``` ## Authentication The Papla Media plugin requires an API key, which you can generate from your app dashboard. Set `PAPLA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.papla import PaplaTTS from videosdk.agents import Pipeline # Initialize the Papla Media TTS service tts = PaplaMediaTTS( # When PAPLA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-papla-media-api-key", ) # Add tts to a cascade pipeline = Pipeline(tts=tts) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so you should omit the `api_key` parameter from your code. ::: ## Configuration Options ### Initialization Parameters These are the options you can set when creating an instance of `PaplaMediaTTS`. - `model_id` (str): The TTS model to use. Defaults to `"papla_p1"`. - `api_key` (str, optional): Your Papla Media API key. It's recommended to set this via the `PAPLA_API_KEY` environment variable instead. - `base_url` (str, optional): Custom base URL for the Papla Media API. Defaults to `"https://api.papla.media/v1"`. ## Additional Resources The following resources provide more information about using Papla Media with the VideoSDK Agent Framework. - **[Papla Media API Docs](https://api.papla.media/docs)**: Papla Media's official API documentation. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Resemble AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Resemble AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Resemble AI's services" pagination_label: "Resemble AI TTS" keywords: - Resemble AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Resemble AI slug: resemble-ai-tts --- # Resemble AI TTS The Resemble AI TTS provider enables your agent to use Resemble AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Resemble AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-resemble" ``` ## Importing ```python from videosdk.plugins.resemble import ResembleTTS ``` ## Authentication The Resemble plugin requires an [Resemble API key](https://app.resemble.ai/account/api). Set `RESEMBLE_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.resemble import ResembleTTS from videosdk.agents import Pipeline # Initialize the Resemble AI TTS model tts = ResembleTTS( # When RESEMBLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-resemble-api-key", voice_uuid="55592656" ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Resemble AI API key. Can also be set via the `RESEMBLE_API_KEY` environment variable. - `voice_uuid`: (str) The UUID of the voice to use for synthesis (default: `"55592656"`). ## Additional Resources The following resources provide more information about using Resemble with VideoSDK Agents SDK. - **[Resemble AI docs](https://docs.app.resemble.ai)**: Resemble AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Rime AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Rime AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Rime AI's services" pagination_label: "Rime AI TTS" keywords: - Rime AI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Rime AI slug: rime-ai-tts --- # Rime AI TTS The Rime AI TTS provider enables your agent to use Rime AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Rime AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-rime" ``` ## Importing ```python from videosdk.plugins.rime import RimeTTS ``` ## Authentication The Rime plugin requires an [Rime API key](https://rime.ai/). Set `RIME_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.Rime import RimeTTS from videosdk.agents import Pipeline # Initialize the Rime AI TTS model tts = RimeTTS( speaker="river", model_id="mist", lang="eng", speed_alpha=1.0 ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `speaker`: (str) Voice ID to use (default: `"river"`). Must match the model's available speakers. - `model_id`: (str) Rime model identifier (default: `"mist"`). Supported: `"mist"`, `"mistv2"`. - `lang`: (str) Language code for the voice (default: `"eng"`). - `speed_alpha`: (float) Controls speaking rate (`1.0` is normal speed). - `reduce_latency`: (bool) Whether to minimize streaming delay (default: `False`). - `pause_between_brackets`: (bool) Insert pauses around bracketed text (default: `False`). - `phonemize_between_brackets`: (bool) Use phonemes for bracketed text (default: `False`). - `inline_speed_alpha`: (str) Optional per-word speed override (e.g., `"1.2,1.0,0.8"`). - `api_key`: (str) Rime API key. Can also be set via the `RIME_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Rime with VideoSDK Agents SDK. - **[Rime AI docs](https://docs.rime.ai/)**: Rime AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Sarvam AI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Sarvam AI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Sarvam AI's services" pagination_label: "Sarvam AI TTS" keywords: - Sarvam AI - bulbul:v2 - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 3 sidebar_label: Sarvam AI slug: sarvam-ai-tts --- # Sarvam AI TTS The Sarvam AI TTS provider enables your agent to use Sarvam AI's text-to-speech models for generating voice output. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAITTS ``` ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAMAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAITTS from videosdk.agents import Pipeline # Initialize the Sarvam AI TTS model tts = SarvamAITTS( # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", model="bulbul:v3", speaker="shubh", language="en-IN", pitch=0.0, pace=1.0, loudness=1.0, ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `model`: (str) The Sarvam AI model to use, e.g. `"bulbul:v3"`, `"bulbul:v2"`, `"bulbul:v3-beta"` (default: `"bulbul:v3"`). - `speaker`: (str) The speaker voice to use (default: `"shubh"`). - `language`: (str) The language code for the generated audio (default: `"en-IN"`). - `enable_streaming`: (bool) If `True`, uses WebSockets for low-latency streaming. If `False`, uses HTTP for batch synthesis (default: `True`). - `sample_rate`: (int) The audio sample rate in Hz (default: `24000`). - `output_audio_codec`: (str) The output audio codec (default: `"linear16"`). - `pitch`: (float | None) Pitch of the voice. Only supported on `bulbul:v2`. Range: [-0.75, 0.75]. Set to `None` to omit (default: `0.0`). - `pace`: (float | None) Pace/speed of the voice. `bulbul:v2`: range [0.3, 3.0]; `bulbul:v3`/`bulbul:v3-beta`: range [0.5, 2.0]. Set to `None` to omit (default: `1.0`). - `loudness`: (float | None) Loudness of the voice. Only supported on `bulbul:v2`. Range: [0.3, 3.0]. Set to `None` to omit (default: `1.0`). - `temperature`: (float | None) Sampling temperature. Only supported on `bulbul:v3` and `bulbul:v3-beta`. Range: [0.01, 1.0]. Set to `None` to omit (default: `0.6`). - `output_audio_bitrate`: (str) Output audio bitrate. Allowed values: `"32k"`, `"64k"`, `"96k"`, `"128k"`, `"192k"` (default: `"128k"`). - `min_buffer_size`: (int) Minimum character length that triggers buffer flushing (default: `50`). - `max_chunk_length`: (int) Maximum chunk length for sentence splitting (default: `150`). - `enable_preprocessing`: (bool) Controls normalization of English words and numeric entities (e.g., numbers, dates). Recommended for mixed-language text. Only supported on `bulbul:v2` (default: `False`). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/api-reference-docs/getting-started/models/bulbul)**: Sarvam's full docs site. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: SmallestAI TTS hide_title: false hide_table_of_contents: false description: "Learn how to use SmallestAI's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for SmallestAI's services" pagination_label: "SmallestAI TTS" keywords: - SmallestAI - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 2 sidebar_label: SmallestAI slug: smallestai-tts --- # SmallestAI TTS The SmallestAI TTS provider enables your agent to use SmallestAI's high-quality text-to-speech models for generating voice output. ## Installation Install the SmallestAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-smallestai" ``` ## Importing ```python from videosdk.plugins.smallestai import SmallestAITTS ``` ## Authentication The Smallest AI plugin requires a [Smallest AI API key](https://console.smallest.ai/apikeys). Set `SMALLEST_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.smallestai import SmallestAITTS from videosdk.agents import Pipeline # Initialize the SmallestAI TTS model tts = SmallestAITTS( # When SMALLEST_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-smallestai-api-key", model="lightning", voice_id="emily" ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` from your code. ::: ## Configuration Options - `api_key`: (str) Your SmallestAI API key. Can also be set via the `SMALLEST_API_KEY` environment variable. - `model`: (str) The TTS model to use (e.g., `"lightning"`, `"lightning-large"`). Defaults to `"lightning"`. - `voice_id`: (str) The ID of the voice to use. Defaults to `"emily"`. - `speed`: (float) Speech speed multiplier. Defaults to `1.0`. - `consistency`: (float) Controls word repetition and skipping. Only supported in `lightning-large` model. Defaults to `0.5`. - `similarity`: (float) Controls similarity to the reference audio. Only supported in `lightning-large` model. Defaults to `0.0`. - `enhancement`: (bool) Enhances speech quality at the cost of increased latency. Only supported in `lightning-large` model. Defaults to `False`. ## Additional Resources The following resources provide more information about using Smallest AI with VideoSDK Agents SDK. - **[Smallest AI docs](https://waves-docs.smallest.ai/v3.0.1/content/introduction/introduction)**: Smallest AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Speechify TTS hide_title: false hide_table_of_contents: false description: "Learn how to use Speechify's TTS models with the VideoSDK AI Agent SDK. This guide covers model configuration, API integration, and implementing text to speech for Speechify's services" pagination_label: "Speechify TTS" keywords: - Speechify - Text-to-Speech - TTS - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Speechify slug: speechify-tts --- # Speechify TTS The Speechify TTS provider enables your agent to use Speechify's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Speechify-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-speechify" ``` ## Importing ```python from videosdk.plugins.speechify import SpeechifyTTS ``` ## Authentication The Speechify plugin requires an [Speechify API key](https://console.sws.speechify.com/). Set `SPEECHIFY_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.speechify import SpeechifyTTS from videosdk.agents import Pipeline # Initialize the Speechify TTS model tts = SpeechifyTTS( voice_id="kristy", model="simba-english" ) # Add tts to pipeline pipeline = Pipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `voice_id`: (str) The Speechify voice to use (default: `"kristy"`). - `api_key`: (str) Speechify API key. Can also be set via the `SPEECHIFY_API_KEY` environment variable. - `model`: (str) The model variant to use (`"simba-base"`, `"simba-english"`, `"simba-multilingual"`, `"simba-turbo"`). Default: `"simba-english"`. - `language`: (str) Optional ISO language code for multilingual models (e.g., `"en"`, `"es"`). ## Additional Resources The following resources provide more information about using Speechify with VideoSDK Agents SDK. - **[Speechify AI docs](https://docs.sws.speechify.com/v1/docs)**: Speechify AI docs. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- title: Turn Detector hide_title: false hide_table_of_contents: false description: "Learn how to use TurnDetector model with the VideoSDK AI Agent SDK. This guide covers model configuration." pagination_label: "Turn Detector" keywords: - Turn Detection - Turn Detector - Large Language Model - VideoSDK Agents - Python SDK - Text To Speech - AI Chat - Conversational AI image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Turn Detector slug: turn-detector --- # Turn Detector The Turn Detector uses a Hugging Face model to determine whether a user's turn is completed or not, enabling precise conversation flow management in cascades. ## Installation Install the Turn Detector-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-turn-detector" ``` ## Importing ```python from videosdk.plugins.turn_detector import TurnDetector ``` ## Example Usage ```python from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.agents import Pipeline # Pre-download the model (optional but recommended) pre_download_model() # Initialize the Turn Detector turn_detector = TurnDetector( threshold=0.7 ) # Add Turn Detector to pipeline pipeline = Pipeline(turn_detector=turn_detector) ``` ## Configuration Options - `threshold`: (float) Confidence threshold for turn completion detection (0.0 to 1.0, default: `0.7`) ## Pre-downloading Model To avoid delays during agent initialization, you can pre-download the Hugging Face model: ```python from videosdk.plugins.turn_detector import pre_download_model # Download model before running the agent pre_download_model() ``` ## Additional Resources The following resources provide more information about VideoSDK Turn Detector plugin for AI Agents SDK. import PluginResourceCards from '@site/src/components/PluginResourceCards' --- --- id: recording title: Recording hide_title: false hide_table_of_contents: false description: "Learn how to enable the recording functionality with VideoSDK AI Agents for agent sessions and user interactions." pagination_label: "Recording" keywords: - Agent Recording - AI Agents - Recording - AI Agent Oversight - Traces - Playback - VideoSDK Agents - MCP Server - Python SDK - Audio Store - Autoscroll Transcript - Timestamped Playback image: img/videosdklive-thumbnail.jpg sidebar_position: 5 sidebar_label: Recording slug: recording --- The AI Agent SDK now supports session recordings, which can be enabled with a simple configuration. When enabled, all interactions between the user and the agent are recorded. These recordings can be played back directly from the dashboard with autoscrolling transcripts and precise timestamps, and you can also download them for offline review and analysis. ## Enabling Recording To enable recording for an AI agent session, you need to set the `recording` flag to `true` in the session context. Once that's done, start your agent as usual—no additional changes are required in the pipeline. By default, the recording flag is set to `false`. ```python job_context = JobContext( room_options = RoomOptions( room_id = "YOUR_ROOM_ID", name = "Agent", recording = True ) ) ``` --- --- title: Running Agents with Worker hide_title: false hide_table_of_contents: false description: "Learn how to run AI agent instances using the Worker system in the VideoSDK AI Agent SDK. Understand WorkerJob and JobContext for robust agent deployment with proper process isolation and lifecycle management." pagination_label: "Running Agents with Worker System" keywords: - Worker System - VideoSDK Agents - AI Agent SDK - Python - Multiprocessing - Process Isolation - WorkerJob - JobContext - Voice Agent Sessions - Agent Deployment image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Running Agents with Worker slug: running-multiple-agents --- The worker system provides a robust way to run AI agent instances using Python's multiprocessing. It offers process isolation, proper lifecycle management, and a clean separation between agent logic and infrastructure concerns. ## Key Components ### 1. WorkerJob `WorkerJob` is the main class that defines an agent task to be executed in a separate process. It takes two parameters: - `entrypoint`: An async function that accepts a JobContext parameter - `jobctx`: A JobContext object or a callable that returns a JobContext ```python job = WorkerJob(entrypoint=my_function, jobctx=my_context) ``` ### 2. JobContext `JobContext` provides the runtime environment for your agent, including: - **Room Management**: Handles VideoSDK room connections - **Shutdown Callbacks**: Allows cleanup operations - **Process Isolation**: Each job runs in its own process ### 3. Worker `Worker` manages the execution of jobs in separate processes, providing: - Process isolation for each agent instance - Automatic cleanup on shutdown - Error handling and logging ## Usage Example Here's a complete example of how to use the worker system with a voice agent: ```python import asyncio import aiohttp from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from videosdk.agents import Agent, AgentSession, Pipeline, WorkerJob, JobContext, RoomOptions class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful voice assistant that can answer questions and help with tasks.", ) async def on_enter(self) -> None: await self.session.say("Hello, how can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def entrypoint(ctx: JobContext): model = GeminiRealtime( model="gemini-3.1-flash-live-preview", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = Pipeline(llm=model) agent = MyVoiceAgent(ctx) session = AgentSession( agent=agent, pipeline=pipeline, ) async def cleanup_session(): print("Cleaning up session...") ctx.add_shutdown_callback(cleanup_session) try: # connect to the room await ctx.connect() await ctx.room.wait_for_participant() await session.start() await asyncio.Event().wait() except KeyboardInterrupt: print("Shutting down...") finally: await session.close() await ctx.shutdown() def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Sandbox Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Configuration Options ### RoomOptions - `room_id`: The VideoSDK meeting ID - `auth_token`: Authentication token (or use VIDEOSDK_AUTH_TOKEN env var) - `name`: Agent name displayed in the meeting - `playground`: Enable playground mode for testing - `vision`: Enable vision capabilities - `avatar`: Use virtual avatars from available providers ## Best Practices 1. **Always use cleanup callbacks**: Register shutdown callbacks to ensure proper resource cleanup 2. **Handle exceptions gracefully**: Use try-finally blocks to ensure cleanup happens 3. **Use playground mode for testing**: Set `playground=True` for easy testing and debugging 4. **Set environment variables**: Use `VIDEOSDK_AUTH_TOKEN` for authentication 5. **Wait for participants**: Use `wait_for_participant()` to ensure agent waits for a participant The worker system provides a production-ready way to deploy AI agents with proper isolation, lifecycle management, and error handling. --- --- id: observability-options title: Observability Options hide_title: false hide_table_of_contents: false description: "Configure recording, traces, metrics, and logs at AgentSession startup using a single ObservabilityOptions object in the VideoSDK AI Agent SDK." pagination_label: "Observability Options" keywords: - ObservabilityOptions - AI Agent SDK - VideoSDK Agents - Recording - Traces - Metrics - Logs - OpenTelemetry - OTLP - AgentSession start - Python SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Observability Options slug: observability-options --- import { AgentCardGrid, GithubIcon, PlayIcon, CodeIcon, ExternalLinkIcon, SettingsIcon } from '@site/src/components/agent/cards'; # Observability Options `ObservabilityOptions` is a single, declarative way to enable observability features — **recording**, **traces**, **metrics**, and **logs** — directly when you start an `AgentSession`. Instead of toggling each feature in different places, you pass one config object to `session.start()` and the framework wires everything up. ## Quick Start ```python title="main.py" from videosdk.agents import ( AgentSession, ObservabilityOptions, RecordingOptions, TracesOptions, MetricsOptions, LoggingOptions, ) session = AgentSession(agent=agent, pipeline=pipeline) await session.start( wait_for_participant=True, run_until_shutdown=True, #highlight-start observability=ObservabilityOptions( recording=RecordingOptions(video=True), traces=TracesOptions( enabled=True, export_url="https://otlp.example.com/v1/traces", ), metrics=MetricsOptions( enabled=True, export_url="https://otlp.example.com/v1/metrics", ), logs=LoggingOptions(enabled=True, level="DEBUG"), ), #highlight-end ) ``` :::tip Each field is **optional**. Pass only the features you want to enable — omitted fields stay off. ::: ## ObservabilityOptions Parameters | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `recording` | `Optional[RecordingOptions]` | `None` | Configure session recording (audio, video, screen share) | | `traces` | `Optional[TracesOptions]` | `None` | Configure OpenTelemetry trace collection and OTLP export | | `metrics` | `Optional[MetricsOptions]` | `None` | Configure OpenTelemetry metric collection and OTLP export | | `logs` | `Optional[LoggingOptions]` | `None` | Configure log level filtering and optional log export | :::note `TracesOptions`, `MetricsOptions`, `LoggingOptions`, and `RecordingOptions` are the **same classes** used in [`RoomOptions`](https://docs.videosdk.live/ai_agents/core-components/room-options#telemetry--logging). `ObservabilityOptions` simply lets you pass them at `session.start()` instead of at room construction — pick whichever entry point suits your code structure. ::: --- ## Recording Provide a `RecordingOptions` to capture the session. Audio is recorded by default; pass additional flags to also record video and/or screen share. ### `RecordingOptions` | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `video` | `bool` | `False` | Record the agent's camera video track (composite audio+video participant recording) | | `screen_share` | `bool` | `False` | Record the screen-share track. Requires `vision=True` in `RoomOptions` | ```python title="main.py" from videosdk.agents import ObservabilityOptions, RecordingOptions await session.start( observability=ObservabilityOptions( recording=RecordingOptions(video=True), ), ) ``` :::note For full details on recording behavior, output formats, and the difference between participant, track, and meeting recording, see the [Recording](https://docs.videosdk.live/ai_agents/core-components/recording) doc. ::: --- ## Traces Stream OpenTelemetry traces (spans for STT, LLM, TTS, EOU, tool calls, etc.) to your own observability backend over OTLP/HTTP — for example Grafana Tempo, Honeycomb, Jaeger, Datadog, or any OTLP-compatible collector. ### `TracesOptions` | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `enabled` | `bool` | `True` | Enable trace collection | | `export_url` | `Optional[str]` | `None` | OTLP/HTTP endpoint that should receive trace spans | | `export_headers` | `Optional[Dict[str, str]]` | `None` | Custom headers for the export endpoint (e.g., auth tokens) | ```python title="main.py" from videosdk.agents import ObservabilityOptions, TracesOptions await session.start( observability=ObservabilityOptions( traces=TracesOptions( enabled=True, export_url="https://otlp.example.com/v1/traces", export_headers={"Authorization": "Bearer your-token"}, ), ), ) ``` :::tip Traces are also available on the [VideoSDK Dashboard](https://docs.videosdk.live/ai_agents/tracing-observability/traces) without any extra setup. Use `TracesOptions` when you want to ship traces to an external system in addition to (or instead of) the dashboard. ::: --- ## Metrics Export pipeline metrics (latency, durations, token usage) to your own OTLP-compatible metrics backend — for example Prometheus (via OTLP), Grafana Mimir, Datadog, or New Relic. ### `MetricsOptions` | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `enabled` | `bool` | `True` | Enable metrics collection | | `export_url` | `Optional[str]` | `None` | OTLP/HTTP endpoint that should receive metric points | | `export_headers` | `Optional[Dict[str, str]]` | `None` | Custom headers for the export endpoint (e.g., auth tokens) | ```python title="main.py" from videosdk.agents import ObservabilityOptions, MetricsOptions await session.start( observability=ObservabilityOptions( metrics=MetricsOptions( enabled=True, export_url="https://otlp.example.com/v1/metrics", export_headers={"Authorization": "Bearer your-token"}, ), ), ) ``` :::tip For programmatic, in-process access to the same metrics, use [Pipeline Observability](https://docs.videosdk.live/ai_agents/core-components/pipeline-observability) hooks (`@pipeline.metrics.on(...)`). ::: --- ## Logs Filter the session's log output by level, and optionally export logs to an external collector. Useful for cranking up verbosity in development or shipping structured logs to your SIEM in production. ### `LoggingOptions` | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `enabled` | `bool` | `False` | Enable log export to the configured `export_url` | | `level` | `str` | `"INFO"` | Log level filter — one of `"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"` | | `export_url` | `Optional[str]` | `None` | OTLP/HTTP endpoint that should receive log records | | `export_headers` | `Optional[Dict[str, str]]` | `None` | Custom headers for the export endpoint (e.g., auth tokens) | ```python title="main.py" from videosdk.agents import ObservabilityOptions, LoggingOptions # Local: just bump the log level for the session await session.start( observability=ObservabilityOptions( logs=LoggingOptions(level="DEBUG"), ), ) # Remote: also ship logs to an OTLP collector await session.start( observability=ObservabilityOptions( logs=LoggingOptions( enabled=True, level="INFO", export_url="https://otlp.example.com/v1/logs", export_headers={"Authorization": "Bearer your-token"}, ), ), ) ``` --- ## Complete Example For a runnable script that turns on recording, traces, metrics, and logs together inside `session.start()`, see the [Observability Hooks example on GitHub](https://github.com/videosdk-live/agents/blob/main/examples/observability_hooks.py). ## Related }, { title: "Pipeline Observability", description: "Hooks for metrics, errors, recording lifecycle, and context access", link: "/ai_agents/core-components/pipeline-observability", icon: }, { title: "Session Analytics", description: "Inspect sessions, transcripts, and latency on the VideoSDK Dashboard", link: "/ai_agents/tracing-observability/session-analytics", icon: }, { title: "Trace Insights", description: "Granular per-turn breakdown of STT, LLM, TTS, and tool calls", link: "/ai_agents/tracing-observability/traces", icon: }, { title: "Recording", description: "Recording types, options, and lifecycle", link: "/ai_agents/core-components/recording", icon: } ]} /> --- --- id: session-analytics title: Session Analytics hide_title: false hide_table_of_contents: false description: "Understand how to use Tracing & Observability for the AI Agent SDK on the VideoSDK Dashboard to inspect sessions, transcripts, and end‑to‑end latency per component." keywords: - AI Agent SDK - VideoSDK Agents - Tracing and Observability - Session Analytics - Telemetry and Metrics - Latency Measurement - End-to-End Latency - Session Debugging - Transcript Playback - Conversation Analytics - Interaction Turns - Tool Calls Monitoring - Audio/Video Session Recording - Agent Responsiveness - Performance Monitoring - User-Agent Interaction - Real-time Insights - Session Playback - VideoSDK Dashboard image: img/videosdklive-thumbnail.jpg sidebar_position: 1 sidebar_label: Session Analytics slug: session-analytics --- VideoSDK's AI Agent framework offers powerful **Tracing and Observability** tools, providing deep insights into your AI agent's performance and behavior. These tools, accessible from the VideoSDK dashboard, allow you to monitor sessions, analyze interactions, and debug issues with precision. ## Prerequisites To View Tracing and Observability At VideoSDK Dashboard, make sure to install the VideoSDK AI Agent package using pip: ```bash pip install videosdk-agents==0.0.23 ``` :::note Tracing and Observability support was added starting from version 0.0.23, which is why this version is required. ::: ## Sessions The Sessions dashboard provides a comprehensive list of all interactions with your AI agents. Each session is a unique conversation between a user and an agent, identified by a `Session ID` and associated with a `Room ID`.
### Key Metrics For each session, you can monitor the following key metrics at a glance: - **Session ID**: A unique identifier for the session. - **Room ID**: The identifier of the room where the session took place. - **TTFW (Time to First Word)**: The time it takes for the agent to utter its first word after the user has finished speaking. This metric is crucial for measuring the responsiveness of your agent. - **P50, P90, P95**: These are percentile metrics for latency, providing a statistical distribution of response times. For example, P90 indicates that 90% of the responses were faster than the specified value. - **Interruption**: The number of times the agent was interrupted by the user. - **Duration**: The total duration of the session. - **Recording**: Indicates whether the session was recorded. You can play back the recording directly from the dashboard. - **Created At**: The timestamp of when the session was created. - **Actions**: From here, you can navigate to the detailed analytics view for the session. ## Session View By clicking on "View Analytics" for a specific session, you are taken to the Session View. This view provides a complete transcript of the conversation, along with timestamps and speaker identification (Caller or Agent).
If the session was recorded, you can play back the audio and follow along with the transcript, which automatically scrolls as the conversation progresses. This is an invaluable tool for understanding the user experience and identifying areas for improvement. By analyzing these metrics, you can quickly identify underperforming agents, diagnose latency issues, and gain a holistic view of the user experience. The next section will delve into the detailed session and trace views, where you can explore individual conversations and their underlying processes. --- --- id: traces title: Trace Insights keywords: - VideoSDK Tracing - Trace View - Spans and Traces - Speech-to-Text (STT) - Text-to-Speech (TTS) - Large Language Model (LLM) - End-of-Utterance (EOU) - AI Agent Performance - Session Trace Analysis - Conversation Turn Breakdown - Latency Metrics - Tool Call Debugging - Agent Interaction Insights - Real-time Trace Visualization - AI Agent Observability - VideoSDK Dashboard Traces image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: Traces slug: traces --- The real power of VideoSDK's Tracing and Observability tools lies in the detailed session and trace views. These views provide a granular breakdown of each conversation, allowing you to analyze every turn, inspect component latencies, and understand the agent's decision-making process. ## Trace View The Trace View offers an even deeper level of insight, breaking down the entire session into a hierarchical structure of traces and spans.
### Session Configuration At the top level, you'll find the **Session Configuration**, which details all the parameters the agent was initialized with. This includes the models used for STT, LLM, and TTS, as well as any function tools or MCP tools that were configured. This information is crucial for reproducing and debugging specific agent behaviors. ### User & Agent Turns The core of the Trace View is the breakdown of the conversation into **User & Agent Turns**. Each turn represents a single exchange between the user and the agent.
Within each turn, you can see a detailed timeline of the underlying processes, including: - **STT (Speech-to-Text) Processing**: The time it took to transcribe the user's speech. - **EOU (End-of-Utterance) Detection**: The time taken to detect that the user has finished speaking. - **LLM Processing**: The time the Large Language Model took to process the input and generate a response. - **TTS (Text-to-Speech) Processing**: The time it took to convert the LLM's text response into speech. - **Time to First Byte**: The initial delay before the agent starts speaking. - **User Input Speech**: The duration of the user's speech. - **Agent Output Speech**: The duration of the agent's spoken response. ### Turn Properties For each turn, you can inspect the properties of the components involved. This includes the transcript of the user's input, the response from the LLM, and any errors that may have occurred.
By leveraging the detailed information in the Trace View, you can pinpoint performance bottlenecks, debug errors, and gain a comprehensive understanding of your AI agent's inner workings. ### Tool Calls When an LLM invokes a tool, the Trace View provides specific details about the tool call, including the tool's name and the parameters it was called with. This is essential for debugging integrations and ensuring that your agent's tools are functioning as expected.
--- --- title: Wake Up Call hide_title: false hide_table_of_contents: false description: "Learn how to implement Wake Up Call functionality with VideoSDK AI Agents to automatically trigger actions when users are inactive for a specified duration." pagination_label: "Wake Up Call" keywords: - Wake Up Call - Inactivity Detection - Auto Trigger - User Engagement - VideoSDK Agents - Callback Functions - Session Management - AgentSession - Timeout Management image: img/videosdklive-thumbnail.jpg sidebar_position: 6 sidebar_label: Wake Up Call slug: wakeup-call --- # Wake Up Call Wake Up Call enables AI agents to automatically trigger actions when users remain inactive for a specified duration. This feature helps maintain user engagement and provides proactive assistance during conversation sessions. ## Overview The Wake Up Call system allows AI agents to: - Monitor user inactivity periods during conversations - Automatically trigger custom callback functions after specified timeouts - Re-engage users with proactive messages or actions ## Key Components ### 1. Wake Up Configuration Set the inactivity timeout duration in the `AgentSession` constructor using the `wake_up` parameter: ```python session = AgentSession( agent=agent, pipeline=pipeline, wake_up=10 # seconds ) ``` **Important**: If a `wake_up` time is provided, you must set a callback function before starting the session. If no `wake_up` time is specified, no timer or callback will be activated. ### 2. Callback Function Define a custom async function that will be executed when the inactivity threshold is reached: ```python async def on_wake_up(): print("Wake up triggered - user inactive for 10 seconds") session.say("Hello, how can I help you today?") # Assign the callback function to the session session.on_wake_up = on_wake_up ``` :::tip Get started quickly with the [Wake Up Call Example](https://github.com/videosdk-live/agents/tree/main/examples/wakeup_call.py) — everything you need to implement inactivity detection in your AI agents. ::: --- --- title: WhatsApp Agent Quick Start hide_title: false hide_table_of_contents: false description: "A comprehensive guide to creating a powerful AI voice agent that can answer calls made to your WhatsApp Business number. Learn how to integrate with Meta Business Platform using direct SIP integration and VideoSDK." pagination_label: "WhatsApp Agent Quick Start" keywords: - WhatsApp Voice Agent - Quick Start - VideoSDK Agents - AI Agent SDK - Python - SIP - WhatsApp Business - Meta Business Platform - Voice Integration - AI Assistant - Customer Service - Real-time Communication image: img/videosdklive-thumbnail.jpg sidebar_position: 4 sidebar_label: AI WhatsApp Agent slug: whatsapp-voice-agent-quick-start --- import WhatsAppQuickStart from '@site/mdx/\_whatsapp-voice-agent-quick-start-v1.mdx'; --- --- title: Custom Tracks hide_title: true hide_table_of_contents: false description: Custom Video Track features quick integrate in Javascript, React JS, Android, IOS, React Native, Flutter with Video SDK to add live video & audio conferencing to your applications. sidebar_label: Custom Tracks pagination_label: Custom Tracks keywords: - custom Track - audio calling - video calling - real-time communication image: img/videosdklive-thumbnail.jpg sidebar_position: 1 --- ## Custom Video Track - Android - You can create a Video Track using `createCameraVideoTrack()` method of `VideoSDK`. - This method can be used to create video track using different encoding parameters, camera facing mode, bitrateMode, maxLayer and optimization mode. ### Parameters - **encoderConfig**: - type: `String` - required: `true` - default: `h480p_w720p` - You can choose from the below mentioned list of values for the encoder config. | Config | Resolution | Frame Rate | Optimized (kbps) | Balanced (kbps) | High Quality (kbps) | | :---------------- | :--------: | :--------: | :--------------: | :-------------: | :-----------------: | | h144p_w176p | 176x144 | 15 fps | 60 | 100 | 150 | | h240p_w320p | 320x240 | 15 fps | 80 | 150 | 300 | | h360p_w640p | 360x640 | 25 fps | 200 | 400 | 800 | | h480p_w640p | 640x480 | 25 fps | 300 | 600 | 1000 | | h480p_w720p | 720x480 | 30 fps | 400 | 700 | 1100 | | h720p_w960p | 960x720 | 30 fps | 800 | 1300 | 1800 | | h720p_w1280p | 1280x720 | 30 fps | 1000 | 1600 | 2400 | | h1080p_w1440p | 1440x1080 | 30 fps | 2000 | 2500 | 3500 | :::note Above mentioned encoder configurations are valid for both, landscape as well as portrait mode. ::: - **facingMode**: - type: `String` - required: `true` - Allowed values : `front` | `back` - It will specify wheater to use front or back camera for the video track. - **optimizationMode** - type: `CustomStreamTrack.VideoMode` - required: `true` - Allowed values: `motion` | `text` | `detail` - It will specify the optimization mode for the video track being generated. - **multiStream**: - type: `boolean` - required: `true` - It will specify if the stream should send multiple resolution layers or single resolution layer. - **context**: - type: `Context` - required: `true` - Pass the Android Context for this parameter. - **observer**: - type: `CapturerObserver` - required: `false` - If you want to use video filter from external SDK(e.g., [Banuba](https://www.banuba.com/)) then pass instance of `CapturerObserver` in this parameter. - **videoDeviceInfo**: - type: `VideoDeviceInfo` - required: `false` - If you want to specify a camera device to be used in the meeting. - **bitrateMode**: - type: `BitrateMode` - required: `false` - Allowed values : `BitrateMode.BANDWIDTH_OPTIMIZED` | `BitrateMode.BALANCED` | `BitrateMode.HIGH_QUALITY` - Controls the video quality and bandwidth consumption. You can choose between `BitrateMode.HIGH_QUALITY` for the best picture, `BitrateMode.BANDWIDTH_OPTIMIZED` to save data, or `BitrateMode.BALANCED` for a mix of both. Defaults to `BitrateMode.BALANCED`. - **maxLayer**: - type: `Integer` - required: `false` - Allowed values : `2` | `3` - Specifies the maximum number of simulcast layers to publish. This parameter only has an effect if `multiStream` is set to true. :::note For Banuba integraion with VideoSDK, please visit [Banuba Intergation with VideoSDK](/android/guide/video-and-audio-calling-api-sdk/video-processor/banuba-integration) ::: :::info - To learn more about optimizations and best practices for using custom video tracks, [follow this guide](/android/guide/video-and-audio-calling-api-sdk/render-media/optimize-video-track). ::: #### Returns - `CustomStreamTrack` ### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript val videoCustomTrack: CustomStreamTrack = VideoSDK.createCameraVideoTrack("h720p_w960p", "front", CustomStreamTrack.VideoMode.MOTION, true, this, 3, BitrateMode.BALANCED) ``` ```javascript CustomStreamTrack customStreamTrack = VideoSDK.createCameraVideoTrack("h720p_w960p", "front", CustomStreamTrack.VideoMode.MOTION, true, this, 3, BitrateMode.BALANCED); ``` ## Custom Audio Track - Android - You can create a Audio Track using `createAudioTrack()` method of `VideoSDK`. - This method can be used to create audio track using different encoding parameters. ### Parameters - **encoderConfig**: - type: `String` - required: `true` - default: `speech_standard` - You can choose from the below mentioned list of values for the encoder config. | Encoder Config | Bitrate | Auto Gain | Echo Cancellation | Noise Suppression | | ------------------- | :------: | :-------: | :---------------: | :---------------: | | speech_low_quality | 16 kbps | TRUE | TRUE | TRUE | | speech_standard | 24 kbps | TRUE | TRUE | TRUE | | music_standard | 32 kbps | FALSE | FALSE | FALSE | | standard_stereo | 64 kbps | FALSE | FALSE | FALSE | | high_quality | 128 kbps | FALSE | FALSE | FALSE | | high_quality_stereo | 192 kbps | FALSE | FALSE | FALSE | - **context** - type: `Context` - required: `true` - Pass the Android Context for this parameter. #### Returns - `CustomStreamTrack` ### Example ```js val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("speech_standard",this) ``` ```js CustomStreamTrack audioCustomTrack=VideoSDK.createAudioTrack("speech_standard", this); ``` ## Custom Screen Share Track - Android - You can create a Screen Share track using `createScreenShareVideoTrack()` method of `VideoSDK`. - This method can be used to create screen share track using different encoding parameters. ### Parameters - **encoderConfig**: - type: `String` - required: `true` - default: `h720p_15fps` - You can choose from the below mentioned list of values for the encoder config. | Encoder Config | Resolution | Frame Rate | Bitrate | | -------------- | :--------: | :--------: | :----------: | | h360p_30fps | 640x360 | 3 fps | 200000 kbps | | h720p_5fps | 1280x720 | 5 fps | 400000 kbps | | h720p_15fps | 1280x720 | 15 fps | 1000000 kbps | | h1080p_15fps | 1920x1080 | 15 fps | 1500000 kbps | | h1080p_30fps | 1920x1080 | 15 fps | 1000000 kbps | :::note Above mentioned encoder configurations are valid for both, landscape as well as portrait mode. ::: - **data** - type: `Intent` - required: `true` - It is Intent received from onActivityResult when user provide permission for ScreenShare. - **context** - type: `Context` - required: `true` - Pass the Android Context for this parameter. - **listener** - type: `CustomTrackListener` - required: `true` - Callback to this listener will be made when track is ready with CustomTrack as parameter. ### Example ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this) { track -> meeting!!.enableScreenShare(track) } ``` ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this, (track)->{ meeting.enableScreenShare(track); }); ``` --- --- sidebar_position: 2 sidebar_label: Meeting Error Codes pagination_label: Meeting Error Codes title: Meeting Error Codes --- # Meeting Error Codes - Android If you encounter any of the errors listed below, refer to the [Developer Experience Guide](../../guide/best-practices/developer-experience.md#listen-for-error-events), which offers recommended solutions based on common error categories. import ServerErrorCodes from '../../../mdx/\_server-error-codes.mdx' import SDKErrorCodes from '../../data/\_sdk-error-codes.mdx' --- --- sidebar_position: 2 sidebar_label: Initializing a Meeting pagination_label: Initializing a Meeting title: Initializing a Meeting --- # Initializing a Meeting - Android
## initialize() To initialize the meeting, first you have to initialize the `VideoSDK`. You can initialize the `VideoSDK` using `initialize()` method provided by the SDK. #### Parameters - **context**: Context #### Returns - _`void`_ ```js title="initialize" VideoSDK.initialize(Context context) ``` --- ## config() Now, you have to set `token` property of `VideoSDK` class. By using `config()` method, you can set the `token` property of `VideoSDK` class. Please refer this [documentation](/api-reference/realtime-communication/intro/) to generate a token. #### Parameters - **token**: String #### Returns - _`void`_ ```js title="config" VideoSDK.config(String token) ``` --- ## initMeeting() - Now, you can initialize the meeting using a factory method provided by the SDK called `initMeeting()`. - `initMeeting()` will generate a new [`Meeting`](./meeting-class/introduction.md) class and the initiated meeting will be returned. ```js title="initMeeting" VideoSDK.initMeeting( Context context, String meetingId, String name, boolean micEnabled, boolean webcamEnabled, String participantId, String mode, boolean multiStream, Map customTracks JSONObject metaData, String signalingBaseUrl, PreferredProtocol preferredProtocol ) ``` ## Parameters ### context - Context of activity. - type : Context - `REQUIRED` ### meetingId - Unique Id of the meeting where that participant will be joining. - type : `String` - `REQUIRED` Please refer this [documentation](/api-reference/realtime-communication/create-room) to create a room. ### name - Name of the participant who will be joining the meeting, this name will be displayed to other participants in the same meeting. - type : String - `REQUIRED` ### micEnabled - Whether `mic` of the participant will be on while joining the meeting. If it is set to `false`, then mic of that participant will be `disabled` by default, but can be `enabled` or `disabled` later. - type: `Boolean` - `REQUIRED` ### webcamEnabled - Whether `webcam` of the participant will be on while joining the meeting. If it is set to `false`, then webcam of that participant will be `disabled` by default, but can be `enabled` or `disabled` later. - type: `Boolean` - `REQUIRED` ### participantId - Unique Id of the participant. If you passed `null` then SDK will create an Id by itself and will use that id. - type : `String` or `null` - `REQUIRED` ### Modes There are three modes available: - **`SEND_AND_RECV`**: In this mode, both audio and video streams will be produced and consumed. - **`SIGNALLING_ONLY`**: In this mode, no audio or video streams will be produced or consumed. It is used solely for signaling. - **`RECV_ONLY`**: This mode allows only the consumption of audio and video streams without producing any. **Type**: `String` or `null` **Default Value**: `SEND_AND_RECV` import CautionMessage from '@site/src/theme/CautionMessage'; ### multiStream - It will specify if the stream should send multiple resolution layers or single resolution layer. - type: `boolean` - `REQUIRED` ### customTracks - If you want to use custom tracks from start of the meeting, you can pass map of custom tracks in this paramater. - type : `Map` or `null` - `REQUIRED` Please refer this [documentation](../../guide/video-and-audio-calling-api-sdk/features/custom-track/custom-video-track) to know more about CustomTrack. ### metaData - If you want to provide additional details about a user joining a meeting, such as their profile image, you can pass that information in this parameter. - type: `JSONObject` - `REQUIRED` ### signalingBaseUrl - If you want to use a proxy server with the VideoSDK, you can specify your baseURL here. - type: `String` - `OPTIONAL` :::note If you intend to use a proxy server with the VideoSDK, priorly inform us at support@videosdk.live ::: ### preferredProtocol - If you want to provide a preferred network protocol for communication, you can specify that in `PreferredProtocol`, with options including `UDP_ONLY`, `UDP_OVER_TCP`, and `TCP_ONLY`. - type: `PreferredProtocol` - `OPTIONAL` ## Returns ### meeting - After initializing the meeting, `initMeeting()` will return a new [`Meeting`](./meeting-class/introduction.md) instance. --- ## Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js title="initMeeting" VideoSDK.initialize(applicationContext) // Configure the token VideoSDK.config(token) // pass the token generated from VideoSDK Dashboard // Initialize the meeting var meeting = VideoSDK.initMeeting( arrayOf( this@MainActivity, "abc-1234-xyz", "John Doe", true, true, null, null, false, null, null ) ) ``` ```js title="initMeeting" VideoSDK.initialize(getApplicationContext()); // Configure the token VideoSDK.config(token); // pass the token generated from VideoSDK Dashboard // Initialize the meeting Meeting meeting = VideoSDK.initMeeting({ MainActivity.this, "abc-1234-xyz", "John Doe", true, true, null, null, false, null, null, null }); ```
--- --- title: MediaEffects library hide_title: true hide_table_of_contents: false description: The MediaEffects library enhances video applications by providing advanced media effects, including virtual backgrounds. sidebar_label: MediaEffects library pagination_label: MediaEffects library keywords: - Apply Virtual Background - Remove Virtual Background - Change Virtual Background image: img/videosdklive-thumbnail.jpg sidebar_position: 1 --- # MediaEffects library - Android
## Introduction - The `MediaEffects` library enhances video applications with advanced media effects, including virtual backgrounds. It supports real-time processing and is optimized for Android devices. - The `MediaEffects` library offers three classes to customize your video background: using a custom image, applying a blur effect, or choosing a solid color. :::info The Virtual Background feature in VideoSDK can be utilized regardless of the meeting environment, including the pre-call screen. ::: ## 1. BackgroundImageProcessor - `BackgroundImageProcessor` sets a specified image as the background in a video stream, allowing you to customize the visual appearance of the video. - `BackgroundImageProcessor` class provides following method. - `setBackgroundSource()` method updates the virtual background by setting a new image as the background that the user wants to switch to. - **Parameters**: `Uri`: An image URI for the background image. - **Return Type**: `void` import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js val uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg") val backgroundImageProcessor = BackgroundImageProcessor(uri) // Sets the background image val newUri = Uri.parse("https://img.freepik.com/free-photo/plant-against-blue-wall-mockup_53876-96052.jpg?size=626&ext=jpg&ga=GA1.1.2008272138.1723420800&semt=ais_hybrid") backgroundImageProcessor.setBackgroundSource(newUri) // Changed background image ``` ```java Uri uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg"); BackgroundImageProcessor backgroundImageProcessor= new BackgroundImageProcessor(uri); // Sets the background image Uri uri = Uri.parse("https://img.freepik.com/free-photo/plant-against-blue-wall-mockup_53876-96052.jpg?size=626&ext=jpg&ga=GA1.1.2008272138.1723420800&semt=ais_hybrid"); backgroundImageProcessor.setBackgroundSource(uri); //Changed background image ``` ## 2. BackgroundBlurProcessor - `BackgroundBlurProcessor` applies a blur effect to the video background, with the intensity controlled by a float value, creating a softened visual effect. - `BackgroundBlurProcessor` class provides following method. - `setBlurRadius()` method adjusts the blur effect on the video background, with the blur strength controlled by the specified float value. - **Parameters**: `Float`: representing the blur strength; higher values mean stronger blur. The supported range is 0 to 25. - **Return Type**: `void` ```js val backgroundBlurProcessor = BackgroundBlurProcessor(25, this) // Applies a blur with intensity 25 backgroundBlurProcessor.setBlurRadius(17) // Changes the blur intensity to 17 ``` ```java BackgroundBlurProcessor backgroundBlurProcessor = new BackgroundBlurProcessor(25, this);// Applies a blur with intensity 25 backgroundBlurProcessor.setBlurRadius(17); // changes the blur intensity to 17 ``` ## 3. BackgroundColorProcessor - `BackgroundColorProcessor` sets a solid color as the video background using a `Color` object, enabling you to create a uniform color backdrop for your video. - `BackgroundColorProcessor` class provides following method. - `setBackgroundColor()` method sets the color that user wants to switch to, for virtual background effect. - **Parameters**: `Integer`: Specifies the color for the virtual background. - **Return Type**: `void` ```js val backgroundColorProcessor = BackgroundColorProcessor(Color.BLUE) // Sets the background color to blue backgroundColorProcessor.setBackgroundColor(Color.CYAN) // Changes the background color to CYAN ``` ```java BackgroundColorProcessor backgroundColorProcessor = new BackgroundColorProcessor(Color.BLUE);// Sets the background color to blue backgroundColorProcessor.setBackgroundColor(Color.CYAN); // changed the background color to CYAN ```
--- --- sidebar_position: 1 sidebar_label: Introduction pagination_label: Intro to Video SDK Meeting Class title: Video SDK Meeting Class --- # Video SDK Meeting Class - Android
## Introduction The `Meeting` class includes properties, methods and meeting-event-listener-class for managing a meeting, participants, video, audio and share streams, messaging and UI customization. import LinksGrid from "../../../../src/theme/LinksGrid"; import properties from "./../data/meeting-class/properties.json"; import methods from "./../data/meeting-class/methods.json"; import events from "./../data/meeting-class/events.json"; ## Meeting Properties
- [getmeetingId()](/android/api/sdk-reference/meeting-class/properties#getmeetingid)
- [getLocalParticipant()](./properties#getlocalparticipant)
- [getConnectionState()](./properties#getconnectionstate)
- [getParticipants()](./properties#getparticipants)
- [pubSub](./properties#pubsub)
## Meeting Methods
- [join()](./methods#join)
- [leave()](./methods#leave)
- [end()](./methods#end)
- [enableWebcam()](./methods#enablewebcam)
- [disableWebcam()](./methods#disablewebcam)
- [unmuteMic()](./methods#unmutemic)
- [muteMic()](./methods#mutemic)
- [enableScreenShare()](./methods#enablescreenshare)
- [disableScreenShare()](./methods#disablescreenshare)
- [startRecording()](./methods#startrecording)
- [stopRecording()](./methods#stoprecording)
- [startLiveStream()](./methods#startlivestream)
- [stopLiveStream()](./methods#stoplivestream)
- [startHls()](./methods#starthls)
- [stopHls()](./methods#stophls)
- [startTranscription()](./methods#starttranscription)
- [stopTranscription()](./methods#stoptranscription)
- [changeMode()](./methods#changemode)
- [getMics()](./methods#getmics)
- [changeMic()](./methods#changemic)
- [setAudioDeviceChangeListener()](./methods#setaudiodevicechangelistener)
- [changeWebcam()](./methods#changewebcam)
- [uploadBase64File()](./methods#uploadbase64file)
- [fetchBase64File()](./methods#fetchbase64file)
- [addEventListener()](./methods#addeventlistener)
- [removeEventListener()](./methods#removeeventlistener)
- [removeAllListeners()](./methods#removealllisteners)
- [startWhiteboard()](./methods#startwhiteboard)
- [stopWhiteboard()](./methods#stopwhiteboard)
- [pauseAllStreams()](./methods#pauseallstreams)
- [resumeAllStreams()](./methods#resumeallstreams)
- [requestMediaRelay()](./methods#requestmediarelay)
- [stopMediaRelay()](./methods#stopmediarelay)
- [switchTo()](./methods#switchto)
## Meeting Events
- [onMeetingJoined](./meeting-event-listener-class#onmeetingjoined)
- [onMeetingLeft](./meeting-event-listener-class#onmeetingleft)
- [onParticipantJoined](./meeting-event-listener-class#onparticipantjoined)
- [onParticipantLeft](./meeting-event-listener-class#onparticipantleft)
- [onSpeakerChanged](./meeting-event-listener-class#onspeakerchanged)
- [onPresenterChanged](./meeting-event-listener-class#onpresenterchanged)
- [onEntryRequested](./meeting-event-listener-class#onentryrequested)
- [onEntryResponded](./meeting-event-listener-class#onentryresponded)
- [onWebcamRequested](./meeting-event-listener-class#onwebcamrequested)
- [onMicRequested](./meeting-event-listener-class#onmicrequested)
- [onRecordingStateChanged](./meeting-event-listener-class#onrecordingstatechanged)
- [onRecordingStarted](./meeting-event-listener-class#onrecordingstarted)
- [onRecordingStopped](./meeting-event-listener-class#onrecordingstopped)
- [onLivestreamStateChanged](./meeting-event-listener-class#onlivestreamstatechanged)
- [onLivestreamStarted](./meeting-event-listener-class#onlivestreamstarted)
- [onLivestreamStopped](./meeting-event-listener-class#onlivestreamstopped)
- [onHlsStateChanged](./meeting-event-listener-class#onhlsstatechanged)
- [onTranscriptionStateChanged](./meeting-event-listener-class#ontranscriptionstatechanged)
- [onTranscriptionText](./meeting-event-listener-class#ontranscriptiontext)
- [onExternalCallStarted](./meeting-event-listener-class#onexternalcallstarted)
- [onMeetingStateChanged](./meeting-event-listener-class#onmeetingstatechanged)
- [onParticipantModeChanged](./meeting-event-listener-class#onparticipantmodechanged)
- [onPinStateChanged()](./meeting-event-listener-class#onpinstatechanged)
- [onWhiteboardStarted()](./meeting-event-listener-class#onwhiteboardstarted)
- [onWhiteboardStopped()](./meeting-event-listener-class#onwhiteboardstopped)
- [onExternalCallRinging()](./meeting-event-listener-class#onexternalcallringing)
- [onExternalCallStarted()](./meeting-event-listener-class#onexternalcallstarted-1)
- [onExternalCallHangup()](./meeting-event-listener-class#onexternalcallhangup)
- [onPausedAllStreams()](./meeting-event-listener-class#onpausedallstreams)
- [onResumedAllStreams()](./meeting-event-listener-class#onresumedallstreams)
- [onMediaRelayRequestReceived()](./meeting-event-listener-class#onmediarelayrequestreceived)
- [onMediaRelayRequestResponse()](./meeting-event-listener-class#onmediarelayrequestreceived)
- [onMediaRelayStarted()](./meeting-event-listener-class#onmediarelaystarted)
- [onMediaRelayStopped()](./meeting-event-listener-class#onmediarelaystopped)
- [onMediaRelayError()](./meeting-event-listener-class#onmediarelayerror)
--- --- sidebar_position: 1 sidebar_label: MeetingEventListener Class pagination_label: MeetingEventListener Class title: MeetingEventListener Class --- # MeetingEventListener Class - Android
--- ### implementation - You can implement all the methods of `MeetingEventListener` abstract Class and add the listener to `Meeting` class using the `addEventListener()` method of `Meeting` Class. #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript private final MeetingEventListener meetingEventListener = new MeetingEventListener() { override fun onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()") } } ``` ```javascript private final MeetingEventListener meetingEventListener = new MeetingEventListener() { @Override public void onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()"); } } ``` --- ### onMeetingJoined() - This event will be emitted when a [localParticipant](./properties#getlocalparticipant) successfully joined the meeting. #### Example ```javascript override fun onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()") } ``` ```javascript @Override public void onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()"); } ``` --- ### onMeetingLeft() - This event will be emitted when a [localParticipant](./properties#getlocalparticipant) left the meeting. #### Example ```javascript override fun onMeetingLeft() { Log.d("#meeting", "onMeetingLeft()") } ``` ```javascript @Override public void onMeetingLeft() { Log.d("#meeting", "onMeetingLeft()"); } ``` --- ### onParticipantJoined() - This event will be emitted when a new [participant](../participant-class/introduction) joined the meeting. #### Event callback parameters - **participant**: [Participant](../participant-class/introduction) #### Example ```javascript override fun onParticipantJoined(participant: Participant) { Log.d("#meeting", participant.displayName + " joined"); } ``` ```javascript @Override public void onParticipantJoined(Participant participant) { Log.d("#meeting", participant.getDisplayName() + " joined"); } ``` --- ### onParticipantLeft - This event will be emitted when a joined [participant](../participant-class/introduction) left the meeting. #### Event callback parameters - **participant**: [Participant](../participant-class/introduction) #### Example ```javascript override fun onParticipantLeft(participant: Participant) { Log.d("#meeting", participant.displayName + " left"); } ``` ```javascript @Override public void onParticipantLeft(Participant participant) { Log.d("#meeting", participant.getDisplayName() + " left"); } ``` --- ### onSpeakerChanged() - This event will be emitted when a active speaker changed. - If you want to know which participant is actively speaking, then this event will be used. - If no participant is actively speaking, then this event will pass `null` as en event callback parameter. #### Event callback parameters - **participantId**: String #### Example ```javascript override fun onSpeakerChanged(participantId: String?) { // } ``` ```javascript @Override public void onSpeakerChanged(String participantId) { // } ``` --- ### onPresenterChanged() - This event will be emitted when any [participant](../participant-class/introduction) starts or stops screen sharing. - It will pass `participantId` as an event callback parameter. - If a participant stops screensharing, then this event will pass `null` as en event callback parameter. #### Event callback parameters - **participantId**: String #### Example ```javascript override fun onPresenterChanged(participantId: String) { // } ``` ```javascript @Override public void onPresenterChanged(String participantId) { // } ``` --- ### onEntryRequested() - This event will be emitted when a new [participant](../participant-class/introduction) who is trying to join the meeting, is having permission **`ask_join`** in token. - This event will only be emitted to the [participants](./properties#getparticipants) in the meeting, who is having the permission **`allow_join`** in token. - This event will pass following parameters as an event parameters, `participantId` and `name` of the new participant who is trying to join the meeting, `allow()` and `deny()` to take required actions. #### Event callback parameters - **peerId**: String - **name**: String #### Example ```javascript override fun onEntryRequested(id: String?, name: String?) { // } ``` ```javascript @Override public void onEntryRequested(String id, String name) { // } ``` --- ### onEntryResponded() - This event will be emitted when the `join()` request is responded. - This event will be emitted to the [participants](./properties#getparticipants) in the meeting, who is having the permission **`allow_join`** in token. - This event will be also emitted to the [participant](../participant-class/introduction) who requested to join the meeting. #### Event callback parameters - **participantId**: _String_ - **decision**: _"allowed"_ | _"denied"_ #### Example ```javascript override fun onEntryResponded(id: String?, decision: String?) { // } ``` ```javascript @Override public void onEntryResponded(String id, String decision) { // } ``` --- ### onWebcamRequested() - This event will be emitted to the participant `B` when any other participant `A` requests to enable webcam of participant `B`. - On accepting the request, webcam of participant `B` will be enabled. #### Event callback parameters - **participantId**: String - **listener**: WebcamRequestListener \{ **accept**: Method; **reject**: Method } #### Example ```javascript override fun onWebcamRequested(participantId: String, listener: WebcamRequestListener) { // if accept request listener.accept() // if reject request listener.reject() } ``` ```javascript @Override public void onWebcamRequested(String participantId, WebcamRequestListener listener) { // if accept request listener.accept(); // if reject request listener.reject(); } ``` ### onMicRequested() - This event will be emitted to the participant `B` when any other participant `A` requests to enable mic of participant `B`. - On accepting the request, mic of participant `B` will be enabled. #### Event callback parameters - **participantId**: String - **listener**: MicRequestListener \{ **accept**: Method; **reject**: Method } #### Example ```javascript override fun onMicRequested(participantId: String, listener: MicRequestListener) { // if accept request listener.accept() // if reject request listener.reject() } ``` ```javascript @Override public void onMicRequested(String participantId, MicRequestListener listener) { // if accept request listener.accept(); // if reject request listener.reject(); } ``` --- ### onRecordingStateChanged() - This event will be emitted when the meeting's recording status changed. #### Event callback parameters - **recordingState**: String `recordingState` has following values - `RECORDING_STARTING` - Recording is in starting phase and hasn't started yet. - `RECORDING_STARTED` - Recording has started successfully. - `RECORDING_STOPPING` - Recording is in stopping phase and hasn't stopped yet. - `RECORDING_STOPPED` - Recording has stopped successfully. #### Example ```javascript override fun onRecordingStateChanged(recordingState: String) { when (recordingState) { "RECORDING_STARTING" -> { Log.d("onRecordingStateChanged", "Meeting recording is starting") } "RECORDING_STARTED" -> { Log.d("onRecordingStateChanged", "Meeting recording is started") } "RECORDING_STOPPING" -> { Log.d("onRecordingStateChanged", "Meeting recording is stopping") } "RECORDING_STOPPED" -> { Log.d("onRecordingStateChanged", "Meeting recording is stopped") } } } ``` ```javascript @Override public void onRecordingStateChanged(String recordingState) { switch (recordingState) { case "RECORDING_STARTING": Log.d("onRecordingStateChanged", "Meeting recording is starting"); break; case "RECORDING_STARTED": Log.d("onRecordingStateChanged", "Meeting recording is started"); break; case "RECORDING_STOPPING": Log.d("onRecordingStateChanged", "Meeting recording is stopping"); break; case "RECORDING_STOPPED": Log.d("onRecordingStateChanged", "Meeting recording is stopped"); break; } } ``` --- ### onRecordingStarted() _`This event will be deprecated soon`_ - This event will be emitted when recording of the meeting is started. #### Example ```javascript override fun onRecordingStarted() { // } ``` ```javascript @Override public void onRecordingStarted() { // } ``` --- ### onRecordingStopped() _`This event will be deprecated soon`_ - This event will be emitted when recording of the meeting is stopped. #### Example ```javascript override fun onRecordingStopped() { // } ``` ```javascript @Override public void onRecordingStopped() { // } ``` --- ### onLivestreamStateChanged() - This event will be emitted when the meeting's livestream status changed. #### Event callback parameters - **livestreamState**: String `livestreamState` has following values - `LIVESTREAM_STARTING` - Livestream is in starting phase and hasn't started yet. - `LIVESTREAM_STARTED` - Livestream has started successfully. - `LIVESTREAM_STOPPING` - Livestream is in stopping phase and hasn't stopped yet. - `LIVESTREAM_STOPPED` - Livestream has stopped successfully. #### Example ```javascript override fun onLivestreamStateChanged(livestreamState: String?) { when (livestreamState) { "LIVESTREAM_STARTING" -> Log.d( "LivestreamStateChanged", "Meeting livestream is starting" ) "LIVESTREAM_STARTED" -> Log.d( "LivestreamStateChanged", "Meeting livestream is started" ) "LIVESTREAM_STOPPING" -> Log.d("LivestreamStateChanged", "Meeting livestream is stopping" ) "LIVESTREAM_STOPPED" -> Log.d("LivestreamStateChanged", "Meeting livestream is stopped" ) } } ``` ```javascript @Override public void onLivestreamStateChanged(String livestreamState) { switch (livestreamState) { case "LIVESTREAM_STARTING": Log.d("LivestreamStateChanged", "Meeting livestream is starting"); break; case "LIVESTREAM_STARTED": Log.d("LivestreamStateChanged", "Meeting livestream is started"); break; case "LIVESTREAM_STOPPING": Log.d("LivestreamStateChanged", "Meeting livestream is stopping"); break; case "LIVESTREAM_STOPPED": Log.d("LivestreamStateChanged", "Meeting livestream is stopped"); break; } } ``` --- ### onLivestreamStarted() _`This event will be deprecated soon`_ - This event will be emitted when `RTMP` live stream of the meeting is started. #### Example ```javascript override fun onLivestreamStarted() { // } ``` ```javascript @Override public void onLivestreamStarted() { // } ``` --- ### onLivestreamStopped() _`This event will be deprecated soon`_ - This event will be emitted when `RTMP` live stream of the meeting is stopped. #### Example ```javascript override fun onLivestreamStopped() { // } ``` ```javascript @Override public void onLivestreamStopped() { // } ``` --- ### onHlsStateChanged() - This event will be emitted when the meeting's HLS(Http Livestreaming) status changed. #### Event callback parameters - **HlsState**: \{ **status**: String} - `status` has following values : - `HLS_STARTING` - HLS is in starting phase and hasn't started yet. - `HLS_STARTED` - HLS has started successfully. - `HLS_PLAYABLE` - HLS can be playable now. - `HLS_STOPPING` - HLS is in stopping phase and hasn't stopped yet. - `HLS_STOPPED` - HLS has stopped successfully. - when you receive `HLS_PLAYABLE` status you will receive 2 urls in response - `playbackHlsUrl` - Live HLS with playback support - `livestreamUrl` - Live HLS without playback support :::note `downstreamUrl` is now depecated. Use `playbackHlsUrl` or `livestreamUrl` in place of `downstreamUrl` ::: #### Example ```javascript override fun onHlsStateChanged(HlsState: JSONObject) { when (HlsState.getString("status")) { "HLS_STARTING" -> Log.d("onHlsStateChanged", "Meeting hls is starting") "HLS_STARTED" -> Log.d("onHlsStateChanged", "Meeting hls is started") "HLS_PLAYABLE" -> { Log.d("onHlsStateChanged", "Meeting hls is playable now") // on hls playable you will receive playbackHlsUrl and livestreamUrl val playbackHlsUrl = HlsState.getString("playbackHlsUrl") val livestreamUrl = HlsState.getString("livestreamUrl") } "HLS_STOPPING" -> Log.d("onHlsStateChanged", "Meeting hls is stopping") "HLS_STOPPED" -> Log.d("onHlsStateChanged", "Meeting hls is stopped") } } ``` ```javascript @Override public void onHlsStateChanged(JSONObject HlsState) { switch (HlsState.getString("status")) { case "HLS_STARTING": Log.d("onHlsStateChanged", "Meeting hls is starting"); break; case "HLS_STARTED": Log.d("onHlsStateChanged", "Meeting hls is started"); break; case "HLS_PLAYABLE": Log.d("onHlsStateChanged", "Meeting hls is playable now"); // on hls started you will receive playbackHlsUrl and livestreamUrl String playbackHlsUrl = HlsState.getString("playbackHlsUrl"); String livestreamUrl = HlsState.getString("livestreamUrl"); break; case "HLS_STOPPING": Log.d("onHlsStateChanged", "Meeting hls is stopping"); break; case "HLS_STOPPED": Log.d("onHlsStateChanged", "Meeting hls is stopped"); break; } } ``` --- ### onTranscriptionStateChanged() - This event will be triggered whenever state of realtime transcription is changed. #### Event callback parameters - **data**: \{ **status**: String, **id**: String } - **status**: String - **id**: String `status` has following values - `TRANSCRIPTION_STARTING` - Realtime Transcription is in starting phase and hasn't started yet. - `TRANSCRIPTION_STARTED` - Realtime Transcription has started successfully. - `TRANSCRIPTION_STOPPING` - Realtime Transcription is in stopping phase and hasn't stopped yet. - `TRANSCRIPTION_STOPPED` - Realtime Transcription has stopped successfully. #### Example ```javascript override fun onTranscriptionStateChanged(data: JSONObject) { //Status can be :: TRANSCRIPTION_STARTING //Status can be :: TRANSCRIPTION_STARTED //Status can be :: TRANSCRIPTION_STOPPING //Status can be :: TRANSCRIPTION_STOPPED val status = data.getString("status") Log.d("MeetingActivity", "Transcription status: $status") } ``` ```javascript @Override public void onTranscriptionStateChanged(JSONObject data) { //Status can be :: TRANSCRIPTION_STARTING //Status can be :: TRANSCRIPTION_STARTED //Status can be :: TRANSCRIPTION_STOPPING //Status can be :: TRANSCRIPTION_STOPPED String status = data.getString("status"); Log.d("MeetingActivity", "Transcription status: " + status); } ``` --- ### onTranscriptionText() - This event will be emitted when text for running realtime transcription received. #### Event callback parameters - **data**: TranscriptionText - **TranscriptionText.participantId**: String - **TranscriptionText.participantName**: String - **TranscriptionText.text**: String - **TranscriptionText.timestamp**: int - **TranscriptionText.type**: String #### Example ```javascript override fun onTranscriptionText(data: TranscriptionText) { val participantId = data.participantId val participantName = data.participantName val text = data.text val timestamp = data.timestamp val type = data.type Log.d("MeetingActivity", "$participantName: $text $timestamp") } ``` ```javascript @Override public void onTranscriptionText(TranscriptionText data) { String participantId = data.getParticipantId(); String participantName = data.getParticipantName(); String text = data.getText(); int timestamp = data.getTimestamp(); String type = data.getType(); Log.d("MeetingActivity", participantName + ": " + text + " " + timestamp); } ``` --- ### onWhiteboardStarted() - This event will be triggered when the whiteboard is successfully started. #### Event callback parameters **url**: String #### Example ```javascript override fun onWhiteboardStarted(url: String) { super.onWhiteboardStarted(url) //... } ``` ```java @Override public void onWhiteboardStarted(String url) { super.onWhiteboardStarted(url); //... } ``` --- ### onWhiteboardStopped() - This event will be triggered when the whiteboard session is successfully terminated. #### Example ```javascript override fun onWhiteboardStopped() { super.onWhiteboardStopped() //... } ``` ```java @Override public void onWhiteboardStopped() { super.onWhiteboardStopped(); //... } ``` --- ### onExternalCallStarted() - This event will be emitted when local particpant receive incoming call. #### Example ```javascript override fun onExternalCallStarted() { // } ``` ```javascript @Override public void onExternalCallStarted() { // } ``` --- ### onMeetingStateChanged() - This event will be emitted when state of meeting changes. - It will pass **`state`** as an event callback parameter which will indicate current state of the meeting. - All available states are `CONNECTING`, `CONNECTED`, `RECONNECTING`, `DISCONNECTED`. #### Event callback parameters - **state**: MeetingState #### Example ```javascript override fun onMeetingStateChanged(state: ConnectionState) { super.onMeetingStateChanged(state) Log.d("TAG", "onMeetingStateChanged: $state") } ``` ```javascript @Override public void onMeetingStateChanged(ConnectionState state) { super.onMeetingStateChanged(state); Log.d("TAG", "onMeetingStateChanged: "+ state); } ``` --- ### onExternalCallRinging() This callback is triggered when the user's phone starts ringing. whether it’s a traditional phone call or a VoIP call (e.g., WhatsApp). This event allows us to detect when the user is receiving an external call. #### Example ```javascript override fun onExternalCallRinging() { Log.d("#meeting", "onExternalCallAnswered: User phone is ringing") } ``` ```javascript @Override public void onExternalCallRinging() { Log.d("#meeting", "onExternalCallAnswered: User phone is ringing"); } ``` --- ### onExternalCallStarted() This callback is triggered when the user answers an external phone call. whether it’s a traditional phone call or a VoIP call (e.g., WhatsApp). This event allows us to detect when the user has started a call. #### Example ```javascript override fun onExternalCallStarted() { Log.d("#meeting", "onExternalCallAnswered: User call is answered") } ``` ```javascript @Override public void onExternalCallStarted() { Log.d("#meeting", "onExternalCallAnswered: User call is answered"); } ``` --- ### onExternalCallHangup() This callback is triggered when an external call ends, whether it’s a traditional phone call or a VoIP call (e.g., WhatsApp). This event detects when a call has ended #### Example ```javascript override fun onExternalCallHangup() { Log.d("#meeting", "onExternalCallAnswered: User call ends") } ``` ```javascript @Override public void onExternalCallHangup() { Log.d("#meeting", "onExternalCallAnswered: User call ends"); } ``` --- ### onPausedAllStreams() - This callback is triggered when all or specified media streams within the meeting are successfully paused #### Parameters - **`kind`**: Specifies the type of media stream that was paused. - **Type**: `String` - **Possible values**: - `"audio"`: Indicates that audio streams have been paused. - `"video"`: Indicates that video streams have been paused. - `"share"`: Indicates that screen-sharing video streams have been paused #### Example ```javascript override fun onPausedAllStreams(kind: String) { Log.d("TAG", "onPausedAllStreams: $kind") super.onPausedAllStreams(kind) } ``` ```javascript @Override public void onPausedAllStreams(String kind) { Log.d("TAG", "onPausedAllStreams: " + kind); super.onPausedAllStreams(kind); } ``` --- ### onResumedAllStreams() - This callback is triggered when all or specified media streams within the meeting are successfully resumed #### Parameters - **`kind`**: Specifies the type of media stream that was resumed. - **Type**: `String` - **Possible values**: - `"audio"`: Indicates that audio streams have been resumed. - `"video"`: Indicates that video streams have been resumed. - `"share"`: Indicates that screen-sharing video streams have been resumed #### Example ```javascript override fun onResumedAllStreams(kind: String) { Log.d("TAG", "onResumedAllStreams: $kind") super.onResumedAllStreams(kind) } ``` ```javascript @Override public void onResumedAllStreams(String kind) { Log.d("TAG", "onResumedAllStreams: " + kind); super.onResumedAllStreams(kind); } ``` --- ### onParticipantModeChanged() This event is triggered when a participant's mode is updated. It passes `data` as an event callback parameter, which includes the following: - **`SEND_AND_RECV`**: Both audio and video streams will be produced and consumed. - **`SIGNALLING_ONLY`**: Audio and video streams will not be produced or consumed. It is used solely for signaling. - **`RECV_ONLY`**: Only audio and video streams will be consumed without producing any. This event is triggered when a participant's mode is updated. #### Event Callback Parameters - **data**: `{ mode: String, participantId: String }` - **mode**: `String` - **participantId**: `String` import CautionMessage from '@site/src/theme/CautionMessage'; #### Example ```javascript override fun onParticipantModeChanged(data: JSONObject?) { //... } ``` ```javascript @Override public void onParticipantModeChanged(JSONObject data) { //... } ``` --- ### onPinStateChanged() - This event will be triggered when any participant got pinned or unpinned by any participant got pinned or unpinned by any participant. #### Event callback parameters - **pinStateData**: \{ **peerId**: String, **state**: JSONObject, **pinnedBy**: String } - **peerId**: String - **state**: JSONObject - **pinnedBy**: String #### Example ```javascript override fun onPinStateChanged(pinStateData: JSONObject?) { Log.d("onPinStateChanged: ", pinStateData.getString("peerId")) // id of participant who were pinned Log.d("onPinStateChanged: ", pinStateData.getJSONObject("state")) // { cam: true, share: true } Log.d("onPinStateChanged: ", pinStateData.getString("pinnedBy")) // id of participant who pinned that participant } ``` ```javascript @Override public void onPinStateChanged(JSONObject pinStateData) { Log.d("onPinStateChanged: ", pinStateData.getString("peerId")); // id of participant who were pinned Log.d("onPinStateChanged: ", pinStateData.getJSONObject("state")); // { cam: true, share: true } Log.d("onPinStateChanged: ", pinStateData.getString("pinnedBy")); // id of participant who pinned that participant } ``` --- ### onMediaRelayRequestReceived() - This callback is triggered when a request is recieved for media relay in the destination meeting. #### Event callback parameters - **`participantId - (String)`**: Specifies the participantId who requested the media relay. - **`meetingId - (String)`**: Specifies the meeting from where the media relay request was made. - **`listener - (RelayRequestListener)`**: A callback interface with the following methods: - **accept()**: Call this to approve the media relay request. - **reject()**: Call this to deny the media relay request. #### Example ```javascript override fun onMediaRelayRequestReceived(participantId: String,liveStreamId: String,listener: RelayRequestListener) { // If accepting the request listener.accept() // If rejecting the request listener.reject() } ``` ```javascript @Override public void onMediaRelayRequestReceived(String participantId, String liveStreamId, RelayRequestListener listener) { // if accept request listener.accept(); // if reject request listener.reject(); } ``` --- ### onMediaRelayRequestResponse() - This callback is triggered when a response is recieved for media relay request in the source meeting. #### Event callback parameters - **`participantId - (String)`**: Specifies the participantId who responded the request for the media relay. - **`decision - (String)`**: Specifies the decision whether the request for media relay was accepted or not. #### Example ```javascript override fun onMediaRelayRequestResponse(participantId: String, decision: String) { super.onMediaRelayRequestResponse(participantId, decision) Log.d("MediaRelay", "Participant ID: $participantId, Decision: $decision") } ``` ```javascript @Override public void onMediaRelayRequestResponse(String participantId, String decision) { super.onMediaRelayRequestResponse(participantId, decision); Log.d("MediaRelay", "Participant ID: " + participantId + ", Decision: " + decision); } ``` --- ### onMediaRelayStarted() - This callback is triggered when the media relay to the destination meeting succesfully starts. #### Parameters - **`meetingId - (String)`**: Specifies the meeting where the media relay started. #### Example ```javascript override fun onMediaRelayStarted(relayMeetingId: String) { super.onMediaRelayStarted(relayMeetingId) Log.d("MediaRelay", "Media relay started to meeting ID: $relayMeetingId") } ``` ```javascript override fun onMediaRelayStarted(relayMeetingId: String) { super.onMediaRelayStarted(relayMeetingId) Log.d("MediaRelay", "Media relay started to meeting ID: $relayMeetingId") } ``` --- ### onMediaRelayStopped() - This callback is triggered when the media relay to the destination meeting stops for any reason. #### Parameters - **`meetingId - (String)`**: Specifies the meeting where the media relay stopped. - **`reason - (String)`**: Specifies the reason why the media relay stopped #### Example ```javascript override fun onMediaRelayStopped(meetingId: String, reason: String) { super.onMediaRelayStopped(meetingId, reason) Log.d("MediaRelay", "Media relay stopped for meeting ID: $meetingId, Reason: $reason") } ``` ```javascript @Override public void onMediaRelayStopped(String meetingId, String reason) { super.onMediaRelayStopped(meetingId, reason); Log.d("MediaRelay", "Media relay stopped for meeting ID: " + meetingId + ", Reason: " + reason); } ``` --- ### onMediaRelayError() - This callback is triggered when an error occurs during media relay to the destination meeting. #### Parameters - **`meetingId - (String)`**: Specifies the meeting where the media relay stopped. - **`error - (String)`**: Specifies the error that occured. #### Example ```javascript override fun onMediaRelayError(meetingId: String, error: String) { super.onMediaRelayError(meetingId, error) Log.e("MediaRelay", "Media relay error for meeting ID: $meetingId, Error: $error") } ``` ```javascript @Override public void onMediaRelayError(String meetingId, String error) { super.onMediaRelayError(meetingId, error); Log.e("MediaRelay", "Media relay error for meeting ID: " + meetingId + ", Error: " + error); } ``` ---
--- --- sidebar_position: 1 sidebar_label: Methods pagination_label: Meeting Class Methods title: Meeting Class Methods --- # Meeting Class Methods - Android
### join() - It is used to join a meeting. - After meeting initialization by [`initMeeting()`](../initMeeting) it returns a new instance of [Meeting](./introduction). However by default, it will not automatically join the meeting. Hence, to join the meeting you should call `join()`. #### Events associated with `join()`: - Local Participant will receive a [`onMeetingJoined`](./meeting-event-listener-class#onmeetingjoined) event, when successfully joined. - Remote Participant will receive a [`onParticipantJoined`](./meeting-event-listener-class#onparticipantjoined) event with the newly joined [`Participant`](../participant-class/introduction) object from the event callback. #### Participant having `ask_join` permission inside token - If a token contains the permission `ask_join`, then the participant will not join the meeting directly after calling `join()`, but an event will be emitted to the participant having the permission `allow_join` called [`onEntryRequested`](./meeting-event-listener-class#onentryrequested). - After the decision from the remote participant, an event will be emitted to participant called [`onEntryResponded`](./meeting-event-listener-class#onentryresponded). This event will contain the decision made by the remote participant. #### Participant having `allow_join` permission inside token - If a token containing the permission `allow_join`, then the participant will join the meeting directly after calling `join()`. #### Returns - _`void`_ --- ### leave() - It is used to leave the current meeting. #### Events associated with `leave()`: - Local participant will receive a [`onMeetingLeft`](./meeting-event-listener-class#onmeetingleft) event. - All remote participants will receive a [`onParticipantLeft`](./meeting-event-listener-class#onparticipantleft) event with `participantId`. #### Returns - _`void`_ --- ### end() - It is used to end the current running session. - By calling `end()`, all joined [participants](properties#getparticipants) including [localParticipant](./properties.md#getlocalparticipant) of that session will leave the meeting. #### Events associated with `end()`: - All [participants](./properties.md#getparticipants) and [localParticipant](./properties.md#getlocalparticipant), will be emitted [`onMeetingLeft`](./meeting-event-listener-class#onmeetingleft) event. #### Returns - _`void`_ --- ### enableWebcam() - It is used to enable self camera. - [`onStreamEnabled`](../participant-class/participant-event-listener-class#onstreamenabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### disableWebcam() - It is used to disable self camera. - [`onStreamDisabled`](../participant-class/participant-event-listener-class#onstreamdisabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### unmuteMic() - It is used to enable self microphone. - [`onStreamEnabled`](../participant-class/participant-event-listener-class#onstreamenabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### muteMic() - It is used to disable self microphone. - [`onStreamDisabled`](../participant-class/participant-event-listener-class#onstreamdisabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### enableScreenShare() - it is used to enable screen-sharing. - [`onStreamEnabled`](../participant-class/participant-event-listener-class#onstreamenabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. - [`onPresenterChanged()`](./meeting-event-listener-class#onpresenterchanged) event will be trigget to all participant with `participantId`. #### Parameters - **data**: Intent #### Returns - _`void`_ --- ### disableScreenShare() - It is used to disable screen-sharing. - [`onStreamDisabled`](../participant-class/participant-event-listener-class#onstreamdisabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. - [`onPresenterChanged()`](./meeting-event-listener-class#onpresenterchanged) event will be trigget to all participant with `null`. #### Returns - _`void`_ --- ### uploadBase64File() - It is used to upload your file to Videosdk's Temporary storage. - `base64Data` convert your file to base64 and pass here. - `token` pass your videosdk token. Read more about token [here](/android/guide/video-and-audio-calling-api-sdk/authentication-and-token) - `fileName` provide your fileName with extension. - `TaskCompletionListener` will handle the result of the upload operation. - When the upload is complete, the `onComplete()` method of `TaskCompletionListener` will provide the corresponding `fileUrl`, which can be used to retrieve the uploaded file. - If an error occurs during the upload process, the `onError()` method of `TaskCompletionListener` will provide the error details. #### Parameters - **base64Data**: String - **token**: String - **fileName**: String - **listener**: TaskCompletionListener #### Returns - _`void`_ #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js private fun uploadFile() { val base64Data = "" // Convert your file to base64 and pass here val token = "" val fileName = "myImage.jpeg" // Provide name with extension here meeting!!.uploadBase64File( base64Data, token, fileName, object : TaskCompletionListener { override fun onComplete(data: String?) { Log.d("VideoSDK", "Uploaded file url: $data") } override fun onError(error: String?) { Log.d("VideoSDK", "Error in upload file: $error") } } ) } ``` ```js private void uploadFile() { String base64Data = ""; // Convert your file to base64 and pass here String token = ""; String fileName = "myImage.jpeg"; // Provide name with extension here meeting.uploadBase64File(base64Data, token, fileName, new TaskCompletionListener() { @Override public void onComplete(@Nullable String data) { Log.d("VideoSDK", "Uploaded file url: " + data); } @Override public void onError(@Nullable String error) { Log.d("VideoSDK", "Error in upload file: " + error); } }); } ``` --- ### fetchBase64File() - It is used to retrieve your file from the Videosdk's Temporary storage. - `url` pass fileUrl which is returned by `uploadBase64File()` - `token` pass your videosdk token. Read more about token [here](/android/guide/video-and-audio-calling-api-sdk/authentication-and-token) - `TaskCompletionListener` will handle the result of the fetch operation. - When the fetch operation is complete, the `onComplete()` method of `TaskCompletionListener` will provide the file in `base64` format. - If an error occurs during the fetch operation, the `onError()` method of `TaskCompletionListener` will provide the error details. #### Parameters - **url**: String - **token**: String - **listener**: TaskCompletionListener #### Returns - _`void`_ #### Example ```js private fun fetchFile() { val url = "" // Provide fileUrl which is returned by uploadBase64File() val token = "" meeting.fetchBase64File(url, token, object : TaskCompletionListener { override fun onComplete(data: String?) { Log.d("VideoSDK", "Fetched file in base64:$data") } override fun onError(error: String?) { Log.d("VideoSDK", "Error in fetch file: $error") } }) } ``` ```js private void fetchFile() { String url = ""; // Provide fileUrl which is returned by uploadBase64File() String token = ""; meeting.fetchBase64File(url, token, new TaskCompletionListener() { @Override public void onComplete(@Nullable String data) { Log.d("VideoSDK", "Fetched file in base64:" + data); } @Override public void onError(@Nullable String error) { Log.d("VideoSDK", "Error in fetch file: " + error); } }); } ``` --- ### startRecording() - `startRecording` is used to start meeting recording. - `webhookUrl` will be triggered when the recording is completed and stored into server. Read more about webhooks [here](https://en.wikipedia.org/wiki/Webhook). - `awsDirPath` will be the path for the your S3 bucket where you want to store recordings to. To allow us to store recording in your S3 bucket, you will need to fill this form by providing the required values. [VideoSDK AWS S3 Integration](https://zfrmz.in/RVlFLFiturVJ7Q97fr23) - `config: mode` is used to either record video-and-audio both or only audio. And by default it will be video-and-audio. - `config: quality` is only applicable to video-and-audio. - `transcription` This parameter lets you start post transcription for the recording. #### Parameters - **webhookUrl**: String - **awsDirPath**: String - **config**: - **layout**: - **type**: _"GRID"_ | _"SPOTLIGHT"_ | _"SIDEBAR"_ - **priority**: _"SPEAKER"_ | _"PIN"_ - **gridSize**: Number _`max 4`_ - **theme**: _"DARK"_ | _"LIGHT"_ | _"DEFAULT"_ - **mode**: _"video-and-audio"_ | _"audio"_ - **quality**: _"low"_ | _"med"_ | _"high"_ - **orientation**: _"landscape"_ | _"portrait"_ - **transcription**: **PostTranscriptionConfig** - **PostTranscriptionConfig.enabled**: boolean - **PostTranscriptionConfig.modelId**: String - **PostTranscriptionConfig.summary**: SummaryConfig - **SummaryConfig.enabled**: boolean - **SummaryConfig.prompt**: String #### Returns - _`void`_ #### Events associated with `startRecording()`: - Every participant will receive a callback on [`onRecordingStateChanged()`](./meeting-event-listener-class#onrecordingstatechanged) #### Example ```js val webhookUrl = "https://webhook.your-api-server.com" var config = JSONObject() var layout = JSONObject() JsonUtils.jsonPut(layout, "type", "SPOTLIGHT") JsonUtils.jsonPut(layout, "priority", "PIN") JsonUtils.jsonPut(layout, "gridSize", 9) JsonUtils.jsonPut(config, "layout", layout) JsonUtils.jsonPut(config, "theme", "DARK") val prompt = "Write summary in sections like Title, Agenda, Speakers, Action Items, Outlines, Notes and Summary" val summaryConfig = SummaryConfig(true, prompt) val modelId = "raman_v1" val transcription = PostTranscriptionConfig(true, summaryConfig, modelId) meeting!!.startRecording(webhookUrl,null,config,transcription) ``` ```js String webhookUrl = "https://webhook.your-api-server.com"; JSONObject config = new JSONObject(); JSONObject layout = new JSONObject(); JsonUtils.jsonPut(layout, "type", "SPOTLIGHT"); JsonUtils.jsonPut(layout, "priority", "PIN"); JsonUtils.jsonPut(layout, "gridSize", 9); JsonUtils.jsonPut(config, "layout", layout); JsonUtils.jsonPut(config, "theme", "DARK"); String prompt = "Write summary in sections like Title, Agenda, Speakers, Action Items, Outlines, Notes and Summary"; SummaryConfig summaryConfig = new SummaryConfig(true, prompt); String modelId = "raman_v1"; PostTranscriptionConfig transcription = new PostTranscriptionConfig(true, summaryConfig, modelId); meeting.startRecording(webhookUrl,null,config,transcription); ``` --- ### stopRecording() - It is used to stop meeting recording. #### Returns - _`void`_ #### Events associated with `stopRecording()`: - Every participant will receive a callback on [`onRecordingStateChanged()`](./meeting-event-listener-class#onrecordingstatechanged) #### Example ```javascript meeting!!.stopRecording() ``` ```javascript meeting.stopRecording(); ``` --- ### startLivestream() - `startLiveStream()` is used to start meeting livestreaming. - You will be able to start live stream meetings to other platforms such as Youtube, Facebook, etc. that support `RTMP` streaming. #### Parameters - **outputs**: `List` - **config**: - **layout**: - **type**: _"GRID"_ | _"SPOTLIGHT"_ | _"SIDEBAR"_ - **priority**: _"SPEAKER"_ | _"PIN"_ - **gridSize**: Number _`max 25`_ - **theme**: _"DARK"_ | _"LIGHT"_ | _"DEFAULT"_ #### Returns - _`void`_ #### Events associated with `startLiveStream()`: - Every participant will receive a callback on [`onLivestreamStateChanged()`](./meeting-event-listener-class#onlivestreamstatechanged) #### Example ```javascript val YOUTUBE_RTMP_URL = "rtmp://a.rtmp.youtube.com/live2" val YOUTUBE_RTMP_STREAM_KEY = "" val outputs: MutableList = ArrayList() outputs.add(LivestreamOutput(YOUTUBE_RTMP_URL, YOUTUBE_RTMP_STREAM_KEY)) var config = JSONObject() var layout = JSONObject() JsonUtils.jsonPut(layout, "type", "SPOTLIGHT") JsonUtils.jsonPut(layout, "priority", "PIN") JsonUtils.jsonPut(layout, "gridSize", 9) JsonUtils.jsonPut(config, "layout", layout) JsonUtils.jsonPut(config, "theme", "DARK") meeting!!.startLivestream(outputs,config) ``` ```javascript final String YOUTUBE_RTMP_URL = "rtmp://a.rtmp.youtube.com/live2"; final String YOUTUBE_RTMP_STREAM_KEY = ""; List outputs = new ArrayList<>(); outputs.add(new LivestreamOutput(YOUTUBE_RTMP_URL, YOUTUBE_RTMP_STREAM_KEY)); JSONObject config = new JSONObject(); JSONObject layout = new JSONObject(); JsonUtils.jsonPut(layout, "type", "SPOTLIGHT"); JsonUtils.jsonPut(layout, "priority", "PIN"); JsonUtils.jsonPut(layout, "gridSize", 9); JsonUtils.jsonPut(config, "layout", layout); JsonUtils.jsonPut(config, "theme", "DARK"); meeting.startLivestream(outputs,config); ``` --- ### stopLivestream() - It is used to stop meeting livestreaming. #### Returns - _`void`_ #### Events associated with `stopLivestream()`: - Every participant will receive a callback on [`onLivestreamStateChanged()`](./meeting-event-listener-class#onlivestreamstatechanged) #### Example ```javascript meeting!!.stopLivestream() ``` ```javascript meeting.stopLivestream(); ``` --- ### startHls() - `startHls()` will start HLS streaming of your meeting. - You will be able to start HLS and watch the live stream of meeting over HLS. - `mode` is used to either start hls streaming of video-and-audio both or only audio. And by default it will be video-and-audio. - `quality` is only applicable to video-and-audio. - `transcription` This parameter lets you start post transcription for the recording. #### Parameters - **config**: - **layout**: - **type**: _"GRID"_ | _"SPOTLIGHT"_ | _"SIDEBAR"_ - **priority**: _"SPEAKER"_ | _"PIN"_ - **gridSize**: Number _`max 25`_ - **theme**: _"DARK"_ | _"LIGHT"_ | _"DEFAULT"_ - **mode**: _"video-and-audio"_ | _"audio"_ - **quality**: _"low"_ | _"med"_ | _"high"_ - **transcription**: **PostTranscriptionConfig** - **PostTranscriptionConfig.enabled**: boolean - **PostTranscriptionConfig.modelId**: String - **PostTranscriptionConfig.summary**: SummaryConfig - **SummaryConfig.enabled**: boolean - **SummaryConfig.prompt**: String #### Returns - _`void`_ #### Events associated with `startHls()`: - Every participant will receive a callback on [`onHlsStateChanged()`](./meeting-event-listener-class#onhlsstatechanged) #### Example ```javascript var config = JSONObject() var layout = JSONObject() JsonUtils.jsonPut(layout, "type", "SPOTLIGHT") JsonUtils.jsonPut(layout, "priority", "PIN") JsonUtils.jsonPut(layout, "gridSize", 9) JsonUtils.jsonPut(config, "layout", layout) JsonUtils.jsonPut(config, "orientation", "portrait") JsonUtils.jsonPut(config, "theme", "DARK") val prompt = "Write summary in sections like Title, Agenda, Speakers, Action Items, Outlines, Notes and Summary" val summaryConfig = SummaryConfig(true, prompt) val modelId = "raman_v1" val transcription = PostTranscriptionConfig(true, summaryConfig,modelId) meeting!!.startHls(config, transcription) ``` ```javascript JSONObject config = new JSONObject(); JSONObject layout = new JSONObject(); JsonUtils.jsonPut(layout, "type", "SPOTLIGHT"); JsonUtils.jsonPut(layout, "priority", "PIN"); JsonUtils.jsonPut(layout, "gridSize", 9); JsonUtils.jsonPut(config, "layout", layout); JsonUtils.jsonPut(config, "orientation", "portrait"); JsonUtils.jsonPut(config, "theme", "DARK"); String prompt = "Write summary in sections like Title, Agenda, Speakers, Action Items, Outlines, Notes and Summary"; SummaryConfig summaryConfig = new SummaryConfig(true, prompt); String modelId = "raman_v1"; PostTranscriptionConfig transcription = new PostTranscriptionConfig(true, summaryConfig,modelId); meeting.startHls(config,transcription); ``` --- ### stopHls() - `stopHls()` is used to stop the HLS streaming. #### Returns - _`void`_ #### Events associated with `stopHls()`: - Every participant will receive a callback on [`onHlsStateChanged()`](./meeting-event-listener-class#onhlsstatechanged) #### Example ```javascript meeting!!.stopHls() ``` ```javascript meeting.stopHls(); ``` --- ### startTranscription() - `startTranscription()` It is used to start realtime transcription. #### Parameters #### config - type : `TranscriptionConfig` - This specifies the configurations for realtime transcription. You can specify following properties. - `TranscriptionConfig.webhookUrl`: Webhooks will be triggered when the state of realtime transcription is changed. Read more about webhooks [here](https://en.wikipedia.org/wiki/Webhook) - `TranscriptionConfig.summary`: `SummaryConfig` - `enabled`: Indicates whether realtime transcription summary generation is enabled. Summary will be available after realtime transcription stopped. Default: `false` - `prompt`: provides guidelines or instructions for generating a custom summary based on the realtime transcription content. #### Returns - _`void`_ #### Events associated with `startTranscription()`: - Every participant will receive a callback on [`onTranscriptionStateChanged()`](./meeting-event-listener-class#ontranscriptionstatechanged) - Every participant will receive a callback on [`onTranscriptionText()`](./meeting-event-listener-class#ontranscriptiontext) #### Example ```javascript // Realtime Transcription Configuration val webhookUrl = "https://www.example.com" val summaryConfig = SummaryConfig( true, "Write summary in sections like Title, Agenda, Speakers, Action Items, Outlines, Notes and Summary" ) val transcriptionConfig = TranscriptionConfig( webhookUrl, summaryConfig ) meeting!!.startTranscription(transcriptionConfig) ``` ```javascript // Realtime Transcription Configuration final String webhookUrl = "https://www.example.com"; SummaryConfig summaryConfig = new SummaryConfig( true, "Write summary in sections like Title, Agenda, Speakers, Action Items, Outlines, Notes and Summary" ); TranscriptionConfig transcriptionConfig = new TranscriptionConfig( webhookUrl, summaryConfig ); meeting.startTranscription(transcriptionConfig); ``` --- ### stopTranscription() - `stopTranscription()` It is used to stop realtime transcription. #### Returns - _`void`_ #### Events associated with `startTranscription()`: - Every participant will receive a callback on [`onTranscriptionStateChanged()`](./meeting-event-listener-class#ontranscriptionstatechanged) #### Example ```javascript meeting!!.stopTranscription() ``` ```javascript meeting.stopTranscription(); ``` --- ### startWhiteboard() - It is used to initilize a whiteboard session. #### Returns - _`void`_ --- ### stopWhiteboard() - It is used to end a whiteboard session. #### Returns - _`void`_ --- ### changeMode() - It is used to change the mode. - You can toggle between the following modes: - **`SEND_AND_RECV`**: Both audio and video streams will be produced and consumed. - **`SIGNALLING_ONLY`**: Audio and video streams will not be produced or consumed. It is used solely for signaling. - **`RECV_ONLY`**: Only audio and video streams will be consumed without producing any. import CautionMessage from '@site/src/theme/CautionMessage'; #### Parameters - **mode**: `String` - **mode**: `String` #### Returns - _`void`_ #### Events associated with `changeMode()`: - Every participant will receive a callback on [`onParticipantModeChanged()`](./meeting-event-listener-class#onparticipantmodechanged) ```javascript meeting!!.changeMode("SIGNALLING_ONLY") meeting!!.changeMode("SIGNALLING_ONLY") ``` ```javascript meeting!!.changeMode("SIGNALLING_ONLY") meeting!!.changeMode("SIGNALLING_ONLY") ``` --- ### getMics() - It will return all connected mic devices. #### Returns - `Set` #### Example ```javascript val mics = meeting!!.mics var mic: String for (i in mics.indices) { mic = mics.toTypedArray()[i].toString() Toast.makeText(this, "Mic : $mic", Toast.LENGTH_SHORT).show() } ``` ```javascript Set mics = meeting.getMics(); String mic; for (int i = 0; i < mics.size(); i++) { mic=mics.toArray()[i].toString(); Toast.makeText(this, "Mic : " + mic, Toast.LENGTH_SHORT).show(); } ``` --- ### changeMic() - It is used to change the mic device. - If multiple mic devices are connected, by using `changeMic()` one can change the mic device. #### Parameters - **device**: AppRTCAudioManager.AudioDevice #### Returns - _`void`_ #### Example ```javascript meeting!!.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH) ``` ```javascript meeting.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH); ``` --- ### changeWebcam() - It is used to change the camera device. - If multiple camera devices are connected, by using `changeWebcam()`, one can change the camera device with its respective device id. - You can get a list of connected video devices using [`VideoSDK.getVideoDevices()`](../videosdk-class/methods#getvideodevices) #### Parameters - **deviceId**: - The `deviceId` represents the unique identifier of the camera device you wish to switch to. If no deviceId is provided, the facing mode will toggle, from the back camera to the front camera if the back camera is currently in use, or from the front camera to the back camera if the front camera is currently in use. - type : String - `OPTIONAL` #### Returns - _`void`_ #### Example ```javascript meeting!!.changeWebcam() ``` ```javascript meeting.changeWebcam(); ``` --- ### pauseAllStreams() This method pauses active media streams within the meeting. #### Parameters - **kind**: Specifies the type of media stream to be paused. If this parameter is omitted, all media streams (audio, video, and screen share) will be paused. - **Type**: `String` - **Optional**: Yes - Possible values: - `"audio"`: Pauses audio streams. - `"video"`: Pauses video streams. - `"share"`: Pauses screen-sharing video streams. #### Returns - _`void`_ #### Example ```javascript meeting!!.pauseAllStreams() ``` ```javascript meeting.pauseAllStreams(); ``` --- ### resumeAllStreams() This method resumes media streams that have been paused #### Parameters - **kind**: Specifies the type of media stream to be resumed. If this parameter is omitted, all media streams (audio, video, and screen share) will be resumed. - **Type**: `String` - **Optional**: Yes - Possible values: - `"audio"`: Resumes audio streams. - `"video"`: Resumes video streams. - `"share"`:Resumes screen-sharing video streams. #### Returns - _`void`_ #### Example ```javascript meeting!!.resumeAllStreams() ``` ```javascript meeting.resumeAllStreams(); ``` --- ### requestMediaRelay() This method starts relaying selected media streams (like camera video, microphone audio, screen share) from the current meeting to a specified destination meeting. #### Parameters - **destinationMeetingId (String) – Required**: ID of the target meeting where media should be relayed. - **token (String) – : Authentication token for the destination meeting. - If you pass `null`, the SDK will use the existing authentication token. - **kinds (Array of Strings) – : Array of media types to relay. - Possible values: - `"audio"`: Resumes audio streams. - `"video"`: Resumes video streams. - `"share"`:Resumes screen-sharing video streams. - If you pass `null`, all media types (audio, video, share) will be relayed by default. #### Returns - _`void`_ #### Example ```javascript meeting.requestMediaRelay("", null, null) ``` ```javascript meeting.requestMediaRelay("",null,null); ``` --- ### stopMediaRelay() This method stops the ongoing media relay to a specific destination meeting. #### Parameters - **destinationMeetingId (String) – Required**: ID of the destination meeting where the media relay should be stopped. #### Returns - _`void`_ #### Example ```javascript meeting.stopMediaRelay("") ``` ```javascript meeting.stopMediaRelay(""); ``` --- ### switchTo() This method enables a seamless transition from the current meeting to another, without needing to disconnect and reconnect manually. #### Parameters - **meetingId (String) – Required**: ID of the new meeting to switch to. - **token (String) – Optional**: Authentication token for the new meeting. #### Returns - _`void`_ #### Example ```javascript meeting.switchTo("") //or meeting.switchTo("",token) ``` ```javascript meeting.switchTo(""); //or meeting.switchTo("",token); ``` --- ### setAudioDeviceChangeListener() - When a Local participant changes the Mic, `AppRTCAudioManager.AudioManagerEvents()` is triggered which can be set by using `setAudioDeviceChangeListener()` method. #### Parameters - **audioManagerEvents**: AppRTCAudioManager.AudioManagerEvents #### Returns - _`void`_ #### Example ```javascript meeting!!.setAudioDeviceChangeListener(object : AudioManagerEvents { override fun onAudioDeviceChanged( selectedAudioDevice: AppRTCAudioManager.AudioDevice, availableAudioDevices: Set ) { when (selectedAudioDevice) { AppRTCAudioManager.AudioDevice.BLUETOOTH -> Toast.makeText(this@MainActivity, "Selected AudioDevice: BLUETOOTH", Toast.LENGTH_SHORT).show() AppRTCAudioManager.AudioDevice.WIRED_HEADSET -> Toast.makeText(this@MainActivity, "Selected AudioDevice: WIRED_HEADSET", Toast.LENGTH_SHORT).show() AppRTCAudioManager.AudioDevice.SPEAKER_PHONE -> Toast.makeText(this@MainActivity, "Selected AudioDevice: SPEAKER_PHONE", Toast.LENGTH_SHORT).show() AppRTCAudioManager.AudioDevice.EARPIECE -> Toast.makeText(this@MainActivity, "Selected AudioDevice: EARPIECE", Toast.LENGTH_SHORT).show() } } }) ``` ```javascript meeting.setAudioDeviceChangeListener(new AppRTCAudioManager.AudioManagerEvents() { @Override public void onAudioDeviceChanged(AppRTCAudioManager.AudioDevice selectedAudioDevice, Set availableAudioDevices) { switch (selectedAudioDevice) { case BLUETOOTH: Toast.makeText(MainActivity.this, "Selected AudioDevice: BLUETOOTH", Toast.LENGTH_SHORT).show(); break; case WIRED_HEADSET: Toast.makeText(MainActivity.this, "Selected AudioDevice: WIRED_HEADSET", Toast.LENGTH_SHORT).show(); break; case SPEAKER_PHONE: Toast.makeText(MainActivity.this, "Selected AudioDevice: SPEAKER_PHONE", Toast.LENGTH_SHORT).show(); break; case EARPIECE: Toast.makeText(MainActivity.this, "Selected AudioDevice: EARPIECE", Toast.LENGTH_SHORT).show(); break; } } }); ``` --- ### addEventListener() #### Parameters - **listener**: MeetingEventListener #### Returns - _`void`_ --- ### removeEventListener() #### Parameters - **listener**: MeetingEventListener #### Returns - _`void`_ --- ### removeAllListeners() #### Returns - _`void`_
--- --- sidebar_position: 1 sidebar_label: Properties pagination_label: Meeting Class Properties title: Meeting Class Properties --- # Meeting Class Properties - Android
## getmeetingId() - type: `String` - `getmeetingId()` will return `meetingId`, which is unique id of the meeting where the participant has joined. --- ## getLocalParticipant() - type: [Participant](../participant-class/introduction) - It will be the instance of [Participant](../participant-class/introduction) class for the local participant(You) who joined the meeting. --- ## getMeetingState() - type: `MeetingState` - `getMeetingState()` will return `MeetingState`, which is current connection state of the meeting. #### Example ```javascript meeting!!.getMeetingState() ``` ```javascript meeting.getMeetingState(); ``` --- ## getParticipants() - type: [`Map`](https://developer.android.com/reference/java/util/Map) of [Participant](../participant-class/introduction) - `Map` - Map{'<'}`participantId`, [Participant](../participant-class/introduction)> - It will contain all joined participants in the meeting except the `localParticipant`. - This will be the [`Map`](https://developer.android.com/reference/java/util/Map) what will container all participants attached with the key as id of that participant. import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript val remoteParticipantId = "ajf897" val participant = meeting!!.participants[remoteParticipantId] ``` ```javascript String remoteParticipantId = "ajf897"; Participant participant = meeting.getParticipants().get(remoteParticipantId); ``` --- ### pubSub - type: [`PubSub`](../pubsub-class/introduction) - It is used to enable Publisher-Subscriber feature in [`meeting`](introduction) class. Learn more about `PubSub`, [here](../pubsub-class/introduction)
--- --- title: Meeting class for android SDK. hide_title: false hide_table_of_contents: false description: RTC Meeting Class provides features to implement custom meeting layout in your application. sidebar_label: Meeting Class pagination_label: Meeting Class keywords: - RTC Android - Meeting Class - Video API - Video Conferencing image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: meeting-class --- import NoIndex from '/mdx/\_no-index.mdx'; # Meeting Class ## using Meeting Class The `Meeting Class` includes methods and events for managing meetings, participants, video & audio streams, data channels and UI customization. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ## Constructor ### Meeting(String meetingId, Participant localParticipant) - return type : `void` ## Properties ### getmeetingId() - `getmeetingId()` will return `meetingId`, which represents the meetingId for the current meeting - return type : `void` ### getLocalParticipant() - `getLocalParticipant()` will return Local participant - return type :`Participant` ### getParticipants() - `getParticipants()` will return all Remote participant - return type : `void` ### pubSub() - `pubSub()` will return object of `PubSub` class - return type : `PubSub` ### Events ### Methods "} /> --- --- title: MeetingEventListener Class for android SDK. hide_title: false hide_table_of_contents: false description: The `MeetingEventListener Class` includes list of events which can be useful for the design custom user interface. sidebar_label: MeetingEventListener Class pagination_label: MeetingEventListener Class keywords: - RTC Android - MeetingEventListener Class - Video API - Video Conferencing image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: meeting-event-listener-class --- import NoIndex from '/mdx/\_no-index.mdx'; # MeetingEventListener Class ## using MeetingEventListener Class The `MeetingEventListener Class` is responsible for listening to all the events that are related to `Meeting Class`. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ### Listeners --- --- title: Video SDK Participant Class sidebar_position: 1 sidebar_label: Introduction pagination_label: Video SDK Participant Class --- # Video SDK Participant Class - Android
import properties from './../data/participant-class/properties.json' import methods from './../data/participant-class/methods.json' import events from './../data/participant-class/events.json' import LinksGrid from '../../../../src/theme/LinksGrid' Participant class includes all the properties, methods and events related to all the participants joined in a particular meeting. ## Get local and remote participants You can get the local streams and participant meta from `meeting.getLocalParticipant()`. And a Map of joined participants is always available via `meeting.getParticipants()` import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js title="Javascript" val localParticipant = meeting!!.getLocalParticipant() val participants = meeting!!.getParticipants() ``` ```js title="Javascript" Participant localParticipant = meeting.getLocalParticipant(); Map participants = meeting.getParticipants(); ``` ## Participant Properties
- [getId()](./properties#getid)
- [getDisplayName()](./properties#getdisplayname)
- [getQuality()](./properties#getquality)
- [isLocal()](./properties#islocal)
- [getStreams()](./properties#getstreams)
- [getMode()](./properties#getmode)
- [getMetaData()](./properties#getmetadata)
## Participant Methods
- [enableWebcam()](./methods#enablewebcam)
- [disableWebcam()](./methods#disablewebcam)
- [enableMic()](./methods#enablemic)
- [disableMic()](./methods#disablemic)
- [remove()](./methods#remove)
- [setQuality()](./methods#setquality)
- [setViewPort()](./methods#setviewport)
- [captureImage()](./methods#captureimage)
## Participant Events
- [onStreamEnabled](./participant-event-listener-class#onstreamenabled)
- [onStreamDisabled](./participant-event-listener-class#onstreamdisabled)
--- --- title: Participant Class Methods sidebar_position: 1 sidebar_label: Methods pagination_label: Participant Class Methods --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Participant Class Methods - Android
### enableWebcam() - `enableWebcam()` is used to enable participant's camera. #### Events associated with `enableWebcam()` : - First the participant will get a callback on [onWebcamRequested()](../meeting-class/meeting-event-listener-class#onwebcamrequested) and once the participant accepts the request, webcam will be enabled. - Every Participant will receive a `streamEnabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### disableWebcam() - `disableWebcam()` is used to disable participant camera. #### Events associated with `disableWebcam()` : - Every Participant will receive a `streamDisabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### enableMic() - `enableMic()` is used to enable participant microphone. #### Events associated with `enableMic()` : - First the participant will get a callback on [onMicRequested()](../meeting-class/meeting-event-listener-class#onmicrequested) and once the participant accepts the request, mic will be enabled. - Every Participant will receive a `streamEnabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### disableMic() - `disableMic()` is used to disable participant microphone. #### Events associated with `disableMic()`: - Every Participant will receive a `streamDisabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### pin() - It is used to set pin state of the participant. You can use it to pin the screen share, camera or both of the participant. It accepts a paramter of type `String`. Default `SHARE_AND_CAM` #### Parameters - **pinType**: `SHARE_AND_CAM` | `CAM` | `SHARE` --- ### unpin() - It is used to unpin participant. You can use it to unpin the screen share, camera or both of the participant. It accepts a paramter of type `String`. Default is `SHARE_AND_CAM` #### Parameters - **pinType**: `SHARE_AND_CAM` | `CAM` | `SHARE` --- ### setQuality() - `setQuality()` is used to set the quality of the participant's video stream. #### Parameters - `quality`: low | med | high #### Returns - `void` --- ### setViewPort() - `setViewPort()` is used to set the quality of the participant's video stream based on the viewport height and width. #### Parameters - **width**: int - **height**: int #### Returns - `void` :::info MultiStream is not supported by the Android SDK. Use `customTrack` rather than `setQuality()` and `setViewPort()` if you want to change participant's quality who joined using our Android SDK. To know more about customTrack visit [here](/android/guide/video-and-audio-calling-api-sdk/features/custom-track/custom-video-track) ::: --- ### remove() - `remove()` is used to remove the participant from the meeting. #### Events associated with `remove()` : - Local participant will receive a [`onMeetingLeft`](../meeting-class/meeting-event-listener-class.md#onmeetingleft) event. - All remote participants will receive a [`onParticipantLeft`](../meeting-class/meeting-event-listener-class.md#onparticipantleft) event with `participantId`. #### Returns - `void` --- ### captureImage() - It is used to capture image of local participant's current videoStream. - You need to pass an implementation of `TaskCompletionListener` as a parameter. This listener will handle the result of the image capture task. - When the image capture task is complete, the `onComplete()` method will provide the image in the form of a `base64` string. If an error occurs, the `onError()` method will provide the error details. #### Parameters - **height**: int - **width**: int - **listener**: TaskCompletionListener #### Returns - _`void`_ --- ### getVideoStats() - `getVideoStats()` will return an JSONArray which will contain details regarding the participant's critical video metrics such as **Jitter**, **Packet Loss**, **Quality Score** etc. #### Returns - `JSONArray` - `jitter` : It represents the distortion in the stream. - `bitrate` : It represents the bitrate of the stream which is being transmitted. - `totalPackets` : It represents the total packet count which were transmitted for that particiular stream. - `packetsLost` : It represents the total packets lost during the transimission of the stream. - `rtt` : It represents the time between the stream being reached to client from the server in milliseconds(ms). - `codec`: It represents the codec used for the stream. - `network`: It represents the network used to transmit the stream - `size`: It is object containing the height, width and frame rate of the stream. ```javascript val videoStats = participant.getVideoStats() // will return all possible stream layers in JSONArray val videoStat = videoStats.getJSONObject(0) // will return the first stream layer in JSONObject ``` ```javascript JSONArray videoStats = participant.getVideoStats(); // will return all possible stream layers in JSONArray JSONObject videoStat = videoStats.getJSONObject(0); // will return the first stream layer in JSONObject ``` :::note getVideoStats() will return the metrics for the participant at that given point of time and not average data of the complete meeting. To view the metrics for the complete meeting using the stats API documented [here](/api-reference/realtime-communication/fetch-session-quality-stats). ::: :::info If you are getting `rtt` greater than 300ms, try using a different region which is nearest to your user. To know more about changing region [visit here](/api-reference/realtime-communication/create-room). If you are getting high packet loss, try using the `customTrack` for better experience. To know more about customTrack [visit here](/android/guide/video-and-audio-calling-api-sdk/features/custom-track/custom-video-track) ::: --- ### getAudioStats() - `getAudioStats()` will return an JSONObject which will contain details regarding the participant's critical audio metrics such as **Jitter**, **Packet Loss**, **Quality Score** etc. #### Returns - `JSONObject` - `jitter` : It represents the distortion in the stream. - `bitrate` : It represents the bitrate of the stream which is being transmitted. - `totalPackets` : It represents the total packet count which were transmitted for that particiular stream. - `packetsLost` : It represents the total packets lost during the transimission of the stream. - `rtt` : It represents the time between the stream being reached to client from the server in milliseconds(ms). - `codec`: It represents the codec used for the stream. - `network`: It represents the network used to transmit the stream ```javascript val audioStat = videoStats.getAudioStats() ``` ```javascript JSONObject audioStat = videoStats.getAudioStats(); ``` :::note getAudioStats() will return the metrics for the participant at that given point of time and not average data of the complete meeting. To view the metrics for the complete meeting using the stats API documented [here](/api-reference/realtime-communication/fetch-session-quality-stats). ::: :::info If you are getting `rtt` greater than 300ms, try using a different region which is nearest to your user. To know more about changing region [visit here](/api-reference/realtime-communication/create-room). ::: ### getShareStats() - `getShareStats()` will return an JSONObject which will contain details regarding the participant's critical video metrics such as **Jitter**, **Packet Loss**, **Quality Score** etc. #### Returns - `JSONObject` - `jitter` : It represents the distortion in the stream. - `bitrate` : It represents the bitrate of the stream which is being transmitted. - `totalPackets` : It represents the total packet count which were transmitted for that particiular stream. - `packetsLost` : It represents the total packets lost during the transimission of the stream. - `rtt` : It represents the time between the stream being reached to client from the server in milliseconds(ms). - `codec`: It represents the codec used for the stream. - `network`: It represents the network used to transmit the stream - `size`: It is object containing the height, width and frame rate of the stream. ```javascript val shareStat = videoStats.getShareStats() ``` ```javascript JSONObject shareStat = videoStats.getShareStats(); ``` :::note getShareStats() will return the metrics for the participant at that given point of time and not average data of the complete meeting. To view the metrics for the complete meeting using the stats API documented [here](/api-reference/realtime-communication/fetch-session-quality-stats). ::: :::info If you are getting `rtt` greater than 300ms, try using a different region which is nearest to your user. To know more about changing region [visit here](/api-reference/realtime-communication/create-room). ::: ### addEventListener() #### Parameters - **listener**: ParticipantEventListener #### Returns - _`void`_ --- ### removeEventListener() #### Parameters - **listener**: ParticipantEventListener #### Returns - _`void`_ --- ### removeAllListeners() #### Returns - _`void`_
--- --- title: ParticipantEventListener Class sidebar_position: 1 sidebar_label: ParticipantEventListener pagination_label: ParticipantEventListener Class --- # ParticipantEventListener Class - Android
### Implementation - You can implement all the methods of `ParticipantEventListener` abstract Class and add the listener to `Participant` class using the `addEventListener()` method of `Participant` Class. --- ### onStreamEnabled() - `onStreamEnabled()` is a callback which gets triggered whenever a participant's video, audio or screen share stream is enabled. #### Event callback parameters - **stream**: [Stream](../stream-class/introduction.md) --- ### onStreamDisabled() - `onStreamDisabled()` is a callback which gets triggered whenever a participant's video, audio or screen share stream is disabled. #### Event callback parameters - **stream**: [Stream](../stream-class/introduction.md) --- ### onStreamPaused() - This event will be emitted when any participant pauses consuming or producing stream of any type. --- ### onStreamResumed() - This event will be emitted when any participant resumes consuming or producing stream of any type. --- ### onE2eeStateChanged() - This event will be emitted when participant's E2EE State changes. --- ### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js meeting!!.localParticipant.addEventListener(object : ParticipantEventListener() { override fun onStreamEnabled(stream: Stream) { // } override fun onStreamDisabled(stream: Stream) { // } override fun onStreamPaused(kind: String, reason: String) { // } override fun onStreamResumed(kind: String, reason: String) { // } override fun onE2eeStateChanged(state: E2EEState, stream: Stream) { // } }); ``` ```js participant.addEventListener(new ParticipantEventListener() { @Override public void onStreamEnabled(Stream stream) { // } @Override public void onStreamDisabled(Stream stream) { // } @Override public void onStreamPaused(String kind, String reason) { // } @Override public void onStreamResumed(String kind, String reason) { // } @Override public void onE2eeStateChanged(E2EEState state, Stream stream) { // } }); ```
--- --- title: Participant Class Properties sidebar_position: 1 sidebar_label: Properties pagination_label: Participant Class Properties --- # Participant Class Properties - Android
### getId() - type: `String` - `getId` will return unique id of the participant who has joined the meeting. --- ### getDisplayName() - type: `String` - It will return the `displayName` of the participant who has joined the meeting. --- ### getQuality() - type: `String` - `getQuality()` will return quality of participant's stream. Stream could be `audio` , `video` or `share`. --- ### isLocal() - type: `boolean` - `isLocal()` will return `true` if participant is Local,`false` otherwise. --- ### getStreams() - type: `Map` - It will represents the stream for that particular participant who has joined the meeting. Streams could be `audio` , `video` or `share`. --- ### getMode() - type : `string` - ` getMode()` will return mode of the Participant. --- ### getMetaData() - type : `JSONObject` - `getMetaData()` will return additional information, that you have passed in `initMeeting()`.
--- --- title: Participant class for android SDK. hide_title: false hide_table_of_contents: false description: The `Participant Class` includes methods and events for participants and their associated video & audio streams, data channels and UI customization. sidebar_label: Participant Class pagination_label: Participant Class keywords: - RTC Android - Participant Class - Video API - Video Conferencing image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: participant-class --- import NoIndex from '/mdx/\_no-index.mdx'; # Participant Class ## Introduction The `Participant Class` includes methods and events for participants and their associated video & audio streams, data channels and UI customization. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ## Properties ### getId() - `getId()` will return participant's Id - return type : `String` ### getDisplayName() - `getDisplayName()` will return name of participant - return type : `String` ### getQuality() - `getQuality()` will return quality of participant's video stream - return type : `String` ### isLocal() - `isLocal()` will return `true` if participant is Local,`false` otherwise - return type : `boolean` ### getStreams() - `getStreams()` will return streams of participant - return type : `Map` - Map contains `streamId` as key and `stream` as value ## Events ### addEventListener(ParticipantEventListener listener) - By using `addEventListener(ParticipantEventListener listener)`, we can add listener to the List of `ParticipantEventListener` - return type : `void` ### removeEventListener(ParticipantEventListener listener) - By using `removeEventListener(ParticipantEventListener listener)`, we can remove listener from List of `ParticipantEventListener` - return type : `void` ### removeAllListeners() - By using `removeAllListeners()`, we can remove all listener from List - return type : `void` ## Methods ### enableMic() - By using `enableMic()` function, a participant can enable the Mic of any particular Remote Participant - When `enableMic()` is called, - Local Participant will receive a callback on `streamEnabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback for `onMicRequested()` and once the remote participant accepts the request, mic will be enabled for that participant - return type : `void` ### disableMic() - By using `disableMic()` function, a participant can disable the Mic of any particular Remote Participant - When `enableMic()` is called, - Local Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - return type : `void` ### enableWebcam() - By using `enableWebcam()` function, a participant can enable the Webcam of any particular Remote Participant - When `enableWebcam()` is called, - Local Participant will receive a callback on `streamEnabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback for `webcamRequested()` and once the remote participant accepts the request, webcam will be enabled for that participant - return type : `void` ### disableWebcam() - By using `disableWebcam()` function, a participant can disable the Webcam of any particular Remote Participant - When `disableWebcam()` is called, - Local Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - return type : `void` ### remove() - By using `remove()` function, a participant can remove any particular Remote Participant - When `remove()` is called, - Local Participant will receive a callback on `meetingLeft()` - Remote Participant will receive a callback on `participantLeft()` - return type : `void` ### setQuality() - By using `setQuality()`,you can set quality of participant's video stream - return type : `void` --- --- title: ParticipantEventListener Class for android SDK. hide_title: false hide_table_of_contents: false description: The `ParticipantEventListener Class` includes list of events which can be useful for the design custom user interface. sidebar_label: ParticipantEventListener Class pagination_label: ParticipantEventListener Class keywords: - RTC Android - ParticipantEventListener Class - Video API - Video Conferencing image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: participant-event-listener-class --- # ParticipantEventListener Class ## using ParticipantEventListener Class The `ParticipantEventListener Class` is responsible for listening to all the events that are related to `Participant Class`. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ### Listeners --- --- title: Video SDK PubSub Class sidebar_position: 1 sidebar_label: Introduction pagination_label: Video SDK PubSub Class --- # Video SDK PubSub Class - Android
## Introduction PubSub class provides the methods to implement Publisher-Subscriber feature in your Application. ## PubSub Methods
- [publish()](methods#publish)
- [subscribe()](methods#subscribe)
- [unsubscribe()](methods#unsubscribe)
--- --- sidebar_position: 1 sidebar_label: Methods pagination_label: PubSub Class Methods title: PubSub Class Methods --- # PubSub Class Methods - Android
### publish() - `publish()` is used to publish messages on a specified topic in the meeting. #### Parameters - topic - type: `String` - This is the name of the topic, for which message will be published. - message - type: `String` - This is the actual message. - options - type: [`PubSubPublishOptions`](pubsub-publish-options-class) - This specifies the options for publish. - payload - type: `JSONObject` - `OPTIONAL` - If you need to include additional information along with a message, you can pass here as `JSONObject`. #### Returns - _`void`_ #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js // Publish message for 'CHAT' topic val publishOptions = PubSubPublishOptions() publishOptions.isPersist = true meeting!!.pubSub.publish("CHAT", "Hello from Android", publishOptions) ``` ```js // Publish message for 'CHAT' topic PubSubPublishOptions publishOptions = new PubSubPublishOptions(); publishOptions.setPersist(true); meeting.pubSub.publish("CHAT", "Hello from Android", publishOptions); ``` --- ### subscribe() - `subscribe()` is used to subscribe a particular topic to get all the messages of that particular topic in the meeting. #### Parameters - topic: - type: `String` - Participants can listen to messages on that particular topic. - listener: - type: [`PubSubMessageListener`](pubsub-message-listener-class) #### Returns - [_`List`_](pubsub-message-class) #### Example ```js var pubSubMessageListener: ubSubMessageListener = PubSubMessageListener { message -> Log.d("#message","onMessageReceived: " + message.message) } // Subscribe for 'CHAT' topic val pubSubMessageList = meeting!!.pubSub.subscribe("CHAT", pubSubMessageListener) ``` ```js PubSubMessageListener pubSubMessageListener = new PubSubMessageListener() { @Override public void onMessageReceived(PubSubMessage message) { Log.d("#message", "onMessageReceived: " + message.getMessage()); } }; // Subscribe for 'CHAT' topic List pubSubMessageList = meeting.pubSub.subscribe("CHAT", pubSubMessageListener); ``` --- ### unsubscribe() - `unsubscribe()` is used to unsubscribe a particular topic on which you have subscribed priviously. #### Parameters - topic: - type: `String` - This is the name of the topic to be unsubscribed. - listener: - type: [`PubSubMessageListener`](pubsub-message-listener-class) #### Returns - _`void`_ #### Example ```js // Unsubscribe for 'CHAT' topic meeting!!.pubSub.unsubscribe("CHAT", pubSubMessageListener) ``` ```js // Unsubscribe for 'CHAT' topic meeting.pubSub.unsubscribe("CHAT", pubSubMessageListener); ```
--- --- sidebar_position: 1 sidebar_label: Properties pagination_label: Properties title: Properties --- # Properties - Android
### getId() - type: `String` - `getId()` will return unique id of the pubsub message. --- ### getMessage() - type: `String` - `getMessage()` will return message that has been published on the specific topic. --- ### getTopic() - type: `String` - `getTopic()` will return name of the message topic. --- ### getSenderId() - type: `String` - `getSenderId()` will return id of the participant, who has sent the message. --- ### getSenderName() - type: `String` - `getSenderName()` will return name of the participant, who has sent the pubsub message. --- ### getTimestamp() - type: `long` - `getTimestamp()` will return timestamp at which, the pubsub message was sent. --- ### getPayload() - type: `JSONObject` - `getPayload()` will return data that you have send with message.
--- --- sidebar_position: 1 sidebar_label: PubSubMessageListener Class pagination_label: PubSubMessageListener Class title: PubSubMessageListener Class --- # PubSubMessageListener Class - Android
--- #### onMessageReceived() - This event will be emitted whenever any pubsub message received. #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript var pubSubMessageListener = PubSubMessageListener { message -> Log.d("#message", "onMessageReceived: " + message.message) } ``` ```javascript PubSubMessageListener pubSubMessageListener = new PubSubMessageListener() { @Override public void onMessageReceived(PubSubMessage message) { Log.d("#message", "onMessageReceived: " + message.getMessage()); } }; ```
--- --- sidebar_position: 1 sidebar_label: PubSubPublishOptions Class pagination_label: PubSubPublishOptions Class title: PubSubPublishOptions Class --- # PubSubPublishOptions Class - Android
## Properties ### persist - type: `boolean` - defaultValue: `false` - This property specifies whether to store messages on server for upcoming participants. - If the value of this property is true, then server will store pubsub messages for the upcoming participants. --- ### sendOnly - type: `String[]` - defaultValue: `null` - If you want to send a message to specific participants, you can pass their respective `participantId` here. If you don't provide any IDs or pass a `null` value, the message will be sent to all participants by default. :::note Make sure that participantId present in the array must be subscribe to that specific topic. :::
--- --- title: PubSub class for android SDK. hide_title: false hide_table_of_contents: false description: PubSub Class sidebar_label: PubSub Class pagination_label: PubSub Class keywords: - RTC Android - Publisher-Subscriber - PubSub image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: pubsub-class --- # PubSub Class ## using PubSub Class The `PubSub` includes methods for pubsub. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ### Methods "} /> --- --- title: PubSubMessage class for android SDK. hide_title: false hide_table_of_contents: false description: PubSubMessage Class sidebar_label: PubSubMessage Class pagination_label: PubSubMessage Class keywords: - RTC Android - Publisher-Subscriber - PubSub - PubSubMessage image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: pubsub-message-class --- # PubSubMessage Class ## using PubSubMessage Class The `PubSubMessage` includes properties of PubSub message. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ### Properties --- --- title: PubSubPublishOptions class for android SDK. hide_title: false hide_table_of_contents: false description: PubSubPublishOptions Class sidebar_label: PubSubPublishOptions Class pagination_label: PubSubPublishOptions Class keywords: - RTC Android - PubSub - PubSubPublishOptions image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: pubsub-publish-options-class --- # PubSubPublishOptions Class ## using PubSubPublishOptions Class The `PubSubPublishOptions` includes properties of PubSub options. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ### Properties ### Methods --- --- id: setup title: Installation steps for RTC Android SDK hide_title: false hide_table_of_contents: false description: RTC Android SDK provides client for almost all Android devices. it takes less amount of cpu and memory. sidebar_label: Setup pagination_label: Setup keywords: - RTC Android - Android SDK - Kotlin SDK - Java SDK image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: setup --- # Setup - Android ## Setting up android sdk Android SDK is client for real-time communication for android devices. It inherits the same terminology as all other SDKs does. ## Minimum OS/SDK versions It supports the following OS/SDK versions. ### Android: minSdkVersion >= 21 ## Installation 1. If your Android Studio Version is older than Android Studio Bumblebees, add the repository to project's `build.gradle` file.
If your are using Android Studio Bumblebees or newer Version, add the repository to `settings.gradle` file. :::note You can use imports with Maven Central after rtc-android-sdk version `0.1.12`. Whether on Maven or Jitpack, the same version numbers always refer to the same SDK. ::: import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js title="build.gradle" allprojects { repositories { // ... google() mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` ```js title="settings.gradle" dependencyResolutionManagement{ repositories { // ... google() mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` ```js title="build.gradle" allprojects { repositories { // ... google() maven { url 'https://jitpack.io' } mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` ```js title="settings.gradle" dependencyResolutionManagement{ repositories { // ... google() maven { url 'https://jitpack.io' } mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` ### Step 2: Add the following dependency in your app's `app/build.gradle`. ```js title="app/build.gradle" dependencies { implementation 'live.videosdk:rtc-android-sdk:0.3.0' // library to perform Network call to generate a meeting id implementation 'com.amitshekhar.android:android-networking:1.0.2' // other app dependencies } ``` :::important Android SDK compatible with armeabi-v7a, arm64-v8a, x86_64 architectures. If you want to run the application in an emulator, choose ABI x86_64 when creating a device. ::: ## Integration ### Step 1: Add the following permissions in `AndroidManifest.xml`. ```xml title="AndroidManifest.xml" ``` ### Step 2: Create `MainApplication` class which will extend the `android.app.Application`. ```js title="MainApplication.kt" package live.videosdk.demo; import live.videosdk.android.VideoSDK class MainApplication : Application() { override fun onCreate() { super.onCreate() VideoSDK.initialize(applicationContext) } } ``` ```js title="MainApplication.java" package live.videosdk.demo; import android.app.Application; import live.videosdk.android.VideoSDK; public class MainApplication extends Application { @Override public void onCreate() { super.onCreate(); VideoSDK.initialize(getApplicationContext()); } } ``` ### Step 3: Add `MainApplication` to `AndroidManifest.xml`. ```js title="AndroidManifest.xml" ``` ### Step 4: In your `JoinActivity` add the following code in `onCreate()` method. ```js title="JoinActivity.kt" override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_join) val meetingId = "" val participantName = "John Doe" var micEnabled = true var webcamEnabled = true // generate the jwt token from your api server and add it here VideoSDK.config("JWT TOKEN GENERATED FROM SERVER") // create a new meeting instance meeting = VideoSDK.initMeeting( this@MeetingActivity, meetingId, participantName, micEnabled, webcamEnabled, null, null, false, null, null) // get permissions and join the meeting with meeting.join(); // checkPermissionAndJoinMeeting(); } ``` ```js title="JoinActivity.java" @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_join); final String meetingId = ""; final String participantName = "John Doe"; final boolean micEnabled = true; final boolean webcamEnabled = true; // generate the jwt token from your api server and add it here VideoSDK.config("JWT TOKEN GENERATED FROM SERVER"); // create a new meeting instance Meeting meeting = VideoSDK.initMeeting( MainActivity.this, meetingId, participantName, micEnabled, webcamEnabled, null, null, false, null, null ); // get permissions and join the meeting with meeting.join(); // checkPermissionAndJoinMeeting(); } ``` All set! Here is the link to the complete sample code on [Github](https://github.com/videosdk-live/videosdk-rtc-android-java-sdk-example). Please refer to the [documentation](initMeeting) for a full list of available methods, events and features of the SDK. --- --- title: Video SDK Stream Class sidebar_position: 1 sidebar_label: Introduction pagination_label: Video SDK Stream Class --- # Video SDK Stream Class - Android
import properties from './../data/stream-class/properties.json' import methods from './../data/stream-class/methods.json' import LinksGrid from '../../../../src/theme/LinksGrid' Stream class is responsible for handling audio, video and screen sharing streams. Stream class defines instance of audio, video and shared screen stream of participants. ## Stream Properties
- [getId()](./properties#getid)
- [getKind()](./properties#getkind)
- [getTrack()](./properties#gettrack)
## Stream Methods
- [resume()](methods#resume)
- [pause()](./methods#pause)
--- --- title: Stream Class Methods sidebar_position: 1 sidebar_label: Methods pagination_label: Stream Class Methods --- # Stream Class Methods - Android
### resume() - By using `resume()`, a participant can resume the stream of Remote Participant. #### Returns - `void` --- ### pause() - By using `pause()`, a participant can pause the stream of Remote Participant. #### Returns - `void`
--- --- title: Stream Class Properties sidebar_position: 1 sidebar_label: Properties pagination_label: Stream Class Properties --- # Stream Class Properties - Android
### getId() - type: `String` - `getId()` will return id for that stream . --- ### getKind() - type: `String` - `getKind()` will return `kind`, which represents the type of stream which could be `audio` | `video` or `share` . --- ### getTrack() - type: `MediaStreamTrack` - `getTrack()` will return a MediaStreamTrack object stored in the MediaStream object.
--- --- title: Stream class for android SDK. hide_title: false hide_table_of_contents: false description: RTC Stream Class enables opportunity to . sidebar_label: Stream Class pagination_label: Stream Class keywords: - RTC Android - Stream Class - Video API - Video Conferencing image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: stream-class --- # Stream Class ## Introduction The `Stream Class` includes methods and events of video & audio streams. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ## Properties ### getId() - `getId()` will return Id of stream - return type : `String` ### getKind() - `getKind()` will return kind of stream, which can `audio`,`video` or `share` - return type : `String` ### getTrack() - `getTrack()` will return a MediaStreamTrack object stored in the MediaStream object - return type : `MediaStreamTrack` ## Methods ### pause() - By using `pause()` function, a participant can pause the stream of Remote Participant - return type : `void` ### resume() - By using `resume()` function, a participant can resume the stream of Remote Participant - return type : `void` --- --- title: Terminology - Video SDK Documentation hide_title: true hide_table_of_contents: false description: Video SDK enables the opportunity to integrate native IOS, Android & Web SDKs to add live video & audio conferencing to your applications. sidebar_label: Terminology pagination_label: Terminology keywords: - audio calling - video calling - real-time communication - collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: terminology --- import Terminology from '../../../mdx/introduction/\_terminology.mdx'; --- --- title: Video SDK Class for android SDK. hide_title: false hide_table_of_contents: false description: Video SDK Class is a factory for initialize, configure and init meetings. sidebar_label: VideoSDK Class pagination_label: VideoSDK Class keywords: - RTC Android - VideoSDK Class - Video API - Video Conferencing image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: videosdk-class --- # VideoSDK Class The entry point into real-time communication SDK. ## using VideoSDK Class The `VideoSDK Class` includes methods and events to initialize and configure the SDK. It is a factory class. import MethodListGroup from '@theme/MethodListGroup'; import MethodListItemLabel from '@theme/MethodListItemLabel'; import MethodListHeading from '@theme/MethodListHeading'; ### Parameters ### Methods ## Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js title="initMeeting" // Configure the token VideoSDK.config(token) // Initialize the meeting Meeting meeting = VideoSDK.initMeeting( context, meetingId, // required name, // required micEnabled, // required webcamEnabled, // required null, // required null, // required null // required ) }); ``` ```js title="initMeeting" // Configure the token VideoSDK.config(token) // Initialize the meeting Meeting meeting = VideoSDK.initMeeting({ context, meetingId, // required name, // required micEnabled, // required webcamEnabled, // required null, // required null // required null // required }); ``` --- --- sidebar_position: 1 sidebar_label: Events pagination_label: VideoSDK Class Events title: VideoSDK Class Events --- # VideoSDK Class Events - Android
### onAudioDeviceChanged() - This event will be emitted when an audio device, is connected to or removed from the device. #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript VideoSDK.setAudioDeviceChangeListener(object : VideoSDK.AudioDeviceChangeEvent { override fun onAudioDeviceChanged( selectedAudioDevice: AudioDeviceInfo?, audioDevices: MutableSet? ) { Log.d( "VideoSDK", "Selected Audio Device: " + selectedAudioDevice.label ) for (audioDevice in audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.label) } } }) ``` ```javascript VideoSDK.setAudioDeviceChangeListener(new VideoSDK.AudioDeviceChangeEvent() { @Override public void onAudioDeviceChanged(AudioDeviceInfo selectedAudioDevice, Set audioDevices) { Log.d("VideoSDK", "Selected Audio Device: " + selectedAudioDevice.getLabel()); for (AudioDeviceInfo audioDevice : audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.getLabel()); } } }); ``` ---
--- --- sidebar_position: 1 sidebar_label: Introduction pagination_label: Intro to VideoSDK Class title: VideoSDK Class --- # VideoSDK Class - Android
## Introduction The `VideoSDK` class includes properties, methods and events for creating and configuring a meeting, and managing media devices. import LinksGrid from "../../../../src/theme/LinksGrid"; //import properties from "./../data/meeting-class/properties.json"; import methods from "./../data/meeting-class/methods.json"; import events from "./../data/meeting-class/events.json"; ## VideoSDK Properties
- [getSelectedAudioDevice()](./properties.md#getselectedaudiodevice)
- [getSelectedVideoDevice()](./properties#getselectedvideodevice)
## VideoSDK Methods
- [initialize()](./methods#initialize)
- [config()](./methods#config)
- [initMeeting()](./methods#initmeeting)
- [getDevices()](./methods#getdevices)
- [getVideoDevices()](./methods#getvideodevices)
- [getAudioDevices()](./methods#getaudiodevices)
- [checkPermissions()](./methods#checkpermissions)
- [setAudioDeviceChangeListener()](./methods#setaudiodevicechangelistener)
- [setSelectedAudioDevice()](./methods#setselectedaudiodevice)
- [setSelectedVideoDevice()](./methods#setselectedvideodevice)
## VideoSDK Events
- [onAudioDeviceChanged](./events.md#onaudiodevicechanged)
--- --- sidebar_position: 1 sidebar_label: Methods pagination_label: VideoSDK Class Methods title: VideoSDK Class Methods --- # VideoSDK Class Methods - Android
### initialize() To initialize the meeting, first you have to initialize the `VideoSDK`. You can initialize the `VideoSDK` using `initialize()` method provided by the SDK. #### Parameters - **context**: Context #### Returns - _`void`_ ```js title="initialize" VideoSDK.initialize(Context context) ``` --- ### config() By using `config()` method, you can set the `token` property of `VideoSDK` class. Please refer this [documentation](/api-reference/realtime-communication/intro/) to generate a token. #### Parameters - **token**: String #### Returns - _`void`_ ```js title="config" VideoSDK.config(String token) ``` --- ### initMeeting() - Initialize the meeting using a factory method provided by the SDK called `initMeeting()`. - `initMeeting()` will generate a new [`Meeting`](../meeting-class/introduction.md) class and the initiated meeting will be returned. ```js title="initMeeting" VideoSDK.initMeeting( Context context, String meetingId, String name, boolean micEnabled, boolean webcamEnabled, String participantId, String mode, boolean multiStream, Map customTracks JSONObject metaData, String signalingBaseUrl PreferredProtocol preferredProtocol ) ``` - Please refer this [documentation](../initMeeting.md#initmeeting) to know more about `initMeeting()`. --- ### getDevices() - The `getDevices()` method returns a list of the currently available media devices, such as microphones, cameras, headsets, and so forth. The method returns a list of `DeviceInfo` objects describing the devices. - `DeviceInfo` class has four properties : 1. `DeviceInfo.deviceId` - Returns a string that is an identifier for the represented device, persisted across sessions. 2. `DeviceInfo.label` - Returns a string describing this device (for example `BLUETOOTH`). 3. `DeviceInfo.kind` - Returns an enumerated value that is either `video` or `audio`. 4. `DeviceInfo.FacingMode` - Returns a value of type `FacingMode` indicating which camera device is in use (front or back). #### Returns - `Set` #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript val devices: Set = VideoSDK.getDevices() for (deviceInfo in devices) { Log.d("VideoSDK", "Device's DeviceId " + deviceInfo.deviceId) Log.d("VideoSDK", "Device's Label " + deviceInfo.label) Log.d("VideoSDK", "Device's Kind " + deviceInfo.kind) Log.d("VideoSDK", "Device's Facing Mode " + deviceInfo.facingMode) //Value will be null for Audio Devices } ``` ```javascript Set devices = VideoSDK.getDevices(); for (DeviceInfo deviceInfo : devices) { Log.d("VideoSDK", "Device's DeviceId " + deviceInfo.getDeviceId()); Log.d("VideoSDK", "Device's Label " + deviceInfo.getLabel()); Log.d("VideoSDK", "Device's Kind " + deviceInfo.getKind()); Log.d("VideoSDK", "Device's Facing Mode " + deviceInfo.getFacingMode()) //Value will be null for Audio Devices } ``` --- ### getVideoDevices() - The `getVideoDevices` method returns a list of currently available video devices. The method returns a list of `VideoDeviceInfo` objects describing the video devices. - `VideoDeviceInfo` class has four properties : 1. `VideoDeviceInfo.deviceId` - Returns a string that is an identifier for the represented device, persisted across sessions. 2. `VideoDeviceInfo.label` - Returns a string describing this device (for example `BLUETOOTH`). 2. `VideoDeviceInfo.kind` - Returns an enumerated value that is `video` . 4. `VideoDeviceInfo.FacingMode` - Returns a value of type `FacingMode` indicating which camera device is in use (front or back). #### Returns - `Set` #### Example ```js val videoDevices: Set = VideoSDK.getVideoDevices() for (videoDevice in videoDevices) { Log.d("VideoSDK", "Video Device's DeviceId " + videoDevice.deviceId) Log.d("VideoSDK", "Video Device's Label " + videoDevice.label) Log.d("VideoSDK", "Video Device's Kind " + videoDevice.kind) } ``` ```js Set videoDevices = VideoSDK.getVideoDevices(); for (VideoDeviceInfo videoDevice: videoDevices) { Log.d("VideoSDK", "Video Device's DeviceId " + videoDevice.getDeviceId()); Log.d("VideoSDK", "Video Device's Label " + videoDevice.getLabel()); Log.d("VideoSDK", "Video Device's Kind " + videoDevice.getKind()); } ``` --- ### getAudioDevices() - The `getAudioDevices` method returns a list of currently available audio devices. The method returns a list of `AudioDeviceInfo` objects describing the audio devices. - `AudioDeviceInfo` class has three properties : 1. `AudioDeviceInfo.deviceId` - Returns a string that is an identifier for the represented device, persisted across sessions. 2. `AudioDeviceInfo.label` - Returns a string describing this device (for example `BLUETOOTH`). 3. `AudioDeviceInfo.kind` - Returns an enumerated value that is `audio`. #### Returns - `Set` #### Example ```js val audioDevices: Set = VideoSDK.getAudioDevices() for (audioDevice in audioDevices) { Log.d("VideoSDK", "Audio Device's DeviceId " + audioDevice.deviceId) Log.d("VideoSDK", "Audio Device's Label " + audioDevice.label) Log.d("VideoSDK", "Audio Device's Kind " + audioDevice.kind) } ``` ```js Set audioDevices = VideoSDK.getAudioDevices(); for (AudioDeviceInfo audioDevice: audioDevices) { Log.d("VideoSDK", "Audio Device's DeviceId " + audioDevice.getDeviceId()); Log.d("VideoSDK", "Audio Device's Label " + audioDevice.getLabel()); Log.d("VideoSDK", "Audio Device's Kind " + audioDevice.getKind()); } ``` --- ### setAudioDeviceChangeListener() - The `AudioDeviceChangeEvent` is emitted when an audio device, is connected to or removed from the device. This event can be set by using `setAudioDeviceChangeListener()` method. #### Parameters - **audioDeviceChangeEvent**: AudioDeviceChangeEvent #### Returns - _`void`_ #### Example ```javascript VideoSDK.setAudioDeviceChangeListener { selectedAudioDevice: AudioDeviceInfo, audioDevices: Set -> Log.d( "VideoSDK", "Selected Audio Device: " + selectedAudioDevice.label ) for (audioDevice in audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.label) } } ``` ```javascript VideoSDK.setAudioDeviceChangeListener((selectedAudioDevice, audioDevices) -> { Log.d("VideoSDK", "Selected Audio Device: " + selectedAudioDevice.getLabel()); for (AudioDeviceInfo audioDevice : audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.getLabel()); } }); ``` --- ### checkPermissions() - The `checkPermissions()` method verifies whether permissions to access camera and microphone devices have been granted. If the required permissions are not granted, the method will proceed to request these permissions from the user. #### Parameters - context - type: `Context` - `REQUIRED` - The android context. - permission - type: `List` - `REQUIRED` - The permission to be requested. - permissionHandler - type: `PermissionHandler` - `REQUIRED` - The permission handler object for handling callbacks of various user actions such as permission granted, permission denied, etc. - rationale - type: `String` - `OPTIONAL` - Explanation to be shown to user if they have denied permission earlier. If this parameter is not provided, permissions will be requested without showing the rationale dialog. - options - type: `Permissions.Options` - `OPTIONAL` - The options object for setting title and description of dialog box that prompts users to manually grant permissions by navigating to device settings. If this parameter is not provided,the default title and decription will be used for the dialog box. #### Returns - _`void`_ #### Example ```js private val permissionHandler: PermissionHandler = object : PermissionHandler() { override fun onGranted() {} override fun onBlocked( context: Context, blockedList: java.util.ArrayList ): Boolean { for (blockedPermission in blockedList) { Log.d("VideoSDK Permission", "onBlocked: $blockedPermission") } return super.onBlocked(context, blockedList) } override fun onDenied( context: Context, deniedPermissions: java.util.ArrayList ) { for (deniedPermission in deniedPermissions) { Log.d("VideoSDK Permission", "onDenied: $deniedPermission") } super.onDenied(context, deniedPermissions) } override fun onJustBlocked( context: Context, justBlockedList: java.util.ArrayList, deniedPermissions: java.util.ArrayList ) { for (justBlockedPermission in justBlockedList) { Log.d("VideoSDK Permission", "onJustBlocked: $justBlockedPermission") } super.onJustBlocked(context, justBlockedList, deniedPermissions) } } val permissionList: MutableList = ArrayList() permissionList.add(Permission.audio) permissionList.add(Permission.video) permissionList.add(Permission.bluetooth) val rationale = "Please provide permissions" val options = Permissions.Options().setRationaleDialogTitle("Info").setSettingsDialogTitle("Warning") //If you wish to disable the dialog box that prompts //users to manually grant permissions by navigating to device settings, //you can set options.sendDontAskAgainToSettings(false) VideoSDK.checkPermissions(this, permissionList, rationale, options, permissionHandler) ``` ```js private final PermissionHandler permissionHandler = new PermissionHandler() { @Override public void onGranted() { } @Override public boolean onBlocked(Context context, ArrayList blockedList) { for (Permission blockedPermission : blockedList) { Log.d("VideoSDK Permission", "onBlocked: " + blockedPermission); } return super.onBlocked(context, blockedList); } @Override public void onDenied(Context context, ArrayList deniedPermissions) { for (Permission deniedPermission : deniedPermissions) { Log.d("VideoSDK Permission", "onDenied: " + deniedPermission); } super.onDenied(context, deniedPermissions); } @Override public void onJustBlocked(Context context, ArrayList justBlockedList, ArrayList deniedPermissions) { for (Permission justBlockedPermission : justBlockedList) { Log.d("VideoSDK Permission", "onJustBlocked: " + justBlockedPermission); } super.onJustBlocked(context, justBlockedList, deniedPermissions); } }; List permissionList = new ArrayList<>(); permissionList.add(Permission.audio); permissionList.add(Permission.video); permissionList.add(Permission.bluetooth); String rationale = "Please provide permissions"; Permissions.Options options = new Permissions.Options().setRationaleDialogTitle("Info").setSettingsDialogTitle("Warning"); //If you wish to disable the dialog box that prompts //users to manually grant permissions by navigating to device settings, //you can set options.sendDontAskAgainToSettings(false) VideoSDK.checkPermissions(this, permissionList, rationale, options, permissionHandler); ``` --- ### setSelectedAudioDevice() - It sets the selected audio device, allowing the user to specify which audio device to use in the meeting. #### Parameters - **selectedAudioDevice**: AudioDeviceInfo #### Returns - _`void`_ #### Example ```js val audioDevices: Set = VideoSDK.getAudioDevices() val audioDeviceInfo: AudioDeviceInfo = audioDevices.toTypedArray().get(0) as AudioDeviceInfo VideoSDK.setSelectedAudioDevice(audioDeviceInfo) ``` ```js Set audioDevices = VideoSDK.getAudioDevices(); AudioDeviceInfo audioDeviceInfo = (AudioDeviceInfo) audioDevices.toArray()[0]; VideoSDK.setSelectedAudioDevice(audioDeviceInfo); ``` --- ### setSelectedVideoDevice() - It sets the selected video device, allowing the user to specify which video device to use in the meeting. #### Parameters - **selectedVideoDevice**: VideoDeviceInfo #### Returns - _`void`_ #### Example ```js val videoDevices: Set = VideoSDK.getVideoDevices() val videoDeviceInfo: VideoDeviceInfo = videoDevices.toTypedArray().get(1) as VideoDeviceInfo VideoSDK.setSelectedVideoDevice(videoDeviceInfo) ``` ```js Set videoDevices = VideoSDK.getVideoDevices(); VideoDeviceInfo videoDeviceInfo = (VideoDeviceInfo) videoDevices.toArray()[1]; VideoSDK.setSelectedVideoDevice(videoDeviceInfo); ``` --- ### applyVideoProcessor() - This method allows users to dynamically apply virtual background to their video stream during a live session. #### Parameters - videoFrameProcessor - type: `VideoFrameProcessor` - This is an object of the `VideoFrameProcessor` class, which overrides the `onFrameCaptured(VideoFrame videoFrame)` method. #### Returns - _`void`_ #### Example ```js val uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg") val backgroundImageProcessor = BackgroundImageProcessor(uri) VideoSDK.applyVideoProcessor(backgroundImageProcessor) ``` ```java Uri uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg"); BackgroundImageProcessor backgroundImageProcessor = new BackgroundImageProcessor(uri) VideoSDK.applyVideoProcessor(backgroundColorProcessor); ``` --- ### removeVideoProcessor() - This method provides users with a convenient way to revert their video background to its original state, removing any previously applied virtual background. - **Returns:** - _`void`_ #### Example ```js VideoSDK.removeVideoProcessor(); ```
--- --- sidebar_position: 1 sidebar_label: Properties pagination_label: VideoSDK Class Properties title: VideoSDK Class Properties --- # VideoSDK Class Properties - Android
### getSelectedAudioDevice() - type: `AudioDeviceInfo` - The `getSelectedAudioDevice()` method will return the object of the audio device, which is currently in use. --- ### getSelectedVideoDevice() - type: `VideoDeviceInfo` - The `getSelectedVideoDevice()` method will return the object of the video device, which is currently in use.
--- ### 5.Errors associated with Media These errors involve media access, device availability, or permission-related issues affecting camera, microphone, and screen sharing. ### Device access-related errors | Type | Code | Message | |-------------------------------------------|-------|--------------------------------------------------| | ERROR_CAMERA_ACCESS | 3002 | Something went wrong. Unable to access camera. | | ERROR_MIC_ACCESS_DENIED | 3003 | It seems like microphone access was denied or dismissed. To proceed, kindly grant access through your device's settings. | | ERROR_CAMERA_ACCESS_DENIED | 3004 | It seems like camera access was denied or dismissed. To proceed, kindly grant access through your device's settings. | ### 6.Errors associated with Track These errors occur when there are issues with video or audio tracks, such as disconnections or invalid custom tracks. | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | ERROR_CUSTOM_SCREEN_SHARE_TRACK_ENDED | 3005 | The provided custom track is in an ended state. Please try again with new custom track. | | ERROR_CUSTOM_SCREEN_SHARE_TRACK_DISPOSED | 3006 | The provided custom track was disposed. Please try again with new custom track. | | ERROR_CHANGE_WEBCAM | 3007 | Something went wrong, and the camera could not be changed. Please try again. | ### 7.Errors associated with Actions Below error is triggered when an action is attempted before joining a meeting. | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | ERROR_ACTION_PERFORMED_BEFORE_MEETING_JOINED | 3001 | Oops! Something went wrong. The room was in a connecting state, and during that time, an action encountered an issue. Please try again after joining a meeting. | --- --- sidebar_label: App Size Optimization pagination_label: App Size Optimization --- # App Size Optimization - Android This guide is designed to help developers optimize app size, enhancing performance and efficiency across different devices. By following these best practices, you can reduce load times, minimize storage requirements, and deliver a more seamless experience to users, all while preserving essential functionality. ## Deliver Leaner Apps with App Bundles Using Android App Bundles (AAB) is an effective way to optimize the size of your application, making it lighter and more efficient for users to download and install. App Bundles allow Google Play to dynamically generate APKs tailored to each device, so users only download the resources and code relevant to their specific configuration. This approach reduces app size significantly, leading to faster installs and conserving storage space on users’ devices. Recommended Practices: - `Enable App Bundles`: Configure your build to use the App Bundle format instead of APKs. This will allow Google Play to optimize your app for each device type automatically. - `Organize Resources by Device Type`: Ensure that resources (like images and layouts) are organized by device type (such as screen density or language) to maximize the benefits of App Bundles. - `Test Modularization`: If your app contains large, optional features, use dynamic feature modules to let users download them on demand. This reduces the initial download size and provides features only as needed. - `Monitor Size Reductions`: Regularly analyze your app size to see where the most savings occur, and make sure that App Bundle optimizations are effectively reducing your app size across different device configurations. ### Optimize Libraries for a Leaner App Experience Managing dependencies carefully is essential for minimizing app size and improving performance. Every library or dependency included in your app adds to its overall size, so it’s crucial to only incorporate what’s necessary. Optimizing dependencies helps streamline your app, reduce load times, and enhance maintainability. Recommended Practices: - `Use Only Essential Libraries`: Review all libraries and dependencies, removing any that are not critical to your app’s functionality. This helps avoid unnecessary bloat. - `Leverage Lightweight Alternatives`: Whenever possible, choose lightweight libraries or modularized versions of larger ones. For example, opt for a specific feature module rather than including an entire library. - `Monitor Library Updates`: Regularly update your dependencies to take advantage of any optimizations or size reductions made by the library maintainers. Newer versions are often more efficient. - `Minimize Native Libraries`: If your app uses native libraries, ensure they’re essential and compatible across platforms, as they can significantly increase app size. - `Analyze Dependency Tree`: Use tools like Gradle’s dependency analyzer to identify unnecessary or redundant dependencies, ensuring your app’s dependency tree is as lean as possible. ### Optimize with ProGuard ProGuard is a powerful tool for shrinking, optimizing, and obfuscating your code, which can significantly reduce your app's size and improve performance. By removing unused code and reducing the size of classes, fields, and methods, ProGuard helps to minimize the footprint of your app without sacrificing functionality. Additionally, ProGuard’s obfuscation feature enhances security by making reverse engineering more difficult. You can refer to the official [documentation](https://developer.android.com/build/shrink-code) for more information. Recommended Practices: - `Enable ProGuard`: To enable ProGuard in your project, ensure that your `proguard-rules.pro` file is properly configured, and add the following lines to your `build.gradle` file: ```js title="build.gradle" buildTypes { release { minifyEnabled true proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro' } } ``` - `Customize ProGuard Rules`: Carefully review and customize ProGuard rules in the proguard-rules.pro file to avoid stripping essential code. For example, to keep a specific class, add: ```js -keep class com.example.myapp.MyClass { *; } ``` If you encounter an issue after enabling ProGuard rules, refer to our [known issues section](https://docs.videosdk.live/android/guide/video-and-audio-calling-api-sdk/known-issues). --- --- sidebar_label: Developer Experience Guidelines pagination_label: Developer Experience Guidelines --- # Developer Experience Guidelines - Android This section provides best practices for creating a smooth and efficient development process when working with VideoSDK. From handling errors gracefully to managing resources and event subscriptions, these guidelines help developers build more reliable and maintainable applications. Following these practices can simplify troubleshooting, prevent common pitfalls, and improve overall application performance. ### Initiate Key Features After Meeting Join Event To provide a seamless and reliable meeting experience, initiate specific features **only** after the [onMeetingJoined()](https://docs.videosdk.live/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingjoined) event has been triggered. - **Trigger Key Actions After Joining the Meeting** : Initiating crucial actions after the `onMeetingJoined()` event helps avoid errors and optimizes the meeting setup, ensuring a smoother experience for participants. If your application utilizes any of the following features or you want to perform any action as soon as meeting joins, it's recommended to call them only after the meeting has successfully started: - `Chat Subscription`: To enable in-meeting chat functionality, subscribe to the chat topic after the `onMeetingJoined()` event is triggered. This ensures that messages are reliably received by participants.
- `Device Management`: If you need users to use specific audio or video devices when the meeting is first joined, you can utilize the [`setSelectedAudioDevice()`](https://docs.videosdk.live/android/api/sdk-reference/videosdk-class/methods#setselectedaudiodevice) and [`setSelectedVideoDevice()`](https://docs.videosdk.live/android/api/sdk-reference/videosdk-class/methods#setselectedvideodevice) methods of `VideoSDK` class. - `Recording and Transcription`: To automatically start recording or transcription as soon as the meeting begins, configure the `autoStartConfig` in the `createMeeting` API. For detailed information, refer to the documentation [here](https://docs.videosdk.live/api-reference/realtime-communication/create-room#autoCloseConfig). ### Dispose Custom Tracks When Necessary Proper disposal of custom tracks is essential for managing system resources and ensuring a smooth experience. In most scenarios, tracks are automatically disposed of by the SDK, ensuring efficient resource management. However, in specific cases outlined below, you will need to dispose of custom tracks explicitly: 1. **When Enabling/Disabling the Camera on a Precall Screen**: - If your application includes a precall screen and you want to ensure that the device's camera is not used when the camera is disabled, you must dispose of the custom video track. Otherwise, the device’s camera will continue to be used even when the camera is off. - Additionally, remember to create a new track when the user enables the camera again. - If you don’t need to manage the camera's usage on the device level (i.e., you’re okay with the camera being used whether it’s enabled or disabled), you can skip this step. - Here's how you can manage customTrack on a precall screen : import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js import live.videosdk.rtc.android.CustomStreamTrack import live.videosdk.rtc.android.VideoSDK import live.videosdk.rtc.android.VideoView class JoinActivity : AppCompatActivity() { private var videoTrack: CustomStreamTrack? = null private var joinView: VideoView? = null private fun toggleWebcam(videoDevice: VideoDeviceInfo?) { if (isWebcamEnabled) { // check the track state is LIVE if(videoTrack?.track?.state()?.equals("LIVE") == true){ videoTrack?.track?.dispose() // Dispose the current video track videoTrack?.track?.setEnabled(false) // Disable the track } videoTrack = null joinView!!.removeTrack() // Remove the video track from the view joinView!!.releaseSurfaceViewRenderer() joinView!!.visibility = View.INVISIBLE; } else { // Re-enabling the webcam by creating a new track videoTrack = VideoSDK.createCameraVideoTrack( "h720p_w960p", "front", CustomStreamTrack.VideoMode.TEXT, true, this, videoDevice // Passes the VideoDeviceInfo object of the user's selected device ) // display in localView joinView!!.addTrack(videoTrack!!.track as VideoTrack?) joinView!!.visibility = View.VISIBLE } isWebcamEnabled = !isWebcamEnabled // Toggle webcam state } } ``` ```js import live.videosdk.rtc.android.CustomStreamTrack import live.videosdk.rtc.android.VideoSDK import live.videosdk.rtc.android.VideoView public class JoinActivity extends AppCompatActivity { private CustomStreamTrack videoTrack = null; private VideoView joinView = null; private boolean isWebcamEnabled = false; private void toggleWebcam(VideoDeviceInfo videoDevice) { if (isWebcamEnabled) { // Check if the track state is LIVE if (videoTrack != null && "LIVE".equals(videoTrack.getTrack().state())) { videoTrack.getTrack().dispose(); // Dispose the current video track videoTrack.getTrack().setEnabled(false); // Disable the track } videoTrack = null; joinView.removeTrack(); // Remove the video track from the view joinView.releaseSurfaceViewRenderer(); joinView.setVisibility(View.INVISIBLE); } else { // Re-enabling the webcam by creating a new track videoTrack = VideoSDK.createCameraVideoTrack( "h720p_w960p", "front", CustomStreamTrack.VideoMode.TEXT, true, this, videoDevice // Passes the VideoDeviceInfo object of the user's selected device ); // Display in the local view joinView.addTrack((VideoTrack) videoTrack.getTrack()); joinView.setVisibility(View.VISIBLE); } isWebcamEnabled = !isWebcamEnabled; // Toggle webcam state } } ``` ## Listen for Error Events Listening to error events enables your application to handle unexpected issues efficiently, providing users with clear feedback and potential solutions. Error codes pinpoint specific problems, whether from configuration settings, account restrictions, permission limitations, or device constraints. Here are recommended solutions based on common error categories: 1. [Errors associated with Organization](../../api/sdk-reference/error-codes#1-errors-associated-with-organization): If you encounter errors related to your organization (e.g., account status or participant limits), reach out to support at support@videosdk.live or reach out to us on [Discord](https://discord.com/invite/Qfm8j4YAUJ) for assistance. 2. [Errors associated with Token](../../api/sdk-reference/error-codes#2-errors-associated-with-token): For errors related to authentication tokens, ensure the token is valid and hasn’t expired, then try the request again. 3. [Errors associated with Meeting and Participant](../../api/sdk-reference/error-codes#3-errors-associated-with-meeting-and-participant): Check that meetingId and participantId are correctly passed and valid. Also, ensure each participant has a unique participantId to avoid duplicate entries. 4. [Errors associated with Add-on Service](../../api/sdk-reference/error-codes#4-errors-associated-with-add-on-service): If you encounter errors with add-on services (such as recording or streaming), try restarting the service after receiving a failure event. For example, if a `START_RECORDING_FAILED` error event occurs, attempt to call the `startRecording()` method again. If you're using webhooks, you can also retry on [recording-failed](https://docs.videosdk.live/api-reference/realtime-communication/user-webhooks#recording-failed) hook. 5. [Errors associated with Media](../../api/sdk-reference/error-codes#5errors-associated-with-media): Inform the user about media access issues, such as microphone or camera permissions. Design the UI to clearly indicate what is preventing the mic or camera from enabling, helping the user understand the problem. 6. [Errors associated with Track](../../api/sdk-reference/error-codes#6errors-associated-with-track): Ensure that the track you’ve created and passed to enable the mic or camera methods meets the required specifications. 7. [Errors associated with Actions](../../api/sdk-reference/error-codes#7errors-associated-with-actions): If you need to perform actions as soon as a meeting joins, only initiate them after receiving the [onMeetingJoined()](https://docs.videosdk.live/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingjoined) event, otherwise it will not work well. - Here's how to listen for the error event: ```js private val meetingEventListener: MeetingEventListener = object : MeetingEventListener() { //.. override fun onError(error: JSONObject) { try { val errorCodes: JSONObject = VideoSDK.getErrorCodes() val code = error.getInt("code") Log.d("#error", "Error is: " + error["message"]) } catch (e: Exception) { e.printStackTrace() } } } ``` ```js private final MeetingEventListener meetingEventListener = new MeetingEventListener() { //.. @Override public void onError(JSONObject error) { try { JSONObject errorCodes = VideoSDK.getErrorCodes(); int code = error.getInt("code"); Log.d("#error", "Error is: " + error.get("message")); } catch (Exception e) { e.printStackTrace(); } } }; ``` --- --- sidebar_label: Handle Large Rooms pagination_label: Handle Large Rooms --- # Handle Large Rooms - Android Managing large meetings requires specific strategies to ensure performance, stability, and a seamless user experience. This section provides best practices for optimizing VideoSDK applications to handle high participant volumes effectively. By implementing these recommendations, you can reduce lag, maintain video and audio quality, and provide a smooth experience even in large rooms. ## User Interface Optimization When hosting large meetings, an optimized UI helps manage participant visibility and ensures smooth performance. Recommended Practices: - `Limit Visible Participants`: Display only a limited number of participants on screen at any given time, adapting the view based on screen size. Use pagination to allow users to browse or switch between additional participants seamlessly. For example, you could display only users whose video stream is enabled, or you could choose to display all active speakers. This approach helps manage screen space efficiently, ensuring that the most relevant participants are visible without overwhelming the interface. - `Prioritize Active Speakers`: Ensure all active speakers are displayed on the screen to highlight who is currently talking, helping participants stay engaged and aware of ongoing discussions. To identify which participant is speaking, you can use the [onSpeakerChanged()](https://docs.videosdk.live/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onspeakerchanged) event. ## Optimizing Media Streams In large video calls, it’s important to manage media streams effectively to optimize system resources while maintaining a smooth user experience. Recommended Practices: - `Pause Streams for Non-Visible Participants`: To optimize performance, pause the video and/or audio streams of participants who are not currently visible on the screen. This reduces unnecessary resource consumption. - `Resume Streams When Visible`: Once a participant comes into view, resume their stream to provide an uninterrupted experience as they appear on the screen. For detailed setup instructions on how to achieve this, check out our in-depth documentation [here](https://docs.videosdk.live/android/guide/video-and-audio-calling-api-sdk/render-media/layout-and-grid-management#pauseresume-stream). ## Media Stream Quality Adjustment In large meetings, managing media stream quality is essential to balance performance and user experience. Recommended Practices: - `High Quality for Active Speakers`: For all active speakers, set the video stream quality to a higher level using the setQuality method (e.g., `setQuality("high")`). This ensures that participants will receive higher-quality video for active speakers, providing a clearer and more engaging experience. - `Lower Quality for Non-Speaking Participants`: For other participants who are not actively speaking, set their video stream quality to a lower level (e.g., `setQuality("low")`). This helps conserve bandwidth and system resources while maintaining overall meeting performance. Checkout the documentation for `setQuality()` method [here](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#setquality) --- --- sidebar_label: User Experience Guidelines pagination_label: User Experience Guidelines --- # User Experience Guidelines - Android This guide aims to help developers optimize the user experience and functionality of video conferencing applications with VideoSDK. By following these best practices, you can create smoother interactions, minimize common issues, and deliver a more reliable experience for users. Here are our recommended best practices to enhance the user experience in your application: | **Section** | **Description** | |--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [Configure Precall for Effortless Meeting Join](#configure-precall-for-effortless-meeting-join) | Users may enter meetings unprepared due to device or connection issues. A Precall setup can help them configure devices and settings beforehand for a smooth start. | | [Listen Key Events for Optimal User Experience](#listen-key-events-for-optimal-user-experience) | Users can feel lost without real-time updates on meeting status, events, and errors. Event monitoring and notifications keep them informed and engaged. | | [Handling Media Devices](#handling-media-devices) | Users may want to change their audio or video setup mid-meeting but struggle to manage device controls. Providing easy device switching enhances control and flexibility. | | [Monitoring Real-Time Participant Statistics](#monitoring-real-time-participant-statistics) | Poor video or audio quality without real-time feedback leaves users frustrated. Real-time stats let them assess connection quality and troubleshoot issues actively. | ## Configure Precall for Effortless Meeting Join A Precall step is crucial for ensuring users are set up correctly and have no device before joining a meeting. This step allows users to configure their devices and settings before entering a meeting, leading to a smoother experience and minimizing technical issues once the call begins. Recommended Practices: - `Request Permissions`: Prompt users to grant microphone, and camera permissions before entering the meeting, ensuring seamless access to their devices. - `Device Selection`: Allow users to select their preferred camera, and microphone giving them control over their setup from the start. - `Entry Preferences`: Provide options to join with the microphone and camera either on or off, letting users choose their level of engagement upon entry. - `Camera Preview`: Show a live camera preview, allowing users to adjust angles and lighting to ensure they appear clearly and professionally. - `Virtual Backgrounds`: Allow users to choose from different virtual backgrounds or enter with a virtual background enabled, enhancing privacy and creating a more polished appearance. For detailed setup instructions on each of these features, check out our in-depth documentation [here](https://docs.videosdk.live/android/guide/video-and-audio-calling-api-sdk/setup-call/precall).
## Monitor Key Events for Optimal User Experience Listening for crucial events is vital for providing users with a responsive and engaging experience in your application. By effectively managing state changes and user notifications, you can keep participants informed and enhance their overall experience during meetings. Recommended Practices: - `Monitor State Change Events`: Listen for state change events, such as `onMeetingStateChanged` and `onRecordingStateChanged`, and notify users promptly about these transitions. Keeping users informed helps them understand the current state of the meeting.
- `UI Handling on Event Trigger`: Update the user interface only in response to specific events. For instance, display that the meeting is recording only when the `onRecordingStateChanged` event with the status `RECORDING_STARTED` is received, rather than when the record button is clicked. This ensures users receive accurate and timely updates.
- `Notify Participants of Join/Leave Events`: Keep users informed about participant activity by notifying them when someone joins or leaves the meeting. This fosters a sense of presence and awareness of who is currently available to engage. - `Listen for Error Events`: It is crucial to monitor error events and notify users promptly when issues arise. Clear communication about errors can help users troubleshoot and address problems quickly, minimizing disruptions to the meeting. ## Handling Media Devices Providing seamless control over devices enhances user convenience and allows participants to adjust their setup for the best meeting experience. Proper device management within the UI also helps users stay informed about their current settings and troubleshoot issues effectively. Recommended Practices: - `Allow Device Switching`: Provide users with the option to switch between available microphone, and camera devices during the meeting. This flexibility is essential, especially if users want to adjust their setup mid-call. - `Display Selected Devices`: Ensure the UI shows users which microphone, and camera devices are currently selected. Clear device labeling in the interface can reduce confusion and help users verify their setup at a glance.
## Monitoring Real-Time Participant Statistics Providing real-time insights into stream quality allows participants to monitor and optimize their connection for the best experience. With detailed metrics on video, audio, and screen sharing, users can assess and troubleshoot quality issues, ensuring smooth and uninterrupted meetings. To display these statistics, you can use the [getVideoStats()](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#getvideostats), [getAudioStats()](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#getaudiostats), and [getShareStats()](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#getsharestats) methods. import ReactPlayer from 'react-player'

:::note To show the popup dialog for the participant's realtime stats, you can [refer to this function](https://github.com/videosdk-live/videosdk-rtc-android-kotlin-sdk-example/blob/main/app/src/main/java/live/videosdk/rtc/android/kotlin/Common/Utils/HelperClass.kt#L91). ::: --- --- sidebar_label: Face Match API pagination_label: Face Match API --- # Face Match API import FaceMatchApi from '../../../mdx/\_api-face-match.mdx'; --- --- sidebar_label: Face Spoof Detection API pagination_label: Face Spoof Detection API --- # Face Spoof Detection import FaceSpoofDetection from '../../../mdx/\_api-spoof-detection.mdx'; --- --- sidebar_label: Number of Face Detection API pagination_label: Number of Face Detection API --- # Number of Face Detection import NoOfFaceDetectionApi from '../../../mdx/\_api-number-of-face-detection.mdx'; --- --- sidebar_label: OCR API pagination_label: OCR API --- # OCR API import OcrApi from '../../../mdx/\_api-ocr.mdx'; --- --- title: Understanding Analytics Dashboard | Video SDK hide_title: true hide_table_of_contents: false description: Learn how to access and use VideoSDK's Analytics Dashboard to optimize session performance and diagnose issues. sidebar_label: Understanding Analytics Dashboard pagination_label: Understanding Analytics Dashboard keywords: - audio calling - video calling - real-time communication - collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: understanding-analytics-dashboard --- # Understanding Analytics Dashboard - Android Welcome to the world of actionable insights and empowered decision-making. VideoSDK's session analytics dashboard is your gateway to understanding, optimizing, and elevating every aspect of your sessions. ## Accessing Analytics Made Easy Navigating through session data is a breeze with VideoSDK. Simply head to your session page at https://app.videosdk.live/meetings/sessions, where a treasure trove of session information awaits. ### How to Access Analytics? Open Session Analytics effortlessly by following these steps: 1. **Click on Meeting-ID:** - Directly access analytics by clicking on the meeting-ID within the session table. 2. **Hover and Click View Analytics:** - Hover over a specific meeting row to reveal the **View Analytics** button in the Actions Column. - Click on **View Analytics** to open the Session Overview sidebar, unlocking a wealth of insights. ![Access Analytics](https://cdn.videosdk.live/website-resources/docs-resources/access_analytics.png) --- There are three tabs available within the session analytics view: ## **1. Session Overview** This tab provides an overview of the session, including its duration and participant details. You can explore individual participant statistics to understand their session performance better. ## **2. Errors** In this tab, you can find information about any errors encountered during the session. It helps you identify and address issues like network problems or technical glitches promptly. ## **3. Session Stats** Explore data and metrics sent and received by participants to measure performance of the session in this tab. It offers insights into data exchange among participants, including metrics on jitter, RTT, and packet loss, resolutions, and fps aiding in assessing communication efficiency. Let's dive deeper into each of these tabs to gain a better understanding of session analytics. # Session Overview The session overview page is your compass in the sea of data. Here's what you'll uncover: - **Meeting ID:** Unique identifier for the meeting in the format of `abcd-efgh-ijkl`. - **Session ID:** Unique identifier for each session, uniquely identified by its `sessionId`. - **Session Status:** Indicates if the session is ongoing or ended. - **Session Initiating Time:** Time taken by the first participant to establish the connection. - **Start and End Time:** Marks the start and end of the session. - **Total Unique Participants Joined:** Total number of unique participants in the session. - **Total Session Duration:** Overall duration of the session from start to end. - **Total Participant Minutes:** A sum of all participant duration. - **Recording, HLS, RTMP:** Additional services used in the session. ### Participant Table List - **Participant ID:** Unique identifier for each participant. - **Participant Name:** Personalized identification for each participant. - **Join Time:** Time taken by the participant to establish connection. - **Duration:** Total time spent by this participant in the session. ### Efficient Session Management with Actions Enhance your session management with streamlined actions: - **Kick Out Participants:** Effortlessly remove/kickout participants from ongoing sessions. - **Detailed Participant Analytics:** - Hover over a specific participant row to reveal the `View Stats` button at the end of the row. - Click on `View Stats` to open the Participant Overview sidebar. ![View Stats](https://cdn.videosdk.live/website-resources/docs-resources/view_stats.png) ## Explore Participant Insights Discover valuable participant data that provides a clear view of engagement and experience: - **Participant ID:** Unique identifier for each participant. - **Participant Name:** Personalized identification for better interaction. - **Joined At:** Indicates the precise moment when participant connects to the session. - **Left At:** Indicates the precise moment when participant left the session. - **Total Duration:** Total duration of participant within that session. - **Joining Time:** Time taken by the this participant to establish the connection. - **Location:** Approx. geographic location from where the participant joined. - **Platform:** Specifies whether participants are using a desktop or mobile device. - **Device Info:** Offers details regarding the participant's device. - **OS:** Provides information about the participant's OS. - **Browser:** Provides specifics about the participant's browser - **SDK Version:** Indicates the version of the SDK used by participants. ### Understanding Participant Call Health We've developed call health to offer a rapid assessment of participant performance during calls. This feature highlights the performance of audio, video, and screen sharing audio-video separately. We've utilized color theory, with green indicating good performance, orange for moderate, and red for poor performance, to enhance clarity. For detailed insights, simply hover over the bars ![Participant Call Health](https://cdn.videosdk.live/website-resources/docs-resources/participant_call_health.png) ## Participant Session Stats Dive into Participant Session Stats for valuable insights into your session experience! With just a click on `View Session Stats` at the bottom of the page, unlock a treasure trove of data crucial for understanding audio and video experiences. Within this section, you can observe quality metrics for the selected participant, comparing them to others. Additionally, the `Sessions Stats` covers quality stats comparison for every participant. Keep reading to explore more about Session Stats. **Visualizing Session Performance from Both Sides** This section provides a two-sided view of your session's metrics. - **Left Side: Sender Participant Graph** This section displays graphs representing the metrics sent by the sender participant.. Here, you can see how the various factors impacted the data you transmitted. - **Right Side: Receiver Participant Graph** On the right side, you'll find a dropdown menu where you can select a specific participant. Choosing a receiver will display graphs showcasing the metrics **received by that participant**. This allows you to compare the sending experience (left side) with the receiving experience (right side) for different participants. ![VideoSDK Jitter Graph](https://cdn.videosdk.live/website-resources/docs-resources/video_jitter_graph.png) **See How Your Session Performed** These metrics give you a clear picture of your session's quality. Understanding them helps you spot any issues and keep things running smoothly. Let's dive into what each metric means! **Jitter:** Imagine your internet connection like a bumpy road. Jitter is how much those bumps cause your signal to bounce around. Less jitter means a smoother ride for your data (audio and video, threshold ≤ 30ms). **RTT (Latency):** This is how long it takes for data to travel between you and the server you're connected to. Think of it like the time it takes for a message to get delivered – a lower RTT means a faster delivery (affects both audio and video, threshold ≤ 300ms). **Bitrate:** This measures the amount of data flowing through your connection per second. Imagine it like the width of a water pipe – a higher bitrate allows for more data to flow, which can be good for high-quality audio/video or fast downloads. **Packet Loss:** Think of data traveling in tiny packets. Packet loss is when some of those packets get dropped along the way. More packet loss means information might be missing, affecting things like audio/video quality or lag in games (threshold ≤ 5%). **Resolution:** This refers to the sharpness and detail of an image or video. Think of it like the number of pixels on your screen – a higher resolution means a crisper picture (**Video Only**). **FPS (Frames Per Second):** This measures how many images (frames) are displayed on your screen every second. Imagine it like a flipbook – a higher FPS creates a smoother and more fluid animation or video experience (**Video Only**). ![VideoSDK FPS Graph](https://cdn.videosdk.live/website-resources/docs-resources/video_fps_graph.png) ![VideoSDK Resolution Graph](https://cdn.videosdk.live/website-resources/docs-resources/video_resolution_graph.png) --- # **Investigate Session Errors** Get to the root of smoother sessions by addressing errors directly. Use the Errors tab to explore details of errors encountered during your session. Note: Session errors are visible with JS SDK v0.0.82 or higher, React SDK v0.1.85 or higher, and React Native v0.1.6 or newer versions. - **Participant Name & ID:** Quickly identify the participant associated with each error for swift resolution. - **Error Types:** Understand the different types of errors encountered, such as network issues or connection disruptions. - **Detailed Descriptions:** Access clear explanations of errors to take actionable steps towards solution. ![VideoSDK Error Details](https://cdn.videosdk.live/website-resources/docs-resources/error_new.png) # **Analyse Session Stats** Similar to `Participant Session Stats`, this tab covers quality statistics sent by individual participants. You can select any participant to compare effectively with others. Choose a sender participant from the dropdown menu on the left and select a recipient on the right to compare data over different time frames. This allows you to identify and explore issues within specific durations. This tab covers the same metrics as covered in the `participant session stats`: Jitter, RTT (Latency), Bitrate, Packet-loss, Resolution (Video only), and FPS (Video only). ![VideoSDK Session Stats](https://cdn.videosdk.live/website-resources/docs-resources/session_stats.png) --- --- title: Understanding Call Quality | Video SDK hide_title: true hide_table_of_contents: false description: Learn how factors like bandwidth, latency, and device quality impact your app's call quality with Video SDK. sidebar_label: Understanding Call Quality pagination_label: Understanding Call Quality keywords: - audio calling - video calling - real-time communication - collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: understanding-call-quality --- # Understanding Call Quality - Android When developing a video call app, customer satisfaction heavily depends on the app's video and audio quality and its fluctuation. ## Call Quality From the user's perspective, good video quality is defined as smooth and clear video, along with crystal clear audio. Developers consider good video quality as high-resolution, high-frame-rate video with minimal latency and high-bitrate audio with minimal audio loss. ## Factors affecting Quality When measuring video and audio quality, several variables come into play. The common factors affecting quality are as follows: ### `1. Network Bandwidth` - Network bandwidth is the measure of a user's network capacity, indicating how much data can be received and sent. - If the participant's bandwidth is low, they are likely to experience pixelated or frozen video, along with robotic voice or minor audio interruptions. - Frequent changes in network providers can also lead to significant fluctuations in internet bandwidth, resulting in poor video quality. ### `2. Latency` - Latency refers to the time it takes for data to transfer from one machine to another. - If the meeting is hosted in `Ohio` region, and users are joining from `Singapore` region, this can result in a long delay for data to transfer between mackines. Therefore, it is advisable to choose a server based on your user base. - With VideoSDK, you can specify the server during the creation of a Meeting/Room. ### `3. CPU Usage` - CPU Usage is also a determining factor, as all the audio and video streams going out and coming in need to be encoded or decoded, requiring a significant amount of computation. - The higher the resolution and frame rate of the videos, the greater the computation required, which can lead to a bottleneck in delivering good-quality video. - If the CPU is heavily consumed, it can also result in choppy or robotic audio. ### `4. Camera and Mic Quality` - The camera and microphone should capture high-quality streams to ensure that remote users don't receive a low-quality stream even if they have bad network bandwidth. ## Identifying various issues related to Quality. In order to identify potential issues, VideoSDK collects various audio and video-related metrics that can help pinpoint quality concerns. Take a look at these metrics and understand what they indicate. ### `1. Resolution and Framerate` - Resolution and frame rate serve as crucial metrics for video quality in a video call app. Resolution indicates the number of pixels in a video image, while frame rate denotes the number of frames displayed per second. - While higher resolutions and frame rates can enhance video quality, they also demand more bandwidth and processing power. It's essential to optimize these metrics based on the devices and network conditions of your users. - For instance, if the majority of your users are on mobile devices, opting for a lower resolution and frame rate may be more suitable to ensure smooth playback and minimize bandwidth usage. - If your user base comprises both mobile and desktop devices, adopting higher resolution for desktops and mid-resolutions for mobile devices can contribute to improved performance and quality while conserving bandwidth on mobile devices. ### `2. Bitrate` - Bitrate represents the number of bits per second transmitted or received during the transmission of audio or video streams. It is a crucial parameter for assessing the quality of audio or video streams and should be adjusted for each resolution to strike a balance between performance and bandwidth utilization. - In scenarios with excellent network conditions, a higher bitrate can lead to significantly improved video quality. However, it's essential to be cautious when dealing with very high bitrates on mobile devices, as it may result in heating issues due to the substantial computational requirements for encoding and decoding videos. ##### Example We used the same phone **(iPhone 14)**, for both participants, but there were differences in resolution and bitrate. The first participant had a resolution of **`1280x720`** with a bitrate of **`1442 kbps`**, while the second participant had a resolution of **`960x540`** with a bitrate of **`642 kbps`**. Surprisingly, both participants' videos appeared to be of equal quality despite variations in resolution and bitrate. ![resolution-and-bitrate](/img/resolution-and-bitrate.png) ### `3. Packet Loss` - Packet loss is a metric that reveals the number of lost data packets during transmission from the sender to the receiver. It can happen due to network congestion, hardware or software issues, or network latency. Increased packet loss can lead to degraded video and audio quality, as the absence of packets may cause gaps or distortions in the media stream. import ReactPlayer from 'react-player'
### `4. Jitter` - The audio and video packets are sent out at random intervals over a specified time frame as they travel between the server and client. Jitter occurs when there is a variation in transmitting or receiving these data packets due to a faulty network connection. - When data packets experience delays during transmission to the participant, usually because the network is busy, they may result in pixelated video during a video call or sound choppy, distorted, or robotic in a voice call upon arrival. This creates jitter, with packets arriving at random intervals. ### `5. Round Trip Time (Latency)` - Round trip time refers to the duration it takes for data packets to be transmitted from the user's device to the server and back. If the servers are located far from the user's location, they may experience high latency (delay). - With VideoSDK, this factor is addressed as we automatically choose the nearest available server for participants. However, if you are geofencing to a specific region, ensure that you choose the server nearest to your users.
![resolution-and-bitrate](/img/rtt.png)
## Checking Realtime Statistics VideoSDK provides methods to check the realtime statistics of audio and video of all the participants. ### `getVideoStats()` - The `getVideoStats()` method returns an object containing the different quality parameters for video stream, which can be accessed through the `useParticipant` hook. - This object will contain values for the specific participant's video stream, including resolution, frame rate, bitrate, jitter, round trip time, and packet loss.
### `getAudioStats()` - The `getAudioStats()` method returns an object containing the different quality parameters for audio stream, which can be accessed through the `useParticipant` hook. - This object will contain values for the specific participant's audio stream, including bitrate, jitter, round trip time and packet loss. ### `getShareStats()` - The `getShareStats()` method returns an object containing the different quality parameters for screen share stream, which can be accessed through the `useParticipant` hook. - This objects will contain values for the specific participant's screen share stream, including resolution, frame rate, bitrate, jitter, round trip time, and packet loss. ### `getShareAudioStats()` - The `getShareAudioStats()` method returns an object containing the different quality parameters for screen share audio stream, which can be accessed through the `useParticipant` hook. - This object will contain values for the specific participant's screen share audio stream, including bitrate, jitter, round trip time and packet loss. :::note To show the popup dialog for the participant's realtime stats, you can [refer to this component](https://github.com/videosdk-live/videosdk-rtc-react-sdk-example/blob/main/src/utils/common.js#L142). ::: ## Quality analysis Graphs For all sessions conducted using VideoSDK, you can access quality analysis graphs from the [VideoSDK Dashboard](https://app.videosdk.live/meetings/sessions). These graphs help you visualize real-time data and identify spikes in certain parameters during calls, aiding in understanding the reasons for quality issues. ![quality analysis](/img/quality-analysis.png) ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [getVideoStats()](/android/api/sdk-reference/participant-class/methods#getvideostats) - [getAudioStats()](/android/api/sdk-reference/participant-class/methods#getaudiostats) - [getShareStats()](/android/api/sdk-reference/participant-class/methods#getsharestats) --- --- sidebar_label: Change Mode pagination_label: Change Mode --- # Change Mode - Android In a live stream, audience members usually join in `RECV_ONLY` mode, meaning they can only view and listen to the hosts. However, if a host invites an audience member to actively participate (e.g., to speak or present), the audience member can switch their mode to `SEND_AND_RECV` using the changeMode() method. This guide explains how to use the changeMode() method and walks through a sample implementation where a host invites an audience member to become a host using PubSub. ### `changeMode()` - The `changeMode()` method from the `Meeting` class allows a participant to switch between modes during a live stream—for example, from audience to host. #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js class LiveStreamActivity : AppCompatActivity() { override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_live_stream); // initialize the meeting liveStream = VideoSDK.initMeeting(... ) ... // join meeting liveStream!!.join() // Button to change mode val changeModeBtn: Button = findViewById(R.id.btnChangeMode) changeModeBtn.setOnClickListener { liveStream.changeMode(MeetingMode.SEND_AND_RECV) } } } ``` ```java public class LiveStreamActivity extends AppCompatActivity { @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_live_stream); // initialize the meeting Meeting liveStream = VideoSDK.initMeeting( ...); ... // join meeting liveStream.join() // Button to change mode Button changeModeBtn = findViewById(R.id.btnChangeMode); changeModeBtn.setOnClickListener(new View.OnClickListener() { @Override public void onClick(View v) { liveStream.changeMode(MeetingMode.SEND_AND_RECV); } }); } } ``` #### Implementation Guide #### Step 1 : Create a Pubsub Topic - Set up a PubSub topic to send a mode change request from the host to a specific audience member. ```js fun sendInvite(livestream: Meeting, participantId: String) { val pubSubPublishOptions = PubSubPublishOptions().apply { isPersist = false } livestream.pubSub.publish( "REQUEST_TO_JOIN_AS_HOST_$participantId", // PubSub topic specific to participant "SEND_AND_RECV", // message pubSubPublishOptions ) } ``` ```java void sendInvite(Meeting livestream, String participantId) { PubSubPublishOptions pubSubPublishOptions = new PubSubPublishOptions(); pubSubPublishOptions.setPersist(false); livestream.pubSub.publish( "REQUEST_TO_JOIN_AS_HOST_"+ participantId, // PubSub topic specific to participant "SEND_AND_RECV", // message pubSubPublishOptions ); } ``` #### Step 2 : Create an Invite Button - Add an "Invite on Stage" button for each audience member. When clicked, it publishes a PubSub message with the mode "SEND_AND_RECV" to that participant. ```js // In ParticipantListAdapter.kt, inside showPopup method if (participant!!.mode == "RECV_ONLY") { popup.menu.add("Add as a co-host") popup.setOnMenuItemClickListener { item: MenuItem -> if (item.toString() == "Add as a co-host") { sendInvite(liveStream,participantId); holder.requestedIndicator.visibility = View.VISIBLE holder.btnParticipantMoreOptions.isEnabled = false return@setOnMenuItemClickListener true } false } } ``` ```java // In ParticipantListAdapter.java, inside showPopup method if ("RECV_ONLY".equals(participant.getMode())) { popup.getMenu().add("Add as a co-host"); popup.setOnMenuItemClickListener(new PopupMenu.OnMenuItemClickListener() { @Override public boolean onMenuItemClick(MenuItem item) { if ("Add as a co-host".equals(item.toString())) { sendInvite(liveStream,participantId); holder.requestedIndicator.setVisibility(View.VISIBLE); holder.btnParticipantMoreOptions.setEnabled(false); return true; } return false; } }); } ``` #### Step 3 : Create a Listener to Change the Mode - On the audience side, subscribe to the specific PubSub topic. When a mode request is received, update the participant’s mode using changeMode(). ```js // In your class where you define coHostListener val coHostListener = object : PubSubMessageListener { override fun onMessage(pubSubMessage: PubSubMessage) { showCoHostRequestDialog() } } liveStream.pubSub.subscribe( "REQUEST_TO_JOIN_AS_HOST_${liveStream.localParticipant.id}", coHostListener ) // In the showCoHostRequestDialog method private fun showCoHostRequestDialog() { // Dialog setup code... acceptBtn.setOnClickListener { liveStream.changeMode("SEND_AND_RECV") } // Rest of the dialog code... } ``` ```java // In your class where you define coHostListener PubSubMessageListener coHostListener = new PubSubMessageListener () { @Override public void onMessage(PubSubMessage pubSubMessage) { showCoHostRequestDialog(); } }; liveStream.pubSub.subscribe("REQUEST_TO_JOIN_AS_HOST_" + liveStream.getLocalParticipant().getId(), coHostListener); // In the showCoHostRequestDialog method private void showCoHostRequestDialog() { // Dialog setup code... acceptBtn.setOnClickListener(new View.OnClickListener() { @Override public void onClick(View v) { liveStream.changeMode("SEND_AND_RECV"); } }); // Rest of the dialog code... } ``` import ReactPlayer from "react-player"; ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [changeMode()](/android/api/sdk-reference/meeting-class/methods#changemode) - [Participant](/android/api/sdk-reference/participant-class/properties) - [Meeting](/android/api/sdk-reference/meeting-class/properties) - [pubSub()](/android/api/sdk-reference/pubsub-class/introduction) --- --- title: Remove participant from the meeting - React JS SDK hide_title: false hide_table_of_contents: false description: Remove a participant or a peer from the meeting while it is still in progress. It helps in meeting moderation. sidebar_label: Remove Participant pagination_label: Remove Participant keywords: - remove participant - block participant - react js image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: remove-participant --- # Remove Participant - Android When hosting a live stream, it's essential for the host to have the capability to to remove a participant from the live stream. This can be useful in various scenarios where a participant is causing disturbance, behaving inappropriately, or is not following the guidelines. This guide focuses on this very aspect of removing a particpant from the live stream. VideoSDK provides three ways to do so: 1. [Using SDK](#1-using-sdk) 2. [Using VideoSDK Dashboard](#2-using-videosdk-dashboard) 3. [Using Rest API](#3-using-rest-api) ## 1. Using SDK ### `remove()` The `remove()` method allows for the removal of a participant during an on-going session. This can be helpful when moderation is required in a particular live stream. import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js btnRemoveParticipant!!.setOnClickListener { _: View? -> val remoteParticipantId = "" // Get specific participant instance val participant = meeting!!.participants[remoteParticipantId] // Remove participant from active session // This will emit an event called "onParticipantLeft" for that particular participant participant!!.remove() } ``` ```js btnRemoveParticipant.setOnClickListener(new View.OnClickListener() { @Override public void onClick(View v) { String remoteParticipantId = ""; // Get specific participant instance Participant participant = meeting.getParticipants().get(remoteParticipantId); // Remove participant from active session // This will emit an event called "onParticipantLeft" for that particular participant participant.remove(); } }); ``` ### Events associated with remove() Following callbacks are received when a participant is removed from the meeting. - The participant who was removed from the meeting will receive a callback on the [`onMeetingLeft`](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingleft) event of the `Meeting` class. - All other [remote participants](/android/guide/video-and-audio-calling-api-sdk/concept-and-architecture#2-participant) will receive a callback [`onParticipantLeft`](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onparticipantleft) with Participant object. ## 2. Using VideoSDK Dashboard - For removing a participant using the VideoSDK Dashboard, navigate to the session page on [VideoSDK Dashboard](https://app.videosdk.live/meetings/sessions). Select the specific session, and from the list of participants, choose the participant you wish to remove. Utilize the provided options to remove the selected participant from the session. import ReactPlayer from 'react-player'
## 3. Using Rest API - You can also remove a particular participant from the live stream [using the REST API](/api-reference/realtime-communication/remove-participant). - To employ this method, you need the `sessionId` of the live stream and the `participantId` of the individual you intend to remove. ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [remove()](/android/api/sdk-reference/participant-class/methods#remove) - [onMeetingLeft()](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingleft) - [onParticipantLeft()](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onparticipantleft) --- --- title: Audience Polls during Live Stream - Video SDK Docs hide_title: false hide_table_of_contents: false description: PubSub features quick integrate in Javascript, React JS, Android, IOS, React Native, Flutter with Video SDK to add live video & audio conferencing to your applications. sidebar_label: Audience Polls pagination_label: Audience Polls keywords: - Polls during Live Stream - Livestream audience polls - real-time communication image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: audience-polls-during-livestream --- # Audience Polls during Live Stream - Android Interactive polls are a great way to increase engagement during livestreams. Using VideoSDK’s PubSub mechanism, you can easily implement real-time audience polling, where viewers can vote and see live results instantly. This guide walks you through how to create, send, and visualize poll results during a livestream. ## Step 1: Creating and Publishing a Poll To initiate a poll, use the `PubSub` class with a `POLL` topic. The poll structure should include a question and multiple options. This message will be published to all participants. import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js class CreatePollDialog() { companion object { const val POLL_TOPIC = "POLL" } fun show() { // ... UI setup code ... dialog.getButton(AlertDialog.BUTTON_POSITIVE).setOnClickListener { // Validate inputs // Create poll object val options = mutableListOf(option1, option2) if (option3.isNotEmpty()) options.add(option3) if (option4.isNotEmpty()) options.add(option4) val poll = SimplePoll(question, options) // Publish poll to all participants val pubSubPublishOptions = PubSubPublishOptions() pubSubPublishOptions.isPersist = true liveStream.pubSub.publish(POLL_TOPIC, poll.toJsonString(), pubSubPublishOptions) // Show results dialog to host PollResultsDialog(context, liveStream).show(poll) dialog.dismiss() } } } ``` ```java public class CreatePollDialog { public static final String POLL_TOPIC = "POLL"; public CreatePollDialog() { // default constructor } public void show() { // ... UI setup code ... dialog.getButton(AlertDialog.BUTTON_POSITIVE).setOnClickListener(v -> { // Validate inputs // Create poll object List options = new ArrayList<>(); options.add(option1); options.add(option2); if (!option3.isEmpty()) options.add(option3); if (!option4.isEmpty()) options.add(option4); SimplePoll poll = new SimplePoll(question, options); // Publish poll to all participants PubSubPublishOptions pubSubPublishOptions = new PubSubPublishOptions(); pubSubPublishOptions.setPersist(true); liveStream.pubSub.publish(POLL_TOPIC, poll.toJsonString(), pubSubPublishOptions); // Show results dialog to host new PollResultsDialog(context, liveStream).show(poll); dialog.dismiss(); }); } } ``` ## Step 2: Subscribing to Polls and Displaying Options Participants can listen to the POLL topic and render voting options dynamically based on the incoming data. ```js class PollVotingDialog() { companion object { const val POLL_RESPONSE_TOPIC = "POLL_RESPONSE" } fun show(poll: SimplePoll) { // ... UI setup code ... // Create option buttons dynamically poll.options.forEach { option -> val button = Button(context) button.text = option button.setOnClickListener { // Submit vote val response = SimplePollResponse( pollId = poll.id, option = option, participantId = liveStream.localParticipant.id, participantName = liveStream.localParticipant.displayName ) liveStream.pubSub.publish(POLL_RESPONSE_TOPIC, response.toJsonString()) // Disable all buttons after voting for (i in 0 until optionsContainer.childCount) { optionsContainer.getChildAt(i).isEnabled = false } } optionsContainer.addView(button) } } } ``` ```java public class PollVotingDialog { public static final String POLL_RESPONSE_TOPIC = "POLL_RESPONSE"; public void show(SimplePoll poll) { // ... UI setup code ... // Create option buttons dynamically for (String option : poll.getOptions()) { Button button = new Button(context); button.setText(option); button.setOnClickListener(v -> { // Submit vote SimplePollResponse response = new SimplePollResponse( poll.getId(), option, liveStream.getLocalParticipant().getId(), liveStream.getLocalParticipant().getDisplayName() ); liveStream.pubSub.publish(POLL_RESPONSE_TOPIC, response.toJsonString()); // Disable all buttons after voting for (int i = 0; i < optionsContainer.getChildCount(); i++) { optionsContainer.getChildAt(i).setEnabled(false); } }); optionsContainer.addView(button); } } } ``` ## Step 3: Aggregating and Displaying Poll Results The host can subscribe to the POLL_RESPONSE topic to collect responses and render the result in real-time. ```js class PollResultsDialog() { private val pollResults = ConcurrentHashMap() private var totalVotes = 0 companion object { const val POLL_RESPONSE_TOPIC = "POLL_RESPONSE" } fun show(poll: SimplePoll) { // ... UI setup code ... // Initialize results map with 0 votes for each option poll.options.forEach { option -> pollResults[option] = 0 } // Create initial result bars updateResultBars() // Listen for poll responses responsesListener = PubSubMessageListener { pubSubMessage -> try { val response = SimplePollResponse.fromJsonString(pubSubMessage.message) val option = response.option // Update vote count pollResults[option] = (pollResults[option] ?: 0) + 1 totalVotes++ // Update UI on main thread (context as? android.app.Activity)?.runOnUiThread { updateResultBars() } } catch (e: Exception) { e.printStackTrace() } } // Subscribe to poll responses liveStream.pubSub.subscribe(POLL_RESPONSE_TOPIC, responsesListener) } private fun updateResultBars() { // Clear previous results optionsContainer?.removeAllViews() // Create result bars for each option pollResults.forEach { (option, votes) -> val percentage = if (totalVotes > 0) (votes * 100) / totalVotes else 0 // Create progress bar UI showing percentage // ... UI code to display results ... } } } ``` ```java import java.util.concurrent.ConcurrentHashMap; import java.util.Map; public class PollResultsDialog { private ConcurrentHashMap pollResults = new ConcurrentHashMap<>(); private int totalVotes = 0; public static final String POLL_RESPONSE_TOPIC = "POLL_RESPONSE"; public void show(SimplePoll poll) { // ... UI setup code ... // Initialize results map with 0 votes for each option for (String option : poll.options) { pollResults.put(option, 0); } // Create initial result bars updateResultBars(); // Listen for poll responses responsesListener = new PubSubMessageListener() { @Override public void onMessage(PubSubMessage pubSubMessage) { try { SimplePollResponse response = SimplePollResponse.fromJsonString(pubSubMessage.message); String option = response.option; // Update vote count pollResults.put(option, pollResults.getOrDefault(option, 0) + 1); totalVotes++; // Update UI on main thread if (context instanceof android.app.Activity) { ((android.app.Activity) context).runOnUiThread(new Runnable() { @Override public void run() { updateResultBars(); } }); } } catch (Exception e) { e.printStackTrace(); } } }; // Subscribe to poll responses liveStream.pubSub.subscribe(POLL_RESPONSE_TOPIC, responsesListener); } private void updateResultBars() { // Clear previous results if (optionsContainer != null) { optionsContainer.removeAllViews(); } // Create result bars for each option for (Map.Entry entry : pollResults.entrySet()) { String option = entry.getKey(); int votes = entry.getValue(); int percentage = (totalVotes > 0) ? (votes * 100) / totalVotes : 0; // Create progress bar UI showing percentage // ... UI code to display results ... } } } ``` ### API Reference The API references for all the methods and events utilized in this guide are provided below. - [pubSub()](/android/api/sdk-reference/pubsub-class/introduction) --- --- title: Authentication and Token | Video SDK hide_title: true hide_table_of_contents: false description: Video SDK and Audio SDK, developers need to implement a token server. This requires efforts on both the front-end and backend. sidebar_label: Authentication and Tokens pagination_label: Authentication and Tokens keywords: - audio calling - video calling - real-time communication - collaboration image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: authentication-and-token sidebar: ilsSidebar --- import ServerSetup from '../../../mdx/introduction/\_server-setup.mdx'; --- --- sidebar_label: Developer Experience Guidelines pagination_label: Developer Experience Guidelines sidebar: ilsSidebar --- import AndroidDeveloperExperience from '/mdx/sdk-pages/android/best-practices/developer-experience.mdx'; # Developer Experience Guidelines - Android --- --- sidebar_label: Handle Large Rooms pagination_label: Handle Large Rooms sidebar: ilsSidebar --- import AndroidHandleLargeRooms from '/mdx/sdk-pages/android/best-practices/handle-large-rooms.mdx'; # Handle Large Rooms - Android --- --- sidebar_label: User Experience Guidelines pagination_label: User Experience Guidelines sidebar: ilsSidebar --- import AndroidUserExperience from '/mdx/sdk-pages/android/best-practices/user-experience.mdx'; # User Experience Guidelines - Android --- --- title: Chat during Live Stream - Video SDK Docs hide_title: false hide_table_of_contents: false description: PubSub features quick integrate in Javascript, React JS, Android, IOS, React Native, Flutter with Video SDK to add live video & audio conferencing to your applications. sidebar_label: Chat during Live Stream pagination_label: Chat during Live Stream keywords: - Chat during Live Stream - Livestream - real-time communication image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: chat-during-livestream --- # Chat during Live Stream - Android Enhance your live stream experience by enabling real-time audience chat using VideoSDK's PubSub class. Whether you’re streaming a webinar, online event, or an interactive session, integrating a chat system lets your viewers engage, ask questions, and react instantly. This guide shows how to build a group or private chat interface for a live stream using the Publish-Subscribe (PubSub) mechanism. This guide focuses on using PubSub to implement Chat functionality. If you are not familiar with the PubSub mechanism and `PubSub` class , you can [follow this guide](/android/guide/video-and-audio-calling-api-sdk/collaboration-in-meeting/pubsub). ## Implementing Chat ### `Group Chat` 1. First step in creating a group chat is choosing the topic which all the participants will publish and subscribe to send and receive the messages. We will be using `CHAT` as the topic for this one. 2. On the send button, publish the message that the sender typed in the `EditText` field. import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js import androidx.appcompat.app.AppCompatActivity import android.os.Bundle import android.view.View import android.widget.EditText import android.widget.Toast import androidx.appcompat.widget.Toolbar import live.videosdk.rtc.android.Meeting import live.videosdk.rtc.android.listeners.PubSubMessageListener import live.videosdk.rtc.android.model.PubSubPublishOptions class ChatActivity : AppCompatActivity() { // Meeting var meeting: liveStream? = null override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_chat) /** * Here, we have created 'MainApplication' class, which extends android.app.Application class. * It has Meeting property and getter and setter methods of Meeting property. * In your android manifest, you must declare the class implementing android.app.Application * (add the android:name=".MainApplication" attribute to the existing application tag): * In MainActivity.kt, we have set Meeting property. * * For Example: (MainActivity.kt) * var meeting = VideoSDK.initMeeting(context, meetingId, ParticipantName, micEnabled, webcamEnabled, paricipantId, mode, multiStream, customTrack, metaData, signalingBaseUrl) * (this.application as MainApplication).meeting = meeting */ // Get Meeting liveStream = (this.application as MainApplication).meeting findViewById(R.id.btnSend).setOnClickListener(view -> sendMessage()); } private fun sendMessage() { // get message from EditText val message: String = etmessage.getText().toString() if (!TextUtils.isEmpty(message)) { val publishOptions = PubSubPublishOptions() publishOptions.setPersist(true) // Sending the Message using the publish method //highlight-next-line liveStream!!.pubSub.publish("CHAT", message, publishOptions) // Clearing the message input etmessage.setText("") } else { Toast.makeText( this@ChatActivity, "Please Enter Message", Toast.LENGTH_SHORT ).show() } } } ``` ```js import androidx.appcompat.app.AppCompatActivity; import android.os.Bundle; import java.util.List; import live.videosdk.rtc.android.Meeting; import live.videosdk.rtc.android.lib.PubSubMessage; import live.videosdk.rtc.android.listeners.PubSubMessageListener; import live.videosdk.rtc.android.model.PubSubPublishOptions; public class ChatActivity extends AppCompatActivity { // Meeting Meeting liveStream; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_chat); /** * Here, we have created 'MainApplication' class, which extends android.app.Application class. * It has Meeting property and getter and setter methods of Meeting property. * In your android manifest, you must declare the class implementing android.app.Application * (add the android:name=".MainApplication" attribute to the existing application tag): * In MainActivity.java, we have set Meeting property. * * For Example: (MainActivity.java) * Meeting meeting = VideoSDK.initMeeting(context, meetingId, ParticipantName, micEnabled, webcamEnabled, participantId, mode, mutliStream, customTrack,metaData, signalingBaseUrl, preferredProtocol); * ((MainApplication) this.getApplication()).setMeeting(meeting); */ // Get Meeting liveStream = ((MainApplication) this.getApplication()).getMeeting(); findViewById(R.id.btnSend).setOnClickListener(view -> sendMessage()); } private void sendMessage() { // get message from EditText String message = etmessage.getText().toString(); if (!message.equals("")) { PubSubPublishOptions publishOptions = new PubSubPublishOptions(); publishOptions.setPersist(true); // Sending the Message using the publish method //highlight-next-line liveStream.pubSub.publish("CHAT", message, publishOptions); // Clearing the message input etmessage.setText(""); } else { Toast.makeText(ChatActivity.this, "Please Enter Message", Toast.LENGTH_SHORT).show(); } } } ``` 3. Next step would be to display the messages others send. For this we have to `subscribe` to that topic i.e `CHAT` and display all the messages. ```js class ChatActivity : AppCompatActivity() { // PubSubMessageListener //highlight-start var pubSubMessageListener = PubSubMessageListener { message -> // New message Toast.makeText( this@ChatActivity, message.senderName + " says : " + message.message, Toast.LENGTH_SHORT ).show() } //highlight-end override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_chat) //... // Subscribe for 'CHAT' topic //highlight-next-line val pubSubMessageList = liveStream!!.pubSub.subscribe("CHAT", pubSubMessageListener) for (message in pubSubMessageList) { // Persisted messages Toast.makeText( this@ChatActivity, message.senderName + " says : " + message.message, Toast.LENGTH_SHORT ).show() } } } ``` ```js public class ChatActivity extends AppCompatActivity { // PubSubMessageListener //highlight-start private PubSubMessageListener pubSubMessageListener = new PubSubMessageListener() { @Override public void onMessageReceived(PubSubMessage message) { // New message Toast.makeText( ChatActivity.this, message.senderName + " says : "+ message.getMessag(), Toast.LENGTH_SHORT ).show(); } }; //highlight-end @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_chat); //.. // Subscribe for 'CHAT' topic //highlight-next-line List pubSubMessageList = liveStream.pubSub.subscribe("CHAT", pubSubMessageListener); for(PubSubMessage message : pubSubMessageList){ // Persisted messages Toast.makeText( ChatActivity.this, message.senderName + " says : "+ message.getMessag(), Toast.LENGTH_SHORT ).show(); } } } ``` 4. Final step in the group chat would be `unsubscribe` to that topic, which you had previously subscribed but no longer needed. Here we are `unsubscribe` to `CHAT` topic on activity destroy. ```js class ChatActivity : AppCompatActivity() { override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_chat) //... } override fun onDestroy() { // Unsubscribe for 'CHAT' topic //highlight-next-line liveStream!!.pubSub.unsubscribe("CHAT", pubSubMessageListener) super.onDestroy() } } ``` ```js public class ChatActivity extends AppCompatActivity { @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_chat); //.. } @Override protected void onDestroy() { // Unsubscribe for 'CHAT' topic //highlight-next-line liveStream.pubSub.unsubscribe("CHAT", pubSubMessageListener); super.onDestroy(); } } ``` ### `Private Chat` (1:1 between Host and Viewer) Private messaging is ideal when a host or moderator needs to directly respond to a viewer’s question. This can be achieved using the `sendOnly` property. ```js class ChatActivity : AppCompatActivity() { //.. private fun sendMessage() { // get message from EditText val message: String = etmessage.getText().toString() if (!TextUtils.isEmpty(message)) { val publishOptions = PubSubPublishOptions() publishOptions.setPersist(true) //highlight-start // Pass the participantId of the participant to whom you want to send the message. var sendOnly: Array = arrayOf("xyz") publishOptions.setSendOnly(sendOnly); //highlight-end // Sending the Message using the publish method //highlight-next-line liveStream!!.pubSub.publish("CHAT", message, publishOptions) // Clearing the message input etmessage.setText("") } else { Toast.makeText( this@ChatActivity, "Please Enter Message", Toast.LENGTH_SHORT ).show() } } } ``` ```js public class ChatActivity extends AppCompatActivity { //... private void sendMessage() { // get message from EditText String message = etmessage.getText().toString(); if (!message.equals("")) { PubSubPublishOptions publishOptions = new PubSubPublishOptions(); publishOptions.setPersist(true); //highlight-start // Pass the participantId of the participant to whom you want to send the message. String[] sendOnly = { "xyz" }; publishOptions.setSendOnly(sendOnly); //highlight-end // Sending the Message using the publish method //highlight-next-line liveStream.pubSub.publish("CHAT", message, publishOptions); // Clearing the message input etmessage.setText(""); } else { Toast.makeText(ChatActivity.this, "Please Enter Message", Toast.LENGTH_SHORT).show(); } } } ``` ### Downloading Chat Messages All the messages from the PubSub which where published with `persist : true` and can be downloaded as an `.csv` file. This file will be available in the VideoSDK dashboard as well as throught the [Sessions API](/api-reference/realtime-communication/fetch-session-using-sessionid). ### API Reference The API references for all the methods and events utilised in this guide are provided below. - [pubSub()](/android/api/sdk-reference/pubsub-class/introduction) --- --- title: Cloud Proxy | Secure and Manage Streaming Traffic | Video SDK hide_title: true hide_table_of_contents: false description: Leverage Video SDK's Cloud Proxy to securely manage and optimize your video streaming traffic. Ideal for enhancing performance and securing data. sidebar_label: Cloud Proxy pagination_label: Cloud Proxy keywords: - cloud proxy - secure streaming - video SDK security - traffic management image: img/videosdklive-thumbnail.jpg sidebar_position: 2 slug: cloud-proxy --- # Cloud Proxy - Android import CloudProxy from '../../../mdx/\_cloud-proxy.mdx'; ## Implementation import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js val liveStream: Meeting = VideoSDK.initMeeting( this@MainActivity, "meetingId", "John Due", true, true, null, null, false, null, null,"proxy.yourwebsite.com", VideoSDK.PreferredProtocol.UDP_ONLY ); ``` ```java Meeting liveStream = VideoSDK.initMeeting( MainActivity.this, "meetingId", "John Due",true, true, null, null, false, null,null,"proxy.yourwebsite.com", VideoSDK.PreferredProtocol.UDP_ONLY ); ``` ### Parameters - preferredProtocol: - UDP_OVER_TCP (default): Initially the server attempts to establish a connection using UDP, if that fails it automatically switches to TCP protocol. - UDP_ONLY: Force UDP protocol - TCP_ONLY: Force TCP protocol - signalingBaseUrl: Proxy URL to origin signaling and media. ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [initMeeting()](/android/api/sdk-reference/initMeeting) --- --- sidebar_label: Concept And Architecture pagination_label: Concept And Architecture sidebar: ilsSidebar --- # Concept and Architecture - Android Before diving into the concept, let's understand the VideoSDK, VideoSDK is a software development kit that offers tools and APIs for creating apps that are based on video and audio. It typically includes features such as video and audio calls, chat, cloud recording, simulcasting (RTMP), interactive live streaming (HLS), and many more across a wide range of platforms and devices. ## Concepts ![img.png](../../../static/img/room-concept.png) ### `1. Meeting / Room` - Meeting or Room object in the VideoSDK provide a virtual place for participants to interact and engage in real-time voice, video, and screen-sharing sessions. The object is in charge of handling media streams and participant communication. - Meeting or Room can be uniquely identified by `meetingId` or `roomId`. ### `2. Participant` - Participant is a VideoSDK object that represents each user/client in the meeting or room and allows them to share audio/video assets. - `2.1 Local Participant` : The local participant is the one that runs on the user's device. The local participant has control over their own media streams, including the ability to start and stop audio and video. - The local participant in a meeting/room can also connect with other participants by transmitting and receiving audio and video streams, exchanging chat messages, and more. - `2.2 Remote Participant` : The remote participant receives audio and video streams from the local participant and other remote participants and also has the ability to exchange audio, video, and chat messages with the local participant. - Each participant in VideoSDK can be uniquely identified by `participantId`. ### `3. MediaStream & Track` - A mediastream is a collection of audio & video tracks that can be transmitted between participants in real-time. - A track is a continuous flow of audio or video data and can be thought of as a stream of media frames. - A mediastream can contain multiple tracks. One video track for the video feed from the camera and one audio track for the audio feed from the microphone. These tracks can be transmitted between participants in VideoSDK Meeting / Room. ### `4. Events / Notifications` - Events / Notifications can be used to inform users about various activities happening in a Meeting / Room, including participant join/leave and new messages. They can also be used to alert users about any SDK-level errors that occur during a call. ### `5. Session` - A Session is the instance of an ongoing meeting/room which has one or more participants in it. A single room or meeting can have multiple sessions. - Each session can be uniquely identified by `sessionId`. ![img.png](../../../static/img/meeting-session.jpg) --- ![img.png](../../../static/img/recording-hls-rtmp.png) ### `6. Cloud Recording` - Cloud recording in VideoSDK refers to the process of recording audio or video content and storing it on a remote server or VideoSDK server. ### `7. Simulcasting (RTMP)` - RTMP is a popular protocol for live streaming video content from a VideoSDK to platforms such as YouTube, Twitch, Facebook, and others. - By providing the platform-specific `stream key` and `stream URL`, the VideoSDK can connect to the platform's RTMP server and transmit the live video stream. ### `8. Http Live Streaming (HLS)` - Interactive live streaming (HLS) refers to a type of live streaming where viewers can actively engage with the content being streamed and with other viewers in real-time. - In an interactive live stream (HLS), viewers can take part in a variety of activities like live polling, Q&A sessions, and even sending virtual gifts to the content creator or each other. ## Architecture This diagram demonstrates end-to-end flow to implement video & audio calls, record calls, and go live on social media. ![Video-sdk-architecture!](/img/video-sdk-archietecture.svg) --- --- title: Custom Audio Sources - React JS SDK hide_title: false hide_table_of_contents: false description: Custom Audio Sources sidebar_label: Custom Audio Sources pagination_label: Custom Audio Sources keywords: - custom audio sources - react js image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: custom-audio-sources --- # Custom Audio Sources - Android For a high-quality streaming experience, fine-tuning audio tracks becomes essential—especially when delivering content to a broader live audience. To enhance your live audio pipeline, we've introduced the capability to provide a custom audio track for a hosts's stream both before and during a live session. ## Custom Audio Track This feature allows you to integrate advanced audio layers like background noise suppression, echo cancellation, and more—so your stream sounds polished and professional to every viewer. ### `How to Create Custom Audio Track ?` - You can create a Audio Track using `createAudioTrack()` method of `VideoSDK`. - This method can be used to create audio track using different encoding parameters. #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```js val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("speech_standard",this) // `high_quality` | `music_standard`, Default : `speech_standard` ``` ```js CustomStreamTrack audioCustomTrack=VideoSDK.createAudioTrack("speech_standard", this); // `high_quality` | `music_standard`, Default : `speech_standard` ``` - `speech_standard` : This config is optimised for normal voice communication. - `high_quality` : This config is used for getting RAW audio, where you can apply your `noiseConfig`. - `music_standard` : This config is optimised for communication, where sharing of musical notes such as songs or instrumental sounds, is important. ### `How to Setup Custom Audio Track ?` The custom track can be set up both before and after the initialization of the meeting. 1. [Setting up a Custom Track during the initialization of a meeting](/android/guide/video-and-audio-calling-api-sdk/render-media/optimize-audio-track#1-setting-up-a-custom-track-during-the-initialization-of-a-meeting) 2. [Setting up a Custom Track with methods](/android/guide/video-and-audio-calling-api-sdk/render-media/optimize-audio-track#2-setting-up-a-custom-track-with-methods) ##### 1. Setup during live stream initialization If you're starting the stream with the mic enabled `(micEnabled: true)` and wish to use a custom track from the beginning, pass it through the config of MeetingProvider. :::caution Custom Track will not apply on `micEnabled: false` configuration. ::: ##### Example ```js override fun onCreate(savedInstanceState: Bundle?) { //.. val customTracks: MutableMap = HashMap() //highlight-start val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("high_quality", this) customTracks["mic"] = audioCustomTrack //Key must be "mic" //highlight-end // create a new meeting instance val liveStream = VideoSDK.initMeeting( this@MainActivity,meetingId,participantName, //MicEnabled , If true, it will use the passed custom track to turn mic on true, //WebcamEnabled true, //ParticipantId null, //Mode null, //MultiStream false, //Pass the custom tracks here //highlight-next-line customTracks, //MetaData null ) } ``` ```js @Override protected void onCreate(Bundle savedInstanceState) { //.. Map customTracks = new HashMap<>(); //highlight-start CustomStreamTrack audioCustomTrack = VideoSDK.createAudioTrack("high_quality", this); customTracks.put("mic", audioCustomTrack); //Key must be "mic" //highlight-end // create a new meeting instance Meeting liveStream = VideoSDK.initMeeting( MainActivity.this, meetingId, participantName, //MicEnabled , If true, it will use the passed custom track to turn mic on true, //WebcamEnabled true, //ParticipantId null, //Mode null, //MultiStream false, //Pass the custom tracks here //highlight-next-line customTracks, //MetaData null ); } ``` #### 2. Setup dynamically using methods During the live stream, you can update the audio source by passing the `CustomStreamTrack` in the `unmuteMic()` method of `Meeting`. You can also pass custom track in `changeMic()` method of `Meeting`. :::tip Make sure to call the `muteMic()` method before you create a new track as it may lead to unexpected behavior. ::: ##### Example ```js try { val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("high_quality", this) liveStream!!.unmuteMic(audioCustomTrack) //or liveStream!!.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH, audioCustomTrack) } catch (e: JSONException) { e.printStackTrace() } ``` ```js try { CustomStreamTrack audioCustomTrack = VideoSDK.createAudioTrack("high_quality", this); liveStream.unmuteMic(audioCustomTrack); //or liveStream.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH,audioCustomTrack); }catch (JSONException e) { e.printStackTrace(); } ``` ## API Reference The API references for all the methods and events utilised in this guide are provided below. - [Custom Audio Track](/android/api/sdk-reference/custom-tracks#custom-audio-track---android) --- --- title: Custom ScreenShare Sources - React JS SDK hide_title: false hide_table_of_contents: false description: Custom ScreenShare Sources sidebar_label: Custom ScreenShare Sources pagination_label: Custom ScreenShare Sources keywords: - custom screenshare sources - react js image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: custom-screenshare-sources --- # Custom ScreenShare Sources - Android To deliver high-quality livestreams, it's essential to fine-tune screen share tracks being broadcasted. Whether you’re hosting a webinar, or going live with a presentation, using custom media tracks gives you better control over stream quality and performance. ## Custom Screen Share Track This feature enables the customization of screenshare streams with enhanced optimization modes and predefined encoder configuration (resolution + FPS) for specific use cases, which can then be sent to other hosts and audience members. ### `How to Create Custom Screen Share Track ?` - You can create a Screen Share track using `createScreenShareVideoTrack()` method of `VideoSDK`. - This method can be used to create video track using different encoding parameters and optimization mode. #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack( //highlight-next-line // This will accept the height & FPS of video you want to capture. "h720p_15fps", // `h360p_30fps` | `h1080p_30fps` // Default : `h720p_15fps` //highlight-next-line // It is Intent received from onActivityResult when user provide permission for ScreenShare. data, //highlight-next-line // Pass Conext this) //highlight-next-line //Callback to this listener will be made when track is ready with CustomTrack as parameter { track -> meeting!!.enableScreenShare(track) } ``` ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack( //highlight-next-line // This will accept the height & FPS of video you want to capture. "h720p_15fps", // `h360p_30fps` | `h1080p_30fps` // Default : `h720p_15fps` /highlight-next-line // It is Intent received from onActivityResult when user provide permission for ScreenShare data, //highlight-next-line // Pass Conext this, //highlight-next-line //Callback to this listener will be made when track is ready with CustomTrack as parameter (track)->{meeting.enableScreenShare(track);} ); ``` ### `How to Setup Custom Screen Share Track ?` In order to switch tracks during the meeting, you have to pass the `CustomStreamTrack` in the `enableScreenShare()` method of `Meeting`. :::note Make sure to call `disableScreenShare()` before you create a new track as it may lead to unexpected behavior. ::: ##### Example ```javascript @TargetApi(21) private fun askPermissionForScreenShare() { val mediaProjectionManager = application.getSystemService( Context.MEDIA_PROJECTION_SERVICE ) as MediaProjectionManager startActivityForResult( mediaProjectionManager.createScreenCaptureIntent(), CAPTURE_PERMISSION_REQUEST_CODE ) } @RequiresApi(api = Build.VERSION_CODES.LOLLIPOP) override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) { super.onActivityResult(requestCode, resultCode, data) if (requestCode != CAPTURE_PERMISSION_REQUEST_CODE) return if (resultCode == RESULT_OK) { //highlight-start VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this) { track -> liveStream!!.enableScreenShare(track) } //highlight-end } } ``` ```javascript @TargetApi(21) private void askPermissionForScreenShare() { MediaProjectionManager mediaProjectionManager = (MediaProjectionManager) getApplication().getSystemService( Context.MEDIA_PROJECTION_SERVICE); startActivityForResult( mediaProjectionManager.createScreenCaptureIntent(), CAPTURE_PERMISSION_REQUEST_CODE); } @RequiresApi(api = Build.VERSION_CODES.LOLLIPOP) @Override public void onActivityResult(int requestCode, int resultCode, Intent data) { super.onActivityResult(requestCode, resultCode, data); if (requestCode != CAPTURE_PERMISSION_REQUEST_CODE) return; if (resultCode == Activity.RESULT_OK) { //highlight-start VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this, (track)->{ liveStream.enableScreenShare(track); }); //highlight-end } } ``` ## API Reference The API references for all the methods and events utilised in this guide are provided below. - [Custom Video Track](/react/api/sdk-reference/custom-tracks#custom-video-track---react) - [Custom Screen Share Track](/react/api/sdk-reference/custom-tracks#custom-screen-share-track---react) --- --- title: Custom Video Sources - React JS SDK hide_title: false hide_table_of_contents: false description: Custom Video Sources sidebar_label: Custom Video Sources pagination_label: Custom Video Sources keywords: - custom video sources - react js image: img/videosdklive-thumbnail.jpg sidebar_position: 1 slug: custom-video-sources --- # Custom Video Sources - Android To deliver high-quality livestreams, it's essential to fine-tune the video tracks being broadcasted. Whether you’re hosting a webinar, or streaming an event, using custom video tracks gives you better control over stream quality and performance. ## Custom Video Track This feature can be used to add custom video encoder configurations, optimization mode (whether you want to focus on motion, text or detail of the video) and background removal & video filter from external SDK(e.g., Banuba)and send it to other participants. ### `How to Create a Custom Video Track ?` - You can create a Video Track using `createCameraVideoTrack()` method of `VideoSDK`. - This method can be used to create video track using different encoding parameters, camera facing mode, and optimization mode and return `CustomStreamTrack`. #### Example import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; ```javascript val videoCustomTrack: CustomStreamTrack = VideoSDK.createCameraVideoTrack( // highlight-next-line // This will accept the resolution (height x width) of video you want to capture. "h720p_w960p", // "h720p_w960p" | "h720p_w1280p" ... // Default : "h480p_w720p" // highlight-next-line // It will specify whether to use front or back camera for the video track. "front", "back", Default : "front" // highlight-next-line // We will discuss this parameter in next step. CustomStreamTrack.VideoMode.MOTION, // CustomStreamTrack.VideoMode.TEXT, CustomStreamTrack.VideoMode.DETAIL , Default : CustomStreamTrack.VideoMode.MOTION // highlight-next-line // multiStream - we will discuss this parameter in next step. false, // true // highlight-next-line // Pass Context this, // highlight-next-line // This is Optional parameter. We will discuss this parameter in next step. observer) ``` ```javascript CustomStreamTrack customStreamTrack = VideoSDK.createCameraVideoTrack( // highlight-next-line // This will accept the resolution (height x width) of video you want to capture. "h480p_w640p", // "h720p_w960p" | "h720p_w1280p" ... // Default : "h480p_w640p" // highlight-next-line // It will specify whether to use front or back camera for the video track. "front", // "back, Default : "front"" // highlight-next-line // We will discuss this parameter in next step. CustomStreamTrack.VideoMode.MOTION, // CustomStreamTrack.VideoMode.TEXT, CustomStreamTrack.VideoMode.DETAIL , Default : CustomStreamTrack.VideoMode.MOTION // highlight-next-line // multiStream - we will discuss this parameter in next step. false, // true // highlight-next-line // Pass Context this // highlight-next-line // This is Optional parameter. We will discuss this parameter in next step. observer); ``` :::caution The behavior of custom track configurations is influenced by the capabilities of the device. For example, if you set the encoder configuration to 1080p but the webcam only supports 720p, the encoder configuration will automatically adjust to the highest resolution that the device can handle, which in this case is 720p. ::: ##### What is `optimizationMode`? - This parameter specifies the optimization mode for the video track being generated. - `motion` : This type of track focuses more on motion video such as webcam video, movies or video games. - It will degrade `resolution` in order to maintain `frame rate`. - `text` : This type of track focuses on significant sharp edges and areas of consistent color that can change frequently such as presentations or web pages with text content. - It will degrade `frame rate` in order to maintain `resolution`. - `detail` : This type of track focuses more on the details of the video such as, presentations, painting or line art. - It will degrade `frame rate` in order to maintain `resolution`. ##### What is `multiStream`? - By enabling multiStream, your livestream will broadcast multiple resolutions (e.g., 720p, 480p, 360p), allowing viewers to receive the best stream quality based on their network. The **`multiStream : true`** configuration indicates that VideoSDK, by default, sends multiple resolution video streams to the server. For example, if a user's device capability is 720p, VideoSDK sends streams in 720p, 640p, and 480p resolution. This enables VideoSDK to deliver the appropriate stream to each participant based on their network bandwidth.
![Multi Stream False](/img/multistream_true.png)
Setting **`multiStream : false`** restricts VideoSDK to send only one stream, helping to maintain quality by focusing on a single resolution.
![Multi Stream False](/img/multistream_false.png)
:::danger The `setQuality` parameter will not have any effect if multiStream is set to `false`. ::: ### `How to Setup a Custom Video Track ?` You can plug in your custom video track either before going live or dynamically while the session is ongoing. 1. [Setup during live stream initialization](/react/guide/video-and-audio-calling-api-sdk/render-media/optimize-video-track#1-setup-during-live-stream-initialization) 2. [Setup dynamically using methods](/react/guide/video-and-audio-calling-api-sdk/render-media/optimize-video-track#2-setup-dynamically-using-methods) ##### 1. Setting up a Custom Track during the initialization of a liveStream If you're starting the stream with the webcam enabled `(webcamEnabled: true)` and wish to use a custom track from the beginning, pass it through the config of initMeeting as shown below. :::caution Custom Track will not apply on the `webcamEnabled: false` configuration. ::: ##### Example ```js override fun onCreate(savedInstanceState: Bundle?) { //.. //highlight-start val customTracks: MutableMap = HashMap() val videoCustomTrack: CustomStreamTrack = VideoSDK.createCameraVideoTrack("h720p_w960p", "front", CustomStreamTrack.VideoMode.MOTION, false, this) customTracks["video"] = videoCustomTrack //Key must be "video" //highlight-end // create a new meeting instance val liveStream = VideoSDK.initMeeting( this@MainActivity, meetingId, participantName, //MicEnabled true, //WebcamEnabled , If true, it will use the passed custom track to turn webcam on true, // ParticipantId null, // Mode null, // MultiStream false, //Pass the custom tracks here //highlight-next-line customTracks, //MetaData null ) } ``` ```js @Override protected void onCreate(Bundle savedInstanceState) { //.. //highlight-start Map customTracks = new HashMap<>(); CustomStreamTrack videoCustomTrack = VideoSDK.createCameraVideoTrack("h720p_w960p", "front", CustomStreamTrack.VideoMode.MOTION, false, this); customTracks.put("video", videoCustomTrack); //Key must be "video" //highlight-end // create a new meeting instance Meeting liveStream = VideoSDK.initMeeting( MainActivity.this, meetingId, participantName, //MicEnabled true, //WebcamEnabled , If true, it will use the passed custom track to turn webcam on true, // ParticipantId null, // Mode null, // MultiStream false, //Pass the custom tracks here //highlight-next-line customTracks, //MetaData null ); } ``` #### 2. Setting up a Custom Track with methods In order to switch tracks during the meeting, you have to pass the `CustomStreamTrack` in the `enableWebcam()` method of `Meeting`. :::tip Make sure to call `disableWebcam()` before you create a new track as it may lead to unexpected behavior. ::: ##### Example ```javascript val customStreamTrack: CustomStreamTrack = VideoSDK.createCameraVideoTrack("h720p_w960p", "back", CustomStreamTrack.VideoMode.MOTION, false, this) liveStream!!.enableWebcam(customStreamTrack) ``` ```javascript CustomStreamTrack customStreamTrack=VideoSDK.createCameraVideoTrack("h720p_w960p", "back", CustomStreamTrack.VideoMode.MOTION, false, this); liveStream.enableWebcam(customStreamTrack); ``` ### `Which Configuration is suitable for Device ?` In this section, we will understand participant size wise `encoder(Resolution)` and `multiStream` configuration.
## API Reference The API references for all the methods and events utilised in this guide are provided below. - [Custom Video Track](/react/api/sdk-reference/custom-tracks#custom-video-track---react) --- --- title: Customized Live Stream sidebar_position: 1 sidebar_label: Customized Live Stream hide_table_of_contents: false --- # Customized Live Stream - Android VideoSDK is a platform that offers a range of video streaming tools and solutions for content creators, publishers, and developers. ### Custom Template - Custom template is template for live stream, which allows users to add real-time graphics to their streams. - With custom templates, users can create unique and engaging video experiences by overlaying graphics, text, images, and animations onto their live streams. These graphics can be customized to match the branding. - Custom templates enable users to create engaging video content with real-time graphics, with live scoreboards, social media feeds, and other customizations, users can easily create unique and visually appealing streams that stands out from the crowd. :::note Custom templates can be used with recordings and RTMP service provided by VideoSDK as well. ::: ### What you can do with Custom Template Using a custom template, you may create a variety of various modes. Here are a few of the more well-known modes that you can create. - **`PK Host:`** Host can organise player vs player battle. Below image is example of gaming battle. - **`Watermark:`** Host can add & update watermark anywhere in the template. In below image we have added VideoSDK watermark on top right side of the screen. - **`News Mode:`** Host can add dynamic text in lower third banner. in below image we have added some sample text in bottom left of the screen. ![Mobile Custom Template ](https://cdn.videosdk.live/website-resources/docs-resources/mobile_custom_template.png) ## Custom template with VideoSDK In this section, we will discuss how Custom Templates work with VideoSDK. - **`Host`**: The host is responsible for starting the live streaming by passing the `templateURL`. The `templateURL` is the URL of the hosted template webpage. The host is also responsible for managing the template, such as changing text, logos, and switching template layout, among other things. - **`VideoSDK Template Engine`** : The VideoSDK Template Engine accepts and opens the templateURL in the browser. It listens to all the events performed by the Host and enables customization of the template according to the Host's preferences. - **`Viewer`**: The viewer can stream the content. They can watch the live stream with the added real-time graphics, which makes for a unique and engaging viewing experience. ![custom template](https://cdn.videosdk.live/website-resources/docs-resources/custom_template.png) ### Understanding Template URL The template URL is the webpage that VideoSDK Template Engine will open while composing the live stream. The template URL will appear as shown below. ![template url](https://cdn.videosdk.live/website-resources/docs-resources/custom_template_url.png) The Template URL consists of two parts: - Your actual page URL, which will look something like `https://example.com/videosdk-template`. - Query parameters, which will allow the VideoSDK Template Engine to join the meeting when the URL is opened. There are a total of three query parameters: - `token`: This will be your token, which will be used to join the meeting. - `meetingId`: This will be the meeting ID that will be joined by the VideoSDK Template Engine. - `participantId`: This will be the participant ID of the VideoSDK Template Engine, which should be passed while joining the template engine in your template so that the tempalte engine participant is not visible to other participants. **This parameter will be added by the** **VideoSDK**. :::info Above mentioned query parameters are mandatory. Apart from these parameters, you can pass any other extra parameters which are required according to your use-case. ::: ### **Creating Template** **`Step 1:`** Create a new React App using the below command ```js npx create-react-app videosdk-custom-template ``` :::note You can use VideoSDK's React or JavaScript SDK to create custom template. Following is the example of building custom template with React SDK. ::: **`Step 2:`** Install the VideoSDK using the below-mentioned npm command. Make sure you are in your react app directory before you run this command. ```js npm install "@videosdk.live/react-sdk" //For the Participants Video npm install "react-player" ``` ###### App Architecture ![template architechture](https://cdn.videosdk.live/website-resources/docs-resources/custom_template_arch.png) ###### Structure of the Project ```jsx title="Project Structure" root ├── node_modules ├── public ├── src │ ├── components │ ├── MeetingContainer.js │ ├── ParticipantsAudioPlayer.js │ ├── ParticipantsView.js │ ├── Notification.js │ ├── icons │ ├── App.js │ ├── index.js ├── package.json . ``` **`Step 3:`** Next we will fetch the query parameters, from the URL which we will later use to initialize the meeting ```js title=App.js function App() { const { meetingId, token, participantId } = useMemo(() => { //highlight-start const location = window.location; const urlParams = new URLSearchParams(location.search); const paramKeys = { meetingId: "meetingId", token: "token", participantId: "participantId", }; Object.keys(paramKeys).forEach((key) => { paramKeys[key] = urlParams.get(key) ? decodeURIComponent(urlParams.get(key)) : null; }); return paramKeys; //highlight-end }, []); } ``` **`Step 4:`** Now we will initialize the meeting with the parameters we extracted from the URL. Make sure `joinWithoutUserInteraction` is specified, so that the template engine is able to join directly into the meeting, on the page load. ```js title=App.js function App(){ //highlight-next-line ... return meetingId && token && participantId ? (
{/* We will create this in upcoming steps */}
) : null; } ``` **`Step 5:`** Let us create the `MeetingContainer` which will render the meeting view for us. - It will also listen to the PubSub messages from the `CHANGE_BACKGROUND` topic, which will change the background color of the meeting. - It will have `Notification` component which will show any messages share by Host :::note We will be using the PubSub mechanism to communicate with the template. You can learn more about [PubSub from here](../video-and-audio-calling-api-sdk/collaboration-in-meeting/pubsub). ::: import CautionMessage from '@site/src/theme/CautionMessage'; ```js title=MeetingContainer.js import { Constants, useMeeting, usePubSub } from "@videosdk.live/react-sdk"; import { Notification } from "./Notification"; import { ParticipantsAudioPlayer } from "./ParticipantsAudioPlayer"; import { ParticipantView } from "./ParticipantView"; export const MeetingContainer = () => { const { isMeetingJoined, participants, localParticipant } = useMeeting(); //highlight-next-line const { messages } = usePubSub("CHANGE_BACKGROUND"); const remoteSpeakers = [...participants.values()].filter((participant) => { return ( participant.mode == Constants.modes.SEND_AND_RECV && !participant.local ); }); return isMeetingJoined ? (
0 ? messages.at(messages.length - 1).message : "#fff", //highlight-end }} > //highlight-next-line
1 ? "1fr 1fr" : "1fr", flex: 1, maxHeight: `100vh`, overflowY: "auto", gap: "20px", padding: "20px", alignItems: "center", justifyItems: "center", }} > {[...remoteSpeakers].map((participant) => { return ( //highlight-start //highlight-end ); })}
//highlight-next-line
) : (
); }; ``` **`Step 6:`** Let us create the `ParticipantView` and `ParticipantsAudioPlayer` which will render the video and audio of the participants respectively. ```js title=ParticipantView.js import { useParticipant } from "@videosdk.live/react-sdk"; import { useMemo } from "react"; import ReactPlayer from "react-player"; import MicOffIcon from "../icons/MicOffIcon"; export const ParticipantView = (props) => { const { webcamStream, webcamOn, displayName, micOn } = useParticipant( props.participantId ); const videoStream = useMemo(() => { if (webcamOn && webcamStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(webcamStream.track); return mediaStream; } }, [webcamStream, webcamOn]); return (
{webcamOn && webcamStream ? ( { console.log(err, "participant video error"); }} /> ) : (
{String(displayName).charAt(0).toUpperCase()}
)}
{displayName}{" "} {!micOn && }
); }; ``` ```js title=ParticipantsAudioPlayer.js import { useMeeting, useParticipant } from "@videosdk.live/react-sdk"; import { useEffect, useRef } from "react"; const ParticipantAudio = ({ participantId }) => { const { micOn, micStream, isLocal } = useParticipant(participantId); const audioPlayer = useRef(); useEffect(() => { if (!isLocal && audioPlayer.current && micOn && micStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(micStream.track); audioPlayer.current.srcObject = mediaStream; audioPlayer.current.play().catch((err) => {}); } else { audioPlayer.current.srcObject = null; } }, [micStream, micOn, isLocal, participantId]); return