# A2A Implementation Guide This guide shows you how to build a complete Agent to Agent (A2A) system using the concepts from the [A2A Overview](overview). We'll create a banking customer service system with a main customer service agent and a loan specialist. ## Implementation Overview We'll build a system with: - **Customer Service Agent**: Voice-enabled interface agent using **RealTimePipeline** for low-latency voice interactions - **Loan Specialist Agent**: Text-based domain expert using **CascadingPipeline** for efficient text processing - **Intelligent Routing**: Automatic detection and forwarding of loan queries - **Seamless Communication**: Users get expert responses without knowing about the routing ## Structure of the project ```js A2A ├── agents/ │ ├── customer_agent.py # CustomerServiceAgent definition │ ├── loan_agent.py # LoanAgent definition │ ├── session_manager.py # Handles session creation, pipeline setup, meeting join/leave └── main.py # Entry point: runs main() and starts agents ``` ## Sequence Diagram ![A2A Architecture](https://cdn.videosdk.live/website-resources/docs-resources/a2a_sequence_diagram.png) ## Step 1: Create the Customer Service Agent - **`Interface Agent`**: Creates `CustomerServiceAgent` as the main user-facing agent with voice capabilities and customer service instructions. - **`Function Tool`**: Implements`@function_tool forward_to_specialist()`that uses A2A discovery to find and route queries to domain specialists. - **`Response Relay`**: Includes `handle_specialist_response()` method that automatically receives and relays specialist responses back to users. ```python title="agents/customer_agent.py" from videosdk.agents import Agent, AgentCard, A2AMessage, function_tool import asyncio from typing import Dict, Any class CustomerServiceAgent(Agent): def __init__(self): super().__init__( agent_id="customer_service_1", instructions=( "You are a helpful bank customer service agent. " "For general banking queries (account balances, transactions, basic services), answer directly. " "For ANY loan-related queries, questions, or follow-ups, ALWAYS use the forward_to_specialist function " "with domain set to 'loan'. This includes initial loan questions AND all follow-up questions about loans. " "Do NOT attempt to answer loan questions yourself - always forward them to the specialist. " "After forwarding a loan query, stay engaged and automatically relay any response you receive from the specialist. " "When you receive responses from specialists, immediately relay them naturally to the customer." ) ) @function_tool async def forward_to_specialist(self, query: str, domain: str) -> Dict[str, Any]: """Forward queries to domain specialist agents using A2A discovery""" # Use A2A discovery to find specialists by domain specialists = self.a2a.registry.find_agents_by_domain(domain) id_of_target_agent = specialists[0] if specialists else None if not id_of_target_agent: return {"error": f"No specialist found for domain {domain}"} # Send A2A message to the specialist await self.a2a.send_message( to_agent=id_of_target_agent, message_type="specialist_query", content={"query": query} ) return { "status": "forwarded", "specialist": id_of_target_agent, "message": "Let me get that information for you from our loan specialist..." } async def handle_specialist_response(self, message: A2AMessage) -> None: """Handle responses from specialist agents and relay to user""" response = message.content.get("response") if response: # Brief pause for natural conversation flow await asyncio.sleep(0.5) # Try multiple methods to relay the response to the user prompt = f"The loan specialist has responded: {response}" methods_to_try = [ (self.session.pipeline.send_text_message, prompt),# While using Cascading as main agent, comment this (self.session.pipeline.model.send_message, response),# While using Cascading as main agent, comment this (self.session.say, response) ] for method, arg in methods_to_try: try: await method(arg) break except Exception as e: print(f"Error with {method.__name__}: {e}") async def on_enter(self): # Register this agent with the A2A system await self.register_a2a(AgentCard( id="customer_service_1", name="Customer Service Agent", domain="customer_service", capabilities=["query_handling", "specialist_coordination"], description="Handles customer queries and coordinates with specialists" )) await self.session.say("Hello! I am your customer service agent. How can I help you?") # Set up message listener for specialist responses self.a2a.on_message("specialist_response", self.handle_specialist_response) async def on_exit(self): print("Customer agent left the meeting") ``` ## Step 2: Create the Loan Specialist Agent - **`Specialist Agent Setup`**: Creates `LoanAgent` class with specialized loan expertise instructions and agent_id `"specialist_1"`. - **`Message Handlers`**: Implements` handle_specialist_query()` to process incoming queries and handle_model_response() to send responses back. - **`Registration`**: Registers with A2A system using domain "loan" so it can be `discovered` by other agents needing loan expertise. ```python title="agents/loan_agent.py" from videosdk.agents import Agent, AgentCard, A2AMessage class LoanAgent(Agent): def __init__(self): super().__init__( agent_id="specialist_1", instructions=( "You are a specialized loan expert at a bank. " "Provide detailed, helpful information about loans including interest rates, terms, and requirements. " "Give complete answers with specific details when possible. " "You can discuss personal loans, car loans, home loans, and business loans. " "Provide helpful guidance and next steps for loan applications. " "Be friendly and professional in your responses. " "Keep responses concise within 5-7 lines and easily understandable." ) ) async def handle_specialist_query(self, message: A2AMessage): """Process incoming queries from customer service agent""" query = message.content.get("query") if query: # Send the query to our AI model for processing await self.session.pipeline.send_text_message(query) async def handle_model_response(self, message: A2AMessage): """Send processed responses back to requesting agent""" response = message.content.get("response") requesting_agent = message.to_agent if response and requesting_agent: # Send the specialist response back to the customer service agent await self.a2a.send_message( to_agent=requesting_agent, message_type="specialist_response", content={"response": response} ) async def on_enter(self): await self.register_a2a(AgentCard( id="specialist_1", name="Loan Specialist Agent", domain="loan", capabilities=["loan_consultation", "loan_information", "interest_rates"], description="Handles loan queries" )) self.a2a.on_message("specialist_query", self.handle_specialist_query) self.a2a.on_message("model_response", self.handle_model_response) async def on_exit(self): print("LoanAgent Left") ``` ## Step 3: Configure Session Management - **`Pipeline Architecture`**: Uses **RealTimePipeline** for customer agent (audio-enabled Gemini for voice interaction) and **CascadingPipeline** for specialist agent (text-only OpenAI for efficient processing). - **`Session Factory`**: Provides `create_pipeline()` and `create_session()` functions to properly configure agent sessions based on their roles. - **`Modality Separation`**: Ensures customer agent can handle voice while specialist processes text in background. ```python title="session_manager.py" from videosdk.agents import AgentSession, CascadingPipeline, RealTimePipeline, ConversationFlow from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import os class MyConversationFlow(ConversationFlow): async def on_turn_start(self, transcript: str) -> None: pass async def on_turn_end(self) -> None: pass def create_pipeline(agent_type: str): if agent_type == "customer": # Customer agent: RealTimePipeline for voice interaction return RealTimePipeline( model=GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) ) else: # Specialist agent: CascadingPipeline for text processing return CascadingPipeline( llm=OpenAILLM(api_key=os.getenv("OPENAI_API_KEY")), ) def create_session(agent, pipeline) -> AgentSession: return AgentSession( agent=agent, pipeline=pipeline, conversation_flow=MyConversationFlow(agent=agent), ) ``` :::note While setting up pipelines, make sure: - The **customer agent** has **voice capabilities only** (via `RealTimePipeline`). - The **specialist agent (Loan Agent)** operates in **text-only mode** (via `CascadingPipeline`). ::: :::info **Pipeline Support**: The VideoSDK AI Agents framework supports both **RealTimePipeline** and **CascadingPipeline**, enabling flexible configurations for voice and text processing with **A2A**. You can run a full `RealTimePipeline` or `CascadingPipeline` for both modalities, or create a hybrid setup that combines the two. This allows you to tailor the use of STT, TTS, and LLM to suit your specific use case, whether for low-latency interactions, complex processing flows, or a mix of both. ::: ## Step 4: Deploy A2A System on VideoSDK Platform - **`Meeting Setup`**: Customer agent joins VideoSDK meeting for user interaction while specialist runs in background mode. Requires environment variables: `VIDEOSDK_AUTH_TOKEN`, `GOOGLE_API_KEY`, and `OPENAI_API_KEY`. - **`System Orchestration`**: Uses `JobContext` and `WorkerJob` to manage the meeting lifecycle and agent coordination. - **`Resource Management`**: Handles startup sequence, keeps system running, and provides clean shutdown with proper A2A unregistration ```python title="main.py" import asyncio from contextlib import suppress from agents.customer_agent import CustomerServiceAgent from agents.loan_agent import LoanAgent from session_manager import create_pipeline, create_session from videosdk.agents import JobContext, RoomOptions, WorkerJob async def main(ctx: JobContext): specialist_agent = LoanAgent() specialist_pipeline = create_pipeline("specialist") specialist_session = create_session(specialist_agent, specialist_pipeline) customer_agent = CustomerServiceAgent() customer_pipeline = create_pipeline("customer") customer_session = create_session(customer_agent, customer_pipeline) specialist_task = asyncio.create_task(specialist_session.start()) try: await ctx.connect() await customer_session.start() await asyncio.Event().wait() except (KeyboardInterrupt, asyncio.CancelledError): print("Shutting down...") finally: specialist_task.cancel() with suppress(asyncio.CancelledError): await specialist_task await specialist_session.close() await customer_session.close() await specialist_agent.unregister_a2a() await customer_agent.unregister_a2a() await ctx.shutdown() def customer_agent_context() -> JobContext: room_options = RoomOptions(room_id="", name="Customer Service Agent", playground=True) return JobContext( room_options=room_options ) if __name__ == "__main__": job = WorkerJob(entrypoint=main, jobctx=customer_agent_context) job.start() ``` :::note Ensure that the `JobContext` is created **only for the primary (main) agent**, i.e., the agent responsible for user-facing interaction (e.g., Customer Agent). The background agent (e.g., Loan Agent) should not have its own context or initiate a separate connection. ::: #### Running the Application Set the required environment variables: ```bash export VIDEOSDK_AUTH_TOKEN="your_videosdk_token" export GOOGLE_API_KEY="your_google_api_key" export OPENAI_API_KEY="your_openai_api_key" ``` Replace `` in the code with your actual meeting ID, then run: ```bash cd A2A python main.py ``` :::tip Quick Start Get the complete working example at [A2A Quick Start Repository](https://github.com/videosdk-live/agents-quickstart/tree/main/A2A) with all the code ready to run. ::: --- # Agent to Agent (A2A) The Agent to Agent (A2A) protocol enables seamless collaboration between specialized AI agents, allowing them to communicate, share knowledge, and coordinate responses based on their unique capabilities and domain expertise. With VideoSDK's A2A implementation, you can create multi-agent systems where different agents work together to provide comprehensive solutions. ## How It Works ### Basic Flow 1. **Agent Registration**: Agents register themselves with an `AgentCard` that contains their capabilities and domain expertise 2. **Client Query**: Client sends a query to the main agent 3. **Agent Discovery**: Main agent discovers relevant specialist agents using agent cards 4. **Query Forwarding**: Main agent forwards specialized queries to appropriate agents 5. **Response Chain**: Specialist agents process queries and respond back to the main agent 6. **Client Response**: Main agent formats and delivers the final response to the client ![A2A Architecture](https://cdn.videosdk.live/website-resources/docs-resources/a2a_diagram.png) ### Example Scenario ``` Client → "Book a flight to New York and find a hotel" ↓ Travel Agent (Main) → Analyzes query ↓ Travel Agent → Discovers Flight Booking Agent & Hotel Booking Agent ↓ Travel Agent → Forwards flight query to Flight Booking Agent Travel Agent → Forwards hotel query to Hotel Booking Agent ↓ Specialist Agents → Process queries and respond back (text format) ↓ Travel Agent → Combines responses and sends to client (audio format) ``` # Core Components ## 1. AgentCard The `AgentCard` is how agents identify themselves and advertise their capabilities to other agents. #### Structure ```python AgentCard( id="agent_flight_001", name="Skymate", domain="flight", capabilities=[ "search_flights", "modify_bookings", "show_flight_status" ], description="Handles all flight-related tasks" ) ``` #### Parameters | Parameter | Type | Required | Description | | -------------- | ------ | -------- | ------------------------------------ | | `id` | string | Yes | Unique identifier for the agent | | `name` | string | Yes | Human-readable agent name | | `domain` | string | Yes | Primary expertise domain | | `capabilities` | list | Yes | List of specific capabilities | | `description` | string | Yes | Brief description of agent's purpose | | `metadata` | dict | No | Additional metadata for the agent | ## 2. A2AMessage `A2AMessage` is the standardized communication format between agents. #### Structure ```python message = A2AMessage( from_agent="travel_agent_1", to_agent="agent_flight_001", type="flight_status_query", content={"query": "What's the status of AI202?"}, metadata={"client_id": "xyz123", "urgency": "medium"} ) ``` #### Parameters | Parameter | Type | Required | Description | | ------------ | ------ | -------- | --------------------------- | | `from_agent` | string | Yes | ID of the sending agent | | `to_agent` | string | Yes | ID of the receiving agent | | `type` | string | Yes | Message type/event name | | `content` | dict | Yes | Message payload | | `metadata` | dict | No | Additional message metadata | ## 3. Agent Registry #### `register_a2a(agent_card)` Register an agent with the A2A system. ```python async def on_enter(self): await self.register_a2a(AgentCard( id="agent_flight_001", name="Skymate", domain="flight", capabilities=[ "search_flights", "modify_bookings", "show_flight_status" ], description="Handles all flight-related tasks" )) ``` **What Registration Does:** - Adds the agent to the global `AgentRegistry` singleton - Makes the agent discoverable by other agents - Stores both the `AgentCard` and agent instance - Enables message routing to this agent #### `unregister()` Unregister an agent from the A2A system. ```python await self.unregister_a2a() ``` ## 4. A2AProtocol Class The main class for managing agent-to-agent communication. ### Agent Discovery #### `find_agents_by_domain(domain: str)` Discover agents based on their domain expertise. ```python agents = self.a2a.registry.find_agents_by_domain("hotel") # Returns: ["agent_hotel_001"] ``` #### `find_agents_by_capability(cap: str)` Find agents with specific skills. ```python agents = await self.a2a.registry.find_agents_by_capability("modify_bookings") # Returns: ["agent_flight_001"] ``` --- ### Agent Communications #### `send_message(to_agent, message_type, content, metadata=None)` Send messages directly to other agents. ```python await self.a2a.send_message( to_agent="agent_hotel_001", message_type="hotel_booking_query", # Event name that the receiving agent listens for content={"query": "Find 3-star hotels in Delhi under $100"}, metadata={"client_id": "xyz123"} # Optional metadata ) ``` **Parameters:** - `to_agent` (string): Target agent ID - `message_type` (string): Event name the receiving agent listens for - `content` (dict): Message payload - `metadata` (dict, optional): Additional message metadata #### `on_message(message_type, handler)` Register message handlers for incoming messages. ```python # Register a handler for specialist queries self.a2a.on_message("hotel_booking_query", self.handle_specialist_query) async def handle_specialist_query(self, message): # Process the incoming message query = message.content.get("query") # ... process query ... # Return response return {"response": "Current mortgage rates are 6.5%"} ``` ## Next Steps Now that you're familiar with the core A2A concepts, it's time to move from theory to practice: 👉 **[Explore the Full A2A Implementation](implementation)** Dive into a complete, working example that demonstrates agent discovery, messaging, and collaboration in action. --- # Agent Runtime Guide AI voice agents are transforming how businesses interact with customers, providing natural, conversational experiences through voice interfaces. VideoSDK's **Agent Runtime** feature offers a powerful **no-code/low-code interface** that enables you to build sophisticated AI voice agents without extensive programming knowledge. ## Prerequisites Before you begin, ensure you have: - **VideoSDK Account:** Visit [VideoSDK Dashboard](https://app.videosdk.live) to sign up for a free account and access the AI Agent builder. ## Step-By-Step Guide
### Step 1: Create a New Agent
1. In the dashboard, navigate to **AI Agent > Agents** or visit [Agents Dashboard](https://app.videosdk.live/agents/agents). 2. You'll see the `AI Agent > Agents` section in the dashboard. 3. To create a voice agent, click on **Agents** in the sidebar. ![Select Agents in Dashboard](https://strapi.videosdk.live/uploads/1_Select_Agents_in_Dashboard_1b6a6f6d0c.png)
### Step 2: Click `Add New Agent`
This is where you'll start creating your voice agent. If no agent has been created yet, you'll see a **Add New Agent** button. If agents already exist, you'll see a list of all AI voice agents, and you can click the button in the top right corner to create a new agent. ![Click Create AI Voice Agent Button](https://strapi.videosdk.live/uploads/2_Click_Create_AI_Voice_Agent_Button_349f3799f2.png)
### Step 3: Configure Agent Details
This is where you can define your AI voice agent's persona and behavior: - **Agent Name:** Set a descriptive name for your agent (e.g., "AI Interviewer"). - **System Prompt:** Define the agent's role, personality, and behavior guidelines. - **Welcome Message:** Set the message that plays when the agent joins a conversation. - **Closing Message:** Set the message that plays when the agent leaves a conversation. ![Create Voice Agent Persona](https://strapi.videosdk.live/uploads/3_Create_Voice_Agent_Persona_6281a768ef.png)
### Step 4: Configure the Pipeline
The pipeline is the core engine of your voice agent, processing audio through speech recognition, AI reasoning, and text-to-speech. VideoSDK offers two pipeline options: **Realtime Pipeline** and **Cascading Pipeline**. **realtime:** The **Realtime Pipeline** provides direct speech-to-speech processing with minimal latency, ideal for natural, conversational interactions. Example: Adding **Gemini Realtime Model** 1. Add your Gemini API key in the pipeline configuration or at [Realtime Integrations](https://app.videosdk.live/agents/integrations/realtime). 2. To get your API key, visit [Gemini API Keys](https://aistudio.google.com/api-keys). ![Gemini Add Your API Key](https://strapi.videosdk.live/uploads/4_Gemini_Add_Your_API_Key_bcf81a0f82.png) **Available models:** - `gemini-2.5-flash-native-audio-preview-12-2025` - `gemini-2.0-flash` - `gemini-2.5-flash-native-audio-preview-12-2025` - `gemini-2.5-flash-native-audio` --- **cascading:** The **Cascading Pipeline** processes audio through distinct stages (STT → LLM → TTS), providing maximum control over each component. Configure your providers for [STT Integrations](https://app.videosdk.live/agents/integrations/stt), [LLM Integrations](https://app.videosdk.live/agents/integrations/llm) and [TTS Integrations](https://app.videosdk.live/agents/integrations/tts). ![STT Providers](https://strapi.videosdk.live/uploads/stt_e2522d9ea2.png) Example: Adding **Deepgram STT** - Get API Key at: [Deepgram Console](https://console.deepgram.com/) **Available models:** - `flux-general-en` - `nova-2` or `nova-2-general` (for non-English transcriptions) - `nova-3` or `nova-3-general` - `base`
### Step 5: Knowledge Base Integration
Upload a knowledge base to provide context and domain expertise to your voice agent. This dramatically improves answer accuracy and enables your agent to handle specialized queries. - Navigate to the **Knowledge Base** tab in your agent configuration. - Upload documents, FAQs, or product sheets that contain relevant information. - The agent will use this knowledge to provide more accurate and contextual responses. ![Add Knowledge Base in VideoSDK](https://strapi.videosdk.live/uploads/Add_Knowlodege_base_in_videosdk_363aaa82f3.png)
### Step 6: Configure Telephony Settings
Configure telephony settings to enable your agent to handle phone calls: - **Agent Type:** Set the type of agent (inbound, outbound, or both). - **Inbound Gateways:** Set up gateways to receive incoming calls. - **Outbound Gateways:** Set up gateways to make outbound calls. - **Routing Rules:** Create rules to map phone numbers to your agent. - **Calling Settings:** Configure call handling preferences and behavior. ![Telephony Configuration](https://strapi.videosdk.live/uploads/telephony_agents_dd2c2080ac.png) This configuration is essential for **call center automation**, **platform integration**, and smooth **agent orchestration**.
### Step 7: Test Your Voice Agent
You can interact with the agent directly from the dashboard before connecting it to production channels: 1. Visit [Agents Dashboard](https://app.videosdk.live/agents/agents). 2. Locate your agent in the list and click the **Test** button in the top-right corner. 3. Use the built-in simulator to speak with the agent in real time, view live transcripts, and fine-tune prompts based on the conversation. ![Test AI Voice Agent](https://strapi.videosdk.live/uploads/test_ai_voice_agent_30e0045af0.png)
### Step 8: Connect Voice Agent
Once your agent is configured, you can connect it to various platforms and devices: - **Web:** Integrate your agent into web applications. - **Mobile:** Connect to iOS and Android mobile apps. - **Telephony:** Deploy to phone systems for voice calls. - **IoT Devices:** Connect to Internet of Things devices. ![Connect AI Voice Agent](https://strapi.videosdk.live/uploads/8_connect_ai_voice_agent_17fe428419.png) ## Next Steps Congratulations! You've successfully created your AI voice agent. Here are the next steps: - **Test Your Agent:** Use the built-in test simulator to verify your agent's behavior and responses. - **Deploy to Production:** Connect your agent to production environments and real user interactions. - **Monitor Performance:** Track agent performance, user satisfaction, and conversation quality. - **Iterate and Improve:** Refine your agent's prompts, knowledge base, and configuration based on real-world usage. Keep refining your agent's configuration to build a powerful voice AI solution tailored to your specific business needs. ### Integrations - [Connect with JavaScript](/ai_agents/agent-runtime/connect-agent/web-integrations/with-javascript): Core language of the web - [Connect with React](/ai_agents/agent-runtime/connect-agent/web-integrations/with-react): UI library for building interactive web apps. - [Connect with React-Native](/ai_agents/agent-runtime/connect-agent/mobile-integrations/with-react-native): Cross-platform mobile app JS framework. - [Connect with flutter](/ai_agents/agent-runtime/connect-agent/mobile-integrations/with-flutter): Cross-platform apps from one codebase. - [Connect with iOS](/ai_agents/agent-runtime/connect-agent/mobile-integrations/with-ios): Mobile apps for Apple devices. --- # Agent Runtime with Flutter VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your Flutter application within minutes. This guide shows you how to connect a Flutter frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Flutter installed on your device - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token. ::: ## Project Structure Your project structure should look like this: ```jsx title="Project Structure" root ├── android ├── ios ├── lib │ ├── api_call.dart │ ├── join_screen.dart │ ├── main.dart │ ├── meeting_controls.dart │ ├── meeting_screen.dart │ └── participant_tile.dart ├── macos ├── web └── windows ``` You will be working on the following files: - `join_screen.dart`: Responsible for the user interface to join a meeting. - `meeting_screen.dart`: Displays the meeting interface and handles meeting logic. - `api_call.dart`: Handles API calls for creating meetings and dispatching agents. ## 1. Flutter Frontend ### Step 1: Getting Started Follow these steps to create the environment necessary to add AI agent functionality to your app. #### Create a New Flutter App Create a new Flutter app using the following command: ```bash $ flutter create videosdk_ai_agent_flutter_app ``` #### Install VideoSDK Install the VideoSDK using the following Flutter command. Make sure you are in your Flutter app directory before you run this command. ```bash $ flutter pub add videosdk $ flutter pub add http ``` ### Step 2: Configure Project #### For Android - Update the `/android/app/src/main/AndroidManifest.xml` for the permissions we will be using to implement the audio and video features. ```xml title="android/app/src/main/AndroidManifest.xml" ``` - If necessary, in the `build.gradle` you will need to increase `minSdkVersion` of `defaultConfig` up to `23` (currently default Flutter generator set it to `16`). #### For iOS - Add the following entries which allow your app to access the camera and microphone to your `/ios/Runner/Info.plist` file : ```xml title="/ios/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Uncomment the following line to define a global platform for your project in `/ios/Podfile` : ```ruby title="/ios/Podfile" platform :ios, '12.0' ``` #### For MacOS - Add the following entries to your `/macos/Runner/Info.plist` file which allow your app to access the camera and microphone. ```xml title="/macos/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Add the following entries to your `/macos/Runner/DebugProfile.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/DebugProfile.entitleaments" com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ``` - Add the following entries to your `/macos/Runner/Release.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/Release.entitlements" com.apple.security.network.server com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ``` ### Step 3: Configure Environment and Credentials Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `lib/api_call.dart` along with your agent credentials. ```dart title="lib/api_call.dart" import 'dart:convert'; import 'package:http/http.dart' as http; //Auth token we will use to generate a meeting and connect to it const token = 'YOUR_VIDEOSDK_AUTH_TOKEN'; const agentId = 'YOUR_AGENT_ID'; const versionId = 'YOUR_VERSION_ID'; // API call to create meeting Future createMeeting() async { final http.Response httpResponse = await http.post( Uri.parse('https://api.videosdk.live/v2/rooms'), headers: {'Authorization': token}, ); //Destructuring the roomId from the response return json.decode(httpResponse.body)['roomId']; } // API call to connect agent Future connectAgent(String meetingId) async { final http.Response httpResponse = await http.post( Uri.parse('https://api.videosdk.live/v2/agent/general/dispatch'), headers: { 'Authorization': token, 'Content-Type': 'application/json', }, body: json.encode({ 'agentId': agentId, 'meetingId': meetingId, 'versionId': versionId, }), ); if (httpResponse.statusCode != 200) { throw Exception('Failed to connect agent'); } } ``` ### Step 4: Design the User Interface (UI) Update the UI files to add the "Connect Agent" button and connect the logic. ```dart title="lib/join_screen.dart" import 'package:flutter/material.dart'; import 'api_call.dart'; import 'meeting_screen.dart'; class JoinScreen extends StatelessWidget { final _meetingIdController = TextEditingController(); JoinScreen({super.key}); void onJoinButtonPressed(BuildContext context) { // check meeting id is not null or invaild // if meeting id is vaild then navigate to MeetingScreen with meetingId,token Navigator.of(context).push( MaterialPageRoute( builder: (context) => MeetingScreen(meetingId: "YOUR_MEETING_ID", token: token), ), ); } @override Widget build(BuildContext context) { return Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(12.0), child: Center( child: ElevatedButton( onPressed: () => onJoinButtonPressed(context), child: const Text('Join Meeting'), ), ), ), ); } } ``` ```dart title="lib/meeting_screen.dart" import 'package:flutter/material.dart'; import 'package:videosdk/videosdk.dart'; import 'participant_tile.dart'; import 'meeting_controls.dart'; import 'api_call.dart'; class MeetingScreen extends StatefulWidget { final String meetingId; final String token; const MeetingScreen({ super.key, required this.meetingId, required this.token, }); @override State createState() => _MeetingScreenState(); } class _MeetingScreenState extends State { late Room _room; var micEnabled = true; var camEnabled = true; bool _isAgentConnected = false; Map participants = {}; @override void initState() { // create room _room = VideoSDK.createRoom( roomId: widget.meetingId, token: widget.token, displayName: "John Doe", micEnabled: micEnabled, camEnabled: false, defaultCameraIndex: 1, // Index of MediaDevices will be used to set default camera ); setMeetingEventListener(); // Join room _room.join(); super.initState(); } // listening to meeting events void setMeetingEventListener() { _room.on(Events.roomJoined, () { setState(() { participants.putIfAbsent( _room.localParticipant.id, () => _room.localParticipant, ); }); }); _room.on(Events.participantJoined, (Participant participant) { setState( () => participants.putIfAbsent(participant.id, () => participant), ); }); _room.on(Events.participantLeft, (String participantId) { if (participants.containsKey(participantId)) { setState(() => participants.remove(participantId)); } }); _room.on(Events.roomLeft, () { participants.clear(); Navigator.popUntil(context, ModalRoute.withName('/')); }); } void _connectAgent() async { try { await connectAgent(widget.meetingId); setState(() { _isAgentConnected = true; }); ScaffoldMessenger.of(context).showSnackBar( const SnackBar(content: Text('Agent connected successfully!')), ); } catch (e) { ScaffoldMessenger.of(context).showSnackBar( SnackBar(content: Text('Failed to connect agent: ${e.toString()}')), ); } } // onbackButton pressed leave the room Future _onWillPop() async { _room.leave(); return true; } @override Widget build(BuildContext context) { return WillPopScope( onWillPop: () => _onWillPop(), child: Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(8.0), child: Column( children: [ Text(widget.meetingId), //render all participant Expanded( child: Padding( padding: const EdgeInsets.all(8.0), child: GridView.builder( gridDelegate: const SliverGridDelegateWithFixedCrossAxisCount( crossAxisCount: 2, crossAxisSpacing: 10, mainAxisSpacing: 10, mainAxisExtent: 300, ), itemBuilder: (context, index) { return ParticipantTile( key: Key(participants.values.elementAt(index).id), participant: participants.values.elementAt(index), ); }, itemCount: participants.length, ), ), ), MeetingControls( onToggleMicButtonPressed: () { micEnabled ? _room.muteMic() : _room.unmuteMic(); micEnabled = !micEnabled; }, onLeaveButtonPressed: () => _room.leave(), onConnectAgentButtonPressed: _isAgentConnected ? null : _connectAgent, ), ], ), ), ), ); } } ``` ```dart title="lib/meeting_controls.dart" import 'package:flutter/material.dart'; class MeetingControls extends StatelessWidget { final void Function() onToggleMicButtonPressed; final void Function() onLeaveButtonPressed; final void Function()? onConnectAgentButtonPressed; const MeetingControls({ super.key, required this.onToggleMicButtonPressed, required this.onLeaveButtonPressed, required this.onConnectAgentButtonPressed, }); @override Widget build(BuildContext context) { return Row( mainAxisAlignment: MainAxisAlignment.spaceEvenly, children: [ ElevatedButton( onPressed: onLeaveButtonPressed, child: const Text('Leave'), ), ElevatedButton( onPressed: onToggleMicButtonPressed, child: const Text('Toggle Mic'), ), ElevatedButton( onPressed: onConnectAgentButtonPressed, child: const Text('Connect Agent'), ), ], ); } } ``` ## 2. Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard. ### Step 1: Create Your Agent First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard. ### Step 2: Get Agent and Version ID Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png) ### Step 3: Configure IDs in Frontend Now, update your `lib/api_call.dart` file with these IDs. ```dart title="lib/api_call.dart" const token = 'your_videosdk_auth_token_here'; const agentId = 'paste_your_agent_id_here'; const versionId = 'paste_your_version_id_here'; ``` ## 3. Run the Application ### Step 1: Run the Frontend Once you have completed all the steps mentioned above, start your Flutter application: ```bash flutter run ``` ### Step 2: Connect and Interact 1. **Join the meeting from the Flutter app:** - Click the "Join Meeting" button. - Allow microphone permissions when prompted. 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see a confirmation that the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `roomId`, `agentId`, and `versionId` are correctly set. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check device permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `agentId` and `versionId` are correct. - Check the debug console for any network errors. 4. **Flutter build issues:** - Ensure your Flutter version is compatible. - Try cleaning the build: `flutter clean`. - Delete `pubspec.lock` and run `flutter pub get`. --- # Agent Runtime with iOS VideoSDK empowers you to integrate an AI voice agent into your iOS app within minutes. This guide shows you how to connect an iOS (SwiftUI) frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites - macOS with Xcode 15.0+ - iOS 13.0+ deployment target - Valid VideoSDK [Account](https://app.videosdk.live/) - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. :::
### Step 1: Clone the sample project
Clone the repository to your local environment. ```bash git clone https://github.com/videosdk-live/agents-quickstart.git cd mobile-quickstarts/ios/ ```
### Step 2: Environment Configuration
### Create a Meeting Room Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_VIDEOSDK_AUTH_TOKEN" \ -H "Content-Type: application/json" ``` Use the returned `roomId` in your configuration files. ### Configuration Files Update the following files with your credentials. The Agent and Version IDs will be retrieved in a later step. **MeetingViewController.swift** (line 14): ```swift var token = "YOUR_VIDEOSDK_AUTH_TOKEN" // Add Your token here var agentId = "YOUR_AGENT_ID" var versionId = "YOUR_VERSION_ID" ``` **JoinScreenView.swift** (line 13): ```swift let meetingId: String = "YOUR_MEETING_ID" ```
### Step 3: iOS Frontend Modifications
### Step 1: Add Connect Agent Button In `MeetingView.swift`, add a button to connect the agent. ```swift title="MeetingView.swift" // Add this button to your view hierarchy Button(action: { meetingVC.connectAgent() }) { Text("Connect Agent") } .disabled(meetingVC.isAgentConnected) ``` ### Step 2: Implement Connect Logic In `MeetingViewController.swift`, add the logic to call the dispatch API. ```swift title="MeetingViewController.swift" // Add state to track if the agent is connected @Published var isAgentConnected = false // ... func connectAgent() { guard let url = URL(string: "https://api.videosdk.live/v2/agent/general/dispatch") else { return } var request = URLRequest(url: url) request.httpMethod = "POST" request.setValue("application/json", forHTTPHeaderField: "Content-Type") request.setValue(token, forHTTPHeaderField: "Authorization") let body: [String: Any] = [ "agentId": agentId, "meetingId": room?.id ?? "", "versionId": versionId ] request.httpBody = try? JSONSerialization.data(withJSONObject: body) URLSession.shared.dataTask(with: request) { data, response, error in if let error = error { print("Connect error: \(error.localizedDescription)") return } if let httpResponse = response as? HTTPURLResponse, httpResponse.statusCode == 200 { DispatchQueue.main.async { self.isAgentConnected = true print("Agent connected successfully") } } else { print("Failed to connect agent") } }.resume() } ```
### Step 4: Creating the AI Agent from Dashboard (No-Code)
### Step 1: Create Your Agent First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard. ### Step 2: Get Agent and Version ID Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png) ### Step 3: Configure IDs in Frontend Now, update your `MeetingViewController.swift` file with these IDs. ```swift title="MeetingViewController.swift" var agentId = "paste_your_agent_id_here" var versionId = "paste_your_version_id_here" ```
### Step 5: Run the iOS Frontend
1. **Open Xcode:** ```bash open videosdk-agents-quickstart-ios.xcodeproj ``` 2. **Configure your development team:** - Select the project in Xcode - Go to "Signing & Capabilities" - Select your development team 3. **Build and run:** - Select your target device or simulator - Press `Cmd + R` to build and run
### Step 6: Connect and Interact
1. Join the meeting from the app and allow microphone permissions. 2. When you join, click the "Connect Agent" button to call the agent into the meeting. 3. Talk to the agent in real time. ## Troubleshooting ### Common Issues 1. **Build Errors:** - Ensure Xcode 15.0+ is installed - Check iOS deployment target (13.0+) - Verify VideoSDK package dependency 2. **Authentication Issues:** - Verify `VIDEOSDK_AUTH_TOKEN` in `MeetingViewController.swift` - Check token permissions include `allow_join` 3. **Meeting Connection Issues:** - Ensure `YOUR_MEETING_ID` is correct - Verify network connectivity - Check VideoSDK account status 4. **AI Agent Issues:** - Verify `agentId` and `versionId` are set correctly - Check for errors in the Xcode console when connecting the agent. --- # Agent Runtime with React Native VideoSDK empowers you to integrate an AI voice agent into your React Native app (Android/iOS) within minutes. This guide shows you how to connect a React Native frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites - VideoSDK Developer Account (get token from the [dashboard](https://app.videosdk.live/api-keys)) - Node.js and a working React Native environment (Android Studio and/or Xcode) - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK token and an agent from the dashboard. Generate your VideoSDK token from the dashboard. ::: ## Project Structure First, create an empty project using `mkdir folder_name` on your preferable location for the React Native Frontend. Your final project structure should look like this: ```jsx title="Directory Structure" root ├── android/ ├── ios/ ├── App.js ├── constants.js └── index.js ``` You will work on: - `android/`: Contains the Android-specific project files. - `ios/`: Contains the iOS-specific project files. - `App.js`: The main React Native component, containing the UI and meeting logic. - `constants.js`: To store token, meetingId, and agent credentials for the frontend. - `index.js`: The entry point of the React Native application, where VideoSDK is registered. ## Building the React Native Frontend ### Step 1: Create App and Install SDKs Create a React Native app and install the VideoSDK RN SDK: ```bash npx react-native init videosdkAiAgentRN cd videosdkAiAgentRN # Install VideoSDK npm install "@videosdk.live/react-native-sdk" ``` ### Step 2: Configure the Project #### Android Setup ```xml title="android/app/src/main/AndroidManifest.xml" ``` ```java title="android/app/build.gradle" dependencies { implementation project(':rnwebrtc') } ``` ```gradle title="android/settings.gradle" include ':rnwebrtc' project(':rnwebrtc').projectDir = new File(rootProject.projectDir, '../node_modules/@videosdk.live/react-native-webrtc/android') ``` ```java title="MainApplication.kt" import live.videosdk.rnwebrtc.WebRTCModulePackage class MainApplication : Application(), ReactApplication { override val reactNativeHost: ReactNativeHost = object : DefaultReactNativeHost(this) { override fun getPackages(): List ); }; function ControlsContainer({ join, leave, toggleMic }) { const [connected, setConnected] = useState(false); const connectAgent = async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: token, }, body: JSON.stringify({ agentId: agentId, meetingId: meetingId, versionId: versionId }), }); if (response.ok) { Alert.alert("Agent connected successfully!"); setConnected(true); } else { Alert.alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); Alert.alert("An error occurred while connecting the agent."); } }; return ( {!connected && ( )} ); } function ParticipantView({ participantDisplayName }) { return ( ); } function ParticipantList({ participants }) { return participants.length > 0 ? ( ; }} /> ) : ( ); } function MeetingView() { const { join, leave, toggleMic, participants, meetingId } = useMeeting({}); const participantsList = [...participants.values()].map(participant => ({ displayName: participant.displayName, })); return ( ) : null} ); } export default function App() { if (!meetingId || !token) { return ( ); } return ( ); } ``` ## Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard. ### Step 1: Create Your Agent First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard. ### Step 2: Get Agent and Version ID Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png) ### Step 3: Configure IDs in Frontend Now, update your `constants.js` file with these IDs. ```js title="constants.js" export const token = "your_videosdk_auth_token_here"; export const meetingId = "YOUR_MEETING_ID"; export const name = "User Name"; export const agentId = "paste_your_agent_id_here"; export const versionId = "paste_your_version_id_here"; ``` ## Run the Application ### 1) Start the React Native app ```bash npm install # Android npm run android # iOS (macOS only) cd ios && pod install && cd .. npm run ios ``` ### 2) Connect and interact 1. Join the meeting from the app and allow microphone permissions. 2. When you join, click the "Connect Agent" button to call the agent into the meeting. 3. Talk to the agent in real time. ## Troubleshooting - Ensure the same `meetingId` is used and the `agentId` and `versionId` are correct in `constants.js`. - Verify microphone permissions on the device/simulator. - Confirm your VideoSDK token is valid. - If audio is silent, check device output volume. --- # Agent Runtime with JavaScript VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your JavaScript application within minutes. This guide shows you how to connect a JavaScript frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Node.js installed on your device - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token. ::: ## Project Structure First, create an empty project using `mkdir folder_name` on your preferable location for the JavaScript Frontend. Your final project structure should look like this: ```jsx title="Project Structure" root ├── index.html ├── config.js └── index.js ``` You will be working on the following files: - `index.html`: Responsible for creating a basic UI for joining the meeting. - `config.js`: Responsible for storing the token, room ID, and agent credentials. - `index.js`: Responsible for rendering the meeting view and audio functionality. ## Building the JavaScript Frontend ### Step 1: Install VideoSDK Import VideoSDK using the ` ``` --- **npm:** ```bash npm install @videosdk.live/js-sdk ``` --- **yarn:** ```bash yarn add @videosdk.live/js-sdk ``` ### Step 2: Design the User Interface Create an `index.html` file containing `join-screen` and `grid-screen` for audio-only interaction. The "Connect Agent" button will be used to call the AI agent into the meeting. ```html title="index.html"
``` ### Step 3: Configure the Frontend Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `config.js`. You will get the Agent and Version IDs in the next section. ```js title="config.js" TOKEN = "your_videosdk_auth_token_here"; ROOM_ID = "YOUR_MEETING_ID"; AGENT_ID = "YOUR_AGENT_ID"; VERSION_ID = "YOUR_VERSION_ID"; ``` ### Step 4: Implement Meeting Logic In `index.js`, retrieve DOM elements, declare variables, and add the core meeting functionalities, including the logic to connect the agent. ```js title="index.js" // getting Elements from Dom const leaveButton = document.getElementById("leaveBtn"); const toggleMicButton = document.getElementById("toggleMicBtn"); const createButton = document.getElementById("createMeetingBtn"); const connectAgentButton = document.getElementById("connectAgentBtn"); const audioContainer = document.getElementById("audioContainer"); const textDiv = document.getElementById("textDiv"); // declare Variables let meeting = null; let meetingId = ""; let isMicOn = false; // Join Agent Meeting Button Event Listener createButton.addEventListener("click", async () => { document.getElementById("join-screen").style.display = "none"; textDiv.textContent = "Please wait, we are joining the meeting"; meetingId = ROOM_ID; initializeMeeting(); }); // Initialize meeting function initializeMeeting() { window.VideoSDK.config(TOKEN); meeting = window.VideoSDK.initMeeting({ meetingId: meetingId, name: "C.V.Raman", micEnabled: true, webcamEnabled: false, }); meeting.join(); meeting.localParticipant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, meeting.localParticipant, true); } }); meeting.on("meeting-joined", () => { textDiv.textContent = null; document.getElementById("grid-screen").style.display = "block"; document.getElementById("meetingIdHeading").textContent = `Meeting Id: ${meetingId}`; }); meeting.on("meeting-left", () => { audioContainer.innerHTML = ""; }); meeting.on("participant-joined", (participant) => { let audioElement = createAudioElement(participant.id); participant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, participant, false); audioContainer.appendChild(audioElement); } }); }); meeting.on("participant-left", (participant) => { let aElement = document.getElementById(`a-${participant.id}`); if (aElement) aElement.remove(); }); } // Create audio elements for participants function createAudioElement(pId) { let audioElement = document.createElement("audio"); audioElement.setAttribute("autoPlay", "false"); audioElement.setAttribute("playsInline", "true"); audioElement.setAttribute("controls", "false"); audioElement.setAttribute("id", `a-${pId}`); audioElement.style.display = "none"; return audioElement; } // Set audio track function setAudioTrack(stream, participant, isLocal) { if (stream.kind === "audio") { if (isLocal) { isMicOn = true; } else { const audioElement = document.getElementById(`a-${participant.id}`); if (audioElement) { const mediaStream = new MediaStream(); mediaStream.addTrack(stream.track); audioElement.srcObject = mediaStream; audioElement.play().catch((err) => console.error("audioElem.play() failed", err)); } } } } // Implement controls leaveButton.addEventListener("click", async () => { meeting?.leave(); document.getElementById("grid-screen").style.display = "none"; document.getElementById("join-screen").style.display = "block"; }); toggleMicButton.addEventListener("click", async () => { if (isMicOn) meeting?.muteMic(); else meeting?.unmuteMic(); isMicOn = !isMicOn; }); connectAgentButton.addEventListener("click", async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: TOKEN, }, body: JSON.stringify({ agentId: AGENT_ID, meetingId: ROOM_ID, versionId: VERSION_ID }), }); if (response.ok) { alert("Agent connected successfully!"); connectAgentButton.style.display = "none"; } else { alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); alert("An error occurred while connecting the agent."); } }); ``` ## Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard. ### Step 1: Create Your Agent First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard. ### Step 2: Get Agent and Version ID Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png) ### Step 3: Configure IDs in Frontend Now, update your `config.js` file with these IDs. ```js title="config.js" TOKEN = "your_videosdk_auth_token_here"; ROOM_ID = "YOUR_MEETING_ID"; AGENT_ID = "paste_your_agent_id_here"; VERSION_ID = "paste_your_version_id_here"; ``` ## Run the Application ### Step 1: Start the Frontend Once you have completed all the steps, serve your frontend files: ```bash # Using Python's built-in server python3 -m http.server 8000 # Or using Node.js http-server npx http-server -p 8000 ``` Open `http://localhost:8000` in your web browser. ### Step 2: Connect and Interact 1. **Join the meeting from the frontend:** - Click the "Join Agent Meeting" button in your browser. - Allow microphone permissions when prompted. 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see an alert confirming the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Final Output You have completed the implementation of an AI agent with real-time voice interaction using VideoSDK and a no-code agent from the dashboard. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `ROOM_ID`, `AGENT_ID`, and `VERSION_ID` are correctly set in `config.js`. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check browser permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the browser's developer console for any network errors. --- # Agent Runtime with React VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your React application within minutes. This guide shows you how to connect a React frontend with an AI agent created and configured entirely from the VideoSDK dashboard. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Node.js installed on your device - Familiarity with creating a no-code voice agent. If you're new to this, please follow our guide on how to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)** first. :::important You need a VideoSDK account to generate a token and an agent from the dashboard. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token. ::: ## Project Structure Your project structure should look like this. ```jsx title="Project Structure" root ├── node_modules ├── public ├── src │ ├── config.js │ ├── App.js │ └── index.js └── .env ``` You will be working on the following files: - `App.js`: Responsible for creating a basic UI for joining the meeting - `config.js`: Responsible for storing the token, room ID, and agent credentials - `index.js`: This is the entry point of your React application. ## Part 1: React Frontend ### Step 1: Getting Started with the Code! #### Create new React App Create a new React App using the below command. ```bash $ npx create-react-app videosdk-ai-agent-react-app ``` #### Install VideoSDK Install the VideoSDK using the below-mentioned npm command. Make sure you are in your react app directory before you run this command. ```bash $ npm install "@videosdk.live/react-sdk" ``` ### Step 2: Configure Environment and Credentials Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `src/config.js`. You will get the Agent and Version IDs in the next section. ```js title="src/config.js" export const TOKEN = "YOUR_VIDEOSDK_AUTH_TOKEN"; export const ROOM_ID = "YOUR_MEETING_ID"; export const AGENT_ID = "YOUR_AGENT_ID"; export const VERSION_ID = "YOUR_VERSION_ID"; ``` ### Step 3: Design the user interface (UI) Create the main App component with audio-only interaction in `src/App.js`. This includes the "Connect Agent" button. ```js title="src/App.js" import React, { useEffect, useRef, useState } from "react"; function ParticipantAudio({ participantId }) { const { micStream, micOn, isLocal, displayName } = useParticipant(participantId); const audioRef = useRef(null); useEffect(() => { if (!audioRef.current) return; if (micOn && micStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(micStream.track); audioRef.current.srcObject = mediaStream; audioRef.current.play().catch(() => {}); } else { audioRef.current.srcObject = null; } }, [micStream, micOn]); return (

Participant: {displayName} | Mic: {micOn ? "ON" : "OFF"}

); } function Controls() { const { leave, toggleMic } = useMeeting(); const [connected, setConnected] = useState(false); const connectAgent = async () => { try { const response = await fetch("https://api.videosdk.live/v2/agent/general/dispatch", { method: "POST", headers: { "Content-Type": "application/json", Authorization: TOKEN, }, body: JSON.stringify({ agentId: AGENT_ID, meetingId: ROOM_ID, versionId: VERSION_ID }), }); if (response.ok) { alert("Agent connected successfully!"); setConnected(true); } else { alert("Failed to connect agent."); } } catch (error) { console.error("Error connecting agent:", error); alert("An error occurred while connecting the agent."); } }; return (
{!connected && }
); } function MeetingView({ meetingId, onMeetingLeave }) { const [joined, setJoined] = useState(null); const { join, participants } = useMeeting({ onMeetingJoined: () => setJoined("JOINED"), onMeetingLeft: onMeetingLeave, }); const joinMeeting = () => { setJoined("JOINING"); join(); }; return (

Meeting Id: {meetingId}

{joined === "JOINED" ? (
{[...participants.keys()].map((pid) => ( ))}
) : joined === "JOINING" ? (

Joining the meeting...

) : ( )}
); } export default function App() { const [meetingId] = useState(ROOM_ID); const onMeetingLeave = () => { // no-op; simple sample }; return ( } ); } ``` ## Part 2: Creating the AI Agent from Dashboard (No-Code) You can create and configure a powerful AI agent directly from the VideoSDK dashboard. ### Step 1: Create Your Agent First, follow our detailed guide to **[Build a Custom Voice AI Agent in Minutes](/ai_agents/agent-runtime/build-agent)**. This will walk you through creating the agent's persona, configuring its pipeline (Realtime or Cascading), and testing it directly from the dashboard. ### Step 2: Get Agent and Version ID Once your agent is created, you need to get its `agentId` and `versionId` to connect it to your frontend application. 1. After creating your agent, go to the agent's page and find the JSON editor on right side. Copy the `agentId`. 2. To get the `versionId`, click on 3 dots besides Deploy button and click on "Version History" in it. Copy the version id via copy button of the version you want. ![Get agentId and versionId](https://strapi.videosdk.live/uploads/agent_version_id_0f8b59830a.png) ### Step 3: Configure IDs in Frontend Now, update your `src/config.js` file with these IDs. ```js title="src/config.js" export const TOKEN = "your_videosdk_auth_token_here"; export const ROOM_ID = "YOUR_MEETING_ID"; export const AGENT_ID = "paste_your_agent_id_here"; export const VERSION_ID = "paste_your_version_id_here"; ``` ## Part 3: Run the Application ### Step 1: Run the Frontend Once you have completed all the steps mentioned above, start your React application: ```bash # Install dependencies npm install # Start the development server npm start ``` Open `http://localhost:3000` in your web browser. ### Step 2: Connect and Interact 1. **Join the meeting from the React app:** - Click the "Join" button in your browser - Allow microphone permissions when prompted 2. **Connect the agent:** - Once you join, click the "Connect Agent" button. - You should see an alert confirming the agent was connected. - The AI agent will join the meeting and greet you. 3. **Start playing:** - Interact with your AI agent using your microphone. ## Final Output You have completed the implementation of an AI agent with real-time voice interaction using VideoSDK and a no-code agent from the dashboard in React. ## Troubleshooting ### Common Issues: 1. **Agent not joining:** - Check that the `ROOM_ID`, `AGENT_ID`, and `VERSION_ID` are correctly set in `src/config.js`. - Verify your VideoSDK token is valid and has the necessary permissions. 2. **Audio not working:** - Check browser permissions for microphone access. 3. **"Failed to connect agent" error:** - Verify your `AGENT_ID` and `VERSION_ID` are correct. - Check the browser's developer console for any network errors. 4. **React build issues:** - Ensure Node.js version is compatible - Try clearing npm cache: `npm cache clean --force` - Delete `node_modules` and reinstall: `rm -rf node_modules && npm install` --- VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your Flutter application within minutes. In this quickstart, you'll explore how to create an AI agent that joins a flutter meeting room and interacts with users through voice using Google Gemini Live API. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Flutter and Python 3.12+ installed on your device - Google API Key with Gemini Live API access :::important You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token and the **[Google AI Studio](https://aistudio.google.com/api-keys)** for Google API key. ::: ## Project Structure Your project structure should look like this: ```jsx title="Project Structure" root ├── android ├── ios ├── lib │ ├── api_call.dart │ ├── join_screen.dart │ ├── main.dart │ ├── meeting_controls.dart │ ├── meeting_screen.dart │ └── participant_tile.dart ├── macos ├── web ├── windows ├── agent-flutter.py └── .env ``` You will be working on the following files: - `join_screen.dart`: Responsible for the user interface to join a meeting. - `meeting_screen.dart`: Displays the meeting interface and handles meeting logic. - `api_call.dart`: Handles API calls for creating meetings. - `agent-flutter.py`: The Python AI agent backend using Google Gemini Live API. - `.env`: For storing API keys. ## 1. Flutter Frontend ### Step 1: Getting Started Follow these steps to create the environment necessary to add AI agent functionality to your app. #### Create a New Flutter App Create a new Flutter app using the following command: ```bash $ flutter create videosdk_ai_agent_flutter_app ``` #### Install VideoSDK Install the VideoSDK using the following Flutter command. Make sure you are in your Flutter app directory before you run this command. ```bash $ flutter pub add videosdk $ flutter pub add http ``` ### Step 2: Configure Project #### For Android - Update the `/android/app/src/main/AndroidManifest.xml` for the permissions we will be using to implement the audio and video features. ```xml title="android/app/src/main/AndroidManifest.xml" ``` - If necessary, in the `build.gradle` you will need to increase `minSdkVersion` of `defaultConfig` up to `23` (currently default Flutter generator set it to `16`). #### For iOS - Add the following entries which allow your app to access the camera and microphone to your `/ios/Runner/Info.plist` file : ```xml title="/ios/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Uncomment the following line to define a global platform for your project in `/ios/Podfile` : ```ruby title="/ios/Podfile" platform :ios, '12.0' ``` #### For MacOS - Add the following entries to your `/macos/Runner/Info.plist` file which allow your app to access the camera and microphone. ```xml title="/macos/Runner/Info.plist" NSCameraUsageDescription $(PRODUCT_NAME) Camera Usage! NSMicrophoneUsageDescription $(PRODUCT_NAME) Microphone Usage! ``` - Add the following entries to your `/macos/Runner/DebugProfile.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/DebugProfile.entitleaments" com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ``` - Add the following entries to your `/macos/Runner/Release.entitlements` file which allow your app to access the camera, microphone and open outgoing network connections. ```xml title="/macos/Runner/Release.entitlements" com.apple.security.network.server com.apple.security.network.client com.apple.security.device.camera com.apple.security.device.microphone ``` ### Step 3: Configure Environment and Credentials Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `lib/join_screen.dart` and `lib/api_call.dart`. ```js title="lib/api_call.dart" import 'dart:convert'; import 'package:http/http.dart' as http; //Auth token we will use to generate a meeting and connect to it const token = 'YOUR_VIDEOSDK_AUTH_TOKEN'; // API call to create meeting Future createMeeting() async { final http.Response httpResponse = await http.post( Uri.parse('https://api.videosdk.live/v2/rooms'), headers: {'Authorization': token}, ); //Destructuring the roomId from the response return json.decode(httpResponse.body)['roomId']; } ``` ```js title="lib/join_screen.dart" import 'package:flutter/material.dart'; import 'api_call.dart'; import 'meeting_screen.dart'; class JoinScreen extends StatelessWidget { final _meetingIdController = TextEditingController(); JoinScreen({super.key}); void onCreateButtonPressed(BuildContext context) async { // call api to create meeting and navigate to MeetingScreen with meetingId,token await createMeeting().then((meetingId) { if (!context.mounted) return; Navigator.of(context).push( MaterialPageRoute( builder: (context) => MeetingScreen(meetingId: meetingId, token: token), ), ); }); } void onJoinButtonPressed(BuildContext context) { // check meeting id is not null or invaild // if meeting id is vaild then navigate to MeetingScreen with meetingId,token Navigator.of(context).push( MaterialPageRoute( builder: (context) => MeetingScreen(meetingId: "YOUR_MEETING_ID", token: token), ), ); } @override Widget build(BuildContext context) { return Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(12.0), child: Center( child: ElevatedButton( onPressed: () => onJoinButtonPressed(context), child: const Text('Join Meeting'), ), ), ), ); } } ``` ### Step 4: Design the User Interface (UI) Create the main `MeetingScreen` component with audio-only interaction in `lib/meeting_screen.dart`: ```js title="lib/meeting_screen.dart" import 'package:flutter/material.dart'; import 'package:videosdk/videosdk.dart'; import 'participant_tile.dart'; import 'meeting_controls.dart'; class MeetingScreen extends StatefulWidget { final String meetingId; final String token; const MeetingScreen({ super.key, required this.meetingId, required this.token, }); @override State createState() => _MeetingScreenState(); } class _MeetingScreenState extends State { late Room _room; var micEnabled = true; var camEnabled = true; Map participants = {}; @override void initState() { // create room _room = VideoSDK.createRoom( roomId: widget.meetingId, token: widget.token, displayName: "John Doe", micEnabled: micEnabled, camEnabled: false, defaultCameraIndex: 1, // Index of MediaDevices will be used to set default camera ); setMeetingEventListener(); // Join room _room.join(); super.initState(); } // listening to meeting events void setMeetingEventListener() { _room.on(Events.roomJoined, () { setState(() { participants.putIfAbsent( _room.localParticipant.id, () => _room.localParticipant, ); }); }); _room.on(Events.participantJoined, (Participant participant) { setState( () => participants.putIfAbsent(participant.id, () => participant), ); }); _room.on(Events.participantLeft, (String participantId) { if (participants.containsKey(participantId)) { setState(() => participants.remove(participantId)); } }); _room.on(Events.roomLeft, () { participants.clear(); Navigator.popUntil(context, ModalRoute.withName('/')); }); } // onbackButton pressed leave the room Future _onWillPop() async { _room.leave(); return true; } @override Widget build(BuildContext context) { return WillPopScope( onWillPop: () => _onWillPop(), child: Scaffold( appBar: AppBar(title: const Text('VideoSDK QuickStart')), body: Padding( padding: const EdgeInsets.all(8.0), child: Column( children: [ Text(widget.meetingId), //render all participant Expanded( child: Padding( padding: const EdgeInsets.all(8.0), child: GridView.builder( gridDelegate: const SliverGridDelegateWithFixedCrossAxisCount( crossAxisCount: 2, crossAxisSpacing: 10, mainAxisSpacing: 10, mainAxisExtent: 300, ), itemBuilder: (context, index) { return ParticipantTile( key: Key(participants.values.elementAt(index).id), participant: participants.values.elementAt(index), ); }, itemCount: participants.length, ), ), ), MeetingControls( onToggleMicButtonPressed: () { micEnabled ? _room.muteMic() : _room.unmuteMic(); micEnabled = !micEnabled; }, onLeaveButtonPressed: () => _room.leave(), ), ], ), ), ), ); } } ``` ```js title="lib/participant_tile.dart" import 'package:flutter/material.dart'; import 'package:videosdk/videosdk.dart'; class ParticipantTile extends StatefulWidget { final Participant participant; const ParticipantTile({super.key, required this.participant}); @override State createState() => _ParticipantTileState(); } class _ParticipantTileState extends State { var pariticpantName; @override void initState() { pariticpantName = widget.participant.displayName; super.initState(); } @override Widget build(BuildContext context) { return Padding( padding: const EdgeInsets.all(8.0), child: Container( color: Colors.grey.shade800, child: Center( child: Text( '$pariticpantName', style: TextStyle(color: Colors.white), ), ), ), ); } } ``` ```js title="lib/meeting_controls.dart" import 'package:flutter/material.dart'; class MeetingControls extends StatelessWidget { final void Function() onToggleMicButtonPressed; final void Function() onLeaveButtonPressed; const MeetingControls({ super.key, required this.onToggleMicButtonPressed, required this.onLeaveButtonPressed, }); @override Widget build(BuildContext context) { return Row( mainAxisAlignment: MainAxisAlignment.spaceEvenly, children: [ ElevatedButton( onPressed: onLeaveButtonPressed, child: const Text('Leave'), ), ElevatedButton( onPressed: onToggleMicButtonPressed, child: const Text('Toggle Mic'), ), ], ); } } ``` ## 2. Python AI Agent ### Step 1: Create Python AI Agent Create a `.env` file to store your API keys securely for the Python agent: ```env title=".env" # Google API Key for Gemini Live API GOOGLE_API_KEY=your_google_api_key_here # VideoSDK Authentication Token VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here ``` Create the Python AI agent that will join the same meeting room and interact with users through voice. ```python title="agent-flutter.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import logging logging.getLogger().setLevel(logging.INFO) class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.", ) async def on_enter(self) -> None: await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def start_session(context: JobContext): agent = MyVoiceAgent() model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter # api_key="AIXXXXXXXXXXXXXXXXXXXX", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession( agent=agent, pipeline=pipeline ) def on_transcription(data: dict): role = data.get("role") text = data.get("text") print(f"[TRANSCRIPT][{role}]: {text}") pipeline.on("realtime_model_transcription", on_transcription) await context.run_until_shutdown(session=session, wait_for_participant=True) def make_context() -> JobContext: room_options = RoomOptions( # Static meeting ID - same as used in frontend room_id="YOUR_MEETING_ID", # Replace it with your actual room_id name="Gemini Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## 3. Run the Application ### Step 1: Run the Frontend Once you have completed all the steps mentioned above, start your Flutter application: ```bash flutter run ``` ### Step 2: Run the AI Agent Open a new terminal and run the Python agent: ```bash # Install Python dependencies pip install "videosdk-plugins-google" pip install videosdk-agents # Run the AI agent python agent-flutter.py ``` ### Step 3: Connect and Interact 1. **Join the meeting from the Flutter app:** - Click the "Join Meeting" button. - Allow microphone permissions when prompted. 2. **Agent connection:** - Once you join, the Python agent will detect your participation. - You should see "Participant joined" in the terminal. - The AI agent will greet you and start the game. 3. **Start playing:** - The agent will guide you through a number guessing game (1-100). - Use your microphone to interact with the AI host. - The agent will provide hints and encouragement throughout the game. ## Troubleshooting ### Common Issues: 1. **"Waiting for participant..." but no connection:** - Ensure both the frontend and the agent are running. - Check that the room ID matches in both `lib/join_screen.dart` and `agent-flutter.py`. - Verify your VideoSDK token is valid. 2. **Audio not working:** - Check browser permissions for microphone access. - Ensure your Google API key has Gemini Live API access enabled. 3. **Agent not responding:** - Verify your Google API key is correctly set in the environment. - Check that the Gemini Live API is enabled in your Google Cloud Console. 4. **Flutter build issues:** - Ensure your Flutter version is compatible. - Try cleaning the build: `flutter clean`. - Delete `pubspec.lock` and run `flutter pub get`. ## Next Steps Clone repo for quick implementation - [Quickstart Example](https://github.com/videosdk-live/agents-quickstart/tree/main/mobile-quickstarts/flutter): Complete working example with source code --- VideoSDK empowers you to integrate an AI voice agent into your iOS app within minutes. The agent joins the same meeting room and interacts over voice using the Google Gemini Live API. ## Prerequisites - iOS 13.0+ - Xcode 13.0+ - Swift 5.0+ - VideoSDK Developer Account (get token from the [dashboard](https://app.videosdk.live/api-keys)) - Python 3.12+ - Google API Key with Gemini Live API access :::important You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token and the **[Google AI Studio](https://aistudio.google.com/api-keys)** for Google API key. ::: ## Project Structure ```jsx title="Directory Structure" . ├── videosdk-agents-quickstart-ios/ │ ├── JoinScreenView.swift │ ├── MeetingView.swift │ ├── MeetingViewController.swift │ ├── RoomsStruct.swift │ └── videosdk_agents_quickstart_iosApp.swift ├── videosdk-agents-quickstart-ios.xcodeproj/ ├── agent-ios.py └── .env ``` You will work on: - `JoinScreenView.swift`: Join screen UI - `MeetingView.swift`: Meeting interface with audio controls - `MeetingViewController.swift`: Handles meeting logic and events - `agent-ios.py`: Python AI agent (Gemini Live) - `.env`: For storing API keys. ## 1. iOS Frontend ### Step 1: Create App and Install VideoSDK Create a new iOS app in Xcode: 1. Create a new Xcode project 2. Choose **App** template 3. Add **Product Name** and save the project Install VideoSDK using Swift Package Manager: 1. In Xcode, go to `File > Add Packages...` 2. Enter the repository URL: `https://github.com/videosdk-live/videosdk-rtc-ios-sdk.git` 3. Choose the latest version and click `Add Package` Add permissions to `Info.plist`: ```xml title="Info.plist" NSCameraUsageDescription Camera permission description NSMicrophoneUsageDescription Microphone permission description ``` ### Step 2: Create Models and Views Create Swift models and views for the meeting interface: ```swift title="RoomsStruct.swift" struct RoomsStruct: Codable { let createdAt, updatedAt, roomID: String? let links: Links? let id: String? enum CodingKeys: String, CodingKey { case createdAt, updatedAt case roomID = "roomId" case links, id } } struct Links: Codable { let getRoom, getSession: String? enum CodingKeys: String, CodingKey { case getRoom = "get_room" case getSession = "get_session" } } ``` ```swift title="JoinScreenView.swift" import SwiftUI struct JoinScreenView: View { // State variables for let meetingId: String = "YOUR_MEETING_ID" @State var name: String var body: some View { NavigationView { VStack { Text("VideoSDK") .font(.largeTitle) .fontWeight(.bold) Text("AI Agent Quickstart") .font(.largeTitle) .fontWeight(.semibold) .padding(.bottom) TextField("Enter Your Name", text: $name) .foregroundColor(Color.black) .autocorrectionDisabled() .font(.headline) .overlay( Image(systemName: "xmark.circle.fill") .padding() .offset(x: 10) .foregroundColor(Color.gray) .opacity(name.isEmpty ? 0.0 : 1.0) .onTapGesture { UIApplication.shared.endEditing() name = "" } , alignment: .trailing) .padding() .background( RoundedRectangle(cornerRadius: 25) .fill(Color.secondary.opacity(0.5)) .shadow(color: Color.gray.opacity(0.10), radius: 10)) .padding() NavigationLink(destination: MeetingView(meetingId: self.meetingId, userName: name ?? "Guest") .navigationBarBackButtonHidden(true)) { Text("Join Meeting") .foregroundColor(Color.white) .padding() .background( RoundedRectangle(cornerRadius: 25.0) .fill(Color.blue)) } } } } } extension UIApplication { func endEditing() { sendAction(#selector(UIResponder.resignFirstResponder), to: nil, from: nil, for: nil) } } ``` ```swift title="MeetingView.swift" import SwiftUI import VideoSDKRTC struct MeetingView: View{ @Environment(\.presentationMode) var presentationMode @ObservedObject var meetingViewController = MeetingViewController() @State var meetingId: String? @State var userName: String? @State var isUnMute: Bool = true var body: some View { VStack { if meetingViewController.participants.count == 0 { Text("Meeting Initializing") } else { VStack { VStack(spacing: 20) { Text("Meeting ID: \(meetingViewController.meetingID)") .padding(.vertical) List { ForEach(meetingViewController.participants.indices, id: \.self) { index in Text("Participant Name: \(meetingViewController.participants[index].displayName)") } } } VStack { HStack(spacing: 15) { // mic button Button { if isUnMute { isUnMute = false meetingViewController.meeting?.muteMic() } else { isUnMute = true meetingViewController.meeting?.unmuteMic() } } label: { Text("Toggle Mic") .foregroundStyle(Color.white) .font(.caption) .padding() .background( RoundedRectangle(cornerRadius: 25) .fill(Color.blue)) } // end meeting button Button { meetingViewController.meeting?.end() presentationMode.wrappedValue.dismiss() } label: { Text("End Call") .foregroundStyle(Color.white) .font(.caption) .padding() .background( RoundedRectangle(cornerRadius: 25) .fill(Color.red)) } } .padding(.bottom) } } } } .onAppear() { /// MARK :- configuring the videoSDK VideoSDK.config(token: meetingViewController.token) print(meetingId) if meetingId?.isEmpty == false { print("i ff meeting isd is emty \(meetingId)") // join an existing meeting with provided meeting Id meetingViewController.joinMeeting(meetingId: meetingId!, userName: userName!) } } } } ``` ### Step 3: Implement Meeting Logic Create the main meeting view controller to handle events: ```swift title="MeetingViewController.swift" import Foundation import VideoSDKRTC class MeetingViewController: ObservableObject { var token = "YOUR_VIDEOSDK_AUTH_TOKEN" // Add Your token here var meetingId: String = "" var name: String = "" @Published var meeting: Meeting? = nil @Published var participants: [Participant] = [] @Published var meetingID: String = "" func initializeMeeting(meetingId: String, userName: String) { meeting = VideoSDK.initMeeting( meetingId: meetingId, participantName: userName, micEnabled: true, webcamEnabled: false ) meeting?.addEventListener(self) meeting?.join() } } extension MeetingViewController: MeetingEventListener { func onMeetingJoined() { guard let localParticipant = self.meeting?.localParticipant else { return } // add to list participants.append(localParticipant) localParticipant.addEventListener(self) } func onParticipantJoined(_ participant: Participant) { participants.append(participant) // add listener participant.addEventListener(self) } func onParticipantLeft(_ participant: Participant) { participants = participants.filter({ $0.id != participant.id }) } func onMeetingLeft() { meeting?.localParticipant.removeEventListener(self) meeting?.removeEventListener(self) } func onMeetingStateChanged(meetingState: MeetingState) { switch meetingState { case .DISCONNECTED: participants.removeAll() default: print("") } } } extension MeetingViewController: ParticipantEventListener { } extension MeetingViewController { func joinMeeting(meetingId: String, userName: String) { if !token.isEmpty { self.meetingID = meetingId self.initializeMeeting(meetingId: meetingId, userName: userName) } else { print("Auth token required") } } } ``` ### Step 4: App Entry Point Configure the main app entry point: ```swift title="videosdk_agents_quickstart_iosApp.swift" import SwiftUI @main struct videosdk_agents_quickstart_iosApp: App { var body: some Scene { WindowGroup { JoinScreenView(name: "") } } } ``` ## 2. Python AI Agent ### Step 1: Configure Environment Create a `.env` file in the `mobile-quickstarts/ios` directory to store your API keys securely. ```env title=".env" # Google API Key for Gemini Live API GOOGLE_API_KEY=your_google_api_key_here # VideoSDK Authentication Token VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here ``` ### 2. Create Python AI Agent Create the Python AI agent that will join the same meeting room and interact with users through voice. ```python title="agent-ios.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import logging logging.getLogger().setLevel(logging.INFO) class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.", ) async def on_enter(self) -> None: await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def start_session(context: JobContext): agent = MyVoiceAgent() model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter # api_key="AIXXXXXXXXXXXXXXXXXXXX", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession( agent=agent, pipeline=pipeline ) def on_transcription(data: dict): role = data.get("role") text = data.get("text") print(f"[TRANSCRIPT][{role}]: {text}") pipeline.on("realtime_model_transcription", on_transcription) await context.run_until_shutdown(session=session, wait_for_participant=True) def make_context() -> JobContext: room_options = RoomOptions( # Static meeting ID - same as used in frontend room_id="YOUR_MEETING_ID", # Replace it with your actual room_id name="Gemini Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## 3. Run the Application ### Step 1: Run the iOS App Build and run the app from Xcode on a simulator or physical device. ### Step 2: Run the AI Agent Open a new terminal and run the Python agent: ```bash pip install videosdk-agents pip install "videosdk-plugins-google" python agent-ios.py ``` ### Step 3: Connect and Interact 1. Run the iOS app on a simulator or device 2. Join the meeting and allow microphone permissions 3. When you join, the Python agent detects your participation and starts speaking 4. Talk to the agent in real time and play the number guessing game ## Troubleshooting - Ensure the same `room_id` is set in both the iOS app and the agent's `RoomOptions` - Verify microphone permissions in iOS Settings > Privacy & Security > Microphone - Confirm your VideoSDK token is valid and Google API key is set - For simulator issues, ensure you're using a physical device for microphone testing ## Next Steps Clone repo for quick implementation - [Quickstart Example](https://github.com/videosdk-live/agents-quickstart/tree/main/mobile-quickstarts/ios): Complete working example with source code --- VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction with your IoT device within minutes. In this quickstart, you'll explore how to create an AI agent that connects with an IoT device and interacts with users through voice using Google Gemini Live API. ## Prerequisites Before you begin, ensure you have the following: - **ESP-IDF v5.3**: Installed and configured for your ESP32-S3 board. - **Python**: Version 3.12 or higher. - **VideoSDK Account**: If you don't have one, sign up at the [VideoSDK Dashboard](https://app.videosdk.live/). - **Google API Key**: For using the Gemini Live API. :::important You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token and the **[Google AI Studio](https://aistudio.google.com/api-keys)** for Google API key. ::: ## Project Structure ``` IoT-quickstart/ ├── main/ │ ├── ai-demo.c │ ├── CMakeLists.txt │ ├── idf_component.yml │ └── Kconfig.projbuild ├── agent-iot.py ├── partitions.csv ├── sdkconfig.defaults └── README.md ``` You will be working with the following files: - **`main/ai-demo.c`**: Main application logic for the ESP32 firmware. - **`agent-iot.py`**: The Python AI agent that joins the meeting. - **Configuration Files**: `main/idf_component.yml`, `main/CMakeLists.txt`, `main/Kconfig.projbuild`, `partitions.csv`, and `sdkconfig.defaults` for project setup. ## 1. ESP32-S3 Firmware Setup ### Step 1: Create a Meeting Room First, create a meeting room using the VideoSDK API. This will provide a static `roomId` that both the ESP32 device and the AI agent will use to connect. ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Replace `YOUR_JWT_TOKEN_HERE` with your VideoSDK auth token. Copy the `roomId` from the response to use in the following steps. ### Step 2: Configure the Project Update the configuration files to set up your project dependencies, build settings, and hardware specifics. **Dependencies:** ```yaml title="main/idf_component.yml" ## IDF Component Manager Manifest File dependencies: iot-sdk: path: /path/to/your/IoTSdk # Replace with the absolute path to your cloned IoTSdk protocol_examples_common: path: ${IDF_PATH}/examples/common_components/protocol_examples_common idf: version: =5.3.0 mdns: '*' espressif/esp_audio_codec: ~2.3.0 espressif/esp_codec_dev: ~1.3.4 espressif/esp_audio_effects: ~1.1.0 sepfy/srtp: ^2.3.0 ``` --- **CMake:** ```cmake title="main/CMakeLists.txt" idf_component_register(SRCS "ai-demo.c" INCLUDE_DIRS "." REQUIRES mbedtls REQUIRES json REQUIRES esp_netif REQUIRES fatfs REQUIRES vfs REQUIRES esp_common REQUIRES esp_timer REQUIRES esp_lcd REQUIRES nvs_flash REQUIRES bt ) target_compile_options(${COMPONENT_LIB} PRIVATE "-Wno-format") ``` --- **Hardware Config:** ```kconfig title="main/Kconfig.projbuild" menu "SET Microcontroller" choice AUDIO_BOARD prompt "Audio hardware board" default ESP32S3_XIAO help Select an audio board to use config ESP32_S3_KORVO_2_V3_0_BOARD bool "ESP32-S3-Korvo-2" depends on IDF_TARGET_ESP32S3 config ESP32S3_XIAO bool "ESP32-S3-XIAO" endchoice endmenu ``` --- **Partitions:** ```csv title="partitions.csv" # ESP-IDF Partition Table # Name, Type, SubType, Offset, Size, Flags nvs, data, nvs, 0x9000, 0x6000, phy_init, data, phy, 0xf000, 0x1000, factory, app, factory, 0x10000, 4M, ``` --- **SDK Config:** ```ini title="sdkconfig.defaults" # This file was generated using idf.py save-defconfig. It can be edited manually. # Espressif IoT Development Framework (ESP-IDF) 5.2.2 Project Minimal Configuration # CONFIG_IDF_TARGET="esp32s3" CONFIG_APP_RETRIEVE_LEN_ELF_SHA=16 CONFIG_ESPTOOLPY_FLASHSIZE_4MB=y CONFIG_PARTITION_TABLE_CUSTOM=y CONFIG_ESP32S3_XIAO_SENSE=y CONFIG_EXAMPLE_WIFI_SSID="myssid" CONFIG_EXAMPLE_WIFI_PASSWORD="mypassword" CONFIG_EXAMPLE_CONNECT_IPV6=n CONFIG_ESP_PHY_REDUCE_TX_POWER=y CONFIG_SPIRAM=y CONFIG_SPIRAM_MODE_OCT=y CONFIG_ESP_DEFAULT_CPU_FREQ_MHZ_240=y CONFIG_ESP_SYSTEM_EVENT_TASK_STACK_SIZE=2048 CONFIG_ESP_MAIN_TASK_STACK_SIZE=4096 CONFIG_ESP_TASK_WDT_CHECK_IDLE_TASK_CPU1=n CONFIG_ESP_IPC_TASK_STACK_SIZE=2048 CONFIG_ESP_WIFI_DYNAMIC_RX_BUFFER_NUM=16 CONFIG_ESP_WIFI_STATIC_TX_BUFFER_NUM=32 CONFIG_ESP_WIFI_CACHE_TX_BUFFER_NUM=64 CONFIG_LWIP_IPV6_AUTOCONFIG=y CONFIG_LWIP_IPV6_DHCP6=y CONFIG_LWIP_TCP_SND_BUF_DEFAULT=5744 CONFIG_LWIP_TCP_WND_DEFAULT=5744 CONFIG_MBEDTLS_EXTERNAL_MEM_ALLOC=y CONFIG_MBEDTLS_SSL_PROTO_DTLS=y CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT=8192 ``` ### Step 3: Implement the Firmware Logic Update `main/ai-demo.c` with your VideoSDK token and the `roomId` you created. ```c title="main/ai-demo.c" #include #include #include #include #include #include #include "esp_event.h" #include "esp_log.h" #include "esp_mac.h" #include "esp_netif.h" #include "esp_partition.h" #include "esp_system.h" #include "freertos/FreeRTOS.h" #include "nvs_flash.h" #include "protocol_examples_common.h" #include "videosdk.h" static const char *TAG = "Videosdk"; const char *token = "YOUR_VIDEOSDK_AUTH_TOKEN"; // Replace with your VideoSDK auth token static void meeting_task(void *pvParameters) { create_meeting_result_t result = create_meeting(token); if (result.room_id) { ESP_LOGI(TAG, "Created meeting roomId = %s", result.room_id); free(result.room_id); } else { ESP_LOGE(TAG, "Failed to create meeting"); } ESP_LOGI(TAG, "meeting_task finished, deleting self"); vTaskDelete(NULL); } void app_main(void) { static char deviceid[32] = {0}; uint8_t mac[8] = {0}; esp_log_level_set("*", ESP_LOG_INFO); esp_log_level_set("esp-tls", ESP_LOG_VERBOSE); esp_log_level_set("MQTT_CLIENT", ESP_LOG_VERBOSE); esp_log_level_set("MQTT_EXAMPLE", ESP_LOG_VERBOSE); esp_log_level_set("TRANSPORT_BASE", ESP_LOG_VERBOSE); esp_log_level_set("TRANSPORT", ESP_LOG_VERBOSE); esp_log_level_set("OUTBOX", ESP_LOG_VERBOSE); ESP_ERROR_CHECK(nvs_flash_init()); ESP_ERROR_CHECK(esp_netif_init()); ESP_ERROR_CHECK(esp_event_loop_create_default()); ESP_ERROR_CHECK(example_connect()); BaseType_t ok = xTaskCreate(meeting_task, "meeting_task", 16384, (void *)token, 5, NULL); if (ok != pdPASS) { ESP_LOGE(TAG, "Failed to create meeting_task"); } init_config_t init_cfg = { .meetingID = "YOUR_MEETING_ID", // Replace with your meeting ID .token = token, .displayName = "ESP32-Device", .audioCodec = AUDIO_CODEC_OPUS, }; result_t init_result = init(&init_cfg); printf("Result: %d\n", init_result); result_t result_publish = startPublishAudio(""); result_t result_susbcribe = startSubscribeAudio("", NULL); printf("Result:%d\n", result_publish); while (1) { vTaskDelay(pdMS_TO_TICKS(10)); } } ``` ## 2. Python AI Agent ### Step 1: Configure Environment and Credentials Create a `.env` file in the `IoT-quickstart` directory to store your API keys securely. ```env title=".env" # Google API Key for Gemini Live API GOOGLE_API_KEY="your_google_api_key_here" # VideoSDK Authentication Token VIDEOSDK_AUTH_TOKEN="your_videosdk_auth_token_here" ``` ### Step 2: Create the Python AI Agent The Python agent joins the same meeting room and uses the Gemini Live API to interact with the user. Update `agent-iot.py` with the `roomId` you created earlier. ```python title="agent-iot.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline,JobContext, RoomOptions, WorkerJob from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import logging logging.getLogger().setLevel(logging.INFO) class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.", ) async def on_enter(self) -> None: await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def start_session(context: JobContext): agent = MyVoiceAgent() model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter # api_key="AIXXXXXXXXXXXXXXXXXXXX", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession( agent=agent, pipeline=pipeline ) def on_transcription(data: dict): role = data.get("role") text = data.get("text") print(f"[TRANSCRIPT][{role}]: {text}") pipeline.on("realtime_model_transcription", on_transcription) await context.run_until_shutdown(session=session,wait_for_participant=True) def make_context() -> JobContext: room_options = RoomOptions( # Static meeting ID - same as used in IoT room_id="YOUR_MEETING_ID", # Replace it with your actual room_id name="Gemini Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## 3. Run the Application ### Step 1: Run the ESP32 Firmware Configure, build, and flash the firmware onto your ESP32 board. 1. **Set the target board:** ```bash idf.py set-target esp32s3 ``` 2. **Run menuconfig to set WiFi and other board settings:** ```bash idf.py menuconfig ``` Inside `menuconfig`, navigate to: - `Component config` -> `mbedtls` -> Enable `Support DTLS` and `Support TLS`. - `Example Connection Configuration` -> Set your `WIFI SSID` and `WIFI Password`. - `Partition table` -> Enable `Custom partition table CSV`. - `Serial flasher config` -> Adjust the flash size for your board. - `Set Microcontroller` -> Select your audio hardware board. 3. **Build and flash the project:** ```bash idf.py build idf.py flash monitor ``` ### Step 2: Run the Python AI Agent Open a new terminal, navigate to the `IoT-quickstart` directory, and run the Python agent. ```bash # Install Python dependencies pip install videosdk-agents "videosdk-plugins-google" # Run the AI agent python agent-iot.py ``` Once the ESP32 device joins the meeting, the AI agent will detect it and begin the interactive game show. ## Next Steps Clone repo for quick implementation - [Quickstart Example](https://github.com/videosdk-live/agents-quickstart/tree/main/IoT-quickstart): Complete working example with source code --- VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your JavaScript application within minutes. In this quickstart, you'll explore how to create an AI agent that joins a meeting room and interacts with users through voice using Google Gemini Live API. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Node.js and Python 3.12+ installed on your device - Google API Key with Gemini Live API access :::important You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token and the **[Google AI Studio](https://aistudio.google.com/api-keys)** for Google API key. ::: ## Project Structure First, create an empty project using `mkdir folder_name` on your preferable location for the JavaScript Frontend. Your final project structure should look like this: ```jsx title="Project Structure" root ├── index.html ├── config.js ├── index.js ├── agent-js.py └── .env ``` You will be working on the following files: - `index.html`: Responsible for creating a basic UI for joining the meeting. - `config.js`: Responsible for storing the token and room ID for the JavaScript frontend. - `index.js`: Responsible for rendering the meeting view and audio functionality. - `agent-js.py`: The Python agent using Google Gemini Live API. - `.env`: Environment variables for the Python agent's API keys. ## 1. Building the JavaScript Frontend ### Step 1: Install VideoSDK Import VideoSDK using the ` ``` --- **npm:** ```bash npm install @videosdk.live/js-sdk ``` --- **yarn:** ```bash yarn add @videosdk.live/js-sdk ``` ### Step 2: Design the User Interface Create an `index.html` file containing `join-screen` and `grid-screen` for audio-only interaction. ```html title="index.html"
``` ### Step 3: Configure the Frontend Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `config.js` for the JavaScript frontend: ```js title="config.js" TOKEN = "your_videosdk_auth_token_here"; ROOM_ID = "YOUR_MEETING_ID"; // Static room ID shared between frontend and agent ``` ### Step 4: Implement Meeting Logic In `index.js`, retrieve DOM elements, declare variables, and add the core meeting functionalities. ```js title="index.js" // getting Elements from Dom const leaveButton = document.getElementById("leaveBtn"); const toggleMicButton = document.getElementById("toggleMicBtn"); const createButton = document.getElementById("createMeetingBtn"); const audioContainer = document.getElementById("audioContainer"); const textDiv = document.getElementById("textDiv"); // declare Variables let meeting = null; let meetingId = ""; let isMicOn = false; // Join Agent Meeting Button Event Listener createButton.addEventListener("click", async () => { document.getElementById("join-screen").style.display = "none"; textDiv.textContent = "Please wait, we are joining the meeting"; meetingId = ROOM_ID; initializeMeeting(); }); // Initialize meeting function initializeMeeting() { window.VideoSDK.config(TOKEN); meeting = window.VideoSDK.initMeeting({ meetingId: meetingId, name: "C.V.Raman", micEnabled: true, webcamEnabled: false, }); meeting.join(); meeting.localParticipant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, meeting.localParticipant, true); } }); meeting.on("meeting-joined", () => { textDiv.textContent = null; document.getElementById("grid-screen").style.display = "block"; document.getElementById("meetingIdHeading").textContent = `Meeting Id: ${meetingId}`; }); meeting.on("meeting-left", () => { audioContainer.innerHTML = ""; }); meeting.on("participant-joined", (participant) => { let audioElement = createAudioElement(participant.id); participant.on("stream-enabled", (stream) => { if (stream.kind === "audio") { setAudioTrack(stream, participant, false); audioContainer.appendChild(audioElement); } }); }); meeting.on("participant-left", (participant) => { let aElement = document.getElementById(`a-${participant.id}`); if (aElement) aElement.remove(); }); } // Create audio elements for participants function createAudioElement(pId) { let audioElement = document.createElement("audio"); audioElement.setAttribute("autoPlay", "false"); audioElement.setAttribute("playsInline", "true"); audioElement.setAttribute("controls", "false"); audioElement.setAttribute("id", `a-${pId}`); audioElement.style.display = "none"; return audioElement; } // Set audio track function setAudioTrack(stream, participant, isLocal) { if (stream.kind === "audio") { if (isLocal) { isMicOn = true; } else { const audioElement = document.getElementById(`a-${participant.id}`); if (audioElement) { const mediaStream = new MediaStream(); mediaStream.addTrack(stream.track); audioElement.srcObject = mediaStream; audioElement.play().catch((err) => console.error("audioElem.play() failed", err)); } } } } // Implement controls leaveButton.addEventListener("click", async () => { meeting?.leave(); document.getElementById("grid-screen").style.display = "none"; document.getElementById("join-screen").style.display = "block"; }); toggleMicButton.addEventListener("click", async () => { if (isMicOn) meeting?.muteMic(); else meeting?.unmuteMic(); isMicOn = !isMicOn; }); ``` ## 2. Building the Python Agent ### Step 1: Configure the Agent Create a `.env` file to store your API keys securely for the Python agent: ```env title=".env" # Google API Key for Gemini Live API GOOGLE_API_KEY=your_google_api_key_here # VideoSDK Authentication Token VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here ``` ### Step 2: Create the Python Agent Create the Python agent (`agent-js.py`) that will join the same meeting room and interact with users through voice. ```python title="agent-js.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import logging logging.getLogger().setLevel(logging.INFO) class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.", ) async def on_enter(self) -> None: await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def start_session(context: JobContext): agent = MyVoiceAgent() model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter # api_key="AIXXXXXXXXXXXXXXXXXXXX", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession(agent=agent, pipeline=pipeline) def on_transcription(data: dict): role = data.get("role") text = data.get("text") print(f"[TRANSCRIPT][{role}]: {text}") pipeline.on("realtime_model_transcription", on_transcription) await context.run_until_shutdown(session=session, wait_for_participant=True) def make_context() -> JobContext: room_options = RoomOptions( room_id="YOUR_MEETING_ID", # Replace with your actual room_id name="Gemini Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## 3. Run the Application ### Step 1: Start the Frontend Once you have completed all the steps, serve your frontend files: ```bash # Using Python's built-in server python3 -m http.server 8000 # Or using Node.js http-server npx http-server -p 8000 ``` Open `http://localhost:8000` in your web browser. ### Step 2: Start the Python Agent Open a new terminal and run the Python agent: ```bash # Install Python dependencies pip install "videosdk-plugins-google" pip install videosdk-agents # Run the Python agent (reads GOOGLE_API_KEY from .env file) python agent-js.py ``` ### Step 3: Connect and Interact 1. **Join the meeting from the frontend:** - Click the "Join Agent Meeting" button in your browser. - Allow microphone permissions when prompted. 2. **Agent connection:** - Once you join, the Python agent will detect your participation. - You should see "Participant joined" in the terminal. - The AI agent will greet you and initiate the game. 3. **Start playing:** - The agent will guide you through a number guessing game (1-100). - Use your microphone to interact with the AI host. ## Troubleshooting ### Common Issues: 1. **"Waiting for participant..." but no connection:** - Ensure both frontend and the agent are running. - Check that the `ROOM_ID` matches in `config.js` and `agent-js.py`. - Verify your VideoSDK token is valid. 2. **Audio not working:** - Check browser permissions for microphone access. - Ensure your Google API key has Gemini Live API access enabled. 3. **Agent not responding:** - Verify your `GOOGLE_API_KEY` is set in the `.env` file. - Check that the Gemini Live API is enabled in your Google Cloud Console. ## Next Steps Clone repo for quick implementation - [Quickstart Example](https://github.com/videosdk-live/agents-quickstart/tree/main/mobile-quickstarts/js): Complete working example with source code --- VideoSDK empowers you to integrate an AI voice agent into your React Native app (Android/iOS) within minutes. The agent joins the same meeting room and interacts over voice using the Google Gemini Live API. ## Prerequisites - VideoSDK Developer Account (get token from the [dashboard](https://app.videosdk.live/api-keys)) - Node.js and a working React Native environment (Android Studio and/or Xcode) - Python 3.12+ - Google API Key with Gemini Live API access :::important You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token and the **[Google AI Studio](https://aistudio.google.com/api-keys)** for Google API key. ::: ## Project Structure First, create an empty project using `mkdir folder_name` on your preferable location for the React Native Frontend. Your final project structure should look like this: ```jsx title="Directory Structure" root ├── android/ ├── ios/ ├── App.js ├── constants.js ├── index.js ├── agent-react-native.py └── .env ``` You will work on: - `android/`: Contains the Android-specific project files. - `ios/`: Contains the iOS-specific project files. - `App.js`: The main React Native component, containing the UI and meeting logic. - `constants.js`: To store token and meetingId for the frontend. - `index.js`: The entry point of the React Native application, where VideoSDK is registered. - `agent-react-native.py`: The Python agent that joins the meeting. - `.env`: Environment variables file for the Python agent (stores API keys). ## 1. Building the React Native Frontend ### Step 1: Create App and Install SDKs Create a React Native app and install the VideoSDK RN SDK: ```bash npx react-native init videosdkAiAgentRN cd videosdkAiAgentRN # Install VideoSDK npm install "@videosdk.live/react-native-sdk" ``` ### Step 2: Configure the Project #### Android Setup ```xml title="android/app/src/main/AndroidManifest.xml" ``` ```java title="android/app/build.gradle" dependencies { implementation project(':rnwebrtc') } ``` ```gradle title="android/settings.gradle" include ':rnwebrtc' project(':rnwebrtc').projectDir = new File(rootProject.projectDir, '../node_modules/@videosdk.live/react-native-webrtc/android') ``` ```java title="MainApplication.kt" import live.videosdk.rnwebrtc.WebRTCModulePackage class MainApplication : Application(), ReactApplication { override val reactNativeHost: ReactNativeHost = object : DefaultReactNativeHost(this) { override fun getPackages(): List ); }; function ControlsContainer({ join, leave, toggleMic }) { return ( ); } function ParticipantView({ participantDisplayName }) { return ( ); } function ParticipantList({ participants }) { return participants.length > 0 ? ( ; }} /> ) : ( ); } function MeetingView() { const { join, leave, toggleMic, participants, meetingId } = useMeeting({}); const participantsList = [...participants.values()].map(participant => ({ displayName: participant.displayName, })); return ( ) : null} ); } export default function App() { if (!meetingId || !token) { return ( ); } return ( ); } ``` ## 2. Building the Python Agent ### Step 1: Configure the Agent Create a `.env` file to store your API keys securely for the Python agent: ```env title=".env" # Google API Key for Gemini Live API GOOGLE_API_KEY=your_google_api_key_here # VideoSDK Authentication Token VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here ``` ### Step 2: Create the Python Agent ```python title="agent-react-native.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import logging logging.getLogger().setLevel(logging.INFO) class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.", ) async def on_enter(self) -> None: await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def start_session(context: JobContext): agent = MyVoiceAgent() model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter # api_key="AIXXXXXXXXXXXXXXXXXXXX", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession( agent=agent, pipeline=pipeline ) def on_transcription(data: dict): role = data.get("role") text = data.get("text") print(f"[TRANSCRIPT][{role}]: {text}") pipeline.on("realtime_model_transcription", on_transcription) await context.run_until_shutdown(session=session, wait_for_participant=True) def make_context() -> JobContext: room_options = RoomOptions( # Static meeting ID - same as used in frontend room_id="YOUR_MEETING_ID", # Replace it with your actual room_id name="Gemini Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## 3. Run the Application ### 1) Start the React Native app ```bash npm install # Android npm run android # iOS (macOS only) cd ios && pod install && cd .. npm run ios ``` ### 2) Start the Python Agent ```bash pip install videosdk-agents pip install "videosdk-plugins-google" python agent-react-native.py ``` ### 3) Connect and interact 1. Join the meeting from the app and allow microphone permissions. 2. When you join, the Python agent detects your participation and starts speaking. 3. Talk to the agent in real time and play the number guessing game. ## Troubleshooting - Ensure the same `room_id` is set in both the RN app (`constants.js`) and the agent's `RoomOptions`. - Verify microphone and camera permissions on the device/simulator. - Confirm your VideoSDK token is valid and Google API key is set. - If audio is silent, check device output volume and that the agent is not in playground mode. ## Next Steps Clone repo for quick implementation - [Quickstart Example](https://github.com/videosdk-live/agents-quickstart/tree/main/mobile-quickstarts/react-native): Complete working example with source code --- VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your React application within minutes. In this quickstart, you'll explore how to create an AI agent that joins a meeting room and interacts with users through voice using Google Gemini Live API. ## Prerequisites Before proceeding, ensure that your development environment meets the following requirements: - Video SDK Developer Account (Not having one, follow **[Video SDK Dashboard](https://app.videosdk.live/)**) - Node.js and Python 3.12+ installed on your device - Google API Key with Gemini Live API access :::important You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token and the **[Google AI Studio](https://aistudio.google.com/api-keys)** for Google API key. ::: ## Project Structure Your project structure should look like this. ```jsx title="Project Structure" root ├── node_modules ├── public ├── src │ ├── config.js │ ├── App.js │ └── index.js ├── agent-react.py └── .env ``` You will be working on the following files: - `App.js`: Responsible for creating a basic UI for joining the meeting - `config.js`: Responsible for storing the token and room ID - `index.js`: This is the entry point of your React application. - `agent-react.py`: Python AI agent backend using Google Gemini Live API - `.env`: Environment variables for API keys ## Part 1: React Frontend ### Step 1: Getting Started with the Code! #### Create new React App Create a new React App using the below command. ```bash $ npx create-react-app videosdk-ai-agent-react-app ``` #### Install VideoSDK Install the VideoSDK using the below-mentioned npm command. Make sure you are in your react app directory before you run this command. ```bash $ npm install "@videosdk.live/react-sdk" ``` ### Step 2: Configure Environment and Credentials Create a meeting room using the VideoSDK API: ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Copy the `roomId` from the response and configure it in `src/config.js`: ```js title="src/config.js" export const TOKEN = "YOUR_VIDEOSDK_AUTH_TOKEN"; export const ROOM_ID = "YOUR_MEETING_ID"; // Create using VideoSDK API (curl -X POST https://api.videosdk.live/v2/rooms) ``` ### Step 3: Design the user interface (UI) Create the main App component with audio-only interaction in `src/App.js`: ```js title="src/App.js" import React, { useEffect, useRef, useState } from "react"; function ParticipantAudio({ participantId }) { const { micStream, micOn, isLocal, displayName } = useParticipant(participantId); const audioRef = useRef(null); useEffect(() => { if (!audioRef.current) return; if (micOn && micStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(micStream.track); audioRef.current.srcObject = mediaStream; audioRef.current.play().catch(() => {}); } else { audioRef.current.srcObject = null; } }, [micStream, micOn]); return (

Participant: {displayName} | Mic: {micOn ? "ON" : "OFF"}

); } function Controls() { const { leave, toggleMic } = useMeeting(); return (
); } function MeetingView({ meetingId, onMeetingLeave }) { const [joined, setJoined] = useState(null); const { join, participants } = useMeeting({ onMeetingJoined: () => setJoined("JOINED"), onMeetingLeft: onMeetingLeave, }); const joinMeeting = () => { setJoined("JOINING"); join(); }; return (

Meeting Id: {meetingId}

{joined === "JOINED" ? (
{[...participants.keys()].map((pid) => ( ))}
) : joined === "JOINING" ? (

Joining the meeting...

) : ( )}
); } export default function App() { const [meetingId] = useState(ROOM_ID); const onMeetingLeave = () => { // no-op; simple sample }; return ( } ); } ``` ## Part 2: Python AI Agent ### Step 1: Create AI Agent Backend Create a `.env` file to store your API keys securely for the Python agent: ```env title=".env" # Google API Key for Gemini Live API GOOGLE_API_KEY=your_google_api_key_here # VideoSDK Authentication Token VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here ``` Create the Python AI agent that will join the same meeting room and interact with users through voice. ```python title="agent-react.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import logging logging.getLogger().setLevel(logging.INFO) class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.", ) async def on_enter(self) -> None: await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def start_session(context: JobContext): agent = MyVoiceAgent() model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter # api_key="AIXXXXXXXXXXXXXXXXXXXX", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession( agent=agent, pipeline=pipeline ) def on_transcription(data: dict): role = data.get("role") text = data.get("text") print(f"[TRANSCRIPT][{role}]: {text}") pipeline.on("realtime_model_transcription", on_transcription) await context.run_until_shutdown(session=session, wait_for_participant=True) def make_context() -> JobContext: room_options = RoomOptions( # Static meeting ID - same as used in frontend room_id="YOUR_MEETING_ID", # Replace it with your actual room_id name="Gemini Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## Part 3: Run the Application ### Step 1: Run the Frontend Once you have completed all the steps mentioned above, start your React application: ```bash # Install dependencies npm install # Start the development server npm start ``` Open `http://localhost:3000` in your web browser. ### Step 2: Run the AI Agent Open a new terminal and run the Python agent: ```bash # Install Python dependencies pip install "videosdk-plugins-google" pip install videosdk-agents # Run the AI agent python agent-react.py ``` ### Step 3: Connect and Interact 1. **Join the meeting from the React app:** - Click the "Join" button in your browser - Allow microphone permissions when prompted 2. **Agent connection:** - Once you join, the Python backend will detect your participation - You should see "Participant joined" in the terminal - The AI agent will greet and initiate the game 3. **Start playing:** - The agent will guide you through a number guessing game (1-100) - Use your microphone to interact with the AI host - The agent will provide hints and encouragement throughout the game ## Troubleshooting ### Common Issues: 1. **"Waiting for participant..." but no connection:** - Ensure both frontend and backend are running - Check that the room ID matches in both `src/config.js` and `agent-react.py` - Verify your VideoSDK token is valid 2. **Audio not working:** - Check browser permissions for microphone access - Ensure your Google API key has Gemini Live API access enabled 3. **Agent not responding:** - Verify your Google API key is correctly set in the environment - Check that the Gemini Live API is enabled in your Google Cloud Console 4. **React build issues:** - Ensure Node.js version is compatible - Try clearing npm cache: `npm cache clean --force` - Delete `node_modules` and reinstall: `rm -rf node_modules && npm install` ## Next Steps Clone repo for quick implementation - [Quickstart Example](https://github.com/videosdk-live/agents-quickstart/tree/main/mobile-quickstarts/react): Complete working example with source code --- This guide demonstrates how to integrate a real-time AI agent with a Unity application. The agent, powered by Google's Gemini Live API, acts as a high-energy game show host, guiding the user through a number-guessing game via voice. ## Prerequisites Before you begin, ensure you have the following: - **Unity 2022.3 LTS** or later. - A **VideoSDK Account** for generating an auth token. If you don't have one, sign up at the [VideoSDK Dashboard](https://app.videosdk.live/). - **Python 3.12+** for the AI agent. - A **Google API Key** for the Gemini Live API. :::important You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK **[dashboard](https://app.videosdk.live/api-keys)** to generate a token and the **[Google AI Studio](https://aistudio.google.com/api-keys)** for Google API key. ::: ## Project Structure ``` Unity-quickstart/ ├── Unity/ │ ├── Assets/ │ │ └── Scripts/ │ │ └── GameManager.cs │ ├── ProjectSettings/ │ └── Packages/ ├── agent-unity.py └── README.md ``` You will be working with the following files: - **`Unity/Assets/Scripts/GameManager.cs`**: The main script for the Unity application, handling meeting logic and UI. - **`agent-unity.py`**: The Python AI agent that joins the meeting. ## 1. Unity Frontend ### Step 1: Install VideoSDK Package 1. Open Unity’s **Package Manager** by selecting from the top bar: Window -> Package Manager. 2. Click the + button in the top left corner and select **Add package from git URL.** 3. Paste the following URL and click Add: ```js https://github.com/videosdk-live/videosdk-rtc-unity-sdk.git ``` ![Install VideoSDK Package](https://assets.videosdk.live/images/Project%20Manager.gif) 4. Add the `com.unity.nuget.newtonsoft-json` package by following the instructions provided [here](https://github.com/applejag/Newtonsoft.Json-for-Unity/wiki/Install-official-via-UPM). ### Step 2: Platform Setup **Android Setup** To integrate the VideoSDK into your Android project, follow these steps: 1. Add the following repository configuration to your `settingsTemplate.gradle` file: ```js title="settingsTemplate.gradle" dependencyResolutionManagement { repositoriesMode.set(RepositoriesMode.PREFER_SETTINGS) repositories { **ARTIFACTORYREPOSITORY** google() mavenCentral() jcenter() maven { url = uri("https://maven.aliyun.com/repository/jcenter") } flatDir { dirs "${project(':unityLibrary').projectDir}/libs" } } } ``` 2. Install Android SDK in `mainTemplate.gradle` ```js title="mainTemplate.gradle" dependencies { implementation 'live.videosdk:rtc-android-sdk:0.3.1' } ``` 3. If your project has set `android.useAndroidX=true`, then set `android.enableJetifier=true` in the `gradleTemplate.properties` file to migrate your project to AndroidX and avoid duplicate class conflict. ```js title="gradleTemplate.gradle" android.enableJetifier = true; android.useAndroidX = true; android.suppressUnsupportedCompileSdk = 34; ``` **Setting Up for iOS** 1. **Build for iOS**: In Unity, export the project for iOS. 2. **Open in Xcode**: Navigate to the generated Xcode project and open it. 3. **Configure Frameworks**: - Select the **Unity-iPhone** target. - Go to the **General** tab. - Under **Frameworks, Libraries, and Embedded Content**, add **VideoSDK** and its required frameworks. ![Unity iPhone](https://assets.videosdk.live/images/ios%20setup%201.png) ![Frameworks, Libraries, and Embedded Content](https://assets.videosdk.live/images/ios%20setup%202.png) ### Step 3: Create a Meeting Room Create a static `roomId` using the VideoSDK API. Both the Unity app and the AI agent will use this ID to connect to the same meeting. ```bash curl -X POST https://api.videosdk.live/v2/rooms \ -H "Authorization: YOUR_JWT_TOKEN_HERE" \ -H "Content-Type: application/json" ``` Replace `YOUR_JWT_TOKEN_HERE` with your VideoSDK auth token. Copy the `roomId` from the response for the next steps. ### Step 4: Configure Unity Project Update `GameManager.cs` with your VideoSDK auth token and the `roomId` you just created. ```csharp title="Unity/Assets/Scripts/GameManager.cs" // ... existing code ... public class GameManager : MonoBehaviour { // ... existing code ... private readonly string _token = "YOUR_VIDEOSDK_AUTH_TOKEN"; private readonly string _staticMeetingId = "YOUR_MEETING_ID"; // ... existing code ... } ``` ### Step 5: Set Up Platform-Specific Permissions **For Android:** Add the following permissions to your `AndroidManifest.xml` file: ```xml title="AndroidManifest.xml" ``` **For iOS:** Ensure your `Info.plist` includes descriptions for camera and microphone usage: ```xml title="Info.plist" NSCameraUsageDescription Camera access is required for video calls. NSMicrophoneUsageDescription Microphone access is required for audio calls. ``` ## 2. Python AI Agent ### Step 1: Configure Environment and Credentials Create a `.env` file in the `Unity-quickstart` directory to store your API keys. ```env title=".env" # Google API Key for Gemini Live API GOOGLE_API_KEY="your_google_api_key_here" # VideoSDK Authentication Token VIDEOSDK_AUTH_TOKEN="your_videosdk_auth_token_here" ``` ### Step 2: Create the Python AI Agent The Python agent joins the same meeting room to interact with the user. Update `agent-unity.py` with your `roomId`. ```python title="agent-unity.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline,JobContext, RoomOptions, WorkerJob from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig import logging logging.getLogger().setLevel(logging.INFO) class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.", ) async def on_enter(self) -> None: await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def start_session(context: JobContext): agent = MyVoiceAgent() model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession( agent=agent, pipeline=pipeline ) def on_transcription(data: dict): role = data.get("role") text = data.get("text") print(f"[TRANSCRIPT][{role}]: {text}") pipeline.on("realtime_model_transcription", on_transcription) await context.run_until_shutdown(session=session,wait_for_participant=True) def make_context() -> JobContext: room_options = RoomOptions( room_id="YOUR_MEETING_ID", # Replace with your actual room_id name="Gemini Agent", playground=True, ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## 3. Run the Application ### Step 1: Run the Python AI Agent Open a terminal, navigate to the `Unity-quickstart` directory, and run the agent. ```bash # Install Python dependencies pip install videosdk-agents "videosdk-plugins-google" # Run the AI agent python agent-unity.py ``` ### Step 2: Run the Unity Application 1. Open the `Unity/` project in Unity Hub. 2. Once the project is loaded, press the **Play** button in the Unity Editor to start the application. 3. Click the **Join Meeting** button in the app to connect to the session. Once you join the meeting, the AI agent will greet you and start the game. - [Quickstart Example](https://github.com/videosdk-live/agents-quickstart/tree/main/Unity-quickstart): Complete working example with source code --- This guide will walk you through creating a fully functional AI telephony agent using **[VideoSDK Agent SDK](https://github.com/videosdk-live/agents)**. You'll learn how to run the agent locally, connect it to the global telephone network using SIP, and enable it to handle both inbound and outbound phone calls. By the end, you'll have a working AI assistant that you can talk to from any phone. ## The Architecture Before we dive in, let's look at the high-level architecture. A call from the phone network is directed by a [SIP provider](/telephony/integrations/twilio-sip-integration) (like Twilio) to VideoSDK's telephony infrastructure. A [Routing Rule](/telephony/call-routing/setting-up-routing-rules) then intelligently dispatches the call to your self-hosted AI agent, which processes the audio and responds in real-time. ![Architecture: Connecting a Voice Agent to the Telephony Network](https://strapi.videosdk.live/uploads/inbound_outbound_call_c70708607c.png) ## What You'll Build We'll create a simple yet powerful project with the following structure: ``` ├── main.py # The core logic for your AI voice agent ├── requirements.txt # Python package dependencies └── .env # Your secret credentials ``` ## Prerequisites To get started, you'll need a few things: - **Python 3.12+**: Ensure you have a modern version of Python installed. - **VideoSDK Account**: Sign up for a free [VideoSDK account](https://app.videosdk.live/api-keys) to get your `VIDEOSDK_TOKEN`. This token is used to authenticate your agent and manage telephony settings. ## Part 1: Build and Run the AI Agent Locally First, let's get the AI agent running on your machine. ### Step 1: Set Up Your Project 1. Create a `.env` file to store your secret keys. Add your credentials: **realtime-pipeline:** ```bash title=".env" VIDEOSDK_AUTH_TOKEN="your_videosdk_token_here" GOOGLE_API_KEY="your_google_api_key_here" ``` > **API Keys** - Get API keys: [Google API Key ↗](https://aistudio.google.com/app/apikey) & Create your [VideoSDK Account ↗](https://app.videosdk.live/api-keys) and follow this guide to [generate videosdk token ](/ai_agents/authentication-and-token) --- **cascading-pipeline:** ```bash title=".env" VIDEOSDK_AUTH_TOKEN="your_videosdk_token_here" DEEPGRAM_API_KEY="your_deepgram_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ELEVENLABS_API_KEY="your_elevenlabs_api_key_here" ``` > **API Keys** - Get API keys: [Deepgram ↗](https://console.deepgram.com/), [OpenAI ↗](https://platform.openai.com/api-keys), [ElevenLabs ↗](https://elevenlabs.io/app/settings/api-keys) & Create your [VideoSDK Account ↗](https://app.videosdk.live/api-keys) and follow this guide to [generate videosdk token ](/ai_agents/authentication-and-token) 2. Create a `requirements.txt` file and paste in the necessary dependencies: **realtime-pipeline:** ```text title="requirements.txt" videosdk-agents==0.0.45 videosdk-plugins-google==0.0.45 python-dotenv==1.1.1 requests==2.31.0 ``` --- **cascading-pipeline:** ```text title="requirements.txt" videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]==0.0.45 python-dotenv==1.1.1 requests==2.31.0 ``` > **Latest Version**: Check the latest [videosdk-agents version on PyPI](https://pypi.org/project/videosdk-agents/) for the most recent release. 3. Finally, create the `main.py` file. This script defines your agent's personality and handles the connection to VideoSDK. **realtime-pipeline:** ```python title="main.py" import asyncio import traceback from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob, Options from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from dotenv import load_dotenv import os import logging logging.basicConfig(level=logging.INFO) load_dotenv() # Define the agent's behavior and personality class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful AI assistant that answers phone calls. Keep your responses concise and friendly.", ) async def on_enter(self) -> None: await self.session.say("Hello! I'm your real-time assistant. How can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye! It was great talking with you!") async def start_session(context: JobContext): # Configure the Gemini model for real-time voice model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", api_key=os.getenv("GOOGLE_API_KEY"), config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) session = AgentSession(agent=MyVoiceAgent(), pipeline=pipeline) try: await context.connect() await session.start() await asyncio.Event().wait() finally: await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions() return JobContext(room_options=room_options) if __name__ == "__main__": try: # Register the agent with a unique ID options = Options( agent_id="MyTelephonyAgent", # CRITICAL: Unique identifier for routing register=True, # REQUIRED: Register with VideoSDK for telephony max_processes=10, # Concurrent calls to handle host="localhost", port=8081, ) job = WorkerJob(entrypoint=start_session, jobctx=make_context, options=options) job.start() except Exception as e: traceback.print_exc() ``` --- **cascading-pipeline:** ```python title="main.py" import asyncio import traceback from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, Options, ConversationFlow from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from dotenv import load_dotenv import os import logging logging.basicConfig(level=logging.INFO) load_dotenv() # Pre-downloading the Turn Detector model pre_download_model() # Define the agent's behavior and personality class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful AI assistant that answers phone calls. Keep your responses concise and friendly.", ) async def on_enter(self) -> None: await self.session.say("Hello! I'm your AI telephony assistant. How can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye! It was great talking with you!") async def start_session(context: JobContext): # Create agent and conversation flow agent = MyVoiceAgent() conversation_flow = ConversationFlow(agent) # Create pipeline pipeline = CascadingPipeline( stt=DeepgramSTT(model="nova-2", language="en"), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow ) try: await context.connect() await session.start() await asyncio.Event().wait() finally: await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions() return JobContext(room_options=room_options) if __name__ == "__main__": try: # Register the agent with a unique ID options = Options( agent_id="MyTelephonyAgent", # CRITICAL: Unique identifier for routing register=True, # REQUIRED: Register with VideoSDK for telephony max_processes=10, # Concurrent calls to handle host="localhost", port=8081, ) job = WorkerJob(entrypoint=start_session, jobctx=make_context, options=options) job.start() except Exception as e: traceback.print_exc() ``` ### Step 2: Set Up Your Environment and Install Dependencies Create and activate a virtual environment to keep your project dependencies isolated. ```bash # Create the virtual environment python3 -m venv .venv # Activate it (macOS/Linux) source .venv/bin/activate # On Windows, use: .venv\Scripts\activate # Install the required packages pip install -r requirements.txt ``` ### Step 3: Run the Agent Now, start your agent by running the Python script. ```bash python main.py ``` Your terminal will show that the agent is running and has registered itself with VideoSDK using the ID `MyTelephonyAgent`. This ID is crucial for routing calls to it later. ![Running AI Agent Locally](https://strapi.videosdk.live/uploads/run_local_agent_327c1b161c.gif) **Important:** Keep this terminal window open. Your agent must remain running to accept connections. ## Part 2: Connect Your Agent to the Phone Network With your agent running locally, it's time to connect it to the outside world. This involves setting up [gateways and routing rules](/telephony/introduction) in your VideoSDK dashboard. ### Step 1: Configure an Inbound Gateway An **[Inbound Gateway](/telephony/managing-calls/receiving-inbound-calls)** is the entry point for calls coming into VideoSDK. **dashboard:** - Navigate to **Telephony > Inbound Gateways** in the [VideoSDK Dashboard](https://app.videosdk.live/telephony/inbound-gateways) and click **Add**. - Give your gateway a name and enter the phone number you purchased from your SIP provider (e.g., Twilio, Vonage, Telnyx, Plivo, Exotel). - After creating it, copy the **Inbound Gateway URL**. - In your SIP provider's dashboard, paste this URL into the **Origination SIP URI** field for your phone number. This tells your provider to forward all incoming calls to VideoSDK.
--- **api:** ```bash curl --request POST \ --url https://api.videosdk.live/v2/sip/inbound-gateways \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "name": "My Inbound Gateway", "numbers": ["+1234567890"] }' ``` > **API Reference**: [Create Inbound Gateway](/api-reference/realtime-communication/sip/inbound-gateway/create-inbound-gateway) ### Step 2: Configure an Outbound Gateway An **[Outbound Gateway](/telephony/managing-calls/making-outbound-calls)** is the exit point for calls your agent makes to the phone network. **dashboard:** - Go to **Telephony > Outbound Gateways** in the dashboard and click **Add**. - Give it a name and paste the **Termination SIP URI** and credentials from your SIP provider into the required fields.
--- **api:** ```bash curl --request POST \ --url https://api.videosdk.live/v2/sip/outbound-gateways \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "name": "My Outbound Gateway", "numbers": ["+12065551234"], "address": "sip.myprovider.com", "transport": "udp", "auth": { "username": "your-username", "password": "your-password" } }' ``` > **API Reference**: [Create Outbound Gateway](/api-reference/realtime-communication/sip/outbound-gateway/create-outbound-gateway) ### Step 3: Create a Routing Rule A **[Routing Rule](/telephony/call-routing/setting-up-routing-rules)** acts as a switchboard, connecting your gateways to your agent. This is where the magic happens. **dashboard:** - Go to **Telephony > Routing Rules** and click **Add**. - Configure the rule: - **Gateway**: Select the Inbound Gateway you just created. - **Numbers**: Add the phone number associated with the gateway. - **Dispatch**: Choose **Agent**. - **Agent Type**: Set to `Self Hosted`. - **Agent ID**: Enter `MyTelephonyAgent`. This **must match** the `agent_id` in your `main.py` file. - Click **Create** to save the rule. ![Setting Up Routing Rules for an AI Agent](https://assets.videosdk.live/images/routing-rules.gif) --- **api:** ```bash curl --request POST \ --url https://api.videosdk.live/v2/sip/routing-rules \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "gatewayId": "gateway_in_123456789", "name": "Support Line Rule", "numbers": ["+1234567890"], "dispatch": "agent", "agentType": "self_hosted", "agentId": "MyTelephonyAgent" }' ``` > **API Reference**: [Create Routing Rule](/api-reference/realtime-communication/sip/routing-rules/create-routing-rule) You have now successfully instructed VideoSDK to route all inbound calls from your phone number directly to your running Python agent. ## Part 3: Time to Talk! Make and Receive Calls Your setup is complete! Let's test it out. :::tip **Keep Your Agent Running** Make sure your AI agent is running locally before configuring the telephony settings. The agent must be active to receive incoming calls. ::: ### Making an Inbound Call - Using any phone, dial the SIP number you configured. - Your local Python agent will automatically answer. - You'll hear the greeting: "Hello! I'm your real-time assistant. How can I help you today?" - Start talking! The agent will listen and respond in real-time. ### Making an Outbound Call You can trigger an outbound call from your agent using a simple API request. Use `curl` or any API client to make a `POST` request to the VideoSDK API. Replace `YOUR_VIDEOSDK_TOKEN` and the `gatewayId` with your own. ```bash curl --request POST \ --url https://api.videosdk.live/v2/sip/call \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "gatewayId": "gw_123456789", "sipCallTo": "+14155550123" }' ``` This will command your `MyTelephonyAgent` to dial the specified number and start a conversation. :::tip **Geographic Optimization** For optimal performance, run your agent in the same geographic region as your SIP provider (e.g., US East for Twilio, US West for Telnyx, Europe for Plivo). This reduces latency and improves call quality. ::: ## Next Steps Congratulations! You've built and deployed a sophisticated AI telephony agent. You've seen how to run it locally and connect it to the global phone network for both inbound and outbound communication. - [Deploy Your Agent](/ai_agents/deployments/introduction): Learn how to deploy your AI agent to production - [Explore Telephony Docs](/telephony/introduction): Comprehensive telephony documentation and guides - [Provider Integrations](/telephony/integrations/twilio-sip-integration): SIP provider setup guides (Twilio, Vonage, etc.) --- # Why we are using JWT based Token ? Token based authentication allows users to verify their identity by providing generated API key and secret. We use JWT token for the authentication purpose because Token-based authentication is **widely used** in modern web applications and APIs because it offers several benefits over traditional authentication. For example, it can **reduce the risk of the credentials being misused**, and it allows for **more fine-grained control** over access to resources. Additionally, tokens can be easily revoked or expired, making it easier to manage access rights. ## How to generate Token ? To manage secured communication, every participant that connects to the meeting needs an access token. You can easily generate this token by using your `apiKey` and `secret-key` which you can get from [VideoSDK Dashboard](https://app.videosdk.live/api-keys). ### 1. Generating token from Dashboard If you are looking to do **testing or for development purpose**, you can generate a temporary token from [VideoSDK Dashboard's API section](https://app.videosdk.live/api-keys).
:::tip The best practice for getting token includes generating it from your backend server which will help in **keeping your credentials safe**. ::: ### 2. Generating token in your backend - Your server will generate access token using your API key and secret. - While generating a token, you can provide **expiration time, permissions and roles** which are discussed later in this section. - Your client obtains token from your backend server. - For token validation, client will pass this token to VideoSDK server. - VideoSDK server will only allow entry in the meeting if the token is valid. ![img2.png](/img/authentication-and-token.png) Follow our official example repositories to setup token API [videosdk-rtc-api-server-examples](https://github.com/videosdk-live/videosdk-rtc-api-server-examples) ### Payload while generating token For AI Agent authentication, the payload is simplified to include only the essential parameters: ```js { apikey: API_KEY, //MANDATORY permissions: [`allow_join`], //MANDATORY } ``` - **`apikey`(Mandatory)**: This must be the API Key generated from the VideoSDK Dashboard. You can get it from [here](https://app.videosdk.live/api-keys). - **`permissions`(Mandatory)**: For AI agents, typically use `allow_join` to enable the agent to join meetings directly. Available permissions for AI agents: - **`allow_join`**: The AI agent is **allowed to join** the meeting directly. - **`ask_join`**: The AI agent is required to **ask for permission to join** the meeting. Then, you have to sign this payload with your **`SECRET KEY`** and `jwt` options using the **`HS256 algorithm`**. ### Expiration time You can set any expiration time to the token. But in the **production environment**, it is recommended to generate a token with **short expiration time** because by any chance if someone gets hold of the token, it won't be valid for a longer period of time. ### What happens if token is expired? If your token is expired, the user won't be able to join the meeting and all the API calls will give error with message `Token is invalid or expired`. :::note Token is validated only once while joining the meeting, so if a person joins the meeting and the token gets expired after that, there won't be any issue in the current meeting. ::: ## How to check validity of token? 1. After generating the token, visit [jwt.io](https://jwt.io) and paste your token in the given area. 2. You will be able to see the payload you passed while generating the token and also be able to see the expiration time and token creation time. ![img1.png](/img/validate-token.png) ## API Settings & Controls The VideoSDK Dashboard provides a powerful [API Settings](https://app.videosdk.live/api-keys) section where you can configure various services and controls for your meetings. These settings allow you to customize the behavior of different features directly from the dashboard without requiring code changes. ## Audio/Video Configurations :::note Runtime configuration from the dashboard has higher precedence than static SDK configurations. If a value is modified in the dashboard, it will override any previously defined SDK settings for that session. Example: Code: Video = HD → Dashboard: Video = Full HD → Applied: Full HD ::: ### Maximum Send Bitrate Per Participant Control the maximum total upload bandwidth for all media streams from a single participant. ![Maximum Send Bitrate](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.00.11%E2%80%AFPM.png) **Available Options:** - **Auto**: Removes the upload bitrate limit (recommended for best quality) - **SD (800kbps)**: Standard definition bitrate - **HD (1500kbps)**: High definition bitrate - **Full HD (3000kbps)**: Full high definition bitrate :::tip Setting this to **Auto** provides the best video quality as it removes upload bitrate constraints, allowing adaptive streaming based on network conditions. ::: ## Recording & Streaming Configuration :::note Recording and HLS follow a precedence-based configuration model: API/SDK configuration → Highest priority Dashboard configuration → Used only when no API values are defined Any parameters explicitly set via the API will not be overridden by dashboard changes. ::: ### Recording Configuration Configure recording settings for your meetings including layout, quality, and auto-start options. ![Recording Settings](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.01.00%E2%80%AFPM.png) **Recording Controls:** - **Enabled**: Toggle to enable/disable recording functionality - **Enabled Modes**: Choose between Video & Audio or Audio Only recording - **Start Recording Automatically**: Auto-start recording when meeting begins - **Layout Style**: Select Grid layout for recording composition - **Maximum Tiles In Grid**: Set the number of participants visible in grid (1-25) - **Who to Prioritize**: Choose Active Speaker or Pinned Participant - **Theme**: Select System Default, Light, or Dark theme for recordings - **Video Orientation**: Choose between Landscape or Portrait mode - **Recording Quality**: Select quality level (e.g., HD Medium) - **Recording Mode**: Choose between Video & Audio or Audio Only ### HLS Streaming Configuration Configure HTTP Live Streaming (HLS) settings for broadcasting your meetings. ![HLS Streaming Settings](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.02.18%E2%80%AFPM.png) **HLS Streaming Controls:** - **Enabled**: Toggle to enable/disable HLS streaming - **Enabled Modes**: Choose between Video & Audio or Audio Only streaming - **Auto Start HLS**: Automatically start HLS when meeting begins - **Record HLS Stream**: Enable recording of the HLS stream - **Layout Style**: Select Grid layout for stream composition - **Maximum Tiles In Grid**: Set the number of participants visible (1-25) - **Who to Prioritize**: Choose Active Speaker or Pinned Participant - **Theme**: Select System Default, Light, or Dark theme - **HLS Orientation**: Choose between Landscape or Portrait mode - **HLS Quality**: Select streaming quality (e.g., HD Medium) - **HLS Streaming Mode**: Choose between Video & Audio or Audio Only ### Additional Service Controls Enable or disable additional services for your API key. ![Service Controls](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.03.08%E2%80%AFPM.png) **Available Services:** - **RTMP Output**: Enable RTMP streaming to external platforms - **SIP Integration**: Enable SIP (Session Initiation Protocol) integration for telephony - **Realtime Translation**: Enable real-time language translation in meetings ### Transcription & Summary Configure transcription and domain whitelisting for your API key. ![Transcription Settings](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.03.50%E2%80%AFPM.png) **Transcription Controls:** - **Enabled**: Toggle to enable/disable transcription and summary features - **Whitelist Domain**: Add domain restrictions for API key usage - Enter domain names in the format `https://domain.name` - Click "Add Domain" to whitelist specific domains - Only whitelisted domains can use this API key - When no domains are whitelisted, the API key can be used from any domain --- # Console Mode for AI Agents Console mode allows you to interact with your AI agent directly through the terminal without joining a VideoSDK meeting room. This is particularly useful for: - Quick testing of agent functionality - Local development and debugging - Testing function tools and MCP integrations - Validating pipeline configurations ## How It Works When running your agent script in console mode: 1. The agent runs in a terminal-based environment 2. Your microphone input is captured directly through the terminal 3. Agent responses are played through your system audio 4.Run the full `Cascading Pipeline` and `RealTime Pipeline` locally without connecting to a meeting. This makes it easier to verify that audio flows, agent logic, and response generation are working correctly before deploying into a live session. 1. Function tools, MCP integrations, and other features remain fully functional ## Using Console Mode To use console mode, simply add the `console` argument when running your agent script: ```bash python main.py console ```
The console will display: - Agent speech output - User speech input - Various latency metrics (STT, TTS, LLM,EOU) - Pipeline processing information This flexibility allows you to use the same agent code for both development and production environments. --- # Agent Session The `AgentSession` is the central orchestrator that integrates the `Agent`, `Pipeline`, and optional `ConversationFlow` into a cohesive workflow. It manages the complete lifecycle of an agent's interaction within a VideoSDK meeting, handling initialization, execution, and cleanup. ![Agent Session](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_agent_session.png) ## Core Features - **Component Orchestration:** Unifies agent, pipeline, and conversation flow components. - **Lifecycle Management:** Handles session start, execution, and cleanup ## State Management The `AgentSession` provides comprehensive state tracking for both users and agents, automatically emitting state change events for real-time monitoring. :::tip Version Requirement The state management features and enhanced methods (`reply()`, `interrupt()`) are available in versions above v0.0.35. ::: ### User States - **IDLE** - User is not actively speaking or listening - **SPEAKING** - User is currently speaking - **LISTENING** - User is actively listening to the agent ### Agent States - **STARTING** - Agent is initializing - **IDLE** - Agent is ready and waiting - **SPEAKING** - Agent is currently generating speech - **LISTENING** - Agent is processing user input - **THINKING** - Agent is processing and generating response - **CLOSING** - Agent is shutting down ### State Event Monitoring State changes are automatically emitted as events that you can listen to: ```python title="main.py" def on_user_state_changed(data): print("User state:", data) def on_agent_state_changed(data): print("Agent state:", data) session.on("user_state_changed", on_user_state_changed) session.on("agent_state_changed", on_agent_state_changed) ``` ## Constructor Parameters ```python AgentSession( agent: Agent, pipeline: Pipeline, conversation_flow: Optional[ConversationFlow] = None, wake_up: Optional[int] = None ) ``` - [Agent](/ai_agents/core-components/agent): Your custom agent implementation - [Pipeline](/ai_agents/core-components/conversation-flow): Optional conversation state management ### Wake-Up Call Wake-up call automatically triggers actions when users are inactive for a specified period of time, helping maintain engagement. ```python title="main.py" # Configure wake-up timer session = AgentSession( agent=MyAgent(), pipeline=pipeline, wake_up=10 # Trigger after 10 seconds of inactivity ) # Set callback function async def on_wake_up(): await session.say("Are you still there? How can I help?") session.on_wake_up = on_wake_up ``` :::note Important: If a `wake_up` time is provided, you must set a callback function before starting the session. If no `wake_up` time is specified, no timer or callback will be activated. ::: ## Basic Usage To get an agent running, you initialize an `AgentSession` with your custom `Agent` and a configured `Pipeline`. The session handles the underlying connection and data flow. ### Example Implementation: **Real-time Pipeline:** ```python title="main.py" from videosdk.agents import AgentSession, Agent, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAIRealtime from videosdk.agents import RealTimePipeline class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") def setup_state_monitoring(self): def on_user_state_changed(data): print(f"User state changed to: {data['state']}") def on_agent_state_changed(data): print(f"Agent state changed to: {data['state']}") self.session.on("user_state_changed", on_user_state_changed) self.session.on("agent_state_changed", on_agent_state_changed) async def start_session(ctx: JobContext): model = OpenAIRealtime(model="gpt-4o-realtime-preview") pipeline = RealTimePipeline(model=model) session = AgentSession( agent=MyAgent(), pipeline=pipeline ) await ctx.connect() await session.start() # Session runs until manually stopped or meeting ends def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Assistant Bot" ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` --- **Cascading Pipeline:** ```python title="main.py" from videosdk.agents import AgentSession, Agent, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAISTT, OpenAITTS, OpenAILLM from videosdk.agents import CascadingPipeline class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") def setup_state_monitoring(self): def on_user_state_changed(data): print(f"User state changed to: {data['state']}") def on_agent_state_changed(data): print(f"Agent state changed to: {data['state']}") self.session.on("user_state_changed", on_user_state_changed) self.session.on("agent_state_changed", on_agent_state_changed) async def start_session(ctx: JobContext): # Configure individual components stt = OpenAISTT(model="whisper-1") llm = OpenAILLM(model="gpt-4") tts = OpenAITTS(model="tts-1", voice="alloy") pipeline = CascadingPipeline( stt=stt, llm=llm, tts=tts ) session = AgentSession( agent=MyAgent(), pipeline=pipeline ) await ctx.connect() await session.start() # Session runs until manually stopped or meeting ends def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Assistant Bot" ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ## Development and Testing Features The `AgentSession` supports several modes for development, testing, and user engagement: ### Playground Mode Playground mode provides a web-based interface for testing your agent without building a separate client application. #### Usage To activate playground mode, simply set `playground: True` in your RoomOptions for JobContext. ```python title="main.py" from videosdk.agents import RoomOptions, JobContext, WorkerJob async def entrypoint(ctx: JobContext): # Your agent implementation here # This is where you create your pipeline, agent, and session pass def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Test Agent", playground=True # Enable playground mode ) return JobContext(room_options=room_options) if __name__ == "__main__": from videosdk.agents import WorkerJob job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` When enabled, the playground URL is automatically displayed in your terminal for easy access. :::note Note: Playground mode is designed for development and testing purposes. For production deployments, ensure playground mode is disabled to maintain security and performance. ::: ### Console Mode Console mode allows you to test your agent directly in the terminal using your microphone and speakers, without joining a VideoSDK meeting. #### Usage To use console mode, simply add the console argument when running your agent script: ```bash python main.py console ```
The console will display: - Agent speech output - User speech input - Various latency metrics (STT, TTS, LLM,EOU) - Pipeline processing information This flexibility allows you to use the same agent code for both development and production environments. ## Session Lifecycle Management The `AgentSession` provides methods to control the agent's presence and behavior in the meeting. - [start(**kwargs)](https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=to%20the%20agent.-,async%20def%20start,self%2C%20**kwargs%3A%C2%A0Any)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE): Initializes and starts the agent session. Sets up MCP tools, metrics collection, pipeline, and calls agent's on_enter() hook. - [say(message: str)](https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20say,self%2C%20message%3A%C2%A0str)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE): Sends a message from the agent to the meeting participants. Allows the agent to communicate with users in the meeting. - [close()](https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=Methods-,async%20def%20close,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE): Gracefully shuts down the session. Finalizes metrics collection, cancels wake-up timer, and calls agent's on_exit() hook. - [leave()](https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20leave,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE): Leaves the meeting without full session cleanup. Provides a quick exit option while maintaining session state. - [reply(instructions, wait_for_playback)](https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20reply,self%2C%20instructions%3A%C2%A0str%2C%20wait_for_playback%3A%C2%A0bool%C2%A0%3D%C2%A0True)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE): Generate agent responses using instructions and current chat context. Includes playback control and prevents concurrent calls. - [interrupt()](https://docs.videosdk.live/agent-sdk-reference/agents/agent_session#:~:text=the%20agent%20session.-,async%20def%20interrupt,self)%20%E2%80%91%3E%C2%A0None,-EXPAND%20SOURCE%20CODE): Immediately interrupt the agent's current operation, stopping speech generation and LLM processing for emergency stops or user interruptions. ### Example of Managing the Lifecycle: ```python title="main.py" import asyncio from videosdk.agents import AgentSession, Agent, WorkerJob, JobContext, RoomOptions from videosdk.plugins.openai import OpenAIRealtime from videosdk.agents import RealTimePipeline class MyAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful meeting assistant.") # LIFECYCLE: Agent entry point - called when session starts async def on_enter(self): await self.session.say("Hello! How can I help you today?") # LIFECYCLE: Agent exit point - called when session ends async def on_exit(self): print("Agent is leaving the session") @function_tool async def provide_summary(self) -> str: """Provide a conversation summary using the new reply method""" await self.session.reply("Let me summarize our conversation so far...") return "Summary provided" @function_tool async def stop_speaking(self) -> str: """Emergency stop functionality""" await self.session.interrupt() return "Agent stopped successfully" async def run_agent_session(ctx: JobContext): # LIFECYCLE STAGE 1: Session Creation model = OpenAIRealtime(model="gpt-4o-realtime-preview") pipeline = RealTimePipeline(model=model) session = AgentSession(agent=MyAgent(), pipeline=pipeline) try: # LIFECYCLE STAGE 2: Connection Establishment await ctx.connect() # LIFECYCLE STAGE 3: Session Start await session.start() # LIFECYCLE STAGE 4: Session Running await asyncio.Event().wait() finally: # LIFECYCLE STAGE 5: Session Cleanup await session.close() # LIFECYCLE STAGE 6: Context Shutdown await ctx.shutdown() # LIFECYCLE STAGE 0: Context Creation def make_context() -> JobContext: room_options = RoomOptions(room_id="your-room-id", auth_token="your-token") return JobContext(room_options=room_options) if __name__ == "__main__": # LIFECYCLE ORCHESTRATION: Worker Job Management # Creates and starts the worker job that manages the entire lifecycle job = WorkerJob(entrypoint=run_agent_session, jobctx=make_context) job.start() ``` ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. - [Wakeup call Example](https://github.com/videosdk-live/agents/blob/main/examples/wakeup_call.py): Agent Session with wakeup call - [Agent to agent Example](https://github.com/videosdk-live/agents/tree/main/examples/a2a): Agent Session with Customer and Loan agent --- # Agent The `Agent` class is the base class for defining AI agent behavior and capabilities. It provides the foundation for creating intelligent conversational agents with support for function tools, MCP servers, and advanced lifecycle management. ![Agent](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_agent.png) ## Basic Usage ### Simple Agent This is how you can initialize a simple agent with the `Agent` class, where `instructions` defines how the agent should behave. ```python title="main.py" from videosdk.agents import Agent class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant." ) ``` ## Agent with Function Tools Function tools allow your agent to perform actions and interact with external services, extending its capabilities beyond simple conversation. You can register tools that are defined either outside or inside your agent class. ### External Tools External tools are defined as standalone functions and are passed into the agent's constructor via the tools list. This is useful for sharing common tools across multiple agents. ```python title="main.py" from videosdk.agents import Agent, function_tool # External tool defined outside the class @function_tool(description="Get weather information") def get_weather(location: str) -> str: """Get weather information for a specific location.""" # Weather logic here return f"Weather in {location}: Sunny, 72°F" class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant.", tools=[get_weather] # Register the external tool ) ``` ### Internal Tools Internal tools are defined as methods within your agent class and are decorated with `@function_tool`. This is useful for logic that is specific to the agent and needs access to its internal state (`self`). ```python title="main.py" from videosdk.agents import Agent, function_tool class FinanceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful financial assistant." ) self.portfolio = {"AAPL": 10, "GOOG": 5} @function_tool def get_portfolio_value(self) -> dict: """Get the current value of the user's stock portfolio.""" # In a real scenario, you'd fetch live stock prices # This is a simplified example return {"total_value": 5000, "holdings": self.portfolio} ``` ## Agent with MCP Server `MCPServerStdio` enables your agent to communicate with external processes via standard input/output streams. This is ideal for integrating complex, standalone Python scripts or other local executables as tools. ```python title="main.py" import sys from pathlib import Path from videosdk.agents import Agent, MCPServerStdio # Path to your external Python script that runs the MCP server mcp_server_path = Path(__file__).parent / "mcp_server_script.py" class MCPAgent(Agent): def __init__(self): super().__init__( instructions="You are an assistant that can leverage external tools via MCP.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(mcp_server_path)], session_timeout=30 ) ] ) ``` ## Agent Lifecycle and Methods The `Agent` class provides lifecycle hooks and methods to manage state and behavior at critical points in the agent's session. ### Lifecycle Hooks These methods are designed to be overridden in your custom agent class to implement specific behaviors. - `async def on_enter(self) -> None`: Called once when the agent successfully joins the meeting. This is the ideal place for introductions or initial actions, such as greeting participants. - `async def on_exit(self) -> None`: Called when the agent is about to exit the meeting. Use this for cleanup tasks or for saying goodbye. ```python title="main.py" from videosdk.agents import Agent class LifecycleAgent(Agent): async def on_enter(self): print("Agent has entered the meeting.") await self.session.say("Hello everyone! I'm here to help.") async def on_exit(self): print("Agent is exiting the meeting.") await self.session.say("It was a pleasure assisting you. Goodbye!") ``` ## Human in the Loop (HITL) Human in the Loop enables AI agents to escalate specific queries to human operators for review and approval. This implementation uses Discord as the human interface through an MCP server, allowing seamless handoffs between AI automation and human oversight. ### Use Cases - **Discount Requests**: AI escalates pricing queries to human sales agents - **Complex Support**: Technical issues requiring human expertise - **Policy Decisions**: Requests that need human approval or clarification - **Escalation Scenarios**: Situations where AI confidence is low ### Implementation The HITL pattern combines the Agent's MCP server capability with a Discord-based human interface: **Agent Configuration:** ```python title="main.py" from videosdk.agents import Agent, MCPServerStdio, CascadingPipeline, AgentSession, JobContext, RoomOptions, WorkerJob from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.anthropic import AnthropicLLM from videosdk.plugins.google import GoogleTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector import pathlib import sys import os from typing import Optional class CustomerAgent(Agent): def __init__(self, ctx: Optional[JobContext] = None): current_dir = pathlib.Path(__file__).parent discord_mcp_server_path = current_dir / "discord_mcp_server.py" super().__init__( instructions="You are a customer-facing agent for VideoSDK. You have access to various tools to assist with customer inquiries, provide support, and handle tasks. When a user asks for a discount percentage, always use the appropriate tool to retrieve and provide the accurate answer from your superior human agent.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(discord_mcp_server_path)], session_timeout=30 ), ] ) self.ctx = ctx async def on_enter(self) -> None: """Called when the agent first joins the meeting""" await self.session.say("Hi! I'm your VideoSDK customer support agent. How can I help you today?") async def on_exit(self) -> None: """Called when the agent exits the meeting""" await self.session.say("Thank you for contacting VideoSDK support. Have a great day!") # Pipeline configuration integrated into the main setup def create_pipeline() -> CascadingPipeline: """Create and configure the cascading pipeline with all components""" return CascadingPipeline( stt=DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")), llm=AnthropicLLM(api_key=os.getenv("ANTHROPIC_API_KEY")), tts=GoogleTTS(api_key=os.getenv("GOOGLE_API_KEY")), vad=SileroVAD(), turn_detector=TurnDetector(threshold=0.8) ) async def start_session(ctx: JobContext): """Main entry point that creates agent, pipeline, and starts the session""" # Create the pipeline pipeline = create_pipeline() # Create the agent with context agent = CustomerAgent(ctx=ctx) # Create the agent session session = AgentSession( agent=agent, pipeline=pipeline ) try: # Connect to the room await ctx.connect() # Start the agent session await session.start() # Keep running until interrupted import asyncio await asyncio.Event().wait() finally: # Clean up resources await session.close() await ctx.shutdown() def make_context() -> JobContext: """Create the job context with room configuration""" room_options = RoomOptions( room_id=os.getenv("VIDEOSDK_ROOM_ID", "your-room-id"), auth_token=os.getenv("VIDEOSDK_AUTH_TOKEN"), name="VideoSDK Customer Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` --- **Discord MCP Server:** ```python title="discord_mcp_server.py" import asyncio import os from mcp.server.fastmcp import FastMCP import discord from discord.ext import commands class DiscordHuman: def __init__(self, user_id: int, channel_id: int, bot_token: str): self.user_id = user_id self.channel_id = channel_id self.bot_token = bot_token self.bot = commands.Bot(command_prefix="!", intents=discord.Intents.all()) self.response_future = None self.setup_bot_events() def setup_bot_events(self): @self.bot.event async def on_ready(): print(f'{self.bot.user} has connected to Discord!') @self.bot.event async def on_message(message): if (message.author.id == self.user_id and message.channel.id in [thread.id for thread in self.bot.get_all_channels() if hasattr(thread, 'parent')] and self.response_future and not self.response_future.done()): self.response_future.set_result(message.content) async def start_bot(self): """Start the Discord bot""" await self.bot.start(self.bot_token) async def ask(self, question: str) -> str: if not self.bot.is_ready(): return "❌ Discord bot is not ready" try: channel = self.bot.get_channel(self.channel_id) if not channel: return "❌ Channel not found" thread = await channel.create_thread( name=question[:100], type=discord.ChannelType.public_thread ) await thread.send(f"<@{self.user_id}> {question}") self.response_future = asyncio.get_event_loop().create_future() try: response = await asyncio.wait_for(self.response_future, timeout=600) return response except asyncio.TimeoutError: return "⏱️ Timed out waiting for a human response" except Exception as e: return f"❌ Error: {str(e)}" # Initialize Discord human instance discord_human = DiscordHuman( user_id=int(os.getenv("DISCORD_USER_ID")), channel_id=int(os.getenv("DISCORD_CHANNEL_ID")), bot_token=os.getenv("DISCORD_TOKEN") ) # MCP Server Setup mcp = FastMCP("HumanInTheLoopServer") @mcp.tool(description="Ask a human agent via Discord for a specific user query such as discount percentage, etc.") async def ask_human(question: str) -> str: """Ask a human agent via Discord for assistance""" return await discord_human.ask(question) async def main(): """Main function to start both the Discord bot and MCP server""" # Start Discord bot in background bot_task = asyncio.create_task(discord_human.start_bot()) # Wait a moment for bot to initialize await asyncio.sleep(2) # Start MCP server await mcp.run() if __name__ == "__main__": asyncio.run(main()) ``` --- **Environment Variables:** Set the following environment variables: ```bash title=".env" DISCORD_TOKEN=your_discord_bot_token DISCORD_USER_ID=human_operator_user_id DISCORD_CHANNEL_ID=channel_id_for_escalations DEEPGRAM_API_KEY=your_deepgram_key ANTHROPIC_API_KEY=your_anthropic_key GOOGLE_API_KEY=your_google_key VIDEOSDK_AUTH_TOKEN=your_videosdk_token VIDEOSDK_ROOM_ID=your_room_id ``` The Discord MCP server provides the `ask_human` tool that creates Discord threads for human operator responses. This leverages the same MCP integration pattern shown in the previous section. Complete implementation with full source code, setup instructions, and configuration examples available in the [VideoSDK Agents GitHub repository](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop). --> ## Examples - Try Out Yourself Checkout the examples of function tool usage and MCP server. - [Function Tool ](https://github.com/videosdk-live/agents/blob/main/examples/test_cascading_pipeline.py): Implement agent with internal and external tool - [MCP Server](https://github.com/videosdk-live/agents/blob/main/examples/mcp_example.py): Implement agent with MCP server integration - [Human in the Loop](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop): Escalate queries to human operators via Discord --- # Avatar Avatars add a visual, human-like presence to your AI agents, creating more engaging and natural interactions. The VideoSDK Agents framework supports virtual avatars through the Simli integration, providing lifelike video representations that sync with your agent's speech. ![Avatar](https://cdn.videosdk.live/website-resources/docs-resources/voice_agent_avatar.png) ## Overview Avatar functionality enables your AI agents to: - **Visual Presence**: Display a human-like avatar that represents your agent - **Lip Sync**: Automatically synchronize avatar mouth movements with speech - **Real-time Rendering**: Generate avatar video in real-time during conversations - **Customizable Appearance**: Choose from different avatar faces and styles - **Seamless Integration**: Works with both CascadingPipeline and RealTimePipeline ## What Avatars Enable With avatar capabilities, your agents can: - Provide a more human and approachable interface - Increase user engagement through visual interaction - Create branded agent personalities with custom appearances - Enhance accessibility through visual communication cues - Build trust through consistent visual representation ## Simli Avatar Integration ### Basic Setup The Simli avatar integration provides high-quality virtual avatars with real-time lip synchronization. ```python title="main.py" from videosdk.plugins.simli import SimliAvatar, SimliConfig # Configure your avatar avatar_config = SimliConfig( apiKey="your-simli-api-key", faceId="0c2b8b04-5274-41f1-a21c-d5c98322efa9", # Default face syncAudio=True, handleSilence=True ) avatar = SimliAvatar(config=avatar_config) ``` ### Avatar Configuration Options Customize your avatar's behavior and appearance: ```python title="main.py" from videosdk.plugins.simli import SimliConfig config = SimliConfig( apiKey="your-simli-api-key", faceId="your-custom-face-id", # Choose avatar appearance syncAudio=True, # Enable lip sync handleSilence=True, # Manage silent periods maxSessionLength=1800, # 30 minutes max session maxIdleTime=300 # 5 minutes idle timeout ) ``` ## Pipeline Integration **Cascading Pipeline:** Add avatar to your cascading pipeline setup: ```python title="main.py" from videosdk.agents import CascadingPipeline, AgentSession from videosdk.plugins.simli import SimliAvatar, SimliConfig # Configure avatar avatar_config = SimliConfig(apiKey="your-simli-api-key") avatar = SimliAvatar(config=avatar_config) # Create pipeline with avatar pipeline = CascadingPipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, avatar=avatar # Add avatar to pipeline ) ``` --- **Real-time Pipeline:** Integrate avatar with real-time models: ```python title="main.py" from videosdk.agents import RealTimePipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig from videosdk.plugins.openai import OpenAIRealtime # Configure avatar avatar = SimliAvatar(config=SimliConfig(apiKey="your-api-key")) # Configure real-time model model = OpenAIRealtime(model="gpt-4o-realtime-preview") # Create pipeline with avatar pipeline = RealTimePipeline( model=model, avatar=avatar ) ``` :::info You can also specify the avatar in your room configuration: ```python title="main.py" from videosdk.agents import JobContext, RoomOptions def make_context(): avatar = SimliAvatar(config=SimliConfig(apiKey="your-api-key")) return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Avatar Agent", avatar=avatar # Specify avatar in room options ) ) ``` ::: ## Example - Try Out Yourself - [AI Avatar Example](https://github.com/videosdk-live/agents/tree/main/examples/avatar): Complete avatar implementation with weather assistant functionality --- # Background Audio The Background Audio feature enables voice agents to play audio during conversations, enhancing user experience with ambient sounds and processing feedback. There are two ways to set the audio: 1. **Thinking Audio:** Plays automatically during agent processing (e.g., keyboard typing sounds) 2. **Background Audio:** Plays on-demand for ambient music or soundscapes **Thinking Audio:** ![Thinking Audio](https://assets.videosdk.live/images/thinking_audio.png) --- **Background Audio:** ![Background Audio](https://assets.videosdk.live/images/background-audio.png) ## Getting Started ### Enable Background Audio ```python from videosdk.agents import RoomOptions, JobContext room_options = RoomOptions( room_id="your-room-id", name="My Agent", # highlight-start background_audio=True # Enable background audio support #highlight-end ) context = JobContext(room_options=room_options) ``` ### Agent Methods **1. Set Thinking Audio** `set_thinking_audio()`: Configures audio that plays automatically while the agent processes responses. **Parameters:** - `file (str, optional)`: Path to custom WAV audio file. If not provided, uses built-in `agent_keyboard.wav` - `volume (float, optional)`: Volume of the audio. Default: `0.3` **Example:** ```python class MyAgent(Agent): def __init__(self): super().__init__(instructions="...") # highlight-start # Use default keyboard sound self.set_thinking_audio() # Or use custom audio # self.set_thinking_audio(file="path/to/custom.wav") # highlight-end ``` **2. Play Background Audio** `play_background_audio()`: Starts playing background audio during the conversation. **Parameters:** - `file (str, optional)`: Path to custom WAV audio file. If not provided, uses built-in `classical.wav` - `looping (bool, optional)`: Whether to loop the audio. Default: `False` - `override_thinking (bool, optional)`: Whether to stop thinking audio when background audio starts. Default: `True` - `volume (float, optional)`: Volume of the audio. Default: `1.0` **Example:** ```python @function_tool async def play_music(self): """Plays background music""" # highlight-start await self.play_background_audio( looping=True, override_thinking=False ) # highlight-end return "Music started" ``` **3. Stop Background Audio** `stop_background_audio()`: Stops currently playing background audio. **Example:** ```python @function_tool async def stop_music(self): """Stops background music""" # highlight-start await self.stop_background_audio() # highlight-end return "Music stopped" ``` ## Complete Example ```python title="main.py" from videosdk.agents import ( Agent, AgentSession, CascadingPipeline, WorkerJob, ConversationFlow, JobContext, RoomOptions, function_tool ) from videosdk.plugins.openai import OpenAILLM, OpenAITTS from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector class MusicAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant. Use control_music to play or stop background music." ) #highlight-start # Enable thinking audio with default keyboard sound self.set_thinking_audio() #highlight-end async def on_enter(self): await self.session.say("Hello! Ask me to play music.") async def on_exit(self): await self.session.say("Goodbye! Hope you enjoyed the music.") @function_tool async def control_music(self, action: str): """ Controls background music. :param action: 'play' to start music, 'stop' to end it """ if action == "play": #highlight-start await self.play_background_audio( override_thinking=True, looping=True ) #highlight-end return "Music started" elif action == "stop": #highlight-start await self.stop_background_audio() #highlight-end return "Music stopped" return "Invalid action" async def entrypoint(ctx: JobContext): agent = MusicAgent() pipeline = CascadingPipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=OpenAITTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=ConversationFlow(agent) ) await ctx.run_until_shutdown(session=session) def make_context(): return JobContext( room_options=RoomOptions( room_id="", name="Music Agent", #highlight-start background_audio=True # Required! #highlight-end ) ) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Pipeline Support Background audio works with both pipeline types: ### Cascading Pipeline - Thinking audio plays automatically during LLM processing - Background audio can be controlled via agent methods - Audio stops automatically when agent speaks ### RealTime Pipeline - Full background audio support with streaming models - Automatic lifecycle management during conversation turns ## Audio Behavior | Feature | Thinking Audio | Background Audio | |---------|---------------|------------------| | **Trigger** | Automatic during processing | Manual via `play_background_audio()` | | **Default File** | `agent_keyboard.wav` | `classical.wav` | | **Typical Duration** | Short (during LLM call) | Long/continuous | | **Looping** | Optional | Recommended (`looping=True`) | | **User Control** | No | Yes (via function tools) | | **Stops When** | Agent speaks | Agent speaks or `stop_background_audio()` | ## Audio File Requirements - **Format:** WAV (`.wav`) - **Recommended:** 16-bit PCM, 16kHz sample rate, mono channel - **Built-in files:** - `agent_keyboard.wav`: Default thinking sound - `classical.wav`: Default background music ## Best Practices 1. **Always enable in RoomOptions:** Set `background_audio=True` before using audio methods 2. **Use `override_thinking=True`:** When playing music to avoid overlapping sounds 3. **Loop background audio:** Set `looping=True` for continuous ambient sounds 4. **Control via function tools:** Let users control music through natural language 5. **Clean audio files:** Use high-quality WAV files to avoid distortion ## Common Use Cases - **Music player agent:** Control playback through conversation - **Ambient soundscapes:** Create atmosphere during interactions - **Processing feedback:** Custom thinking sounds for different agent personalities - **Hold music:** Play audio while agent performs long operations ## Example - Try It Yourself - [Background Audio example](https://github.com/videosdk-live/agents/blob/main/examples/background_audio.py): Implement and experience the background audio functionality yourself ## FAQs ### Troubleshooting | Issue | Solution | |--------|-----------| | Audio not playing | Verify `background_audio=True` in `RoomOptions` | | Audio quality issues | Use WAV format with 16-bit PCM encoding | | Audio doesn't stop | Ensure `stop_background_audio()` is called properly | | Overlapping sounds | Use `override_thinking=True` when playing background audio | --- Call Transfer lets your AI Agent move an ongoing SIP call to another phone number without ending the current session. Instead of making the caller hang up and dial a new number, the agent can automatically route the call. ## How Call Transfer Works - The agent evaluates the user’s intent to determine when a call transfer is required and then triggers the function tool. - When the function tool is triggered, it tells the system to move the call to another phone number. - The ongoing SIP call is forwarded to the new number instantly, without disconnecting or redialing. ## Trigger Call Transfer To set up incoming call handling, outbound calling, and routing rules, check out the [Quick Start Example](https://docs.videosdk.live/telephony/ai-telephony-agent-quick-start#part-2-connect-your-agent-to-the-phone-network). ```python title="main.py" from videosdk.agents import Agent, function_tool, class CallTransferAgent(Agent): def __init__(self): super().__init__( instructions="You are the Call Transfer Agent Which Help and provide to transfer on going call to new number. use transfer_call tool to transfer the call to new number.", ) async def on_enter(self) -> None: await self.session.say("Hello Buddy, How can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye Buddy, Thank you for calling!") @function_tool async def transfer_call(self) -> None: """Transfer the call to Provided number""" token = os.getenv("VIDEOSDK_AUTH_TOKEN") transfer_to = os.getenv("CALL_TRANSFER_TO") return await self.session.call_transfer(token,transfer_to) ``` ## Example - Try It Yourself - [Call Transfer Example](https://github.com/videosdk-live/agents/blob/main/examples/call_transfer.py): Check out and implement the Call Transfer example from GitHub. --- # Cascading Pipeline The `Cascading Pipeline` component provides a flexible, modular approach to building AI agents by allowing you to mix and match different components for Speech-to-Text (STT), Large Language Models (LLM), Text-to-Speech (TTS), Voice Activity Detection (VAD), and Turn Detection. ## Core Architecture The pipeline is composed of five key stages that work in sequence to handle a conversation: - **VAD (Voice Activity Detection)** - Detects the presence of human speech in the audio stream to know when to start processing. - **STT (Speech-to-Text)** - Converts the detected speech from audio into a text transcript. - **LLM (Large Language Model)** - Takes the text transcript as input, processes it, and generates a meaningful response. - **TTS (Text-to-Speech)** - Converts the LLM's text response back into audible speech. - **Turn Detection** - Manages the back-and-forth of the conversation, determining when one speaker has finished and another can begin. ![Cascading Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_casading_pipeline.png) ## Basic Usage ### Simple Pipeline Here is the most basic setup, combining STT, LLM, and TTS components. The SDK will use default configurations if no specific settings are provided. ```python title="main.py" from videosdk.agents import CascadingPipeline from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector pipeline = CascadingPipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) ``` ## Key Features: - **Modular Component Selection** - Choose different providers for each component - **Flexible Configuration** - Mix and match STT, LLM, TTS, VAD, and Turn Detection - **Custom Processing** - Add custom processing for STT and LLM outputs - **Provider Agnostic** - Support for multiple AI service providers - **Advanced Control** - Fine-tune each component independently ## Advance Configuration You can fine-tune the behavior of each component by passing specific parameters during initialization. ```python title="main.py" from videosdk.agents import CascadingPipeline from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector stt=DeepgramSTT( model="nova-2", language="en", punctuate=True, diarize=True ), llm=OpenAILLM( model="gpt-4o", temperature=0.7, max_tokens=1000 ), tts=ElevenLabsTTS( model="eleven_flash_v2_5", voice_id="21m00Tcm4TlvDq8ikWAM" ), vad=SileroVAD( threshold=0.35, min_silence_duration=0.5 ), turn_detector=TurnDetector( threshold=0.8, min_turn_duration=1.0 ) pipeline = CascadingPipeline(stt=stt, llm=llm, tts=tts, vad=vad, turn_detector=turn_detector) ``` ## Dynamic Component Changes The pipeline supports runtime component swapping ```python # Change components during runtime await pipeline.change_component( stt=new_stt_provider, llm=new_llm_provider, tts=new_tts_provider ) ``` ## Plugin Ecosystem There are multiple plugins available for STT, LLM, & TTS. Checkout here: - [STT](https://docs.videosdk.live/ai_agents/plugins/llm/openai): Learn more about other STT plugins ## Plugin Development ### Creating Custom Plugins To create custom plugins, follow the [plugin development guide ↗](https://github.com/videosdk-live/agents/blob/main/BUILD_YOUR_OWN_PLUGIN.md). Key requirements include: - Inherit from the correct base class (`STT`, `LLM`, or `TTS`) - Implement all abstract methods - Handle errors consistently using `self.emit("error", message)` - Clean up resources in the `aclose()` method ## Plugin Installation Install additional plugins as needed: ```python # Install specific provider plugins pip install videosdk-plugins-openai pip install videosdk-plugins-elevenlabs pip install videosdk-plugins-deepgram ``` ## Best Practices 1. **Component Selection:** Choose providers based on your specific requirements (latency, quality, cost) 2. **Error Handling:** Implement proper error handling and fallback strategies 3. **Resource Management:** Use the `cleanup()` method to properly close components. 4. **Configuration Monitoring:** Use `get_component_configs()` for debugging and monitoring 5. **Audio Format:** Ensure your custom plugins handle the 48kHz audio format correctly ## Key Benefits The Cascading Pipeline offers several advantages over integrated solutions: - **Multi-language Support** - Use specialized STT for different languages - **Cost Optimization** - Mix premium and cost-effective services - **Custom Voice Processing** - Add domain-specific processing logic - **Performance Optimization** - Choose fastest providers for each component - **Compliance Requirements** - Use specific providers for regulatory compliance ## Comparison with Realtime Pipeline | Feature | Cascading Pipeline | Realtime Pipeline | | ------------- | ----------------------------------- | ----------------------------- | | Control | Maximum control over each component | Integrated model control | | Flexibility | Mix different providers | Single model provider | | Latency | Higher due to sequential processing | Lower with streaming | | Customization | Extensive customization options | Limited to model capabilities | | Complexity | More complex configuration | Simpler setup | The `Cascading Pipeline` is ideal when you need maximum flexibility and control over each processing stage, while the `Realtime Pipeline` is better for low-latency applications with integrated model providers. ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. - [Basic Implementation](https://github.com/videosdk-live/agents/blob/main/examples/test_cascading_pipeline.py): Checkout the cascading pipeline implementation --- # Conversation Flow The `Conversation Flow` component manages the turn-taking logic in AI agent conversations, ensuring smooth and natural interactions. It is an inheritable class that allows you to inject custom logic into the `Cascading Pipeline`, enabling advanced capabilities like context preservation, dynamic adaptation, and Retrieval-Augmented Generation (RAG) before the final LLM call. ![Conversation Flow](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_conversation_flow.png) :::note Conversation Flow is a powerful feature that currently works exclusively with the [Cascading Pipeline ↗](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline). ::: ## Core Features The key methods allow you to inject custom logic at different stages of the conversation flow, enabling sophisticated AI agent behaviors while maintaining clean separation of concerns: ### **Core Capabilities** - **Turn-taking Management:** Control the flow and timing of agent and user turns - **Context Preservation:** Maintain conversation history and user data across turns (handled automatically) - **Advanced Flow Control:** Build stateful conversations that can adapt to user input - **Performance Optimization:** Fine-tune conversation processing for speed and efficiency - **Error Handling:** Implement robust error recovery and fallback mechanisms ### **Advanced Use Cases** - **RAG Implementation:** Retrieve relevant documents and context before LLM processing - **Memory Management:** Store and recall conversation history across sessions - **Content Filtering:** Apply safety checks and content moderation on input/output - **Analytics & Logging:** Track conversation metrics and user behavior patterns (built-in metrics integration) - **Business Logic Integration:** Add domain-specific processing and validation rules - **Multi-step Workflows:** Implement complex conversation flows with state management - **Function Tool Execution:** Automatic execution of function tools when requested by the LLM. ## Basic Usage ### Complete Setup with CascadingPipeline The recommended approach is to use `ConversationFlow` with a `CascadingPipeline`, which handles component configuration automatically: ```python title="main.py" from videosdk.agents import ConversationFlow, Agent, CascadingPipeline # First, define your agent class MyAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant." ) async def on_enter(self): # Initialize agent state pass async def on_exit(self): # Cleanup resources pass # Create pipeline and conversation flow pipeline = CascadingPipeline(stt=my_stt, llm=my_llm, tts=my_tts) conversation_flow = ConversationFlow(MyAgent()) # Pipeline automatically configures all components pipeline.set_conversation_flow(conversation_flow) ``` ### Constructor Parameters The ConversationFlow constructor accepts comprehensive configuration options: ```python ConversationFlow( agent: Agent, stt: STT | None = None, llm: LLM | None = None, tts: TTS | None = None, vad: VAD | None = None, turn_detector: EOU | None = None, denoise: Denoise | None = None ) ``` To add custom behavior, you inherit from `ConversationFlow` and override its methods. ## Built-in Methods ### Core Processing Methods - `process_with_llm()`: Processes the current chat context with the LLM and handles function tool execution automatically. - `say(message: str)`: Direct TTS synthesis for agent responses. - `process_text_input(text: str)`: Handle text input for A2A communication, bypassing STT. ### Lifecycle Hooks Override these methods to add custom behavior at specific conversation points: ```python class CustomFlow(ConversationFlow): async def on_turn_start(self, transcript: str) -> None: """Called when a user turn begins.""" print(f"User said: {transcript}") async def on_turn_end(self) -> None: """Called when a user turn ends.""" print("Turn completed") ``` ## Automatic Features - **Context Management**: The conversation flow automatically manages the agent's chat context. Do not manually add user messages as this will create duplicates. - **Audio Processing**: Audio data is automatically processed through send_audio_delta(), handling denoising, STT, and VAD processing. - **Interruption Handling**: The system includes sophisticated interruption logic that gracefully handles user interruptions during agent responses. ## Custom Conversation Flows ### RAG (Retrieval-Augmented Generation) Integration Enhance your agent's knowledge by integrating RAG to retrieve relevant information from external documents and databases. **Benefits:** - Access external documents and FAQs - Reduce hallucinations with real data - Dynamic context retrieval ```python title="rag_example.py" class RAGConversationFlow(ConversationFlow): async def run(self, transcript: str) -> AsyncIterator[str]: # Retrieve relevant context context = await self.agent.retrieve_relevant_documents(transcript) # Add context to conversation if context: self.agent.chat_context.add_message( role="system", content=f"Use this information: {context}" ) # Generate response with enhanced context async for response in self.process_with_llm(): yield response ``` See our [RAG Integration Documentation](../core-components/rag) for complete implementation. ### Implementing Custom Flows You can create a custom flow by inheriting from `ConversationFlow` and overriding the `run` method. This allows you to intercept the user's transcript, modify it, manage state, and even change the response from the LLM. ```python title="main.py" from typing import AsyncIterator from videosdk.agents import ConversationFlow, Agent class CustomConversationFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.turn_count = 0 async def run(self, transcript: str) -> AsyncIterator[str]: """Override the main conversation loop to add custom logic.""" self.turn_count += 1 # You can access and add to the agent's chat context before calling the LLM self.agent.chat_context.add_message(role=ChatRole.USER, content=transcript) # Process with the standard LLM call async for response_chunk in self.process_with_llm(): # Apply custom processing to the response processed_chunk = await self.apply_custom_processing(response_chunk) yield processed_chunk async def apply_custom_processing(self, chunk: str) -> str: """A helper method to modify the LLM's output.""" if self.turn_count == 1: # Prepend a greeting on the first turn return f"Hello! {chunk}" elif self.turn_count > 5: # Offer to summarize after many turns return f"This is an interesting topic. To summarize: {chunk}" else: return chunk ``` ### Advanced Turn-Taking Logic For more complex interactions, you can implement a state machine within your conversation flow to manage different states of the conversation. ```python title="main.py" class AdvancedTurnTakingFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.conversation_state = "listening" # Initial state async def run(self, transcript: str) -> AsyncIterator[str]: """A state-driven conversation loop.""" if self.conversation_state == "listening": # If we were listening, we now process the user's input # and transition to the responding state. await self.process_user_input(transcript) self.conversation_state = "responding" async for response_chunk in self.process_with_llm(): yield response_chunk # Once done responding, go back to listening self.conversation_state = "listening" elif self.conversation_state == "waiting_for_confirmation": # Handle a confirmation state if "yes" in transcript.lower(): yield "Great! Proceeding." self.conversation_state = "listening" else: yield "Okay, cancelling." self.conversation_state = "listening" async def process_user_input(self, transcript: str): """Custom logic for processing user input.""" print(f"Processing user input in state: {self.conversation_state}") # Add logic here, e.g., check if the user is asking a question that needs confirmation if "delete my account" in transcript.lower(): self.conversation_state = "waiting_for_confirmation" ``` ### Context-Aware Conversations Maintain conversation history and user preferences to create a personalized and context-aware experience. ```python title="main.py" import time class ContextAwareFlow(ConversationFlow): def __init__(self, agent): super().__init__(agent) self.conversation_history = [] self.current_topic = "general" async def run(self, transcript: str) -> AsyncIterator[str]: # First, update the context with the new transcript await self.update_context(transcript) # The agent's chat_context (automatically managed) will be # used by process_with_llm() to generate a context-aware response. async for response_chunk in self.process_with_llm(): yield response_chunk async def update_context(self, transcript: str): """Update history and identify the topic before calling the LLM.""" self.conversation_history.append({ 'role': 'user', 'content': transcript, 'timestamp': time.time() }) await self.identify_topic(transcript) # Add topic-specific context (system messages are safe to add) if hasattr(self, 'current_topic'): self.agent.chat_context.add_message( role=ChatRole.SYSTEM, content=f"System note: The user is asking about {self.current_topic}." ) async def identify_topic(self, transcript: str): """A simple topic identification logic.""" if "weather" in transcript.lower(): self.current_topic = "weather" elif "finance" in transcript.lower(): self.current_topic = "finance" ``` ## Performance Optimization To ensure the best user experience, consider the following optimization strategies: - **Efficient Context**: Keep the context provided to the LLM concise. Summarize earlier parts of the conversation to reduce token count and improve LLM response time. - **Asynchronous Operations**: When performing RAG or calling external APIs for data, ensure the operations are fully asynchronous (async/await) to avoid blocking the event loop. - **Caching**: Cache frequently accessed data (e.g., from a database or RAG store) to reduce lookup latency on subsequent turns. - **Streaming**: The run method returns an `AsyncIterator`. Process and yield response chunks as soon as they are available from the LLM to minimize perceived latency for the user. ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. - [Understand conversation flow](https://youtu.be/ZGqmIu-tE18?feature=shared): Youtube video explaining conversation flow --- # Conversational Graph The **Conversational Graph** is a powerful tool that allows you to define complex, structured conversation flows for your AI agents. Instead of relying solely on an LLM's inherent reasoning, which can sometimes be unpredictable, you can use a graph-based approach to guide the conversation through specific states and transitions. ## Installation To use the Conversational Graph, you need to install the `videosdk-conversational-graph` package. ```bash pip install videosdk-conversational-graph ``` :::note Check out the latest version of [videosdk-conversational-graph](https://pypi.org/project/videosdk-conversational-graph/) on PyPI. ::: ## Core Concepts The Conversational Graph is built around a few key concepts: 1. **ConversationalGraph**: The main object that manages the states and transitions. 2. **State**: A specific point in the conversation (e.g., "Greeting", "Asking for Name"). Each state has instructions for the agent. 3. **Transition**: Logic that dictates how the agent moves from one state to another based on user input or collected data. ## Example: Loan Application Let's walk through a complete example of building a Loan Application agent. This agent will guide the user through selecting a loan type (Personal, Home, or Car) and collecting the necessary details. ![Conversational Graph Loan Application](https://assets.videosdk.live/images/conversational-Graph.png) ### Step 1: Define the Data Model First, define the data you want to collect using `ConversationalDataModel`. This ensures the agent knows exactly what information to extract. ```python title="main.py" from pydantic import Field from conversational_graph import ConversationalGraph,ConversationalDataModel class LoanFlow(ConversationalDataModel): loan_type: str = Field(None, description="Type of loan: personal, home, car") annual_income: int = Field(None, description="Annual income of the applicant in INR") credit_score: int = Field(None, description="Credit score of the applicant. Must be between 300 and 850") property_value: int = Field(None, description="Value of the property for home loan in INR") vehicle_price: int = Field(None, description="Price of the vehicle for car loan in INR") loan_amount: int = Field(None, description="Desired loan amount in INR. MUST be greater than ₹11 lakh for approval") ``` ### Step 2: Initialize the Graph Create an instance of `ConversationalGraph` and pass your data model. ```python title="main.py" loan_application = ConversationalGraph( name="Loan Application", DataModel=LoanFlow, off_topic_threshold=5 ) ``` ### Step 3: Define States Define the various states of your conversation. Each state has a `name` and `instruction` that tells the agent what to do in that state. You can also define specific `tools` available to the agent within that state. ```python title="main.py" # Start Greeting q0 = loan_application.state( name="Greeting", instruction="Welcome user and start the conversation about loan application. Ask if they are ready to apply for a loan.", ) # Loan Type Selection q1 = loan_application.state( name="Loan Type Selection", instruction="Ask user to select loan type. We only offer personal loan, home loan, and car loan at the moment.", ) # highlight-start # Tool for state q2 submit_loan_application = PreDefinedTool().http_tool(HttpToolRequest( name="submit_loan_application", description="Called when loan request is approved and sumitted.", url="https://videosdk.free.beeceptor.com/apply", method="POST" ) ) q2 = loan_application.state( name="Review and Confirm", instruction="Review all loan details with the user and get confirmation.", tool=submit_loan_application ) # highlight-end # Master / Off-topic handler q_master = loan_application.state( name="Off-topic Handler", instruction="Handle off-topic or inappropriate inputs respectfully and end the call politely", master=True ) ``` ### Step 4: Define Transitions Now, link the states together using transitions. You specify the `from_state`, `to_state`, and a `condition` that must be met to trigger the transition. ```python title="main.py" # Greeting → Loan Type Selection loan_application.transition( from_state=q0, to_state=q1, condition="User ready to apply for loan" ) # Branch from Loan Type Selection loan_application.transition( from_state=q1, to_state=q1a, condition="User wants personal loan" ) # Merge all branches → Loan Amount Collection loan_application.transition( from_state=q1a, to_state=q2, condition="Personal loan details collected and verified" ) # ... More Transitions ... ``` ### Step 5: Integrate with the Agent Pipeline Finally, pass the `conversational_graph` to your `CascadingPipeline`. ```python title="main.py" class VoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful voice assistant that can assist you with your loan applications") async def on_enter(self) -> None: await self.session.say("Hello, I am here to help with your loan application. How can I help you today?", interruptible=False) async def on_exit(self) -> None: await self.session.say("Goodbye!") async def entrypoint(ctx: JobContext): agent = VoiceAgent() conversation_flow = ConversationFlow(agent) # highlight-start pipeline = CascadingPipeline( stt= DeepgramSTT(), llm=OpenAILLM(), tts=GoogleTTS(), vad=SileroVAD(), turn_detector=TurnDetector(), conversational_graph = loan_application ) # highlight-end session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow ) await session.start(wait_for_participant=True, run_until_shutdown=True) def make_context() -> JobContext: room_options = RoomOptions(room_id="", name="Workflow Agent", playground=True) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Working Example - [Conversational Graph Example](https://github.com/videosdk-live/agents/blob/main/examples/test_workflow_pipeline.py): Check out the complete working example of a Loan Application agent using Conversational Graph. --- # De-noise De-noise improves audio quality in your AI agent conversations by filtering out background noise. This creates more professional and engaging interactions, especially in noisy environments. ## Overview The VideoSDK Agents framework provides real-time audio denoising capabilities via `RNNoise` plugin that: - **Remove Background Noise**: Filters out ambient sounds, keyboard typing, air conditioning, and other distractions - **Enhance Voice Clarity**: Improves speech intelligibility and quality - **Work in Real-time**: Processes audio with minimal latency during live conversations - **Integrate Seamlessly**: Works with both `CascadingPipeline` and `RealTimePipeline` architectures ## What De-noise Solves Without noise removal, your agents may struggle with: - Poor audio quality affecting transcription accuracy - Background noise interfering with conversation flow - Unprofessional sound quality in business applications - Difficulty understanding users in noisy environments With De-noise, you get: - Crystal clear audio for better user experience - Improved speech-to-text accuracy - Professional-grade audio quality - Better performance in various acoustic environments ## RNNoise Implementation `RNNoise` is a real-time noise suppression library that uses deep learning to distinguish between speech and noise, providing effective background noise removal. ### Key Features - **Real-time Processing**: Low-latency noise removal suitable for live conversations - **Adaptive Filtering**: Automatically adjusts to different types of background noise - **Speech Preservation**: Maintains voice quality while removing unwanted sounds - **Lightweight**: Efficient processing with minimal computational overhead ### Basic Setup ```python from videosdk.plugins.rnnoise import RNNoise # Initialize noise removal denoise = RNNoise() ``` ## Pipeline Integration **Cascading Pipeline:** Add noise removal to your cascading pipeline: ```python title="main.py" from videosdk.agents import Agent, CascadingPipeline, AgentSession # highlight-start from videosdk.plugins.rnnoise import RNNoise # highlight-end # Add your preferred providers from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.silero import SileroVAD class EnhancedVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a professional assistant with crystal-clear audio quality. Help users with their questions while maintaining excellent conversation flow." ) async def on_enter(self): await self.session.say("Hello! I'm here with enhanced audio quality for our conversation.") async def on_exit(self): await self.session.say("Goodbye! It was great talking with you.") # Set up pipeline with noise removal pipeline = CascadingPipeline( stt=DeepgramSTT(api_key="your-deepgram-key"), llm=OpenAILLM(api_key="your-openai-key", model="gpt-4"), tts=ElevenLabsTTS(api_key="your-elevenlabs-key", voice_id="your-voice-id"), vad=SileroVAD(), # highlight-start denoise=RNNoise() # Enable noise removal # highlight-end ) # Create and start session async def main(): session = AgentSession(agent=EnhancedVoiceAgent(), pipeline=pipeline) await session.start() if __name__ == "__main__": import asyncio asyncio.run(main()) ``` --- **Real-time Pipeline:** Integrate with real-time models: ```python title="main.py" from videosdk.agents import Agent, RealTimePipeline, AgentSession # highlight-start from videosdk.plugins.rnnoise import RNNoise # highlight-end from videosdk.plugins.openai import OpenAIRealtime class EnhancedRealtimeAgent(Agent): def __init__(self): super().__init__( instructions="You are a professional assistant with crystal-clear audio quality. Engage in natural, real-time conversations while providing helpful responses." ) async def on_enter(self): await self.session.say("Hello! I'm ready for a real-time conversation with enhanced audio quality.") async def on_exit(self): await self.session.say("Thank you for the conversation! Take care.") # Set up real-time model model = OpenAIRealtime( model="gpt-4o-realtime-preview", api_key="your-openai-key", voice="alloy" # Choose from: alloy, echo, fable, onyx, nova, shimmer ) # Set up pipeline with noise removal pipeline = RealTimePipeline( model=model, # highlight-start denoise=RNNoise() # Enable noise removal # highlight-end ) # Create and start session async def main(): session = AgentSession(agent=EnhancedRealtimeAgent(), pipeline=pipeline) await session.start() if __name__ == "__main__": import asyncio asyncio.run(main()) ``` ## Audio Processing Flow When noise removal is enabled, audio processing follows this flow: 1. **Raw Audio Input:** Microphone captures audio with background noise 2. **Noise Removal:** `RNNoise` filters out unwanted sounds 3. **Enhanced Audio:** Clean audio is passed to speech processing 4. **Improved Results:** Better transcription and conversation quality ## Example - Try Out Yourself - [Enhanced Pronounciation Example](https://github.com/videosdk-live/agents/blob/main/examples/enhanced_pronounciation.py): Checkout example with enhanced voice and noise removal --- DTMF (Dual-Tone Multi-Frequency) events happen when a caller presses keys (0–9, *, #) on their phone or SIP device during a call. AI agents can listen for these events to capture user input, run specific actions, or respond to the caller based on the key they pressed. DTMF provides a simple and reliable way for users to interact with the agent during a call. ## How It Works - **DTMF Event Detection**: The agent detects key presses (0–9, *, #) from the caller during a call session. - **Real-Time Processing**: Each key press generates a DTMF event that is delivered to the agent immediately. - **Callback Integration**: A user-defined callback function handles incoming DTMF events. - **Action Execution**: The agent executes actions or triggers workflows based on the received DTMF input like building IVR flows, collecting user input, or triggering actions in your application. ## How to enable DTMF Events ### Step 1 : Activate DTMF Detection DTMF event detection can be enabled in two ways: **dashboard:** When creating an Inbound SIP gateway in the VideoSDK dashboard, enable the `DTMF` option. ![dtmf-event](https://assets.videosdk.live/images/DTMF-events.png) --- **api:** Set the `enableDtmf` parameter to `true` when creating or updating a SIP gateway using the API. ```bash curl -H 'Authorization: $YOUR_TOKEN' \ -H 'Content-Type: application/json' \ -d '{ "name" : "Twilio Inbound Gateway", "enableDtmf" : "true", "numbers" : ["+0123456789"] }' \ -XPOST https://api.videosdk.live/v2/sip/inbound-gateways ``` ### Step 2 : Implementation To set up inbound calls, outbound calls, and routing rules check out the [Quick Start Example](https://docs.videosdk.live/telephony/managing-calls/making-outbound-calls). ```python title="main.py" from videosdk.agents import AgentSession, DTMFHandler async def entrypoint(ctx: JobContext): async def dtmf_callback(digit: int): if digit == 1: agent.instructions = "You are a Sales Representative. Your goal is to sell our products" await agent.session.say( "Routing you to Sales. Hi, I'm from Sales. How can I help you today?" ) elif digit == 2: agent.instructions = "You are a Support Specialist. Your goal is to help customers with technical issues." await agent.session.say( "Routing you to Support. Hi, I'm from Support. What issue are you facing?" ) else: await agent.session.say( "Invalid input. Press 1 for Sales or 2 for Support." ) #highlight-start dtmf_handler = DTMFHandler(dtmf_callback) #highlight-end session = AgentSession( #highlight-start dtmf_handler = dtmf_handler, #highlight-end ) ``` ## Example - Try It Yourself - [DTMF Event Example](https://github.com/videosdk-live/agents-quickstart/blob/main/DTMF%20Handler/dtmf_handler.py): Check out full working example of the DTMF Event Example --- # Fallback Adapter The `Fallback Adapter` provides automatic failover between multiple STT, LLM, or TTS providers. If a provider becomes unavailable, the system automatically switches to the next configured provider without interrupting the session. ## Features - **Automatic Fallback**: Switches to lower-priority providers if the primary provider fails. - **Cooldown-based Retry**: Implements a cooldown period before retrying a failed provider, preventing immediate repeated failures. - **Auto-Recovery**: Automatically switches back to a higher-priority provider once it becomes healthy again. - **Permanent Disable**: Permanently disables a provider after a configured number of failed recovery attempts. ## Example Usage Here is how you can implement fallback providers for STT, LLM, and TTS in your agent configuration. ```python from videosdk.agents import FallbackSTT, FallbackLLM, FallbackTTS from videosdk.plugins.openai import OpenAISTT, OpenAILLM, OpenAITTS from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.cerebras import CerebrasLLM from videosdk.plugins.cartesia import CartesiaTTS # Configure Fallback STT stt_provider = FallbackSTT( [OpenAISTT(), DeepgramSTT()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) # Configure Fallback LLM llm_provider = FallbackLLM( [OpenAILLM(model="gpt-4o-mini"), CerebrasLLM()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) # Configure Fallback TTS tts_provider = FallbackTTS( [OpenAITTS(voice="alloy"), CartesiaTTS()], temporary_disable_sec=30.0, permanent_disable_after_attempts=3 ) ``` ## Configuration Options You can configure the fallback behavior using the following parameters: | Parameter | Description | | :--- | :--- | | `temporary_disable_sec` | The duration (in seconds) to wait before retrying a failed provider. | | `permanent_disable_after_attempts` | The maximum number of recovery attempts allowed before a provider is permanently disabled. | ## Examples - Try Out Yourself - [Fallback Adapter](https://github.com/videosdk-live/agents/blob/main/examples/fallback_recovery.py): Checkout the full implementation on github --- # Memory Give your AI agents the ability to remember past interactions and user preferences. By integrating a memory provider, your agent can move beyond the limits of its immediate context window to deliver truly personalized and context-aware conversations. ## How Memory Enhances Conversations A standard LLM's memory is limited to its context window. A dedicated memory provider solves this by creating a persistent, intelligent storage layer that recalls information across different sessions. ![Memory-enabled Conversation Flow](https://assets.videosdk.live/images/voice-agent-memory-manager.png) As the diagram shows, the agent intelligently stores key facts and retrieves them in later conversations to provide a personalized, efficient interaction. ## Implementation with Mem0 This guide demonstrates how to implement long-term memory using [**Mem0**](https://mem0.ai/), an open-source platform designed to give AI agents a persistent memory layer. This example creates a "Concierge Agent" that remembers returning users. We will break down the implementation into logical steps. :::note The following sections outline the steps you might follow. For a complete working example, see the GitHub repository: - https://github.com/videosdk-live/agents-quickstart/tree/main/Memory ::: ### Prerequisites - A Mem0 API key, available from the [Mem0 dashboard](https://app.mem0.ai/). - Ensure your agent environment is set up per the [AI Voice Agent Quickstart](/ai_agents/voice-agent-quick-start). This is the baseline app where we'll implement the memory features in the steps below. ### Step 1: Create a Dedicated Memory Manager Start by creating a memory manager class that abstracts your chosen memory provider's API. This class should handle three core operations: storing memories, retrieving memories, and deciding what to remember. The key is to implement a `should_store` method that intelligently determines which conversations are worth remembering based on keywords, user intent, or other criteria you define. ```python title="memory_utils.py" from mem0.client.main import AsyncMemoryClient class Mem0MemoryManager: """Handles all interactions with the Mem0 API.""" def __init__(self, api_key: str, user_id: str): self.user_id = user_id self._client = AsyncMemoryClient(api_key=api_key) async def fetch_recent_memories(self, limit: int = 5) -> list[str]: """Retrieves the most recent memories for the user.""" try: response = await self._client.get_all(filters={"user_id": self.user_id}, limit=limit) return [entry.get("memory", "") for entry in response] except Exception as e: print(f"Error fetching memories: {e}") return [] def should_store(self, user_message: str) -> bool: """Determines if a message contains keywords indicating a fact to remember.""" keywords = ("remember", "preference", "my name is", "likes", "dislike") return any(keyword in user_message.lower() for keyword in keywords) async def record_memory(self, user_message: str, assistant_message: str | None = None): """Stores a conversational turn in Mem0.""" # ... implementation to call self._client.add() ``` ### Step 2: Accessing Memory to Personalize the Agent Implement memory retrieval at session startup to personalize your agent's behavior. Create a function that fetches relevant user memories and injects them into your agent's system prompt or context. Consider how you want to use retrieved memories: for personalized greetings, context-aware responses, or maintaining conversation continuity across sessions. ```python title="main.py" class MemoryAgent(Agent): def __init__(self, instructions: str, remembered_facts: list[str] | None = None): self._remembered_facts = remembered_facts or [] super().__init__(instructions=instructions) async def on_enter(self): # Use the retrieved facts for a personalized greeting if self._remembered_facts: top_fact = "; ".join(self._remembered_facts[:2]) await self.session.say(f"Welcome back! I remember that {top_fact}. What can I help you with?") else: await self.session.say("Hello! How can I help today?") # This helper function runs at the start of the session async def build_agent_instructions(memory_manager: Mem0MemoryManager | None) -> tuple[str, list[str]]: base_instructions = "You are a helpful voice concierge..." if not memory_manager: return base_instructions, [] # Fetches memories and adds them to the system prompt remembered_facts = await memory_manager.fetch_recent_memories() if not remembered_facts: return base_instructions, [] memory_lines = "\n".join(f"- {fact}" for fact in remembered_facts) enriched_instructions = f"{base_instructions}\n\nKnown details about this caller:\n{memory_lines}" return enriched_instructions, remembered_facts ``` ### Step 3: Storing New Memories with a Custom Conversation Flow Extend your conversation flow to capture and store new memories during interactions. If you are new to flows, review the core concepts in the [Conversation Flow](./conversation-flow.md) guide. Override the conversation flow's main processing method to evaluate each user message after the agent responds. The goal is to identify valuable information (user preferences, personal details, important facts) and store it without impacting response latency. You can implement this as a post-processing step or integrate it into your existing conversation handling logic. :::tip Want a deeper dive or to run this locally? - Review core concepts in the [Conversation Flow](./conversation-flow.md) guide. - To run your agent, follow the [AI Voice Agent Quickstart](/ai_agents/voice-agent-quick-start). ::: ```python title="memory_utils.py" from videosdk.agents import ConversationFlow class Mem0ConversationFlow(ConversationFlow): """A custom flow that records memories after each turn.""" def __init__(self, agent: Agent, memory_manager: Mem0MemoryManager, **kwargs): super().__init__(agent=agent, **kwargs) self._memory_manager = memory_manager self._pending_user_message: str | None = None async def run(self, transcript: str): self._pending_user_message = transcript # First, let the standard conversation turn happen full_response = "".join([chunk async for chunk in super().run(transcript)]) # After the response, decide if the turn should be stored in memory if self._pending_user_message and self._memory_manager.should_store(self._pending_user_message): await self._memory_manager.record_memory(self._pending_user_message, full_response or None) self._pending_user_message = None ``` ### Step 4: Assembling the Agent Session Integrate all components in your main application entry point. Initialize your memory manager, use it to build personalized agent instructions, and configure your session with the enhanced conversation flow. This is where you connect the memory system to your agent's lifecycle, ensuring memories are loaded at startup and new information is captured during conversations. ```python title="main.py" async def start_session(context: JobContext): # 1. Setup memory manager memory_manager = Mem0MemoryManager(api_key=os.getenv("MEM0_API_KEY"), user_id="demo-user") # 2. Build agent with personalized instructions instructions, facts = await build_agent_instructions(memory_manager) agent = MemoryAgent(instructions=instructions, remembered_facts=facts) # 3. Setup conversation flow with memory capabilities conversation_flow = Mem0ConversationFlow(agent=agent, memory_manager=memory_manager, ...) # 4. Create the session with the custom flow session = AgentSession( agent=agent, pipeline=pipeline, # your pipeline conversation_flow=conversation_flow ) # ... rest of your session and job context setup ``` This creates a powerful feedback loop where each interaction enriches the agent's knowledge, leading to smarter and more personalized conversations over time. ### Step 5: Run the Agent Start your worker process and connect the agent to a room using a `JobContext`. This boots your agent and keeps it running. ```python title="main.py" from videosdk.agents import WorkerJob, JobContext, RoomOptions def make_context() -> JobContext: return JobContext(room_options=RoomOptions(name="Concierge Agent", playground=True)) if __name__ == "__main__": WorkerJob(entrypoint=start_session, jobctx=make_context).start() ``` This will initialize the session using your `start_session` function from Step 4 and keep the worker alive. ## Example - Try It Yourself Explore our complete, runnable example on GitHub to see how to integrate a memory provider into a VideoSDK AI Agent. - [Memory Agent Example](https://github.com/videosdk-live/agents-quickstart/tree/main/Memory): A complete example demonstrating how to use Mem0 to give your AI agent long-term memory. --- Multi agent switching allows you to break a complex workflow into multiple specialized agents, each responsible for a specific domain or task. Instead of relying on a single agent to manage every tool and decision, you can coordinate smaller agents that operate independently. ### Context Inheritance When switching agents, you can control whether the new agent should be aware of the previous conversation using the `inherit_context` flag. - **`inherit_context=True`**: The new agent receives the full chat context. This is ideal for maintaining continuity, so the user doesn't have to repeat information. - **`inherit_context=False`** (Default): The new agent starts with a fresh state. This is useful when switching to a completely unrelated task. ## How It Works - The primary VideoSDK agent identifies whether specialized assistance is needed based on the users intent. - It invokes a `function tool` to switch to the appropriate specialized agent. - Control automatically shifts to the new agent and has access to the previous chat context as `inherit_context=True` was passed. - The specialized agent handles the user’s request, and complete the interaction. ### Implementation ```python title="main.py" from videosdk.agents import Agent, function_tool, class TravelAgent(Agent): def __init__(self): super().__init__( instructions="""You are a travel assistant. Help users with general travel inquiries and guide them to booking when needed.""", ) async def on_enter(self) -> None: await self.session.reply(instructions="Greet the user and ask how you can help with their travel plans.") async def on_exit(self) -> None: await self.session.say("Safe travels!") @function_tool() async def transfer_to_booking(self) -> Agent: """Transfer the user to a booking specialist for reservations and scheduling.""" return BookingAgent(inherit_context=True) class BookingAgent(Agent): def __init__(self, inherit_context: bool = False): super().__init__( instructions="""You are a booking specialist. Help users book or modify flights, hotels, and travel reservations.""", inherit_context=inherit_context ) async def on_enter(self) -> None: await self.session.say("I'm a booking specialist. What would you like to book or modify today?") async def on_exit(self) -> None: await self.session.say("Your booking request is complete. Have a great trip!") ``` ## Example - Try It Yourself - [Travel Agent Example](https://github.com/videosdk-live/agents-quickstart/tree/main/Multi%20Agent%20Switch/Travel%20Agent): Check out full working implementation of the Travel Agent Example - [Health Care Agent Example](https://github.com/videosdk-live/agents-quickstart/blob/main/Multi%20Agent%20Switch/Health%20Care%20agent/): Implement Health Care Agent Example --- The VideoSDK AI Agent SDK provides a powerful framework for building AI agents that can participate in real-time conversations. This guide explains the core components and demonstrates how to create a complete agentic workflow. The SDK serves as a real-time bridge between AI models and your users, facilitating seamless voice and media interactions. ## Architecture The Agent Session orchestrates the entire workflow, combining the Agent with a Pipeline for real-time communication. You can use a direct Realtime Pipeline for speech-to-speech, or a Cascading Pipeline with a Conversation Flow for modular STT-LLM-TTS control. ![Overview](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_overview.png) 1. **Agent** - This is the base class for defining your agent's identity and behavior. Here, you can configure custom instructions, manage its state, and register function tools. 2. **Pipeline** - This component manages the real-time flow of audio and data between the user and the AI models. The SDK offers two types of pipelines: - **Realtime Pipeline** - A speech to speech pipeline where there is no need for converting speech to text or text to speech and no llm to configure in between. - **Cascading Pipleine** - The traditional STT-LLM-TTS pipeline which allows flexibility to mix and match different providers for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). 3. **Agent Session** - This component brings together the agent, pipeline, and conversation flow to manage the agent's lifecycle within a VideoSDK meeting. 4. **Conversation Flow** - This inheritable class works with the CascadingPipeline to let you define custom turn-taking logic and preprocess transcripts. ## Supporting Components These components work behind the scenes to support the core functionality of the AI Agent SDK: - Execution & Lifecycle Management - **JobContext** - Provides the execution environment and lifecycle management for AI agents. It encapsulates the context in which an agent job is running. - **WorkerJob** - Manages the execution of jobs and worker processes using Python's multiprocessing, allowing for concurrent agent operations. - Configuration & Settings - **RoomOptions** - This allows you to configure the behavior of the session, including room settings and other advanced features for the agent's interaction within a meeting. - **Options** - This is used to configure the behavior of the worker, including logging and other execution settings. - External Integration - **MCP Servers** - These enable the integration of external tools through either stdio or HTTP transport. - **MCPServerStdio** - Facilitates direct process communication for local Python scripts. - **MCPServerHTTP** - Enables HTTP-based communication for remote servers and services. ## Advanced Features The AI Agent SDK includes a range of advanced features to build sophisticated conversational agents: - [Session Management](https://docs.videosdk.live/ai_agents/core-components/agent-session): Control session timeouts and configure agents to auto-end conversations - [Playground Mode](https://docs.videosdk.live/ai_agents/core-components/agent-session#playground-mode): A testing environment to experiment with different agent configurations - [Vision Integration](https://docs.videosdk.live/ai_agents/core-components/vision-and-multi-modality): Enable agents to receive and process video input from the meeting - [Recording Capabilities](https://docs.videosdk.live/ai_agents/core-components/recording): Record agent sessions for analysis and quality assurance - [A2A Communication](https://docs.videosdk.live/ai_agents/a2a/overview): Allows for seamless collaboration between specialized AI agents - [MCP Server Integration](https://docs.videosdk.live/ai_agents/mcp-integration): Connect agents to external tools and data sources ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent and customize according to your needs. - [Avatar Integration](https://github.com/videosdk-live/agents/tree/main/examples/avatar): Enhance user experience with realistic, lip-synced virtual avatars - [Human in the loop](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop): Implement human intervention capabilities in AI agent conversations for better control and oversight - [Enhanced Pronounciation](https://github.com/videosdk-live/agents/blob/main/examples/enhanced_pronounciation.py): Improve speech quality and pronunciation accuracy for better user experience and communication clarity - [PubSub Messaging](https://github.com/videosdk-live/agents/blob/main/examples/pubsub_example.py): Facilitates real-time messaging between agent and client --- # Preemptive Response Preemptive Response is a feature that allows the Speech-to-Text (STT) engine to produce **partial, low-latency text output** while the user is still speaking. This is crucial for building highly responsive conversational AI agents. By enabling preemptive response, your agent can begin processing the user's intent and formulating a response before the full utterance is completed, significantly reducing the perceived latency. ## How It Works ![preemtive-response](https://assets.videosdk.live/images/preemptive-response.png) - User audio is streamed to the STT, which generates partial transcripts. - These partial transcripts are immediately sent to the LLM to enable preemptive (early) responses. - The LLM output is then passed to the TTS to generate the spoken response. ## Prerequisites Ensure you have the required packages installed: ```text pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" ``` :::tip Currently, preemptive response generation is limited to Deepgram’s STT implementation and is available only in the Flux model. ::: ## Enabling Preemptive Generation To enable this feature, set the `enable_preemptive_generation` flag to `True` when initializing your STT plugin (e.g., `DeepgramSTTV2`). ```python from videosdk.plugins.deepgram import DeepgramSTTV2 stt = DeepgramSTTV2( enable_preemptive_generation=True ) ``` ## Full Working Example The following example demonstrates how to build a voice agent with preemptive transcription enabled. This setup uses Deepgram for STT, OpenAI for LLM, and ElevenLabs for TTS. ```python import asyncio import os from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTTV2 from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS # Pre-download the Turn Detector model to avoid delays during startup pre_download_model() class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help you today?") async def on_exit(self): await self.session.say("Goodbye!") async def start_session(context: JobContext): # 1. Create the agent and conversation flow agent = MyVoiceAgent() conversation_flow = ConversationFlow(agent) # 2. Define the pipeline with Preemptive Generation enabled pipeline = CascadingPipeline( stt=DeepgramSTTV2( model="flux-general-en", enable_preemptive_generation=True # Enable low-latency partials ), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) # 3. Initialize the session session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow ) try: await context.connect() await session.start() # Keep the session running await asyncio.Event().wait() finally: # Clean up resources await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( name="VideoSDK Cascaded Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` --- # Pub/Sub Messaging Pub/Sub (Publish/Subscribe) messaging enables real-time, bidirectional communication between your AI agent and client applications within a VideoSDK meeting. This allows you to build interactive experiences where the client can send commands or data to the agent, and the agent can push updates or notifications back to the client, all without relying on voice. ![Pub/Sub Architecture Diagram](https://strapi.videosdk.live/uploads/user_agent_pubsub_chat_e62ce8f209.png) ## Key Features - **Send Messages**: Agents can publish messages to any specified Pub/Sub topic, which can be received by any participant (including client applications) subscribed to that topic. - **Receive Messages**: Agents can subscribe to topics to receive messages published by client applications or other participants. - **Bidirectional Flow**: Communication is not one-way. Both the agent and the client can publish and subscribe, enabling a fully interactive loop. - **Decoupled Communication**: The client and agent do not need to know about each other's existence directly. They communicate through shared topics, which simplifies the architecture. ## Implementation Implementing Pub/Sub involves two main parts: subscribing to topics to receive messages and publishing messages. Subscribing is typically the first step on both the agent and client side. Use the tabs below to see how to subscribe to a Pub/Sub topic across the AI Agent and client SDKs. **ai-agent:** ```python title="Subscribe on Room Context" from videosdk import PubSubSubscribeConfig def on_client_message(message): print(f"Received: {message}") await ctx.room.subscribe_to_pubsub(PubSubSubscribeConfig( topic="CHAT", cb=on_client_message )) ``` --- **javascript:** ```js title="Subscribe on meeting join" // Subscribe to CHAT meeting.on("meeting-joined", () => { meeting.pubSub.subscribe("CHAT", (data) => { const { message, senderId, senderName, timestamp } = data; console.log("Client command:", message); }); }); ``` --- **react:** ```js title="usePubSub hook" function ClientCommands() { usePubSub("CHAT", { onMessageReceived: ({ message, senderId }) => { console.log("Client command:", message); }, }); return null; } ``` --- **react-native:** ```js title="usePubSub hook" function ClientCommands() { const { messages } = usePubSub("CHAT", { onMessageReceived: (message) => { console.log("Client command:", message.message); }, }); return null; } ``` --- **ios:** ```swift title="Subscribe with listener" class ClientCommandsListener: PubSubMessageListener { func onMessageReceived(message: PubSubMessage) { print("Client command: \(message.message)") } } let listener = ClientCommandsListener() meeting?.pubsub.subscribe(topic: "CHAT", forListener: listener) ``` --- **android:** ```kotlin title="Subscribe with listener" val listener = PubSubMessageListener { message -> Log.d("PubSub", "Client command: ${message.message}") } meeting?.pubSub?.subscribe("CHAT", listener) ``` --- **flutter:** ```dart title="Subscribe with handler" void messageHandler(PubSubMessage message) { print("Client command: ${message.message}"); } final messages = await room.pubSub.subscribe( "CHAT", messageHandler, ); ``` The most effective way for an agent to publish messages is by exposing a `function_tool`. This allows the LLM to decide when to send a message based on the conversation. To publish, you use `PubSubPublishConfig` and call the `publish_to_pubsub` method on the `JobContext` room object. ```python from videosdk import PubSubPublishConfig from videosdk.agents import Agent, function_tool, JobContext class MyPubSubAgent(Agent): def __init__(self, ctx: JobContext): super().__init__( instructions="You can send messages to the client using the send_message tool." ) self.ctx = ctx @function_tool async def send_message_to_client(self, message: str): """Sends a text message to the client application on the 'CHAT' topic.""" publish_config = PubSubPublishConfig( topic="CHAT", message=message ) await self.ctx.room.publish_to_pubsub(publish_config) return f"Message '{message}' sent to client." ``` To receive messages, the agent must subscribe to a topic using `PubSubSubscribeConfig` and the `subscribe_to_pubsub` method, which registers a callback function to handle incoming messages. This setup is typically done in your main `entrypoint` function after connecting to the room. ```python import asyncio from videosdk import PubSubSubscribeConfig from videosdk.agents import JobContext # Define the callback function that will process incoming messages def on_client_message(message): print(f"Received message from client: {message}") # Add your logic here to process the message. # For example, you could pass it to the agent's pipeline. async def entrypoint(ctx: JobContext): # ... (agent and session setup) try: await ctx.connect() await ctx.room.wait_for_participant() # Configure the subscription subscribe_config = PubSubSubscribeConfig( topic="CHAT", cb=on_client_message ) # Subscribe to the topic await ctx.room.subscribe_to_pubsub(subscribe_config) # Start the agent session await session.start() await asyncio.Event().wait() finally: await session.close() await ctx.shutdown() ``` ## Best Practices - **Topic Naming Conventions**: Use clear and consistent topic names (e.g., `CHAT`, `AGENT_STATUS`) to keep your application organized. - **Structured Data**: Use JSON for your message payloads. This makes messages easy to parse and allows for sending complex data structures. - **Error Handling**: Your callback function should gracefully handle malformed or unexpected messages to prevent crashes. - **Asynchronous Callbacks**: If your callback function performs long-running tasks, make sure it is `async` and consider running tasks in the background with `asyncio.create_task()` to avoid blocking the main event loop. ## Example - Try Out Yourself Check out our quickstart repository for a complete, runnable example of an agent using Pub/Sub. - [Pub/Sub Example](https://github.com/videosdk-live/agents-quickstart/tree/main/Pubsub): A complete, runnable example demonstrating how to send and receive Pub/Sub messages with an AI agent. --- # RAG (Retrieval-Augmented Generation) integration **RAG** helps your AI agent find relevant information from documents to give better answers. It searches through your knowledge base and uses that context to respond more accurately. ## Architecture ![RAG](https://cdn.videosdk.live/website-resources/docs-resources/voice_agent_rag.png) The RAG pipeline flow: 1. **Voice Input** → Deepgram STT converts speech to text 2. **Retrieval** → Query embeddings fetch relevant documents from ChromaDB 3. **Augmentation** → Retrieved context is injected into the prompt 4. **Generation** → OpenAI LLM generates a grounded response 5. **Voice Output** → ElevenLabs TTS converts response to speech ## Managed RAG With Managed RAG, you can upload knowledge bases from the VideoSDK dashboard and attach them to your agent to enhance responses using retrieval-augmented generation. #### Step 1 : Upload Knowledge Base on the dashboard #### Step 2 : Configure it in Cascading Pipeline After uploading, the Knowledge Base is assigned a unique ID(as shown in Step 1), which you can use to load it, enabling the agent to fetch relevant information during conversations. ```python title="main.py" import os from videosdk.agents import KnowledgeBase, KnowledgeBaseConfig # Initialize Knowledge Base with ID from Dashboard kb_id = os.getenv("KNOWLEDGE_BASE_ID") config = KnowledgeBaseConfig(id=kb_id, top_k=3) # Load Knowledge Base and pass it to the agent agent = VoiceAgent( knowledge_base=KnowledgeBase(config) ) ``` ## Custom RAG ### Prerequisites - Install VideoSDK agents with all dependencies: ```bash pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" pip install chromadb openai numpy ``` - Set API keys in envrionment: ```shell title=".env" DEEPGRAM_API_KEY = "Your Deepgram API Key" OPENAI_API_KEY = "Your OpenAI API Key" ELEVENLABS_API_KEY = "Your ElevenLabs API Key" VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token" ``` :::tip For a complete working example with all the code integrated together, check out our GitHub repository: [RAG Implementation Example](https://github.com/videosdk-live/agents-quickstart/blob/main/RAG/rag.py) ::: ## Implementation ### Step 1: Custom Voice Agent with RAG Create a custom agent class that extends `Agent` and adds retrieval capabilities: ```python title="main.py" class VoiceAgent(Agent): def __init__(self): super().__init__( instructions="""You are a helpful voice assistant that answers questions based on provided context. Use the retrieved documents to ground your answers. If no relevant context is found, say so. Be concise and conversational.""" ) # Initialize OpenAI client for embeddings self.openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY")) # Define your knowledge base self.documents = [ "What is VideoSDK? VideoSDK is a comprehensive video calling and live streaming platform...", "How do I authenticate with VideoSDK? Use JWT tokens generated with your API key...", # Add more documents ] # Set up ChromaDB self.chroma_client = chromadb.Client() # In-memory # For persistence: chromadb.PersistentClient(path="./chroma_db") self.collection = self.chroma_client.create_collection( name="videosdk_faq_collection" ) # Generate embeddings and populate database self._initialize_knowledge_base() def _initialize_knowledge_base(self): """Generate embeddings and store documents.""" embeddings = [self._get_embedding_sync(doc) for doc in self.documents] self.collection.add( documents=self.documents, embeddings=embeddings, ids=[f"doc_{i}" for i in range(len(self.documents))] ) ``` ### Step 2: Embedding Generation Implement both synchronous (for initialization) and asynchronous (for runtime) embedding methods: ```python title="main.py" def _get_embedding_sync(self, text: str) -> list[float]: """Synchronous embedding for initialization.""" import openai client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY")) response = client.embeddings.create( input=text, model="text-embedding-ada-002" ) return response.data[0].embedding async def get_embedding(self, text: str) -> list[float]: """Async embedding for runtime queries.""" response = await self.openai_client.embeddings.create( input=text, model="text-embedding-ada-002" ) return response.data[0].embedding ``` ### Step 3: Retrieval Method Add semantic search capability: ```python title="main.py" async def retrieve(self, query: str, k: int = 2) -> list[str]: """Retrieve top-k most relevant documents from vector database.""" # Generate query embedding query_embedding = await self.get_embedding(query) # Query ChromaDB results = self.collection.query( query_embeddings=[query_embedding], n_results=k ) # Return matching documents return results["documents"][0] if results["documents"] else [] ``` ### Step 4: Agent Lifecycle Hooks Define agent behavior on entry and exit: ```python title="main.py" async def on_enter(self) -> None: """Called when agent session starts.""" await self.session.say("Hello! I'm your VideoSDK assistant. How can I help you today?") async def on_exit(self) -> None: """Called when agent session ends.""" await self.session.say("Thank you for using VideoSDK. Goodbye!") ``` ### Step 5: Custom Conversation Flow Override the conversation flow to inject retrieved context: ```python title="main.py" class RAGConversationFlow(ConversationFlow): async def run(self, transcript: str) -> AsyncIterator[str]: """ Process user input with RAG pipeline. Args: transcript: User's speech transcribed to text Yields: Generated response chunks """ # Step 1: Retrieve relevant documents context_docs = await self.agent.retrieve(transcript) # Step 2: Format context if context_docs: context_str = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs)]) else: context_str = "No relevant context found." # Step 3: Inject context into conversation self.agent.chat_context.add_message( role="system", content=f"Retrieved Context:\n{context_str}\n\nUse this context to answer the user's question." ) # Step 4: Generate response with LLM async for response_chunk in self.process_with_llm(): yield response_chunk ``` ### Step 6: Session and Pipeline Setup Configure the agent session and start the job: ```python title="main.py" async def entrypoint(ctx: JobContext): agent = VoiceAgent() conversation_flow = RAGConversationFlow( agent=agent, ) session = AgentSession( agent=agent, pipeline=CascadingPipeline( stt=DeepgramSTT(), llm=OpenAILLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ), conversation_flow=conversation_flow, ) # Register cleanup ctx.add_shutdown_callback(lambda: session.close()) # Start agent try: await ctx.connect() print("Waiting for participant...") await ctx.room.wait_for_participant() print("Participant joined - starting session") await session.start() await asyncio.Event().wait() except KeyboardInterrupt: print("\nShutting down gracefully...") finally: await session.close() await ctx.shutdown() def make_context() -> JobContext: room_options = RoomOptions(name="RAG Voice Assistant", playground=True) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Advanced Features ### Dynamic Document Updates Add documents at runtime: ```python title="main.py" async def add_document(self, document: str, metadata: dict = None): """Add a new document to the knowledge base.""" doc_id = f"doc_{len(self.documents)}" embedding = await self.get_embedding(document) self.collection.add( documents=[document], embeddings=[embedding], ids=[doc_id], metadatas=[metadata] if metadata else None ) self.documents.append(document) ``` ### Document Chunking Split large documents for better retrieval: ```python title="main.py" def chunk_document(self, document: str, chunk_size: int = 500, overlap: int = 50) -> list[str]: """Split document into overlapping chunks.""" words = document.split() chunks = [] for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i:i + chunk_size]) chunks.append(chunk) return chunks # Use when adding documents for doc in large_documents: chunks = self.chunk_document(doc) for chunk in chunks: self.documents.append(chunk) ``` #### Best Practices 1. Document Quality: Use clear, well-structured documents with specific information 2. Chunk Size: Keep chunks between 300-800 words for optimal retrieval 3. Retrieval Count: Start with k=2-3, adjust based on response quality and latency 4. Context Window: Ensure retrieved context fits within LLM token limits 5. Persistent Storage: Use PersistentClient in production to save embeddings 6. Error Handling: Always handle retrieval failures gracefully 7. Testing: Test with diverse queries to ensure good coverage #### Common Issues | Issue | Solution | | ------------------ | ----------------------------------------------------------------------------- | | Slow responses | Reduce retrieval count (k), use faster embedding model, or cache embeddings | | Irrelevant results | Improve document quality, adjust chunking strategy, or use metadata filtering | | Out of memory | Use PersistentClient instead of in-memory Client | --- # RealTime Pipeline The `Realtime Pipeline` provides direct speech-to-speech processing with minimal latency. It uses unified models that handle the entire audio processing pipeline in a single step, offering the fastest possible response times for conversational AI. ![Realtime Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_realtime_pipeline.png) :::tip The `RealTimePipeline` is specifically designed for real-time AI models that provide end-to-end speech processing which is better for conversational agent. For use cases requiring more granular control over individual components (STT, LLM, TTS) for context support and response control, the [CascadingPipeline ↗](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline) would be more appropriate. ::: ## Basic Usage Setting up a `RealTimePipeline` is straightforward. You simply need to initialize your chosen real-time model and pass it to the pipeline's constructor. ```python title="main.py" from videosdk.agents import RealTimePipeline from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig # Initialize the desired real-time model model = OpenAIRealtime( model="gpt-4o-realtime-preview", config=OpenAIRealtimeConfig( voice="alloy", response_modalities=["AUDIO"] ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` In addition to [OpenAI ↗](https://docs.videosdk.live/ai_agents/plugins/realtime/openai), the Realtime Pipeline also supports other advanced models like [Google Gemini (Live API) ↗](https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api) and [AWS Nova Sonic ↗](https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic), each offering unique features for building high-performance conversational agents, you can check their pages for advance configuration options. - [OpenAI](https://docs.videosdk.live/ai_agents/plugins/realtime/openai): More about OpenAI Realtime Plugin - [Google Gemini](https://docs.videosdk.live/ai_agents/plugins/realtime/google-live-api): More about Gemini Realtime Plugin - [AWS Nova Sonic](https://docs.videosdk.live/ai_agents/plugins/realtime/aws-nova-sonic): More about AWSNovaSonic Realtime Plugin
:::tip - Choose a model based on its optimal audio sample rate (OpenAI/Nova Sonic: 16kHz, Gemini: 24kHz) to best fit your needs. - For cloud providers like AWS, select the server region closest to your users to minimize network latency. ::: ## Custom Model Integration To integrate a custom real-time model, you need to implement the `RealtimeBaseModel` interface, which requires implementing methods like `connect()`, `handle_audio_input()`, `send_message()`, and `interrupt()`. ```python title="main.py" from videosdk.agents import RealtimeBaseModel, Agent, CustomAudioStreamTrack from typing import Literal, Optional import asyncio class CustomRealtime(RealtimeBaseModel[Literal["user_speech_started", "error"]]): """Custom real-time AI model implementation""" def __init__(self, model_name: str, api_key: str): super().__init__() self.model_name = model_name self.api_key = api_key self.audio_track: Optional[CustomAudioStreamTrack] = None self._instructions = "" self._tools = [] self._connected = False def set_agent(self, agent: Agent) -> None: """Set agent instructions and tools""" self._instructions = agent.instructions self._tools = agent.tools async def connect(self) -> None: """Initialize connection to your AI service""" # Your connection logic here self._connected = True print(f"Connected to {self.model_name}") async def handle_audio_input(self, audio_data: bytes) -> None: """Process incoming audio from user""" if not self._connected: return # Process audio and generate response # Your audio processing logic here # Emit user speech detection self.emit("user_speech_started", {"type": "detected"}) # Generate and play response audio if self.audio_track: response_audio = b"your_generated_audio_bytes" await self.audio_track.add_new_bytes(response_audio) async def send_message(self, message: str) -> None: """Send text message to model""" # Your text processing logic here pass async def interrupt(self) -> None: """Interrupt current response""" if self.audio_track: self.audio_track.interrupt() async def aclose(self) -> None: """Cleanup resources""" self._connected = False if self.audio_track: await self.audio_track.cleanup() ``` ## Comparison with Cascading Pipeline The key architectural difference is that `RealTimePipeline` uses integrated models that handle the entire speech-to-speech pipeline internally, while cascading pipelines coordinate separate STT, LLM, and TTS components. | Feature | Realtime Pipeline | Cascading Pipeline | | :-------------- | :---------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ | | **Latency** | Significantly lower latency, ideal for highly interactive, real-time conversations. | Higher latency due to coordinating separate STT, LLM, and TTS components. | | **Control** | Less granular control; tools are handled directly by the integrated model. | Granular control over each step (STT, LLM, TTS), allowing for more complex logic. | | **Flexibility** | Limited to the capabilities of the single, chosen real-time model. | Allows mixing and matching different providers for each component (e.g., Google STT, OpenAI LLM). | | **Complexity** | Simpler to configure as it involves a single, unified model. | More complex to set up due to the coordination of multiple separate components. | | **Cost** | Varies depending on the chosen real-time model and usage patterns. | Varies depending on the combination of providers and usage for each component. | ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. - [Basic Implementation](https://github.com/videosdk-live/agents/blob/main/examples/test_realtime_pipeline.py): Checkout the realtime pipeline implementation - [OpenAI](https://github.com/videosdk-live/agents-quickstart/tree/main/OpenAI): Implement with OpenAI - [Google Gemini (LiveAPI)](https://github.com/videosdk-live/agents-quickstart/tree/main/Google%20Gemini%20(LiveAPI)): Implement with Google Gemini (LiveAPI) - [AWS Nova Sonic](https://github.com/videosdk-live/agents-quickstart/tree/main/AWS%20Nova%20Sonic): Implement with AWS Nova Sonic --- # Recording Recording capabilities in VideoSDK Agents allow you to capture and store meeting conversations, enabling features like conversation analysis, compliance documentation, and quality assurance. VideoSDK provides three distinct recording approaches, each suited for different use cases and requirements. ## Recording Types Overview VideoSDK offers three types of recording functionality: 1. **Participant Recording** - Built-in automatic recording managed by the agent framework 2. **Track Recording** - Individual audio/video track recording with granular control 3. **Meeting Recording** - Complete meeting session recording with composite output ## 1. Participant Recording (Built-in) Participant recording is the simplest approach, automatically managed by the VideoSDK Agents framework when you enable the `recording` parameter. ### How It Works When recording=True is set in RoomOptions, the system automatically: - Starts recording when the agent joins the meeting. - Starts recording for each participant as they join. - Stops and merges recordings when the session ends. ### Basic Setup ```python title="main.py" from videosdk.agents import JobContext, RoomOptions def make_context(): return JobContext( room_options=RoomOptions( room_id="your-room-id", auth_token="your-auth-token", name="Recording Agent", #highlight-start recording=True # Enable automatic participant recording #highlight-end ) ) ``` ## 2. Track Recording Track recording provides granular control over individual audio and video tracks, allowing you to record specific streams with custom configurations. ### When to Use Track Recording - Need to record specific audio/video tracks separately - Require custom recording configurations per track - Want to control recording start/stop timing manually - Need different quality settings for different tracks ### Key Features - **Individual Control**: Start/stop recording for specific tracks - **Custom Configuration**: Set different recording parameters per track - **Flexible Output**: Choose output formats and quality settings - **Manual Management**: Full control over recording lifecycle ### API References for Track Recording - [Start Track Recording](https://docs.videosdk.live/api-reference/realtime-communication/start-track-recording): This API lets you record a track of participant of your room by passing roomId, participantId and kind as body parameters. - [Stop Track Recording](https://docs.videosdk.live/api-reference/realtime-communication/stop-track-recording): This API lets you stop recording of track of participant of your room by passing roomId, participantId and kind as a body parameter. - [Fetch a Track Recording](https://docs.videosdk.live/api-reference/realtime-communication/fetch-a-track-recording): This API lets you fetch a particular track recording info by passing trackRecordingId as parameter. - [Fetch All Track Recordings](https://docs.videosdk.live/api-reference/realtime-communication/fetch-all-track-recordings): This API lets you fetch details of your track recording by passing roomId, sessionId and participantId as query parameters. - [Delete A Track Recording](https://docs.videosdk.live/api-reference/realtime-communication/delete-track-recording): This API lets you delete a particular track recording by passing trackRecordingId as parameter. ## 3. Meeting Recording Meeting recording captures the entire meeting session as a single composite recording, including all participants and their interactions. ### When to Use Meeting Recording - Need a single recording file for the entire meeting - Want automatic mixing of all audio/video streams - Require meeting-level recording controls - Need simplified post-processing workflow ### Key Features - **Composite Output**: Single recording file with all participants - **Automatic Mixing**: Audio/video streams automatically combined - **Meeting-level Control**: Start/stop recording for entire meeting - **Simplified Management**: One recording per meeting session ### API References for Meeting Recording - [Start Recording](https://docs.videosdk.live/api-reference/realtime-communication/start-recording): This API lets you record your room by passing roomId and config object as body parameters. - [Stop Recording](https://docs.videosdk.live/api-reference/realtime-communication/stop-recording): This API lets you stop recording of your room by passing roomId as a body parameter. - [Fetch Recordings](https://docs.videosdk.live/api-reference/realtime-communication/fetch-recordings): This API lets you fetch details of your recording by passing roomId and sessionId as query parameters. - [List all Recordings](https://docs.videosdk.live/api-reference/realtime-communication/fetch-recording-using-recordingId): This API lets you fetch a particular recording info by passing recording Id as parameter. - [Delete a Recording](https://docs.videosdk.live/api-reference/realtime-communication/delete-recording): This API lets you delete a particular recording by passing recording Id as parameter. ## Choosing the Right Recording Type | Use Case | Recommended Type | Reason | | :--- | :--- | :--- | | Agent conversations with automatic management | Participant Recording | Built-in automation and channel separation | | Custom recording workflows | Track Recording | Granular control over individual streams | | Simple meeting archival | Meeting Recording | Single composite file for entire meeting | | Compliance and audit trails | Participant Recording | Automatic lifecycle management | | Advanced post-processing | Track Recording | Individual track access and control | ## Best Practices ### Recording Management - Choose the appropriate recording type based on your use case - Ensure proper authentication tokens for recording API access - Monitor recording status and handle errors gracefully - Plan for adequate storage capacity ### Privacy and Compliance - Inform participants that sessions are being recorded - Implement proper data retention and deletion policies - Ensure compliance with local privacy regulations - Use appropriate recording type for your compliance requirements --- --- id: room-options title: RoomOptions hide_title: false hide_table_of_contents: false description: "Learn how to configure RoomOptions for VideoSDK AI Agents to customize meeting connection, agent behavior, and session management." pagination_label: "RoomOptions" keywords: - RoomOptions - AI Agents Configuration - Meeting Connection - Agent Settings - VideoSDK Agents - Session Management - Python SDK - Agent Identity - Playground Mode image: img/videosdklive-thumbnail.jpg sidebar_position: 12 sidebar_label: RoomOptions slug: room-options --- # RoomOptions `RoomOptions` is a configuration class that defines how an AI agent connects to and behaves within a VideoSDK meeting room. It serves as the primary interface for customizing agent behavior, meeting connection parameters, and session management settings. ## Introduction The `RoomOptions` class is the central configuration point for VideoSDK AI agents, providing comprehensive control over how agents join meetings, interact with participants, and manage their sessions. This configuration is passed to the `JobContext` during agent initialization and influences all aspects of the agent's behavior within the meeting environment. ## Core Features - **Meeting Connection**: Configure room ID and authentication for VideoSDK meetings - **Agent Identity**: Set display name and visual representation - **Session Management**: Control automatic session termination and timeouts - **Media Capabilities**: Enable vision processing and meeting recording - **Development Tools**: Playground mode for testing and development - **Error Handling**: Custom error handling callbacks - **Avatar Integration**: Support for virtual avatars ## Basic Example ```python title="main.py" from videosdk.agents import RoomOptions, JobContext # Basic configuration room_options = RoomOptions( room_id="your-meeting-id", name="My AI Agent", playground=True ) # Create job context context = JobContext(room_options=room_options) ``` ## Parameters Parameters that you can pass with `RoomOptions`: | Parameter | Type | Default | Description | | :--- | :--- | :--- | :--- | | `room_id` | Optional[str] | None | Unique identifier for the VideoSDK meeting | | `auth_token` | Optional[str] | None | VideoSDK authentication token | | `name` | Optional[str] | "Agent" | Display name of the agent in the meeting | | `playground` | bool | True | Enable playground mode for easy testing | | `vision` | bool | False | Enable video processing capabilities | | `recording` | bool | False | Enable meeting recording | | `avatar` | Optional[Any] | None | Virtual avatar for visual representation | | `join_meeting` | Optional[bool] | True | Whether agent should join the meeting | | `on_room_error` | Optional[Callable] | None | Error handling callback function | | `auto_end_session` | bool | True | Automatically end session when participants leave | | `session_timeout_seconds` | Optional[int] | 30 | Timeout for automatic session termination | | `signaling_base_url` | Optional[str] | None | Custom VideoSDK signaling server URL | ## Additional Resources - [Playground Mode](https://docs.videosdk.live/ai_agents/core-components/agent-session#playground-mode): A testing environment to experiment with different agent configurations - [Vision Integration](https://docs.videosdk.live/ai_agents/core-components/vision-and-multi-modality): Enable agents to receive and process video input from the meeting - [Recording Capabilities](https://docs.videosdk.live/ai_agents/core-components/recording): Record agent sessions for analysis and quality assurance - [Avatar](https://docs.videosdk.live/ai_agents/core-components/avatar): Use Avatar for visually engaging AI Voice Agent --- # Speech Handle Speech control in VideoSDK agents operates through two complementary layers: **session-level** methods for initiating speech and **utterance-level** handles for managing speech lifecycle. This document covers both aspects of controlling agent speech output. ## Session-Level Speech Control The `AgentSession` provides three primary methods for controlling agent speech output **1. Say** `say(message: str, interruptible: bool = True)`: Sends a direct message from the agent to meeting participants with interruption control. **Parameters:** - `message`: The message to be spoken. - `interruptible`: When `True`, the agent’s speech can be interrupted. When `False`, the agent will continue speaking until the message is fully delivered. Default is `True`. ```python # Basic usage # highlight-start await session.say("Critical update!", interruptible=False) # highlight-end # In agent lifecycle hooks class MyAgent(Agent): async def on_enter(self): # highlight-start await self.session.say("Welcome to the meeting!") # highlight-end ``` **2. Reply** `reply(instructions: str, wait_for_playback: bool = True, interruptible: bool = True)`: Generates agent responses dynamically using custom instructions with interruption control. **Parameters:** - `instructions`: Custom instructions for generating the response - `wait_for_playback`: When `True`, prevents user interruptions until playback completes - `interruptible`: When `True`, the agent’s response can be interrupted. When `False`, the agent will continue speaking without interruption. Default is `True`. ```python # Generate immediate response # highlight-start await session.reply(instructions="Please summarize the conversation so far", interruptible=False) # highlight-end # Wait for complete playback before allowing new inputs # highlight-start await session.reply( instructions="Explain the next steps", wait_for_playback=True ) # highlight-end # Practical example in function tools class MyAgent(Agent): @function_tool async def get_summary(self) -> str: #highlight-start await self.session.reply( instructions="Based on our conversation, let me provide a summary..." ) #highlight-end return "Summary generated" ``` **3. Interrupt** `interrupt()`: Immediately stops the agent's current speech operation. ```python # Emergency stop during agent response # highlight-start session.interrupt() # highlight-end # User interruption handling class InteractiveAgent(Agent): async def handle_user_input(self, user_input: str): if "stop" in user_input.lower(): #highlight-start self.session.interrupt() #highlight-end await self.session.reply(instructions="How can I help you instead?") @function_tool async def emergency_stop(self) -> str: """Stop current agent operation immediately""" # highlight-start self.session.interrupt() # highlight-end return "Agent stopped successfully" ``` ## Utterance-Level Management `UtteranceHandle` manages individual agent utterances, preventing overlapping speech and enabling graceful interruption handling. ### Core Concepts - **Lifecycle Management** - Each `UtteranceHandle` tracks a single utterance from creation through completion. - **Completion States** An utterance can complete in two ways: 1. **Natural Completion:** The TTS finishes playing the audio 2. **User Interruption:** The user starts speaking during playback - **Awaitable Pattern** - The handle supports Python's async/await syntax for sequential speech control. ### API Reference | Property/Method | Return Type | Description | |------------------|--------------|--------------| | `id` | `str` | Unique identifier for the utterance | | `done()` | `bool` | Returns `True` if utterance is complete | | `interrupted` | `bool` | Returns `True` if user interrupted | | `interrupt()` | `None` | Manually marks utterance as interrupted | | `__await__()` | `Generator` | Enables awaiting the handle | ### Usage Patterns - **Sequential Speech** To prevent overlapping TTS, await each handle before starting the next utterance: ```python # Correct approach handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 # Wait for first utterance to complete handle2 = self.session.say("Do you live in this city?") await handle2 # Wait for second utterance to complete ``` - **Checking Interruption Status** Access the current utterance handle via `self.session.current_utterance` to detect interruptions: ```python utterance: UtteranceHandle | None = self.session.current_utterance # In long-running operations, check periodically for i in range(10): if utterance and utterance.interrupted: logger.info("Task was interrupted by the user.") return "The task was cancelled because you interrupted me." await asyncio.sleep(1) ``` ### Best Practices - **Sequential Speech:** Always await handles when you need sequential speech to prevent audio overlap - **Interruption Handling:** Check `interrupted` status in long-running operations to enable graceful cancellation - **Handle References:** Store handle references if you need to check status later in your function - **Avoid Concurrent Tasks:** Don't use `create_task()` for speech that should play sequentially ### Common Use Cases - **Multi-part responses:** When function tools need to speak multiple sentences in sequence - **Long-running operations:** Tasks that should be cancellable when users interrupt - **Conversational flows:** Scenarios requiring precise timing between utterances ## Example - Try It Yourself - [Utterence handle example](https://github.com/videosdk-live/agents/blob/main/examples/utterance_handle_agent.py): Checkout the interruption handle implementation via the utterence handle functionality ## FAQs ### Troubleshooting | Issue | Solution | |--------|-----------| | Overlapping speech | Use `await` on handles instead of `create_task()` | | Tasks not cancelling on interruption | Check `utterance.interrupted` in loops | | Handle is None | Only available during function tool execution via `session.current_utterance` | ### Correct Usage Pattern #### ✅ Correct: Sequential Speech Await each handle to prevent overlapping TTS. ```python handle1 = session.say("First") await handle1 handle2 = session.say("Second") await handle2 ``` --- #### ❌ Incorrect: Concurrent Speech Using `create_task()` causes audio overlap. ```python asyncio.create_task(session.say("First")) asyncio.create_task(session.say("Second")) ``` --- # Testing and Evaluation The VideoSDK Agent SDK provides a structured evaluation framework that allows you to run controlled tests on individual agent components: Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) and collect performance metrics such as latency, accuracy, and stability. ## Evaluation Components To test your agent, use the `Evaluation` class. This allows you to define different scenarios (called "turns") and run them to see how your agent performs. Key components include: - **`Evaluation`**: Runs all your test scenarios. - **`EvalTurn`**: Represents a single conversational turn, one complete exchange where the user gives input and the agent processes it to provide a response. - **`EvalMetric`**: Measurements like `STT_LATENCY`, `LLM_LATENCY`, etc. - **`LLMAsJudge`**: Uses an LLM to "judge" the quality of your agent's response. These are the criteria the Judge can use to evaluate the agent: | Metric | Description | | :--- | :--- | | **REASONING** | Explains *why* the agent responded in a certain way. Useful for debugging logic. | | **RELEVANCE** | Checks if the response actually answers the user's question. | | **CLARITY** | Checks if the response is easy to understand. | | **SCORE** | Gives a numerical rating (0-10) for the quality of the response. | ## Implementation The following steps explain how to set up a test for your agent. ### 1. Import Libraries First, import the necessary modules from the SDK. ```python import logging import aiohttp from videosdk.agents import ( Evaluation, EvalTurn, EvalMetric, LLMAsJudgeMetric, LLMAsJudge, STTEvalConfig, LLMEvalConfig, TTSEvalConfig, STTComponent, LLMComponent, TTSComponent, function_tool ) # Set up logging to see the output logging.basicConfig(level=logging.INFO) ``` ### 2. Define Tools If your agent uses tools (like checking the weather), you need to define them here so the evaluation can use them. ```python @function_tool async def get_weather( latitude: str, longitude: str, ): """ Called when the user asks about the weather. Returns the weather for the given location. Args: latitude: The latitude of the location longitude: The longitude of the location """ print("### Getting weather for", latitude, longitude) url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}¤t=temperature_2m" weather_data = {} try: async with aiohttp.ClientSession() as session: async with session.get(url) as response: if response.status == 200: data = await response.json() print("Weather data", data) weather_data = { "temperature": data["current"]["temperature_2m"], "temperature_unit": "Celsius", } else: print(f"Failed to get weather data, status code: {response.status}") raise Exception(f"Failed to get weather data, status code: {response.status}") except Exception as e: print(f"Exception in get_weather tool: {e}") raise e return weather_data ``` ### 3. Setup Evaluation Create an `Evaluation` instance. You can specify which metrics you want to track. ```python eval = Evaluation( name="basic-agent-eval", include_context=False, metrics=[ EvalMetric.STT_LATENCY, EvalMetric.LLM_LATENCY, EvalMetric.TTS_LATENCY, EvalMetric.END_TO_END_LATENCY ], output_dir="./reports" ) ``` **Parameters:** | Parameter | Type | Description | | :--- | :--- | :--- | | `name` | `str` | Name of the evaluation suite. | | `include_context` | `bool` | Whether to include conversation context. | | `metrics` | `list` | List of metrics to calculate (e.g., `EvalMetric.STT_LATENCY`). | | `output_dir` | `str` | Directory to save the evaluation reports. | ### 4. Add Test Scenarios (Turns) Add "turns" to your evaluation. A turn simulates a single complete interaction loop (Input -> Processing -> Response) between the user and the agent. You can mix and match mock inputs (text) and real inputs (audio files). #### Scenario 1: Complex Interaction Here, we test the full pipeline: 1. **STT**: Transcribes an audio file (`sample.wav`). 2. **LLM**: Receives a mock text input (overriding the STT output for this test) and uses the `get_weather` tool. 3. **TTS**: Generates speech from a mock text string. 4. **Judge**: An LLM reviews the answer to see if it is relevant. :::warning Note Only `.wav` files are supported for STT evaluation. Please ensure your audio files are in this format. ::: ```python eval.add_turn( EvalTurn( stt=STTComponent.deepgram( STTEvalConfig(file_path="./sample.wav") ), llm=LLMComponent.google( LLMEvalConfig( model="gemini-2.5-flash-lite", use_stt_output=False, mock_input="write one paragraph about Water and get weather of Delhi", tools=[get_weather] ) ), tts=TTSComponent.google( TTSEvalConfig( model="en-US-Standard-A", use_llm_output=False, mock_input="Peter Piper picked a peck of pickled peppers" ) ), judge=LLMAsJudge.google( model="gemini-2.5-flash-lite", prompt="Can you evaluate the agent's response based on the following criteria: Is it relevant to the user input?", checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE] ) ) ) ``` **Configuration Parameters:** **STTEvalConfig:** | Parameter | Type | Description | | :--- | :--- | :--- | | `file_path` | `str` | Path to the audio file. **Note:** Only `.wav` files are supported. | --- **LLMEvalConfig:** | Parameter | Type | Description | | :--- | :--- | :--- | | `model` | `str` | The LLM model to use (e.g., `gemini-2.5-flash-lite`). | | `use_stt_output` | `bool` | If `True`, uses the output from the STT stage as input. | | `mock_input` | `str` | Text input to use if `use_stt_output` is `False`. | | `tools` | `list` | List of tool functions available to the LLM. | --- **TTSEvalConfig:** | Parameter | Type | Description | | :--- | :--- | :--- | | `model` | `str` | The TTS model to use. | | `use_llm_output` | `bool` | If `True`, uses the output from the LLM stage as input. | | `mock_input` | `str` | Text input to use if `use_llm_output` is `False`. | --- **LLMAsJudge:** | Parameter | Type | Description | | :--- | :--- | :--- | | `model` | `str` | The LLM model to use for judging. | | `prompt` | `str` | The prompt/criteria for the judge. | | `checks` | `list` | List of metrics to check (e.g., `LLMAsJudgeMetric.REASONING`, `LLMAsJudgeMetric.SCORE`). | #### Scenario 2: End-to-End Flow This scenario uses the output from one step as the input for the next. The STT output is fed into the LLM, and the LLM output is fed into the TTS. ```python eval.add_turn( EvalTurn( stt=STTComponent.deepgram( STTEvalConfig(file_path="./Sports.wav") ), llm=LLMComponent.google( LLMEvalConfig( model="gemini-2.5-flash-lite", use_stt_output=True, # Use the text from STT ) ), tts=TTSComponent.google( TTSEvalConfig( model="en-US-Standard-A", use_llm_output=True # Use the text from LLM ) ), judge=LLMAsJudge.google( model="gemini-2.5-flash-lite", prompt="Is the response relevant?", checks=[LLMAsJudgeMetric.REASONING, LLMAsJudgeMetric.SCORE] ) ) ) ``` #### Scenario 3: Individual Component Testing You can also test components in isolation. **STT Only:** ```python eval.add_turn( EvalTurn( stt=STTComponent.deepgram( STTEvalConfig(file_path="./Sports.wav") ) ) ) ``` --- **LLM Only:** ```python eval.add_turn( EvalTurn( llm=LLMComponent.google( LLMEvalConfig( model="gemini-2.5-flash-lite", use_stt_output=False, mock_input="write one paragraph about trees", ) ), ) ) ``` --- **TTS Only:** ```python eval.add_turn( EvalTurn( tts=TTSComponent.google( TTSEvalConfig( model="en-US-Standard-A", use_llm_output=False, mock_input="A big black bug bit a big black bear, made the big black bear bleed blood." ) ) ) ) ``` ### 5. Run and Save Results Finally, run the evaluation and save the report. The report will be saved to the `output_dir`. ```python results = eval.run() results.save() ``` ## Examples - Try It Out Yourself - [Evaluation Example](https://github.com/videosdk-live/agents/blob/main/examples/eval.py): A complete example of setting up and running evaluations for your agent. --- # Turn Detection and Voice Activity Detection In conversational AI, timing is everything. Traditional voice agents rely on simple silence-based timers (Voice Activity Detection or VAD) to guess when a user has finished speaking. This often leads to awkward interruptions or unnatural pauses. To solve this, VideoSDK created **Namo-v1**: an open-source, high-performance turn-detection model that understands the _meaning_ of the conversation, not just the silence. ![Namo Turn Detection](https://strapi.videosdk.live/uploads/namo_v1_turn_detection_12e042c6ec.png) ## From Silence Detection to Speech Understanding Namo shifts from basic audio analysis to sophisticated Natural Language Understanding (NLU), allowing your agent to know when a user is truly finished speaking versus just pausing to think. | Traditional VAD (Silence-Based) | Namo Turn Detector (Semantic-Based) | | :---------------------------------------------- | :------------------------------------------------------- | | **Listens for silence.** | **Understands words and context.** | | Relies on a fixed timer (e.g., 800ms). | Uses a transformer model to predict intent. | | Often interrupts or lags. | Knows when to wait and when to respond instantly. | | Struggles with natural pauses and filler words. | Distinguishes between a brief pause and a true endpoint. | This semantic understanding enables AI agents to respond quicker and more naturally, creating a fluid, human-like conversational experience. :::tip Learn More For a deep dive into Namo's architecture, performance benchmarks, and how to use it as a standalone model, check out the dedicated [**Namo Turn Detector plugin page**](/ai_agents/plugins/namo-turn-detector). ::: ## Implementation For the most robust setup, you can use VAD and Namo together. VAD acts as a basic speech detector, while Namo intelligently decides if the turn is over. ### 1. Voice Activity Detection (VAD) First, configure VAD to detect the presence of speech. This helps manage interruptions and acts as a first-pass filter. ```python from videosdk.plugins.silero import SileroVAD # Configure VAD to detect speech activity vad = SileroVAD( threshold=0.5, # Sensitivity to speech (0.3-0.8) min_speech_duration=0.1, # Ignore very brief sounds min_silence_duration=0.75 # Wait time before considering speech ended ) ``` ### 2. Namo Turn Detection Next, add the `NamoTurnDetectorV1` plugin to analyze the content of the speech and predict the user's intent. #### Multilingual Model If your agent needs to support multiple languages, use the default multilingual model. It's a single, powerful model that works across more than 20 languages. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download the multilingual model to avoid runtime delays pre_download_namo_turn_v1_model() # Initialize the multilingual Turn Detector turn_detector = NamoTurnDetectorV1( threshold=0.7 # Confidence level for triggering a response ) ``` The table below lists all supported languages with their performance metrics and language codes. #### Language-Specific Models For maximum performance and accuracy in a single language, use a specialized model. These models are faster and have a smaller memory footprint. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download a specific language model (e.g., German) pre_download_namo_turn_v1_model(language="de") # Initialize the Turn Detector for German turn_detector = NamoTurnDetectorV1( language="de", threshold=0.7 ) ``` :::note To see all available models for different languages, along with their benchmarks and accuracy, please visit our [Hugging Face models page](https://huggingface.co/videosdk-live/models). ::: ### 3. Adaptive End-of-Utterance (EOU) Handling The **Adaptive EOU** mode dynamically adjusts the speech-wait timeout based on the confidence scores. This ensures that the agent waits longer when the user is hesitant and responds faster when the user's intent is clear, creating a more natural conversational flow. You can configure this by setting the `eou_config` in your pipeline options: ```python pipeline = CascadingPipeline( # ... other config eou_config=EOUConfig( mode='ADAPTIVE', # or 'DEFAULT' min_max_speech_wait_timeout=[0.5, 0.8] # Min 0.5s, Max 0.8s wait ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `mode` | `str` | • **DEFAULT**: Uses a fixed timeout value.
• **ADAPTIVE**: Dynamically adjusts timeout based on confidence scores.| | `min_max_speech_wait_timeout` | `list[float]` | Defines the minimum and maximum wait time (in seconds)| ##### Example | User Input | Agent Reaction | Wait Time | Example | |------------|----------------|-----------|---------| | **Mode = DEFAULT**
Speaks clearly | Responds immediately | ~0.5s | `“Book a meeting for tomorrow at 10.”` | | **Mode = DEFAULT**
Pauses or hesitates mid-sentence | Waits slightly longer | ~0.8s | `“Book a meeting for… um… tomorrow…”` | | **Mode = ADAPTIVE** | Adjusts based on speech clarity | Scaled between min/max | `“Remind me to call… uh… John later.”` | ### 4. Interruption Detection (VAD + STT) Interruption Detection controls when the system should treat user speech as an intentional interruption. It evaluates both voice activity and recognized speech content to avoid triggering interruptions from short noises, filler words, or background audio. The agent only stops or responds when the user clearly intends to speak. #### Configuration Example (HYBRID mode) ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( mode="HYBRID", interrupt_min_duration=0.2, # 200ms of continuous speech interrupt_min_words=2, # At least 2 words recognized ) ) ``` #### VAD_ONLY mode ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( mode="VAD_ONLY", interrupt_min_duration=0.2, # 200ms of continuous speech ) ) ``` #### STT_ONLY mode ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( mode="STT_ONLY", interrupt_min_words=2, # At least 2 words recognized ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `mode` | `str` |• **HYBRID** : Combines VAD and STT. Requires both audio detection and recognized words to trigger an interruption.
• **VAD_ONLY** : Uses only raw speech activity detection. Faster but may be triggered by background noise.
• **STT_ONLY** : Relies only on recognized words from the transcript. Slower but ensures speech is intelligible. | | `interrupt_min_duration` | `float` | Minimum duration (in seconds) of continuous speech required to trigger interruption. | | `interrupt_min_words` | `int` | Minimum number of words that must be recognized (used in `HYBRID` and `STT_ONLY` modes). | ### 5. False-Interruption Recovery The **False-Interruption Recovery** feature detects accidental or brief user noises and allows the agent to automatically resume speaking when interruptions are not genuine. #### Configuration Example ```python pipeline = CascadingPipeline( # ... other config interrupt_config=InterruptConfig( false_interrupt_pause_duration=2.0, # Wait 2 seconds to confirm interruption resume_on_false_interrupt=True, # Auto-resume if interruption is brief ) ) ``` #### Configuration Parameters | Parameter | Type | Description | | :--- | :--- | :--- | | `false_interrupt_pause_duration` | `float` | Duration (in seconds) to wait after detecting an interruption before considering it false. If the user doesn't continue speaking within this time, the interruption is considered accidental and the agent resumes. | | `resume_on_false_interrupt` | `bool` | If `True`, the agent will automatically resume speaking after detecting a false interruption. If `False`, the agent will remain paused even after brief interruptions. | ## Pipeline Integration Combine VAD and Namo in your `CascadingPipeline` to bring it all together. ```python from videosdk.agents import CascadingPipeline from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model # Pre-download the model you intend to use pre_download_namo_turn_v1_model(language="en") pipeline = CascadingPipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, # highlight-start vad=SileroVAD(threshold=0.5), turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7) # highlight-end ) ``` :::tip The `RealTimePipeline` for providers like OpenAI includes built-in turn detection, so external VAD and Turn Detector components are not required. ::: ## Example Implementation Here’s a complete example showing Namo in a conversational agent. ```python title="main.py" from videosdk.agents import Agent, CascadingPipeline, AgentSession from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from your_providers import your_stt_provider, your_llm_provider, your_tts_provider class ConversationalAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant that waits for users to finish speaking before responding." ) async def on_enter(self): await self.session.say("Hello! I'm listening and will respond when you're ready.") # 1. Pre-download the model to ensure fast startup pre_download_namo_turn_v1_model(language="en") # 2. Set up the pipeline with Namo for intelligent turn detection pipeline = CascadingPipeline( stt=your_stt_provider, llm=your_llm_provider, tts=your_tts_provider, # highlight-start vad=SileroVAD(threshold=0.5), turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7) # highlight-end ) # 3. Create and start the session session = AgentSession(agent=ConversationalAgent(), pipeline=pipeline) # ... connect to your call transport ``` ## Examples - Try It Yourself - [Namo Quickstart](https://github.com/videosdk-live/agents-quickstart/tree/main/Namo%20Turn%20Detector): A quickstart guide to get you started with Namo Turn Detector. - [Cascading Pipleine](https://github.com/videosdk-live/agents/blob/main/examples/test_cascading_pipeline.py): Turn-Detection and VAD with cascading pipeline --- # Utterence Handle `UtteranceHandle` is a lifecycle management class for agent utterances in the videosdk-agents framework. It solves two critical problems: - preventing overlapping text-to-speech (TTS) output - enabling graceful interruption handling when users speak during agent responses. This is essential for creating natural conversational experiences where agents can generate multiple sequential speech outputs without audio overlap. ## Core Concepts ### Lifecycle Management Each `UtteranceHandle` instance tracks a single utterance from creation through completion. The handle manages state transitions automatically as the conversation progresses. ### Completion States An utterance can complete in two ways: 1. **Natural Completion:** The TTS finishes playing the audio to completion 2. **User Interruption:** The user starts speaking, triggering an interruption ### Awaitable Pattern The handle is compatible with Python's async/await syntax. This allows you to write sequential speech code that waits for each utterance to complete before starting the next one. ## API Reference ### Properties | Property/Method | Return Type | Description | |----------------|-------------|-------------| | id | str | Unique identifier for the utterance | | done() | bool | Returns True if utterance is complete | | interrupted | bool | Returns True if user interrupted | | interrupt() | None | Manually marks utterance as interrupted | | __await__() | Generator | Enables awaiting the handle | ### Methods - `interrupt()`: Manually marks the utterance as interrupted - `__await__()`: Enables awaiting the handle to wait for completion ## Usage Patterns ### Sequential Speech To prevent overlapping TTS, await each handle before starting the next utterance: ```python # Correct approach handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 # Wait for first utterance to complete handle2 = self.session.say("Do you live in this city?") await handle2 # Wait for second utterance to complete ``` ### Checking Interruption Status Access the current utterance handle via `self.session.current_utterance` in function tools to detect interruptions: ```python utterance: UtteranceHandle | None = self.session.current_utterance # In long-running operations, check periodically for i in range(10): if utterance and utterance.interrupted: logger.info("Task was interrupted by the user.") return "The task was cancelled because you interrupted me." await asyncio.sleep(1) ``` ## Anti-Pattern: Concurrent Speech Never use `asyncio.create_task()` for speech that should be sequential, as this causes overlapping audio: ```python # INCORRECT - causes overlapping speech asyncio.create_task(self.session.say(f"The current temperature is {temperature}°C.")) asyncio.create_task(self.session.say("Do you live in this city?")) ``` ## Integration with AgentSession The `session.say()` method returns an `UtteranceHandle` instance. During function tool execution, the current utterance is accessible via `self.session.current_utterance`. The handle's lifecycle is managed automatically by the session, with completion and interruption states updated as the conversation progresses. ### Complete Example ```python @function_tool async def get_weather(self, latitude: str, longitude: str) -> dict: utterance: UtteranceHandle | None = self.session.current_utterance # Fetch weather data temperature = await fetch_temperature(latitude, longitude) # Sequential speech with await handle1 = self.session.say(f"The current temperature is {temperature}°C.") await handle1 handle2 = self.session.say("Do you live in this city?") await handle2 # Check if user interrupted if utterance and utterance.interrupted: return {"response": "Weather request cancelled due to user interruption."} return {"response": f"The temperature is {temperature}°C."} ``` ## Best Practices 1. Always await handles when you need sequential speech to prevent audio overlap 2. Check `interrupted` status in long-running operations to enable graceful cancellation 3. Store handle references if you need to check status later in your function 4. Avoid `create_task()` for speech that should play sequentially ## Common Use Cases - **Multi-part responses:** When function tools need to speak multiple sentences in sequence - **Long-running operations:** Tasks that should be cancellable when users interrupt - **Conversational flows:** Scenarios requiring precise timing between utterances ## Example - Try It Yourself - [Utterence handle example](https://github.com/videosdk-live/agents/blob/main/examples/utterance_handle_agent.py): Checkout the interruption handle implementation via the utterence handle functionality ## FAQs ### Troubleshooting | Issue | Solution | |--------|-----------| | Overlapping speech | Use `await` on handles instead of `create_task()` | | Tasks not cancelling on interruption | Check `utterance.interrupted` in loops | | Handle is None | Only available during function tool execution via `session.current_utterance` | ### Correct Usage Pattern #### ✅ Correct: Sequential Speech Await each handle to prevent overlapping TTS. ```python handle1 = session.say("First") await handle1 handle2 = session.say("Second") await handle2 ``` --- #### ❌ Incorrect: Concurrent Speech Using `create_task()` causes audio overlap. ```python asyncio.create_task(session.say("First")) asyncio.create_task(session.say("Second")) ``` --- # Vision & Multi-modality Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see. The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases. ## Pipeline Architecture Overview The framework provides two pipeline types with different vision support: | Pipeline Type | Vision Capabilities | Supported Models | Use Cases | |---|---|---|---| | CascadingPipeline | Live frame capture & static images | OpenAI, Anthropic, Google | On-demand frame analysis, document analysis, visual Q&A | | RealTimePipeline | Continuous live video streaming | Google Gemini Live only | Real-time visual interactions, live video commentary | ## Cascading Pipeline Vision The CascadingPipeline supports vision through two approaches: capturing live video frames from participants, or processing static images. This works with all supported LLM providers (OpenAI, Anthropic, Google). ### Enabling Vision Enable vision capabilities by setting `vision=True` in RoomOptions: ```python from videosdk.agents import JobContext, RoomOptions room_options = RoomOptions( room_id="your-room-id", name="Vision Agent", #highlight-start vision=True # Enable vision capabilities #highlight-end ) job_context = JobContext(room_options=room_options) ``` ### Live Frame Capture Capture video frames from meeting participants on-demand using `agent.capture_frames()`: ```python from videosdk.agents import Agent, AgentSession, CascadingPipeline from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.elevenlabs import ElevenLabsTTS from videosdk.plugins.google import GoogleLLM class VisionAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful assistant that can analyze images." ) async def entrypoint(ctx: JobContext): agent = VisionAgent(ctx) conversation_flow = ConversationFlow(agent) pipeline = CascadingPipeline( stt=DeepgramSTT(), llm=GoogleLLM(), tts=ElevenLabsTTS(), vad=SileroVAD(), turn_detector=TurnDetector() ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow, ) shutdown_event = asyncio.Event() #highlight-start async def on_pubsub_message(message): print("Pubsub message received:", message) if isinstance(message, dict) and message.get("message") == "capture_frames": print("Capturing frame....") try: frames = agent.capture_frames(num_of_frames=1) if frames: print(f"Captured {len(frames)} frame(s)") await session.reply( "Please analyze this frame and describe what you see in details within one line.", frames=frames ) else: print("No frames available. Make sure vision is enabled in RoomOptions.") except ValueError as e: print(f"Error: {e}") def on_pubsub_message_wrapper(message): asyncio.create_task(on_pubsub_message(message)) #highlight-end #rest of the code.. ``` :::tip The `capture_frames` function returns an array and the max number of frames you can input is 5 (`num_of_frames <=5`) ::: **Key Features:** - **On-Demand Capture:** Capture frames only when needed, triggered by events or user requests - **Event-Driven:** Use PubSub or other triggers to capture frames at the right moment - **Flexible Analysis:** Send custom instructions along with frames for specific analysis tasks ### Silent Capture (Saving Captured Frames) You can save captured video frames to disk for later analysis or debugging. The frames returned by `agent.capture_frames()` are `av.VideoFrame` objects that can be converted to JPEG images. (Silent capture - as it doesn't invoke any agent speech saying the image is being captured unless explicity set to do so) ```python title="main.py" import io from av import VideoFrame from PIL import Image def save_frame_as_jpeg(frame: VideoFrame, filename: str) -> None: """Save a video frame as a JPEG file.""" img = frame.to_image() # Convert to PIL Image img.save(filename, format="JPEG") # In your agent code frames = agent.capture_frames(num_of_frames=1) if frames: # Save the first frame save_frame_as_jpeg(frames[0], "captured_frame.jpg") # Or save as bytes for uploading/processing buffer = io.BytesIO() frames[0].to_image().save(buffer, format="JPEG") jpeg_bytes = buffer.getvalue() ``` **Use Cases:** - **Debugging:** Save frames to verify what the agent is seeing - **Logging:** Archive frames for audit trails or quality assurance - **Preprocessing:** Save frames before sending to external vision APIs - **Thumbnails:** Generate preview images for user interfaces ### Static Image Processing For pre-existing images or URLs, use the `ImageContent` class: ```python from videosdk.agents import ChatRole, ImageContent # Add image from URL agent.chat_context.add_message( role=ChatRole.USER, content=[ImageContent(image="https://example.com/image.jpg")] ) # Add image with custom settings image_content = ImageContent( image="https://example.com/document.png", inference_detail="high" # "auto", "high", or "low" ) agent.chat_context.add_message( role=ChatRole.USER, content=[image_content] ) ``` ### Provider Support All major LLM providers support vision in `CascadingPipeline`: | Provider | Vision Models | Capabilities | |-----------|----------------|---------------| | OpenAI | GPT-4 Vision models | Configurable detail levels, URL & base64 support | | Anthropic | Claude 3 models | Advanced image understanding, document analysis | | Google | Gemini models | Comprehensive visual analysis, multi-image support | ### Best Practices - **Frame Timing:** Capture frames at meaningful moments (e.g., when user asks "what do you see?") - **Error Handling:** Always check if frames are available before processing - **Vision Enablement:** Ensure `vision=True` is set in `RoomOptions` for frame capture - **Image Quality:** Use appropriate resolutions for your use case (1024x1024 recommended for detailed analysis) *Here is the example you can try out : [Cascading Pipeline Vision Example](https://github.com/videosdk-live/agents/blob/main/examples/vision_cascading_pipeline.py)* --- ## RealTime Pipeline Vision The `RealTimePipeline` enables continuous live video processing for real-time visual interactions. Video frames are automatically streamed to the model as they arrive. ### Live Video Processing Live video input is enabled through the `vision` parameter in `RoomOptions` and requires Google's Gemini Live model. ```python title="main.py" from videosdk.agents import Agent, AgentSession, RealTimePipeline,WorkerJob, JobContext, RoomOptions from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig async def start_session(context: JobContext): # Initialize Gemini with vision capabilities model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) agent = VisionAgent() session = AgentSession( agent=agent, pipeline=pipeline, ) await session.start(wait_for_participant=True, run_until_shutdown=True) # Enable live video processing def make_context() -> JobContext: room_options = RoomOptions(room_id="", name="Sandbox Agent", playground=True, #highlight-start vision=True #highlight-end ) return JobContext( room_options=room_options ) ``` ### Video Processing Flow When vision is enabled, the system automatically does following: 1. **Continuous Capture**: Captures video frames from meeting participants 2. **Frame Processing**: Processes frames at optimal intervals (throttled to 0.5 seconds) 3. **Model Integration**: Sends visual data to the Gemini Live model 4. **Context Integration**: Integrates visual understanding with conversation context ### RealTimePipeline Limitations - **Model Restriction**: Only works with `GeminiRealtime` model - **Network Requirements**: Requires stable network connections for optimal performance - **Frame Rate**: Automatically throttled to prevent overwhelming the model *Here is the example you can try out : [**Realtime Pipeline Vision Example**](https://github.com/charu1603/test-realtime-vision)* ## Choosing the Right Approach | Use Case | Recommended Pipeline | Why | |-----------|----------------------|-----| | On-demand frame analysis | CascadingPipeline | Capture frames only when needed, works with all LLM providers | | Document/image Q&A | CascadingPipeline | Process static images with custom instructions | | Real-time video commentary | RealTimePipeline | Continuous streaming for live visual interactions | | Multi-provider support | CascadingPipeline | Works with OpenAI, Anthropic, and Google | | Lowest latency | RealTimePipeline | Direct streaming to Gemini Live model | ## Examples - Try Out Yourself Checkout examples of using Realtime and Cascading Vision functionality - [Cascading Pipeline Vision](https://github.com/videosdk-live/agents/blob/main/examples/vision_cascading_pipeline.py): On-demand frame capture and static image processing - [Realtime Pipeline Vision](https://github.com/charu1603/test-realtime-vision): Continuous video streaming with Gemini Realtime API ## Frequently Asked Questions ### Can I use vision with any LLM provider? CascadingPipeline vision works with OpenAI, Anthropic, and Google LLMs. RealTimePipeline vision only works with Google's Gemini Live model. ### How do I capture frames at specific moments? Use event-driven triggers like PubSub messages or user speech to call `agent.capture_frames()` at the right time. See the example code above for implementation details. ### What's the difference between frame capture and continuous streaming? Frame capture (CascadingPipeline) captures frames on-demand when you call `capture_frames()`. Continuous streaming (RealTimePipeline) automatically sends video frames to the model in real-time. --- # Voice Mail Detection Voice Mail Detection allows you to automatically handle voicemail scenarios when making outbound calls with a VideoSDK AI agent. When an outbound call is forwaded to a voicemail system, the detector triggers a callback so your agent can take the action such as leaving a voicemail message or ending the call. ## What Problem This Solves In outbound calling workflows, unanswered calls are often routed to voicemail systems. Without detection, agents may continue speaking or wait unnecessarily. Voice Mail Detection lets you: - Detect voicemail systems automatically - Control how your agent responds - End calls cleanly after voicemail handling :::info To set up an outbound calling, and routing rules check out the [Quick Start Example](https://docs.videosdk.live/telephony/managing-calls/making-outbound-calls). ::: ## Enabling Voice Mail Detection To use voicemail detection, import and add `VoiceMailDetector` to your agent configuration and register a callback that defines how voicemail should be handled. ```python from videosdk.agents import VoiceMailDetector from videosdk.plugins.openai import OpenAILLM async def voice_mail_callback(message): print("Voice Mail message received:", message) # highlight-start voicemail = VoiceMailDetector( llm=OpenAILLM(), duration=5, callback=custom_callback_voicemail, ) # highlight-end session = AgentSession( # highlight-start voice_mail_detector=voicemail # highlight-end ) ``` ## Parameters | Parameter | Description | |----------|-------------| | `llm` | LLM to process the detected voicemail. | | `duration` | The minimum period of silence (in seconds) that triggers voicemail detection. | | `callback` | A function that is called whenever a voicemail is detected, allowing for custom actions like hanging up or leaving a message. | ## Example - Try It Yourself - [Voice Mail Detection](https://github.com/videosdk-live/agents-quickstart/blob/main/Voice%20Mail%20Detector/voice_mail_detector.py): Check out full working example of the Voice Mail Detection --- # Worker This document covers the `worker` and `job` execution system that manages `agent` processes, handles backend registration, and coordinates job assignment and execution. This system provides the foundation for running VideoSDK agents either locally or as part of a distributed backend infrastructure. ## Architecture Overview The `worker` and `job` system consists of three primary components that work together to execute agent code: - **WorkerJob**: The main entry point that configures and starts agent execution - **Worker**: Manages process pools, backend communication, and job lifecycle - **JobContext**: Provides runtime context and resources to agent entrypoint functions ![Worker](https://cdn.videosdk.live/website-resources/docs-resources/build_agent_section_worker.png) ## Core Components ### Worker Class The `Worker` class manages the complete lifecycle of agent execution, including process management, backend communication, and job coordination. **Core Responsibilities:** - Process pool management and lifecycle - Backend registry communication - Job assignment and execution coordination - Resource monitoring and cleanup - Error handling and recovery ### WorkerJob The `WorkerJob` class serves as the primary entry point for creating and running agents. It accepts an `entrypoint function` and `configuration options`, then delegates to the Worker class for execution. ```python from videosdk.agents import WorkerJob, Options, JobContext, RoomOptions # Configure worker options options = Options( agent_id="MyAgent", max_processes=5, register=True, # Registers worker with backend for job scheduling ) # Set up room configuration room_options = RoomOptions( name="My Agent", ) # Create job context job_context = JobContext(room_options=room_options) # Define your agent entrypoint async def your_agent_function(ctx: JobContext): # Your agent logic here await ctx.connect() # Agent implementation... # Create and start the worker job # highlight-start job = WorkerJob( entrypoint=your_agent_function, jobctx=lambda: job_context, options=options, ) job.start() # highlight-end ``` - **Entrypoint:** An async function that serves as your agent's main execution logic. This function receives a `JobContext` parameter and contains your agent implementation. - **JobContext:** Provides the runtime environment for your agent, managing room connections and VideoSDK integration. It handles room setup, authentication, and cleanup operations. - **Options:** Configuration settings for worker execution including process management, authentication, and backend registration. You can find worker options [here ↗](https://docs.videosdk.live/ai_agents/deployments/self-hosting/worker-configuration#worker-options-explained). **Key Methods:** - `start()`: Initiates worker execution based on configuration ## Deployments Choose how to deploy your VideoSDK agents based on your infrastructure needs and requirements. - [Agent Cloud](https://docs.videosdk.live/ai_agents/deployments/agent-cloud): Deploy your agents to VideoSDK's managed cloud infrastructure - [Self Hosting with Worker](https://docs.videosdk.live/ai_agents/deployments/self-hosting/understanding-worker): Learn how to deploy workers for self-hosted agent infrastructure ## Examples - Try Out Yourself We have examples to get you started. Go ahead, try out, talk to agent, understand and customize according to your needs. - [Worker Example](https://github.com/videosdk-live/agents/blob/main/examples/new_worker.py): Checkout worker implementation --- # Deploy Your Agents This guide shows you how to deploy AI Agents with the [videosdk-agents](https://pypi.org/project/videosdk-agents/) python package. Once your AI Agent is ready to use, you need to create an AI Deployment. The AI Deployment is responsible for running your AI Agent. Before proceeding, ensure you have completed the steps under **Prerequisites**. ## Prerequisites To deploy your AI Deployment, make sure you have: - Created an AI Deployment using the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment). - A VideoSDK authentication token (generate from [VideoSDK Dashboard](https://app.videosdk.live)) ## YAML Configuration Create a `videosdk.yaml` file with the following structure: ``` version: "1.0" deployment: id: your_ai_deployment_id entry: path: entry_point_for_deployment env: # Optional to run your agent locally path: "./.env" secrets: VIDEOSDK_AUTH_TOKEN: your_auth_token deploy: cloud: true ``` ### Field Descriptions | Field | Description | | ----------------------------- | ---------------------------------------------------------------------------------------------- | | `deployment.id` | The `deploymentId` obtained from the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment) | | `deployment.entry.path` | Path to the entry point script for your AI Deployment. | | `env.path` | Path to your `.env` file, used only when running the agent locally. | | `secrets.VIDEOSDK_AUTH_TOKEN` | Your VideoSDK auth token (required for deployment). | | `deploy.cloud` | Set to `true` to allow deploying the deployment to VideoSDK Cloud, when using the deploy command. Use `false` to avoid accidental deploys. | ## CLI Commands - ###### Run the AI Deployment locally for Testing. ``` videosdk run ``` - ###### Deploy the AI Deployment. ``` videosdk deploy ``` ## Next Steps After deploying your AI Deployment, you can start using it by: 1. Creating a new session using the [Start Session API](/api-reference/agent-cloud/start-session) 2. Ending the session using the [End Session API](/api-reference/agent-cloud/end-session) --- # Agent Cloud This guide shows you how to deploy AI Agents with the [videosdk-agents](https://pypi.org/project/videosdk-agents/) python package. Once your AI Agent is ready to use, you need to create an AI Deployment. The AI Deployment is responsible for running your AI Agent. Before proceeding, ensure you have completed the steps under **Prerequisites**. ## Prerequisites To deploy your AI Deployment, make sure you have: - Created an AI Deployment using the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment). - A VideoSDK authentication token (generate from [VideoSDK Dashboard](https://app.videosdk.live)) ## YAML Configuration Create a `videosdk.yaml` file with the following structure: ``` version: "1.0" deployment: id: your_ai_deployment_id entry: path: entry_point_for_deployment env: # Optional to run your agent locally path: "./.env" secrets: VIDEOSDK_AUTH_TOKEN: your_auth_token deploy: cloud: true ``` ### Field Descriptions | Field | Description | | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | | `deployment.id` | The `deploymentId` obtained from the [Create AI Deployment API](/api-reference/agent-cloud/create-deployment) | | `deployment.entry.path` | Path to the entry point script for your AI Deployment. | | `env.path` | Path to your `.env` file, used only when running the agent locally. | | `secrets.VIDEOSDK_AUTH_TOKEN` | Your VideoSDK auth token (required for deployment). | | `deploy.cloud` | Set to `true` to allow deploying the deployment to VideoSDK Cloud, when using the deploy command. Use `false` to avoid accidental deploys. | ## CLI Commands - ###### Run the AI Deployment locally for Testing. ``` videosdk run ``` - ###### Deploy the AI Deployment. ``` videosdk deploy ``` ## Next Steps After deploying your AI Deployment, you can start using it by: 1. Creating a new session using the [Start Session API](/api-reference/agent-cloud/start-session) 2. Ending the session using the [End Session API](/api-reference/agent-cloud/end-session) --- # Deployments ### Overview The VideoSDK Agents framework provides multiple deployment options to run your AI agents in production environments. Understanding these options helps you choose the right deployment strategy for your specific use case. VideoSDK Agents supports two primary deployment modes: 1. **Agent Cloud (Managed)** - Fully managed deployment hosted on VideoSDK infrastructure 2. **Self-Hosting** - Self-managed deployment on your own infrastructure (EC2, Docker, Kubernetes, etc.) ### [Agent Cloud (Hosted on Our Infrastructure)](./agent-cloud.md) Agent Cloud is a fully managed service that handles the deployment, scaling, and maintenance of your AI agents. When you deploy to Agent Cloud: - **Zero Infrastructure Management**: No need to manage servers, containers, or scaling - **Automatic Scaling**: Built-in load balancing and auto-scaling capabilities - **High Availability**: Redundant infrastructure with automatic failover - **Managed Updates**: Automatic security patches and framework updates - **Global Distribution**: Agents deployed across multiple regions for low latency - **Built-in Monitoring**: Integrated metrics, logging, and health monitoring **Best for**: Teams that want to focus on agent development rather than infrastructure management, or applications with variable traffic patterns. ### [Self-Hosting (EC2, Docker, or Custom Infrastructure)](./self-hosting/understanding-worker.md) Self-hosting gives you complete control over your deployment environment and infrastructure. When self-hosting: - **Full Control**: Complete control over hardware, networking, and configuration - **Custom Integrations**: Ability to integrate with existing infrastructure and tools - **Cost Optimization**: Potential cost savings for high-volume, predictable workloads - **Compliance**: Meet specific security, compliance, or data residency requirements - **Custom Scaling**: Implement your own scaling strategies and resource management **Best for**: Organizations with existing infrastructure, specific compliance requirements, or predictable high-volume workloads. ### When to Choose Agent Cloud vs Self-Hosting #### Choose Agent Cloud when: - You want to get started quickly without infrastructure setup - You have variable or unpredictable traffic patterns - You need global distribution and low latency - You want automatic scaling and high availability - You prefer a managed service with built-in monitoring #### Choose Self-Hosting when: - You need to meet specific compliance or security requirements - You have predictable, high-volume workloads where cost optimization is important - You require custom integrations with existing systems - You need complete control over the deployment environment ### Common Terminology Understanding these key terms will help you navigate the deployment documentation: | Term | Definition | | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Agent** | Your AI application built using the VideoSDK Agents framework. An agent can handle voice conversations, process audio, and respond with synthesized speech. | | **Worker** | A runtime component that executes your agent code. Workers can run in different environments (Agent Cloud or self-hosted) and handle job assignments from the backend registry system. | | **Backend Registry** | The central service that manages worker registration, job assignment, and load balancing. Workers connect to this registry to receive job assignments and report their status. | | **Job** | A single execution instance of your agent. When a user starts a conversation, the backend registry assigns a job to an available worker. | | **JobContext** | The execution context for a job, containing room configuration, pipeline setup, and session management. This is the main interface your agent code interacts with. | | **Worker Registration** | The process by which self-hosted workers register themselves with the VideoSDK backend registry to receive job assignments. | | **Load Threshold** | A configuration parameter that determines when a worker is considered "at capacity" and should not receive new job assignments. | | **Health Check** | Regular monitoring of worker status to ensure they're available and functioning correctly. Workers provide health endpoints for monitoring. | | **Resource Management** | The system for managing worker resources including process/thread allocation, memory limits, and concurrent job handling. | | **Session Management** | Handles the lifecycle of agent sessions including automatic session ending, timeouts, and cleanup. | | **Horizontal Scaling** | The manual process of deploying additional worker instances to handle increased load (requires manual deployment of new worker instances). | | **Vertical Scaling** | The automatic scaling within a single worker up to its configured maximum capacity (`max_processes`). | | **Dispatch API** | A REST API endpoint that allows you to dynamically dispatch agents to meetings on-demand. | | **AI Deployment** | The deployment configuration that runs your AI agent, either in Agent Cloud or self-hosted environments. | This terminology will be referenced throughout the deployment documentation as we explore specific deployment scenarios and configurations. --- # Dispatch Agents Dynamically assign your AI agents to meetings using the VideoSDK dispatch API. This API supports dispatching for both self-hosted agents created with the Agents SDK and agents managed through the VideoSDK dashboard (Agent Runtime). ## How It Works 1. **Your app** calls the dispatch API 2. **VideoSDK backend** finds an available server 3. **Server spawns a job/process** to join the meeting 4. **Agent starts** and begins processing in the meeting ## API Usage ### Endpoint ```bash POST https://api.videosdk.live/v2/agent/dispatch ``` ### Request Body Parameters | Parameter | Type | Required | Description | | :---------- | :----- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------- | | meetingId | string | Yes | The ID of the meeting to which the agent should be dispatched. | | agentId | string | Yes | The ID of the agent to dispatch. | | metadata | object | No | Optional metadata to pass to the agent, such as variables. | | versionId | string | No | The specific version of a dashboard-managed agent to dispatch. If omitted, the latest deployed version is used. Not for self-hosted agents. | ### Example Request ```bash curl -X POST "https://api.videosdk.live/v2/agent/dispatch" \ -H "Authorization: YOUR_VIDEOSDK_AUTH_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "meetingId": "xxxx-xxxx-xxxx", "agentId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "metadata": { "variables":[ { "name":"fname", "value":"ankit" } ] }, "versionId":"abcd-abcd-abcd-abcd" }' ``` ### Responses **On Success** A successful request will return a confirmation that the dispatch has been initiated. ```json { "message": "Agent dispatch requested successfully.", "data": { "success": true, "status": "assigned", "roomId": "xxxx-xxxx-xxxx", "agentId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" } } ``` **On Error** If the dispatch fails, you will receive one of the following error messages: **no-workers:** This error occurs when no servers and agents are configured to handle the request. ```json { "message": "No workers available" } ``` --- **no-workers-registered:** This error is specific to **self-hosted (Agents SDK) agents**. It means that while the `agentId` is valid, no server has been configured for the specific `agentId`. ```json { "message": "No workers have registered with agentId 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'" } ``` --- **agent-not-deployed:** This error is specific to **dashboard-managed agents**. It indicates that the agent exists but has no deployed versions available for dispatch or the specific version user wants to dispatch is not deployed . ```json { "message": "No agent is deployed with agentId 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'" } ``` ## Dispatching Your Agent The prerequisites for dispatching an agent depend on how it was created. ### For Self-Hosted Agents (Agents SDK) If you created your agent using the Python Agents SDK, you are responsible for hosting it. Your server must be: 1. **Registered**: The server must be configured with `register=True`. 2. **Connected**: The server must be running and connected to the VideoSDK backend. 3. **Available**: The server must have the capacity to handle new jobs. The `versionId` parameter is not applicable in this scenario. **Server Configuration Example** ```python from videosdk.agents import Options options = Options( agent_id="MyAgent", # Must match agentId in API call register=True, # Required for dispatch max_processes=10, load_threshold=0.75, ) ``` ### For Dashboard-Managed Agents (Agent Runtime) If you created your agent using the dashboard interface, VideoSDK manages the hosting for you. The only prerequisite is that your agent must be **deployed**. - You can deploy your agent via the dashboard. - You can use the optional `versionId` parameter in your dispatch request to specify which deployed version of the agent to use. - If `versionId` is not provided, the **latest deployed version** will be dispatched by default. ## Code Examples **python:** ```python import requests def dispatch_agent(auth_token, meeting_id, agent_id, metadata=None, version_id=None): url = "https://api.videosdk.live/v2/agent/dispatch" headers = { "Authorization": auth_token, "Content-Type": "application/json" } payload = { "meetingId": meeting_id, "agentId": agent_id, } if metadata: payload["metadata"] = metadata if version_id: payload["versionId"] = version_id response = requests.post(url, headers=headers, json=payload) return response.json() # Usage result = dispatch_agent("your-token", "room-123", "MyAgent") ``` --- **javascript:** ```javascript async function dispatchAgent(authToken, meetingId, agentId, metadata, versionId) { const url = "https://api.videosdk.live/v2/agent/dispatch"; const headers = { Authorization: authToken, "Content-Type": "application/json", }; const body = { meetingId, agentId, }; if (metadata) { body.metadata = metadata; } if (versionId) { body.versionId = versionId; } const response = await fetch(url, { method: "POST", headers, body: JSON.stringify(body), }); return response.json(); } // Usage dispatchAgent("your-token", "room-123", "MyAgent"); ``` --- # AWS EC2 Deploy your VideoSDK AI Agent Worker on AWS EC2 instances. ## Prerequisites - AWS account - SSH key pair - VideoSDK authentication token ## Quick Setup ### 1. Launch EC2 Instance ```bash aws ec2 run-instances \ --image-id ami-0c02fb55956c7d316 \ --instance-type t3.medium \ --key-name your-key-pair \ --security-group-ids sg-xxxxxxxxx \ --user-data file://user-data.sh ``` ### 2. User Data Script ```bash #!/bin/bash yum update -y yum install -y python3 python3-pip git # Clone private repository with token git clone https://YOUR_TOKEN@github.com/your-org/your-agent.git /opt/agent cd /opt/agent # Install dependencies pip3 install -r requirements.txt # Create systemd service cat > /etc/systemd/system/agent-worker.service << EOF [Unit] Description=VideoSDK Agent Worker After=network.target [Service] Type=simple User=ec2-user WorkingDirectory=/opt/agent Environment=VIDEOSDK_AUTH_TOKEN=your_auth_token ExecStart=/usr/bin/python3 main.py Restart=always [Install] WantedBy=multi-user.target EOF # Start the service systemctl enable agent-worker systemctl start agent-worker ``` ### 3. Security Group Configure your security group with these rules: - **SSH (22)**: Your IP - **Custom TCP (8081)**: Your IP (for health checks) - **HTTPS (443)**: 0.0.0.0/0 (for VideoSDK API) ## Deploy Updates ```bash # Connect to your instance ssh -i your-key.pem ec2-user@your-instance-ip # Update your agent cd /opt/agent git pull systemctl restart agent-worker ``` ## Monitor ```bash # Check service status systemctl status agent-worker # View logs journalctl -u agent-worker -f ``` ## Scaling > To support more concurrent agents, you can spin up additional EC2 instances using the same process. Each instance will register with the VideoSDK backend registry and automatically receive job assignments. The backend will distribute the load across all available workers. **To add more instances:** 1. Use the same user data script 2. Launch additional EC2 instances 3. Each instance will automatically join the worker pool 4. The VideoSDK backend will handle load balancing **Example:** ```bash # Launch multiple instances aws ec2 run-instances \ --image-id ami-0c02fb55956c7d316 \ --instance-type t3.medium \ --key-name your-key-pair \ --security-group-ids sg-xxxxxxxxx \ --user-data file://user-data.sh \ --count 3 ``` --- # Docker Deploy your VideoSDK AI Agent Worker using Docker containers. ## Prerequisites - Docker installed - VideoSDK authentication token ## Quick Setup ### 1. Create Dockerfile ```dockerfile FROM python:3.11-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y \ gcc \ && rm -rf /var/lib/apt/lists/* # Copy requirements and install Python dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose debug port EXPOSE 8081 # Run the worker CMD ["python", "main.py"] ``` ### 2. Build and Run ```bash # Build the image docker build -t my-agent-worker . # Run the container docker run -d \ --name my-agent-worker \ -p 8081:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` ### 3. Docker Compose (Optional) Create `docker-compose.yml`: ```yaml title="docker-compose.yml" version: "3.8" services: agent-worker: build: . ports: - "8081:8081" environment: - VIDEOSDK_AUTH_TOKEN=${VIDEOSDK_AUTH_TOKEN} restart: unless-stopped ``` Run with: ```bash docker-compose up -d ``` ## Deploy Updates ```bash # Stop container docker stop my-agent-worker # Remove old container docker rm my-agent-worker # Build new image docker build -t my-agent-worker . # Run new container docker run -d \ --name my-agent-worker \ -p 8081:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` ## Monitor ```bash # Check container status docker ps # View logs docker logs my-agent-worker # Execute commands in container docker exec -it my-agent-worker bash ``` ## Scaling > To support more concurrent agents, you can run multiple containers using the same image. Each container will register with the VideoSDK backend registry and automatically receive job assignments. **Run multiple containers:** ```bash # Run additional containers docker run -d \ --name my-agent-worker-2 \ -p 8082:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker docker run -d \ --name my-agent-worker-3 \ -p 8083:8081 \ -e VIDEOSDK_AUTH_TOKEN="your_auth_token" \ my-agent-worker ``` **Or scale with Docker Compose:** ```bash docker-compose up -d --scale agent-worker=3 ``` --- # Kubernetes Deploy your VideoSDK AI Agent Worker on Kubernetes clusters. ## Prerequisites - Kubernetes cluster (EKS, GKE, or self-hosted) - kubectl configured - Docker image of your agent ## Quick Setup ### 1. Create Namespace ```bash kubectl create namespace agent-workers ``` ### 2. Create Secret ```bash kubectl create secret generic agent-secrets \ --from-literal=VIDEOSDK_AUTH_TOKEN=your_auth_token \ --namespace agent-workers ``` ### 3. Deploy Agent ```yaml title="deployment.yaml" apiVersion: apps/v1 kind: Deployment metadata: name: agent-worker namespace: agent-workers spec: replicas: 3 selector: matchLabels: app: agent-worker template: metadata: labels: app: agent-worker spec: containers: - name: agent-worker image: your-registry/agent-worker:latest ports: - containerPort: 8081 env: - name: VIDEOSDK_AUTH_TOKEN valueFrom: secretKeyRef: name: agent-secrets key: VIDEOSDK_AUTH_TOKEN resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" ``` Apply the deployment: ```bash kubectl apply -f deployment.yaml ``` ## Monitor ```bash # Check deployment status kubectl get deployments -n agent-workers # Check pods kubectl get pods -n agent-workers # View logs kubectl logs -f deployment/agent-worker -n agent-workers ``` ## Deploy Updates ```bash # Update image kubectl set image deployment/agent-worker agent-worker=your-registry/agent-worker:latest -n agent-workers # Check rollout status kubectl rollout status deployment/agent-worker -n agent-workers ``` ## Scaling > To support more concurrent agents, you can scale the deployment by increasing the number of replicas. Each pod will register with the VideoSDK backend registry and automatically receive job assignments. **Scale the deployment:** ```bash # Scale to 5 replicas kubectl scale deployment agent-worker --replicas=5 -n agent-workers # Or use HPA for automatic scaling kubectl autoscale deployment agent-worker --cpu-percent=70 --min=2 --max=10 -n agent-workers ``` **Check scaling:** ```bash # View current replicas kubectl get deployment agent-worker -n agent-workers # View HPA status kubectl get hpa -n agent-workers ``` ## Cleanup ```bash # Delete deployment kubectl delete deployment agent-worker -n agent-workers # Delete namespace (removes everything) kubectl delete namespace agent-workers ``` --- # Monitoring APIs Monitor your worker status and performance using HTTP endpoints. All endpoints are available at `http://localhost:8081`. ## Available Endpoints - **`/health`** - Basic health check - **`/worker`** - Worker status - **`/stats`** - Detailed statistics - **`/debug`** - Configuration info - **`/`** - Web dashboard ## Quick Health Check ```bash curl http://localhost:8081/health ``` **Response:** ``` OK ``` ## Worker Status ```bash curl http://localhost:8081/worker ``` **Response:** ```json { "agent_id": "MyAgent", "active_jobs": 3, "connected": true, "worker_id": "worker-123", "worker_load": 0.3 } ``` ## Detailed Statistics ```bash curl http://localhost:8081/stats ``` **Response:** ```json { "worker_load": 0.3, "current_jobs": 3, "max_processes": 10, "agent_id": "MyAgent", "backend_connected": true, "resource_stats": { "total_resources": 10, "available_resources": 7, "active_resources": 3 } } ``` ## Web Dashboard Open `http://localhost:8081/` in your browser for a visual interface showing: - Real-time worker status - Resource utilization - Active jobs - Performance metrics ## Integration Examples **python:** ```python import requests def check_worker_health(): response = requests.get("http://localhost:8081/health") return response.status_code == 200 def get_worker_stats(): response = requests.get("http://localhost:8081/stats") return response.json() # Usage if check_worker_health(): stats = get_worker_stats() print(f"Active jobs: {stats['current_jobs']}") ``` --- **javascript:** ```javascript async function checkWorkerHealth() { const response = await fetch("http://localhost:8081/health"); return response.ok; } async function getWorkerStats() { const response = await fetch("http://localhost:8081/stats"); return response.json(); } // Usage if (await checkWorkerHealth()) { const stats = await getWorkerStats(); console.log(`Active jobs: ${stats.current_jobs}`); } ``` ## Common Use Cases - **Health monitoring**: Use `/health` for load balancer checks - **Performance tracking**: Use `/stats` for resource monitoring - **Debugging**: Use `/debug` to verify configuration - **Visual monitoring**: Use web dashboard for real-time overview --- # Understanding the Worker The **Worker** is the runtime engine that executes your AI agents in production. Think of it as the "server" that runs your agent code and handles multiple conversations simultaneously. ![AI Agent Worker](https://cdn.videosdk.live/website-resources/docs-resources/ai_agent_worker.png) ## What the Worker Does The Worker manages the lifecycle of your AI agents by: - **Executing** your agent code when users start conversations - **Managing** multiple concurrent conversations efficiently - **Connecting** to VideoSDK's backend to receive job assignments - **Monitoring** health and performance automatically - **Scaling** up or down based on demand ## Why Use the Built-in Worker? The VideoSDK Agents framework includes a production-ready Worker that handles all the complex infrastructure concerns, so you can focus on building your AI agent logic. **Key Benefits:** - **Production-Ready**: Built for real-world workloads with proper error handling - **Auto-Scaling**: Automatically handles multiple conversations within a single worker - **Health Monitoring**: Built-in health checks and status reporting - **Zero-Downtime**: Graceful shutdown and deployment capabilities --- # Worker Configuration Workers are the execution engines that run your **AI Agent jobs**. Think of them as the bridge between your **agent logic** and the **VideoSDK runtime**. This guide walks you through how to configure and tune a Worker for different environments — from local dev to production. ## Quick Start: Minimal Worker Here’s the simplest Worker setup to get going: ```python from videosdk.agents import WorkerJob, Options, JobContext, RoomOptions options = Options( agent_id="MyAgent", max_processes=5, register=True, # Registers worker with the backend for job scheduling ) room_options = RoomOptions( name="My Agent", ) job_context = JobContext(room_options=room_options) job = WorkerJob( entrypoint=your_agent_function, jobctx=lambda: job_context, options=options, ) job.start() ``` That’s enough to start processing jobs locally or in staging. ## Worker Options Explained The `Options` class gives you fine-grained control over Worker behavior: | Option | Purpose | Example | | -------------------- | ------------------------------------------- | ------------------------------- | | `agent_id` | Unique identifier for your agent | `"SupportBot01"` | | `max_processes` | Maximum concurrent jobs | `10` | | `num_idle_processes` | Pre-warmed processes for faster startup | `2` | | `load_threshold` | Max CPU/Load tolerance before refusing jobs | `0.75` | | `register` | Whether to register with backend | `True` (prod) / `False` (local) | | `log_level` | Logging verbosity | `"DEBUG"`, `"INFO"`, `"ERROR"` | | `host`, `port` | Bind address for health/status endpoints | `"0.0.0.0"`, `8081` | | `memory_warn_mb` | Trigger warning logs at this usage | `500.0` | | `memory_limit_mb` | Hard memory cap (`0` = unlimited) | `1000.0` | | `ping_interval` | Heartbeat interval in seconds | `30.0` | | `max_retry` | Max connection retries before giving up | `16` | ## Example Configurations **standard-production:** **Standard Production** configuration for typical deployments: ```python options = Options( agent_id="StandardAgent", max_processes=5, register=True, log_level="INFO", ) ``` This configuration is suitable for: - Standard production deployments - Moderate traffic loads - Most business applications --- **high-scale-production:** **High-Scale Production** configuration for enterprise workloads: ```python options = Options( agent_id="EnterpriseAgent", max_processes=20, num_idle_processes=5, load_threshold=0.8, memory_limit_mb=2000.0, register=True, log_level="DEBUG", ) ``` This configuration is optimized for: - Enterprise-scale deployments - High concurrent user loads - Advanced monitoring requirements --- **local-development:** **Local Development** configuration for development: ```python options = Options( agent_id="DevAgent", max_processes=1, register=False, # Don't register with backend log_level="DEBUG", host="localhost", port=8081, ) ``` This configuration is ideal for: - Local development and testing - Debugging agent behavior - Isolated development environments ## Hosting Environments ## Scaling Your Workers Workers can scale both **vertically** (more power per instance) and **horizontally** (more instances). - **Vertical Scaling** → Increase `max_processes` to run more jobs per worker. - **Horizontal Scaling** → Deploy multiple workers; the backend registry will balance load. - **Idle Processes** → Use `num_idle_processes` to reduce cold start latency. - **Load Threshold** → Tune `load_threshold` (default `0.75`) to prevent overload. - **Memory Safety** → Use `memory_warn_mb` and `memory_limit_mb` to keep processes healthy. ## Pro Tips - **Start small** → Begin with `max_processes=5` and adjust as you observe metrics. - **Log smart** → Use `DEBUG` in dev, but `INFO` or `WARN` in prod to reduce noise. - **Monitor & Auto-Scale** → Pair with metrics (Prometheus, Grafana, CloudWatch, etc.) to auto-scale horizontally. - **Keep processes warm** → Set at least `num_idle_processes=1` in production for faster first-response times. --- # Function Tools Function tools allow your AI agent to perform actions and interact with external services, extending its capabilities beyond simple conversation. By registering function tools, you enable your agent to execute custom logic, call APIs, access databases, and perform various tasks based on user requests. ## Overview Function tools are Python functions decorated with `@function_tool` that your agent can call during conversations. The LLM automatically decides when to use these tools based on the user's request and the tool's description. ## External Tools External tools are defined as standalone functions and passed into the agent's constructor via the `tools` parameter. This approach is useful for sharing common tools across multiple agents. ```python title="main.py" from videosdk.agents import Agent, function_tool # External tool defined outside the class @function_tool(description="Get weather information for a location") def get_weather(location: str) -> str: """Get weather information for a specific location.""" # Weather logic here return f"Weather in {location}: Sunny, 72°F" class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant.", tools=[get_weather] # Register the external tool ) ``` ## Internal Tools Internal tools are defined as methods within your agent class and decorated with `@function_tool`. This approach is useful for logic that is specific to the agent and needs access to its internal state (`self`). ```python title="main.py" from videosdk.agents import Agent, function_tool class FinanceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful financial assistant." ) self.portfolio = {"AAPL": 10, "GOOG": 5} @function_tool def get_portfolio_value(self) -> dict: """Get the current value of the user's stock portfolio.""" # Access agent state via self return {"total_value": 5000, "holdings": self.portfolio} ``` ## Async Function Tools Function tools can be asynchronous, which is essential for making HTTP requests, performing I/O operations, or integrating with async VideoSDK features. ```python title="main.py" import aiohttp from videosdk.agents import Agent, function_tool class WeatherAgent(Agent): def __init__(self): super().__init__( instructions="You are a weather assistant that can fetch real-time weather data." ) @function_tool async def get_weather_async(self, location: str) -> dict: """Fetch real-time weather data from an API.""" async with aiohttp.ClientSession() as session: async with session.get(f"https://api.weather.com/{location}") as response: data = await response.json() return { "location": location, "temperature": data.get("temp"), "condition": data.get("condition") } ``` :::note **Sarvam AI LLM**: When using Sarvam AI as the LLM option, function tool calls and MCP tools will not work. Consider using alternative LLM providers if you need function tool support. ::: ## Examples - Try Out Yourself - [Function Tool Example](https://github.com/videosdk-live/agents/blob/main/examples/test_cascading_pipeline.py): Complete example demonstrating internal and external function tools - [Real-life Usecase](https://github.com/videosdk-live/agents/blob/ee3ced912078c3be9dd62c7576c95c1bbe227bae/examples/a2a/agents/customer_agent.py#L22): Complete example demonstrating internal and external function tools --- # Human in the Loop Human in the Loop (HITL) enables AI agents to escalate specific queries to human operators for review and approval. This implementation uses Discord as the human interface, allowing seamless handoffs between AI automation and human oversight. ## Overview The HITL system allows AI agents to: - Handle routine customer inquiries autonomously - Escalate specific queries (like discount requests) to human operators via Discord - Receive human responses and relay them back to customers - Maintain conversation flow while waiting for human input ## Use Cases - **Discount Requests**: AI escalates pricing queries to human sales agents - **Complex Support**: Technical issues requiring human expertise - **Policy Decisions**: Requests that need human approval or clarification - **Escalation Scenarios**: Situations where AI confidence is low ## Example Overview The implementation consists of two main components: 1. **Customer Agent**: VideoSDK AI agent that handles customer interactions and escalates specific queries 2. **Discord MCP Server**: MCP server that creates Discord threads for human operator responses ## Example Implementation ### Customer Agent Setup ```python from videosdk.agents import Agent, MCPServerStdio import pathlib import sys class CustomerAgent(Agent): def __init__(self, ctx: Optional[JobContext] = None): current_dir = pathlib.Path(__file__).parent discord_mcp_server_path = current_dir / "discord_mcp_server.py" super().__init__( instructions="You are a customer-facing agent for VideoSDK. You have access to various tools to assist with customer inquiries, provide support, and handle tasks. When a user asks for a discount percentage, always use the appropriate tool to retrieve and provide the accurate answer from your superior human agent.", mcp_servers=[ MCPServerStdio( executable_path=sys.executable, process_arguments=[str(discord_mcp_server_path)], session_timeout=30 ), ] ) self.ctx = ctx ``` ### Discord MCP Server ```python from mcp.server.fastmcp import FastMCP import discord from discord.ext import commands class DiscordHuman: def __init__(self, user_id: int, channel_id: int): self.user_id = user_id self.channel_id = channel_id self.bot = commands.Bot(command_prefix="!", intents=discord.Intents.all()) self.response_future = None async def ask(self, question: str) -> str: channel = self.bot.get_channel(self.channel_id) thread = await channel.create_thread( name=question[:100], type=discord.ChannelType.public_thread ) await thread.send(f"<@{self.user_id}> {question}") self.response_future = self.loop.create_future() try: return await asyncio.wait_for(self.response_future, timeout=600) except asyncio.TimeoutError: return "⏱️ Timed out waiting for a human response" # MCP Server Setup mcp = FastMCP("HumanInTheLoopServer") @mcp.tool(description="Ask a human agent via Discord for a specific user query such as discount percentage, etc.") async def ask_human(question: str) -> str: return await discord_human.ask(question) ``` ### Pipeline Configuration ```python pipeline = CascadingPipeline( stt=DeepgramSTT(api_key=os.getenv("DEEPGRAM_API_KEY")), llm=AnthropicLLM(api_key=os.getenv("ANTHROPIC_API_KEY")), tts=GoogleTTS(api_key=os.getenv("GOOGLE_API_KEY")), vad=SileroVAD(), turn_detector=TurnDetector(threshold=0.8) ) ``` ### Environment Variables Set the following environment variables: ```bash DISCORD_TOKEN=your_discord_bot_token DISCORD_USER_ID=human_operator_user_id DISCORD_CHANNEL_ID=channel_id_for_escalations DEEPGRAM_API_KEY=your_deepgram_key ANTHROPIC_API_KEY=your_anthropic_key GOOGLE_API_KEY=your_google_key ``` ### Example Link Complete implementation with full source code, setup instructions, and configuration examples available in the [VideoSDK Agents GitHub repository](https://github.com/videosdk-live/agents/tree/main/examples/human_in_the_loop). --- # AI Voice Agents The VideoSDK AI Agent SDK is a powerful Python framework for developers to seamlessly integrate intelligent, real-time voice agents into any application. Bridge the gap between advanced AI models and human interaction, creating natural, engaging, and responsive conversational experiences. - [AI Voice Agent Quickstart](/ai_agents/voice-agent-quick-start): Build an AI Voice Agent in less than 10 minutes - [AI Telephony Agent Quickstart](/ai_agents/ai-phone-agent-quick-start): Build an AI Telephony Agent in less than 10 minutes - [Github Repository](https://github.com/videosdk-live/agents): The videosdk agent code and examples - [SDK Reference](https://docs.videosdk.live/agent-sdk-reference/agents/): Reference docs for agents framework ## The Architecture The VideoSDK AI Agents framework connects four key components to enable seamless AI voice interactions: - Your **Infrastructure** hosts the agent management system - The **Agent Worker** creates and manages AI sessions - The **VideoSDK Room** handles real-time meeting operations - **User Devices** connect through web, mobile apps, or phone calls to interact with intelligent agents that can listen, understand, and respond naturally in real-time conversations. ![Introduction](https://assets.videosdk.live/images/agent-architecture.png) ## Use Cases Here are some real-world applications where VideoSDK AI Agents can be deployed to create intelligent, voice-enabled experiences across different industries and scenarios. You can use this, or refer this to create your customized agent. - [Multi-Agent System](https://github.com/videosdk-live/agents/tree/main/examples/a2a): See the Agent-to-Agent Protocol in action, where there is a general customer care agent that transfers queries to Loan Specialist Agent for Loan related queries. - [Agent with MCP Server](https://github.com/videosdk-live/agents-quickstart/blob/main/MCP/mcp_stdio_server.py): Integrate MCP (Model Context Protocol) Server with VideoSDK Agents. - [RAG Agent](https://github.com/videosdk-live/agents-quickstart/tree/main/RAG): An example of Retrieval-Augmented Generation (RAG) based agent for knowledge-grounded conversations. - [Translator Agent](https://github.com/videosdk-live/agents/blob/main/examples/translator_agent.py): A translator assistant that can speak to user in their language. - [Whatsapp Call Agent](https://docs.videosdk.live/ai_agents/whatsapp-voice-agent-quick-start): Streamline your business with whatsapp ai call agent that you can use for inbound queries or for outbound calls. - [Agent with Wakeup Call](https://github.com/videosdk-live/agents-quickstart/tree/main/Wakeup%20Call): An agent that maintains engagement by triggering automatically after specified period of inactivity. - [Virtual Avatar Agent](https://github.com/videosdk-live/agents-quickstart/tree/main/Virtual%20Avatar): Bring your AI agent to life with a virtual avatar that can interact visually during conversations. ## The Building Blocks Our SDK is built on four primary, modular components that work together to create powerful and customizable agents. Understand these concepts, and you're ready to build. - [Agent Capabilities](/ai_agents/core-components/overview): Build sophisticated agents with function tools, vision, human-in-the-loop, and agent-to-agent(A2A) communication. - [Deployment Options](/ai_agents/deployments/introduction): Deploy your agent on cloud or self-host it on your own infrastructure - [Observability](/ai_agents/tracing-observability/session-analytics): Monitor and debug with confidence using our built-in session analytics, latency tracking, and detailed traces. - [Plugin Ecosystem](/ai_agents/plugins/realtime/openai): Integrate with dozens of providers like OpenAI, Google, Anthropic, and Elevenlabs for STT, LLM, and TTS. ## Need Help? If you have any queries, please feel free to reach out to us using one of the following methods: - [Discord](https://discord.com/invite/Gpmj6eCq5u): Join our Discord Community. - [GitHub](https://github.com/videosdk-live/agents/issues): Ask your questions on GitHub. - [Support](https://www.videosdk.live/contact): Talk to an expert, book demo or talk to sales. ## Frequently Asked Questions ### What programming language and version are required? The AI Agent SDK is built in Python. You'll need Python 3.12 or higher to use the SDK. ### Can my agent answer phone calls? Yes. By integrating with our SIP/telephony services, your AI agent can join a room initiated by a standard phone call. This allows you to build powerful IVR systems, automated appointment schedulers, AI-powered call centers, and more. ### What AI models are supported? The SDK supports various AI models including: - **Real-time Models**: OpenAI, Google Gemini, AWS Nova Sonic - **LLM Providers**: OpenAI, Google Gemini, Anthropic Claude, Sarvam AI, Cerebras - **TTS Providers**: ElevenLabs, OpenAI, Google, AWS Polly, Cartesia, and many more - **STT Providers**: OpenAI Whisper, Deepgram, Google, AssemblyAI, and others ### Can I use my own custom models? Absolutely! The SDK's modular architecture allows you to create custom plugins for any AI provider. Check our [plugin development guide](https://github.com/videosdk-live/agents/blob/main/BUILD_YOUR_OWN_PLUGIN.md) for detailed instructions. ### How is pricing handled for the AI Agent SDK? VideoSDK offers a free tier with limited usage. The AI Agent SDK itself is open-source, but you'll need API keys for the AI services you choose to use (OpenAI, Google, etc.). Check the [pricing page](https://www.videosdk.live/pricing) for VideoSDK usage limits. ### Can agents handle more than just voice? Absolutely! Agents support multimodal interactions including vision processing, data messages, and real-time video streams. They can also use function tools to interact with external systems and APIs. ### Is the SDK production-ready? Yes, the AI Agent SDK is stable and production-ready. It is designed to be self-hosted on your own infrastructure for full control and scalability, from a single server to a Kubernetes cluster. It includes comprehensive error handling, metrics collection, and deployment flexibility.
--- The Model Context Protocol (MCP) is an open standard that enables AI assistants to securely connect to data sources and tools. With VideoSDK's AI Agents, you can seamlessly integrate MCP servers to extend your agent's capabilities with external services or applications, databases, and APIs. ## MCP Server Types VideoSDK supports two transport methods for MCP servers: ### 1. STDIO Transport - Direct process communication - Local Python scripts - Best for custom tools and functions - Ideal for server-side integrations ### 2. HTTP Transport (Streamable HTTP or SSE) - Network-based communication - External MCP services - Best for third-party integrations - Supports remote MCP servers ## How It Works with VideoSDK's AI Agent MCP tools are automatically discovered and made available to your agent. Agent will intelligently choose which tools to use based on user requests. When a user asks for information that requires external data, the agent will: - Identify the need for external data based on the user's request - Select appropriate tools from available MCP servers - Execute the tools with relevant parameters - Process the results and provide a natural language response This seamless integration allows your voice agent to access real-time data and external services while maintaining a natural conversational flow. ## Creating an MCP Server # Basic MCP Server Structure A simple MCP server using STDIO to return the current time. First, install the required package: ```bash pip install fastmcp ``` ```python title="mcp_stdio_example.py" from mcp.server.fastmcp import FastMCP import datetime # Create the MCP server mcp = FastMCP("CurrentTimeServer") @mcp.tool() def get_current_time() -> str: """Get the current time in the user's location""" # Get current time now = datetime.datetime.now() # Return formatted time string return f"The current time is {now.strftime('%H:%M:%S')} on {now.strftime('%Y-%m-%d')}" if __name__ == "__main__": # Run the server with STDIO transport mcp.run(transport="stdio") ``` ## Integrating MCP with VideoSDK Agent Now we'll see how to integrate MCP servers with your VideoSDK AI Agent: ```python title="main.py" import asyncio import pathlib import sys from videosdk.agents import Agent, AgentSession, RealTimePipeline,MCPServerStdio, MCPServerHTTP from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig class MyVoiceAgent(Agent): def __init__(self): # Define paths to your MCP servers mcp_script = Path(__file__).parent.parent / "MCP_Example" / "mcp_stdio_example.py" super().__init__( instructions="""You are a helpful assistant with access to real-time data. You can provide current time information. Always be conversational and helpful in your responses.""", mcp_servers=[ # STDIO MCP Server (Local Python script for time) MCPServerStdio( executable_path=sys.executable, # Use current Python interpreter process_arguments=[str(mcp_script)], session_timeout=30 ), # HTTP MCP Server (External service example e.g Zapier) MCPServerHTTP( endpoint_url="https://your-mcp-service.com/api/mcp", session_timeout=30 ) ] ) async def on_enter(self) -> None: await self.session.say("Hi there! How can I help you today?") async def on_exit(self) -> None: await self.session.say("Thank you for using the assistant. Goodbye!") async def main(context: dict): # Configure Gemini Realtime model model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", # Available voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) agent = MyVoiceAgent() session = AgentSession( agent=agent, pipeline=pipeline, context=context ) try: # Start the session await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() if __name__ == "__main__": def make_context(): # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth return { "meetingId": "your_actual_meeting_id_here", # Replace with actual meeting ID "name": "AI Voice Agent", "videosdk_auth": "your_videosdk_auth_token_here" # Replace with actual token } ``` :::tip Get started quickly with the [Quick Start Example](https://github.com/videosdk-live/agents-quickstart/tree/main/MCP) for the VideoSDK AI Agent SDK With MCP — everything you need to build your first AI agent fast. ::: --- # Agents Playground The Agents Playground provides an interactive testing environment where you can directly communicate with your AI agents during development. This feature enables rapid prototyping, testing, and debugging of your voice AI implementations without needing a separate client application. ## Overview Playground mode creates a web-based interface that connects directly to your agent session, allowing you to: - Test agent in real-time - Demonstrate agent capabilities to stakeholders ## Enabling Playground Mode To activate playground mode, simply set `playground: True` in your RoomOptions for JobContext. ### Basic Implementation ```python from videosdk.agents import RoomOptions, JobContext, WorkerJob async def entrypoint(ctx: JobContext): # Your agent implementation here # This is where you create your pipeline, agent, and session pass def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Test Agent", playground=True # Enable playground mode ) return JobContext(room_options=room_options) if __name__ == "__main__": from videosdk.agents import WorkerJob job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Accessing the Playground Once your agent session starts, the playground URL will be displayed in your terminal: ``` Agent started in playground mode Interact with agent here at: https://playground.videosdk.live?token={auth_token}&meetingId={meeting_id} ``` ### URL Structure The playground URL follows this format: ``` https://playground.videosdk.live?token={auth_token}&meetingId={meeting_id} ``` Where: - `auth_token`: videosdk_auth that is provided in session context or in env file. - `meeting_id`: The meeting ID specified in session context. **Note**: Playground mode is designed for development and testing purposes. For production deployments, ensure playground mode is disabled to maintain security and performance. --- # Simli Avatar The Simli Avatar plugin allows you to integrate a real-time, lip-synced AI avatar into your VideoSDK agent. This creates a more engaging and interactive experience for users by providing a visual representation of the AI agent. Simli offers two avatar types: Legacy (30 FPS) and Trinity (25 FPS). When creating a SimliAvatar, set is_trinity_avatar=True if you're using a Trinity avatar (default is False). Always select the correct faceID from the Simli dashboard. ## Installation Install the Simli-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-simli" ``` ## Authentication The Simli plugin requires an [Simli API key](https://app.simli.com/apikey). Set `SIMLI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.simli import SimliAvatar, SimliConfig ``` ## Setup Credentials To use Simli, you need an API key. You can get one from the [Simli Dashboard](https://app.simli.com/profile). Set up your credentials by exporting them as an environment variable: ```bash export SIMLI_API_KEY="YOUR_SIMLI_API_KEY" ``` You can also provide a `faceId` if you have a custom one. ```bash export SIMLI_FACE_ID="YOUR_FACE_ID" ``` ## Example Usage Here's how you can integrate the Simli Avatar with both `CascadingPipeline` and `RealTimePipeline`. ### Cascading Pipeline This example shows how to add the Simli Avatar to a `CascadingPipeline`. ```python import os from videosdk.agents import CascadingPipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig # Import other necessary components like STT, LLM, TTS # 1. Initialize SimliConfig simli_config = SimliConfig( apiKey=os.getenv("SIMLI_API_KEY"), faceId=os.getenv("SIMLI_FACE_ID"), # This is optional and has a default value ) # 2. Create a SimliAvatar instance # For Legacy avatars (default) simli_avatar = SimliAvatar(config=simli_config) # For Trinity avatars # simli_avatar = SimliAvatar( # config=simli_config, # is_trinity_avatar=True, # ) # 3. Add the avatar to the pipeline pipeline = CascadingPipeline( # ... stt=stt, llm=llm, tts=tts avatar=simli_avatar ) ``` ### Real-time Pipeline This example shows how to add the Simli Avatar to a `RealTimePipeline`. ```python import os from videosdk.agents import RealTimePipeline from videosdk.plugins.simli import SimliAvatar, SimliConfig # from videosdk.plugins.google import GeminiRealtime # Example model # 1. Initialize SimliConfig simli_config = SimliConfig( apiKey=os.getenv("SIMLI_API_KEY"), ) # 2. Create a SimliAvatar instance # For Legacy avatars (default) simli_avatar = SimliAvatar(config=simli_config) # For Trinity avatars # simli_avatar = SimliAvatar( # config=simli_config, # is_trinity_avatar=True, # ) # 3. Add the avatar to the pipeline pipeline = RealTimePipeline( model=your_realtime_model, # e.g., GeminiRealtime() avatar=simli_avatar ) ``` :::note When using an environment variable for credentials, you should still load it in your code using `os.getenv("SIMLI_API_KEY")` and pass it to `SimliConfig`. ::: ## Configuration Options You can customize the avatar's behavior using the `SimliConfig` and `SimliAvatar` classes. ### `SimliConfig` - `faceId`: (str, optional) The ID for the avatar face. You can find available faces in the [Simli Docs](https://docs.simli.com/api-reference/available-faces) or create your own. Defaults to `"0c2b8b04-5274-41f1-a21c-d5c98322efa9"`. - `maxSessionLength`: (int, optional) A hard time limit in seconds after which the session will disconnect. Defaults to `1800` (30 minutes). - `maxIdleTime`: (int, optional) A soft time limit in seconds that disconnects the session after a period of not sending data. Defaults to `300` (5 minutes). ### `SimliAvatar` - `config`: (`SimliConfig`) A `SimliConfig` object with your desired settings. - `is_trinity_avatar`: (bool, optional) Set to `True` when using Trinity avatars. Defaults to `False` for Legacy avatars. ## Additional Resources The following resources provide more information about using Simli with VideoSDK Agents SDK. - **[Simli docs](https://docs.simli.com/overview)**: Simli docs. --- # Denoise The RNNoise plugin enhances audio quality by removing background noise from your audio input, resulting in improved speech-to-text (STT) accuracy and better overall audio processing performance. RNNoise is a real-time noise suppression library powered by a recurrent neural network that intelligently filters out environmental noise such as air conditioning, computer fans, and other stationary background sounds while preserving the clarity and quality of speech. ## Installation Install the RNNoise plugin for denoising in VideoSDK Agents package: ```bash pip install "videosdk-plugins-rnnoise" ``` ## Importing ```python from videosdk.plugins.rnnoise import RNNoise ``` ## Example Usage ```python from videosdk.plugins.rnnoise import RNNoise from videosdk.agents import CascadingPipeline # Initialize the RNNoise Plugin rnnoise = RNNoise() # Add Denoise Plugin to cascading pipeline pipeline = CascadingPipeline(denoise=rnnoise) ``` It also works with [`RealTimePipeline`](/ai_agents/core-components/realtime-pipeline.md). ## Example Usage in RealTime Pipeline ```python from videosdk.plugins.rnnoise import RNNoise from videosdk.agents import RealTimePipeline # Initialize the RNNoise Plugin rnnoise = RNNoise() # Add Denoise Plugin to realtime pipeline pipeline = RealTimePipeline(denoise=rnnoise) ``` ## Benefits - **Enhanced STT Accuracy**: Cleaner audio input leads to more accurate speech-to-text transcription - **Real-time Processing**: Processes audio streams with minimal latency for seamless user experience - **Intelligent Noise Reduction**: Effectively removes background noise while preserving speech clarity ## Additional Resources The following resources provide more information about using RNNoise with VideoSDK Agents SDK. - **[RNNoise project](https://github.com/xiph/rnnoise)**: The open source RNNoise library that powers the VideoSDK RNNoise plugin. --- # Cerebras LLM The Cerebras AI LLM provider enables your agent to use Cerebras AI's language models for text-based conversations and processing. ## Installation Install the Cerebras-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cerebras" ``` ## Importing ```python from videosdk.plugins.cerebras import CerebrasLLM ``` ## Authentication The Cerebras plugin requires an [Cerebras API key](https://cloud.cerebras.ai/). Set `CARTESIA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cerebras import CerebrasLLM from videosdk.agents import CascadingPipeline # Initialize the Cerebras LLM model llm = CerebrasLLM( model="llama3.3-70b", temperature=0.7, max_tokens=1024, ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Cerebras model to use (default: `"llama3.3-70b"`). Supported models include: `llama3.3-70b`, `llama3.1-8b`, `llama-4-scout-17b-16e-instruct`, `qwen-3-32b`, `deepseek-r1-distill-llama-70b` (private preview) - `api_key`: (str) Your Cerebras API key. Can also be set via the `CEREBRAS_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (`"auto"`, `"required"`, `"none"`) (default: `"auto"`). - `max_completion_tokens`: (int) Maximum number of tokens to generate in the response (optional). - `top_p`: (float) Nucleus sampling probability (optional). - `seed`: (int) Random seed for reproducible completions (optional). - `stop`: (str) Stop sequence that halts generation when encountered (optional). - `user`: (str) Identifier for the end user triggering the request (optional). ## Additional Resources The following resources provide more information about using Cerebras with VideoSDK Agents SDK. - **[Cerebras docs](https://inference-docs.cerebras.ai/introduction)**: Cerebras documentation. --- # Anthropic LLM The Anthropic AI LLM provider enables your agent to use Anthropic AI's language models for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://docs.anthropic.com/en/docs/about-claude/models/overview) models. ## Installation Install the Anthropic-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-anthropic" ``` ## Importing ```python from videosdk.plugins.anthropic import AnthropicLLM ``` ## Authentication The Anthropic plugin requires an [Anthropic API key](https://console.anthropic.com/dashboard). Set `ANTHROPIC_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.anthropic import AnthropicLLM from videosdk.agents import CascadingPipeline # Initialize the Anthropic LLM model llm = AnthropicLLM( model="claude-sonnet-4-20250514", temperature=0.7, max_tokens=1024, ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Anthropic model to use (default: `"claude-sonnet-4-20250514"`). - `api_key`: (str) Your Anthropic API key. Can also be set via the `ANTHROPIC_API_KEY` environment variable. - `base_url`: (str) Optional custom base URL for Claude API (default: `None`). - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (`"auto"`, `"required"`, `"none"`) (default: `"auto"`). - `max_tokens`: (int) Maximum number of tokens in the response (default: `1024`). - `top_p`: (float) Nucleus sampling probability (optional). - `top_k`: (int) Top-k sampling parameter (optional). ## Additional Resources The following resources provide more information about using Anthropic with VideoSDK Agents SDK. - **[Anthropic docs](https://docs.anthropic.com/en/docs/intro)**: Anthropic documentation. --- # Azure OpenAI LLM The Azure OpenAI LLM provider enables your agent to use Azure OpenAI's language models (like GPT-4o) for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://platform.openai.com/docs/models) models. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAILLM ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import CascadingPipeline # Initialize the Azure OpenAI LLM model llm = OpenAILLM.azure( azure_deployment="gpt-4o", temperature=0.7, ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o"`, `"gpt-4o-mini"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `temperature`: (float) Sampling temperature for response randomness (0.0 to 2.0, default: 0.7) - `tool_choice`: Tool selection mode (e.g., `"auto"`, `"none"`, or specific tool) - `max_completion_tokens`: (int) Maximum number of tokens in the completion response ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. --- # Google LLM The Google LLM provider enables your agent to use Google's Gemini family of language models for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://ai.google.dev/gemini-api/docs/models) models. ## Installation Install the Google-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Importing ```python from videosdk.plugins.google import GoogleLLM ``` ## Authentication The Google plugin requires an [Gemini API key](https://aistudio.google.com/apikey). Set `GOOGLE_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.google import GoogleLLM from videosdk.agents import CascadingPipeline # Initialize the Google LLM model llm = GoogleLLM( model="gemini-3-flash-preview", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", temperature=0.7, tool_choice="auto", max_output_tokens=1000 ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Vertex AI Integration You can also use Google's Gemini models through Vertex AI. This requires a different authentication and configuration setup. ### Authentication for Vertex AI For Vertex AI, you need to set up Google Cloud credentials. Create a service account, download the JSON key file, and set the path to this file in your environment. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` You should also configure your project ID and location. These can be set as environment variables or directly in the code. If not set, the `project_id` is inferred from the credentials file and the `location` defaults to `us-central1`. ```bash export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export GOOGLE_CLOUD_LOCATION="your-gcp-location" ``` ### Example Usage with Vertex AI To use Vertex AI, set `vertexai=True` when initializing `GoogleLLM`. You can configure the project and location using `VertexAIConfig`, which will take precedence over environment variables. ```python from videosdk.plugins.google import GoogleLLM, VertexAIConfig from videosdk.agents import CascadingPipeline # Import other necessary components like STT and TTS # from videosdk.plugins.deepgram import DeepgramSTT # from videosdk.plugins.elevenlabs import ElevenLabsTTS # Initialize GoogleLLM with Vertex AI configuration llm = GoogleLLM( vertexai=True, vertexai_config=VertexAIConfig( project_id="videosdk", location="us-central1" ) ) # Add llm to a cascading pipeline pipeline = CascadingPipeline( stt=DeepgramSTT(), # Example STT llm=llm, tts=ElevenLabsTTS() # Example TTS ) ``` ## Configuration Options - `model`: (str) The Google model to use (e.g., `"gemini-3-flash-preview"`, `"gemini-3-pro-preview"`, `"gemini-2.0-flash-001"`,) (default: `"gemini-2.0-flash-001"`). - `api_key`: (str) Your Google API key. Can also be set via the `GOOGLE_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (`"auto"`, `"required"`, `"none"`) (default: `"auto"`). - `max_output_tokens`: (int) Maximum number of tokens in the completion response (optional). - `top_p`: (float) The nucleus sampling probability (optional). - `top_k`: (int) The top-k sampling parameter (optional). - `presence_penalty`: (float) Penalizes new tokens based on whether they appear in the text so far (optional). - `frequency_penalty`: (float) Penalizes new tokens based on their existing frequency in the text so far (optional). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Gemini docs](https://ai.google.dev/gemini-api/docs/models)**: Google Gemini documentation. --- # OpenAI LLM The OpenAI LLM provider enables your agent to use OpenAI's language models (like GPT-4o) for text-based conversations and processing. It also supports vision input capabilities, allowing your agent to analyze and respond to images alongside text with the [supported](https://platform.openai.com/docs/models) models. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAILLM ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAILLM from videosdk.agents import CascadingPipeline # Initialize the OpenAI LLM model llm = OpenAILLM( model="gpt-4o", # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", temperature=0.7, tool_choice="auto", max_completion_tokens=1000 ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The OpenAI model to use (e.g., `"gpt-4o"`, `"gpt-4o-mini"`, `"gpt-3.5-turbo"`) - `api_key`: Your OpenAI API key (can also be set via environment variable) - `base_url`: Custom base URL for OpenAI API (optional) - `temperature`: (float) Sampling temperature for response randomness (0.0 to 2.0, default: 0.7) - `tool_choice`: Tool selection mode (e.g., `"auto"`, `"none"`, or specific tool) - `max_completion_tokens`: (int) Maximum number of tokens in the completion response ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/)**: OpenAI documentation. --- # Sarvam AI LLM The Sarvam AI LLM provider enables your agent to use Sarvam AI's language models for text-based conversations and processing. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAILLM ``` :::note When using Sarvam AI as the LLM option, the function tool calls and MCP tool will not work. ::: ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAILLM from videosdk.agents import CascadingPipeline # Initialize the Sarvam AI LLM model llm = SarvamAILLM( model="sarvam-m", # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", temperature=0.7, tool_choice="auto", max_completion_tokens=1000 ) # Add llm to cascading pipeline pipeline = CascadingPipeline(llm=llm) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model`: (str) The Sarvam AI model to use (default: `"sarvam-m"`). - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `temperature`: (float) Sampling temperature for response randomness (default: `0.7`). - `tool_choice`: (ToolChoice) Tool selection mode (default: `"auto"`). - `max_completion_tokens`: (int) Maximum number of tokens in the completion response (optional). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/)**: Sarvam's full docs site. --- # Namo Turn Detector The Namo Turn Detector v1 utilizes a custom fine-tuned model from VideoSDK to accurately determine whether a user has finished speaking. This allows for precise management of conversation flow, especially in cascading pipeline setups. It can operate as a multilingual model or be configured for a specific language for optimized performance. ## Installation Install the Turn Detector-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-turn-detector" ``` ## Importing ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1 ``` ## Example Usage **1. For a specific language (e.g., English):** ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from videosdk.agents import CascadingPipeline # Pre-download the English model to avoid delays pre_download_namo_turn_v1_model(language="en") # Initialize the Turn Detector for English turn_detector = NamoTurnDetectorV1( language="en", threshold=0.7 ) # Add the Turn Detector to a cascading pipeline pipeline = CascadingPipeline(turn_detector=turn_detector) ``` **2. For multilingual support:** If you don't specify a language, the detector will default to the multilingual model, which can handle various languages. ```python from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model from videosdk.agents import CascadingPipeline # Pre-download the multilingual model pre_download_namo_turn_v1_model() # Initialize the multilingual Turn Detector turn_detector = NamoTurnDetectorV1( threshold=0.7 ) # Add the Turn Detector to a cascading pipeline pipeline = CascadingPipeline(turn_detector=turn_detector) ``` ## Configuration Options - `language`: (Optional, `str`): Specifies the language for the turn detection model. If left as `None` (the default), it loads a multilingual model capable of handling all supported languages. - `threshold`: (float) Confidence threshold for turn completion detection (0.0 to 1.0, default: `0.7`) ## Supported Languages The `NamoTurnDetectorV1` supports a wide range of languages when you specify the corresponding language code. If no language is specified, the multilingual model will be used. Here is a list of the supported languages and their codes: | Language | Code | | :--- | :--- | | Arabic | `ar` | | Bengali | `bn` | | Chinese | `zh` | | Danish | `da` | | Dutch | `nl` | | English | `en` | | Finnish | `fi` | | French | `fr` | | German | `de` | | Hindi | `hi` | | Indonesian |`id` | | Italian | `it` | | Japanese | `ja` | | Korean | `ko` | | Marathi | `mr` | | Norwegian | `no` | | Polish | `pl` | | Portuguese | `pt` | | Russian | `ru` | | Spanish | `es` | | Turkish | `tr` | | Ukrainian | `uk` | | Vietnamese |`vi` | ## Pre-downloading Model To avoid delays during agent initialization, you can pre-download the Hugging Face model: You can pre-download a specific language model: ```python from videosdk.plugins.turn_detector import pre_download_namo_turn_v1_model # Download the English model before the agent runs pre_download_namo_turn_v1_model(language="en") ``` Or pre-download the multilingual model: ```python from videosdk.plugins.turn_detector import pre_download_namo_turn_v1_model # Download the multilingual model pre_download_namo_turn_v1_model() ``` ## Additional Resources The following resources provide more information about VideoSDK Turn Detector plugin for AI Agents SDK. --- # AWS Nova Sonic The AWS Nova Sonic provider enables your agent to use Amazon's Nova Sonic model for real-time, speech-to-speech AI interactions. ### Prerequisites Before Start Using AWS Nova Sonic with the VideoSDK AI Agent, ensure the following: - `AWS Account`: You have an active AWS account with permissions to access Amazon Bedrock. - `Model Access`: You've requested and obtained access to the Amazon Nova models (Nova Lite and Nova Canvas) via the Amazon Bedrock console. - `Region Selection`: You're operating in the US East (N. Virginia) (us-east-1) region, as model access is region-specific. - `AWS Credentials`: Your AWS credentials (aws_access_key_id and aws_secret_access_key) are configured, either through environment variables or your preferred credential management method. ## Installation Install the Gemini-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-aws" ``` ## Authentication The Amazon Nova Sonic plugin requires an [AWS API key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html). Set the following environment variables in your `.env` file: ```shell AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= AWS_DEFAULT_REGION= ``` ## Importing ```python from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig ``` ## Example Usage ```python from videosdk.plugins.aws import NovaSonicRealtime, NovaSonicConfig from videosdk.agents import RealTimePipeline # Initialize the Nova Sonic real-time model model = NovaSonicRealtime( model="amazon.nova-sonic-v1:0", # When AWS credentials and region are set in .env - DON'T pass credential parameters region="us-east-1", # Currently, only "us-east-1" is supported for Amazon Nova Sonic. aws_access_key_id="YOUR_ACCESS_KEY", aws_secret_access_key="YOUR_SECRET_KEY", config=NovaSonicConfig( voice="tiffany", # "tiffany","matthew", "amy" temperature=0.7, top_p=0.9, max_tokens=1024 ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: :::note To initiate a conversation with Amazon Nova Sonic, the user must speak first. The model listens for user input to begin the interaction. ::: ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The Amazon Nova Sonic model to use (e.g., "amazon.nova-sonic-v1:0"). - `region`: AWS region where the model is hosted (e.g., "us-east-1"). - `aws_access_key_id`: Your AWS access key ID. - `aws_secret_access_key`: Your AWS secret access key. - `config`: A NovaSonicConfig object for advanced options: - `voice`: (str or None) The voice to use for audio output (e.g., "matthew", "tiffany", "amy"). - `temperature`: (float or None) Sampling temperature for response randomness. - `top_p`: (float or None) Nucleus sampling probability. - `max_tokens`: (int or None) Maximum number of tokens in the output ## Additional Resources The following resources provide more information about using AWS Nova Sonic with VideoSDK Agents SDK. - **[Plugin quickstart](https://github.com/videosdk-live/agents-quickstart/blob/main/Realtime%20Pipeline/AWS%20Nova%20Sonic/aws_novasonic_agent_quickstart.py)**: Quickstart for the AWS Nova Sonic API plugin. - **[AWS Nova Sonic docs](https://docs.aws.amazon.com/nova/latest/userguide/speech.html)**: AWS Nova Sonic documentation. --- # Azure Voice Live API (Beta) The Azure Voice Live API provider enables your agent to use Microsoft's comprehensive speech-to-speech solution for low-latency, high-quality voice interactions. This unified API eliminates the need to manually orchestrate multiple components by integrating speech recognition, generative AI, and text-to-speech into a single interface. :::note Preview Feature This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live). ::: ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Authentication The Azure Voice Live plugin requires an Azure AI Services resource with Cognitive Services endpoint. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the AI Services resource endpoint and primary key. After your resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_VOICE_LIVE_ENDPOINT` and `AZURE_VOICE_LIVE_API_KEY` in your `.env` file: ```bash AZURE_VOICE_LIVE_ENDPOINT=your-azure-ai-service-endpoint AZURE_VOICE_LIVE_API_KEY=your-azure-ai-service-primary-key ``` ## Importing ```python from videosdk.plugins.azure import AzureVoiceLive, AzureVoiceLiveConfig from videosdk.agents import RealTimePipeline ``` ## Example Usage ```python from videosdk.plugins.azure import AzureVoiceLive, AzureVoiceLiveConfig from videosdk.agents import RealTimePipeline # Configure the Voice Live API settings config = AzureVoiceLiveConfig( voice="en-US-EmmaNeural", # Azure neural voice temperature=0.7, turn_detection_timeout=1000, enable_interruption=True ) # Initialize the Azure Voice Live model model = AzureVoiceLive( # When environment variables are set in .env - DON'T pass credentials # api_key="your-azure-speech-key", model="gpt-4o-realtime-preview", config=config ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key`, `speech_region`, and other credential parameters from your code. ::: :::note To initiate a conversation with Azure Voice Live, the user must speak first. The model listens for user input to begin the interaction. ::: ## Configuration Options - `model`: The Voice Live model to use (e.g., `"gpt-4o-realtime-preview"`, `"gpt-4o-mini-realtime-preview"`) - `api_key`: Your Azure Speech API key (can also be set via environment variable) - `speech_region`: Your Azure Speech region (can also be set via environment variable) - `credential`: Azure DefaultAzureCredential for authentication (alternative to API key) - `config`: An `AzureVoiceLiveConfig` object for advanced options: - `voice`: (str) The Azure neural voice to use (e.g., `"en-US-EmmaNeural"`, `"hi-IN-AnanyaNeural"`) - `temperature`: (float) Sampling temperature for response randomness (default: 0.7) - `turn_detection_timeout`: (int) Timeout for turn detection in milliseconds - `enable_interruption`: (bool) Allow users to interrupt the agent during speech - `noise_suppression`: (bool) Enable noise suppression for clearer audio - `echo_cancellation`: (bool) Enable echo cancellation - `phrase_list`: (List[str]) Custom phrases for improved recognition accuracy ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Additional Resources The following resources provide more information about using Azure Voice Live with VideoSDK Agents SDK. - **[Azure Voice Live API Documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live)**: Complete Azure Voice Live API documentation. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Overview of Azure Speech services. --- # Google Gemini (LiveAPI) The Google Gemini (Live API) provider allows your agent to leverage Google's Gemini models for real-time, multimodal AI interactions. ## Installation Install the Gemini-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Authentication The Google plugin requires an [Gemini API key](https://aistudio.google.com/apikey). Set `GOOGLE_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig ``` ## Example Usage ```python from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from videosdk.agents import RealTimePipeline # Initialize the Gemini real-time model model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", config=GeminiLiveConfig( voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr. response_modalities=["AUDIO"] ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Vertex AI Integration You can also use Google's Gemini models through Vertex AI. This requires a different authentication and configuration setup. ### Authentication for Vertex AI For Vertex AI, you need to set up Google Cloud credentials. Create a service account, download the JSON key file, and set the path to this file in your environment. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` You should also configure your project ID and location. These can be set as environment variables or directly in the code. If not set, the `project_id` is inferred from the credentials file and the `location` defaults to `us-central1`. ```bash export GOOGLE_CLOUD_PROJECT="your-gcp-project-id" export GOOGLE_CLOUD_LOCATION="your-gcp-location" ``` ### Example Usage with Vertex AI To use Vertex AI, set `vertexai=True` when initializing `GeminiRealtime`. You can configure the project and location using `VertexAIConfig`, which will take precedence over environment variables. ```python from videosdk.plugins.google import GeminiRealtime, VertexAIConfig from videosdk.agents import RealTimePipeline # Initialize the Gemini real-time model with Vertex AI configuration model = GeminiRealtime( model="gemini-live-2.5-flash-native-audio", vertexai=True, vertexai_config=VertexAIConfig( project_id="videosdk", location="us-central1" ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` ## Vision Support Google Gemini Live can also accept `video stream` directly from the VideoSDK room. To enable this, simply turn on your camera and set the vision flag to true in the session context. Once that's done, start your agent as usual—no additional changes are required in the pipeline. ```python pipeline = RealTimePipeline(model=model) session = AgentSession( agent=my_agent, pipeline=pipeline, ) job_context = JobContext( room_options = RoomOptions( room_id = "YOUR_ROOM_ID", name = "Agent", vision = True ) ) ``` - `vision` (bool, room options) – when `True`, forwards Video Stream from VideoSDK's room to Gemini’s LiveAPI (defaults to `False`). ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The Gemini model to use (e.g., `"gemini-2.5-flash-native-audio-preview-12-2025"`). Other supported models include: `"gemini-2.5-flash-preview-native-audio-dialog"` and `"gemini-2.5-flash-exp-native-audio-thinking-dialog"`. - `api_key`: Your Google API key (can also be set via environment variable) - `config`: A `GeminiLiveConfig` object for advanced options: - `voice`: (str or None) The voice to use for audio output (e.g., `"Puck"`). - `language_code`: (str or None) The language code for the conversation (e.g., `"en-US"`). - `temperature`: (float or None) Sampling temperature for response randomness. - `top_p`: (float or None) Nucleus sampling probability. - `top_k`: (float or None) Top-k sampling for response diversity. - `candidate_count`: (int or None) Number of candidate responses to generate. - `max_output_tokens`: (int or None) Maximum number of tokens in the output. - `presence_penalty`: (float or None) Penalty for introducing new topics. - `frequency_penalty`: (float or None) Penalty for repeating tokens. - `response_modalities`: (List[str] or None) List of enabled output modalities (e.g., `["TEXT"]`or `["AUDIO"]`(one at a time)). - `output_audio_transcription`: (`AudioTranscriptionConfig` or None) Configuration for audio output transcription. ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Plugin quickstart]()**: Quickstart for the Gemini Realtime API plugin. - **[Gemini docs](https://ai.google.dev/gemini-api/docs/live)**: Gemini Live API documentation. --- # OpenAI The OpenAI provider enables your agent to use OpenAI's real-time models (like GPT-4o) for text and audio interactions. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig from videosdk.agents import RealTimePipeline from openai.types.beta.realtime.session import TurnDetection # Initialize the OpenAI real-time model model = OpenAIRealtime( model="gpt-realtime-2025-08-28", # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", config=OpenAIRealtimeConfig( voice="alloy", # alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse modalities=["text", "audio"], turn_detection=TurnDetection( type="server_vad", threshold=0.5, prefix_padding_ms=300, silence_duration_ms=200, ), tool_choice="auto" ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## See it in Action Explore a complete, end-to-end implementation of an agent using this provider in our [AI Agent Quickstart Guide](https://docs.videosdk.live/ai_agents/voice-agent-quick-start). ## Configuration Options - `model`: The OpenAI model to use (e.g., `"gpt-realtime-2025-08-28"`) - `api_key`: Your OpenAI API key (can also be set via environment variable) - `config`: An `OpenAIRealtimeConfig` object for advanced options: - `voice`: (str) The voice to use for audio output (e.g., `"alloy"`). - `temperature`: (float) Sampling temperature for response randomness. - `turn_detection`: (`TurnDetection` or None) Configure how the agent detects when a user has finished speaking. - `input_audio_transcription`: (`InputAudioTranscription` or None) Configure audio-to-text (e.g., Whisper). - `tool_choice`: (str or None) Tool selection mode (e.g., `"auto"`). - `modalities`: (list[str]) List of enabled modalities (e.g., `["text", "audio"]`). ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[Plugin quickstart](https://github.com/videosdk-live/agents-quickstart/tree/main/Realtime%20Pipeline/OpenAI)**: Quickstart for the OpenAI Realtime API plugin. - **[OpenAI docs](https://platform.openai.com/docs/guides/realtime)**: OpenAI Realtime API documentation. --- # Ultravox The Ultravox provider enables your agent to use Ultravox's models for real-time, conversational AI interactions. ## Installation Install the Ultravox-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-ultravox" ``` ## Authentication The Ultravox plugin requires an [Ultravox API key](https://app.ultravox.ai/). Set the `ULTRAVOX_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.ultravox import UltravoxRealtime, UltravoxLiveConfig ``` ## Example Usage ```python from videosdk.plugins.ultravox import UltravoxRealtime, UltravoxLiveConfig from videosdk.agents import RealTimePipeline # Initialize the Ultravox real-time model model = UltravoxRealtime( model="fixie-ai/ultravox", # When ULTRAVOX_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-ultravox-api-key", config=UltravoxLiveConfig( voice="54ebeae1-88df-4d66-af13-6c41283b4332" ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using a `.env` file for credentials, you do not need to pass the `api_key` as an argument to the model instance; the SDK reads it automatically. ::: ## Key Features - **Real-time Interactions**: Utilize Ultravox's powerful models for low-latency voice conversations. - **Function Calling**: Empower your agent to perform actions like retrieving weather data or calling external APIs. - **Custom Agent Behaviors**: Define a unique personality and interaction style for your agent through system prompts. - **Call Control**: Agents can manage the conversation flow and gracefully terminate calls. - **MCP Integration**: Connect to external tools and data sources using the Model Context Protocol (MCP) via `MCPServerStdio` for local processes or `MCPServerHTTP` for remote services. ## Configuration Options - `model`: The Ultravox model to use (e.g., `"fixie-ai/ultravox"`). - `api_key`: Your Ultravox API key (can also be set via the `ULTRAVOX_API_KEY` environment variable). - `config`: An `UltravoxLiveConfig` object for advanced options: - `voice`: (str) The Voice ID for the synthesized speech. - `language_hint`: (str) A hint for the conversation's language (e.g., `"en"`). - `temperature`: (float) Controls the randomness of responses (0.0 to 1.0). - `vad_turn_endpoint_delay`: (int) Delay in milliseconds for voice activity detection to determine the end of a turn. - `vad_minimum_turn_duration`: (int) The minimum duration in milliseconds for a valid speech turn. ## Additional Resources The following resources provide more information about using Ultravox with the VideoSDK Agents SDK. --- # xAI (Grok) The xAI (Grok) provider enables your agent to use xAI's powerful Grok models for real-time, multimodal AI interactions. ## Installation Install the xAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-xai" ``` ## Authentication The xAI plugin requires an [xAI API key](https://console.x.ai). Set `XAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.xai import XAIRealtime, XAIRealtimeConfig ``` ## Example Usage ```python from videosdk.plugins.xai import XAIRealtime, XAIRealtimeConfig from videosdk.agents import RealTimePipeline # Initialize the xAI Grok real-time model model = XAIRealtime( model="grok-4-1-fast-non-reasoning", # When XAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-xai-api-key", config=XAIRealtimeConfig( voice="Eve", # collection_id="your-collection-id" # Optional ) ) # Create the pipeline with the model pipeline = RealTimePipeline(model=model) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit the `api_key` parameter from your code. ::: ## Key Features - **Multi-modal Interactions**: Utilize xAI's powerful Grok models for voice and text. - **Function Calling**: Define custom tools to retrieve weather data, interact with external APIs, or perform other actions. - **Web Search**: Enable real-time web search capabilities by setting `enable_web_search=True`. - **X Search**: Access X (formerly Twitter) content by setting `enable_x_search=True` and providing `allowed_x_handles`. ## Configuration Options - `model`: The Grok model to use (e.g., `"grok-4-1-fast-non-reasoning"`). - `api_key`: Your xAI API key (can also be set via the `XAI_API_KEY` environment variable). - `config`: An `XAIRealtimeConfig` object for advanced options: - `voice`: (str) The voice to use for audio output (e.g., `"Eve"`, `"Ara"`, `"Rex"`, `"Sal"`, `"Leo"`). - `enable_web_search`: (bool) Enable or disable web search capabilities. - `enable_x_search`: (bool) Enable or disable search on X (Twitter). - `allowed_x_handles`: (List[str]) A list of allowed X handles to search within. - `collection_id`: (str, optional) The ID of a custom collection from your xAI Console storage to provide additional context. - `turn_detection`: Configuration for detecting when a user has finished speaking. ## Collection Storage xAI Grok supports using "collections" to provide additional context to your agent, grounding its responses in your own documents or data. To use a collection: 1. **Navigate to xAI Console**: Go to your [console.x.ai](https://console.x.ai) dashboard. 2. **Access Storage**: Click on the **Storage** section in the sidebar. 3. **Create New Collection**: Click the "Create New Collection" button. 4. **Upload Files**: Upload your relevant documents or data files to the new collection. 5. **Get Collection ID**: Once the collection is created, copy its **Collection ID**. 6. **Use in Config**: Pass the copied ID to your agent's configuration: ```python config=XAIRealtimeConfig( voice="Eve", collection_id="your-collection-id-from-console", # ... other config options ) ``` The agent will now use the content of this collection to inform its responses. ## Additional Resources The following resources provide more information about using xAI (Grok) with the VideoSDK Agents SDK. --- # Silero VAD The Silero VAD (Voice Activity Detection) provider enables your agent to detect when users start and stop speaking. When added to a cascading pipeline, it automatically enables interrupt functionality - allowing users to interrupt the agent mid-response. ## Installation Install the Silero VAD-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-silero" ``` ## Importing ```python from videosdk.plugins.silero import SileroVAD ``` ## Example Usage ```python from videosdk.plugins.silero import SileroVAD from videosdk.agents import CascadingPipeline # Initialize the Silero VAD vad = SileroVAD( input_sample_rate=48000, model_sample_rate=16000, threshold=0.3, min_speech_duration=0.1, min_silence_duration=0.75, prefix_padding_duration=0.3 ) # Add VAD to cascading pipeline - automatically enables interrupts pipeline = CascadingPipeline(vad=vad) ``` ## Configuration Options - `input_sample_rate`: (int) Sample rate of input audio in Hz (default: `48000`) - `model_sample_rate`: (Literal[8000, 16000]) Model's expected sample rate (default: `16000`) - `threshold`: (float) Voice activity detection sensitivity (0.0 to 1.0, default: `0.3`) - `min_speech_duration`: (float) Minimum speech duration to trigger detection in seconds (default: `0.1`) - `min_silence_duration`: (float) Minimum silence duration to end speech detection in seconds (default: `0.75`) - `max_buffered_speech`: (float) Maximum speech buffer duration in seconds (default: `60.0`) - `force_cpu`: (bool) Force CPU usage instead of GPU acceleration (default: `True`) - `prefix_padding_duration`: (float) Audio padding before speech detection in seconds (default: `0.3`) ## Additional Resources The following resources provide more information about using Silero VAD with VideoSDK Agents SDK. - **[Silero VAD project](https://github.com/snakers4/silero-vad)**: The open source VAD model that powers the VideoSDK Silero VAD plugin. --- # AssemblyAI STT The AssemblyAI STT provider enables your agent to use AssemblyAI's real-time WebSocket API for fast and accurate speech-to-text conversion. ## Installation Install the AssemblyAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-assemblyai" ``` ## Authentication The AssemblyAI plugin requires an [AssemblyAI API key](https://www.assemblyai.com/dashboard/docs/your-api-key). Set `ASSEMBLYAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.assemblyai import AssemblyAISTT ``` ## Example Usage ```python from videosdk.plugins.assemblyai import AssemblyAISTT from videosdk.agents import CascadingPipeline # Initialize the AssemblyAI STT model stt = AssemblyAISTT( api_key="your-assemblyai-api-key", language_code="en_us" ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your AssemblyAI API key (required, can also be set via `ASSEMBLYAI_API_KEY` environment variable). - `language_code`: The language code for transcription (e.g., `"en_us"`, `"es"`). ## Additional Resources The following resources provide more information about using AssemblyAI with the VideoSDK Agents SDK. - **[AssemblyAI Docs](https://www.assemblyai.com/docs/guides/speech-to-text/real-time-streaming-transcription)**: AssemblyAI's official real-time streaming transcription documentation. ``` ``` --- # Azure STT The Azure STT provider enables your agent to use Microsoft Azure's advanced speech-to-text models for high-accuracy, real-time audio transcription with support for multiple languages and custom phrase lists. ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Importing ```python from videosdk.plugins.azure import AzureSTT ``` ## Authentication The Azure STT plugin requires an Azure AI Speech Service resource. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in your `.env` file: ```bash AZURE_SPEECH_KEY=your-azure-speech-key AZURE_SPEECH_REGION=your-azure-region ``` ## Example Usage ```python from videosdk.plugins.azure import AzureSTT from videosdk.agents import CascadingPipeline # Initialize the Azure STT model stt = AzureSTT( language="en-US", sample_rate=16000, enable_phrase_list=True, phrase_list=["VideoSDK", "artificial intelligence", "machine learning"] ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using environment variables for credentials, don't pass the `speech_key` and `speech_region` as arguments to the model instance. The SDK automatically reads the environment variables. ::: ## Configuration Options - `speech_key`: (Optional[str]) Azure Speech API key. Uses `AZURE_SPEECH_KEY` environment variable if not provided. - `speech_region`: (Optional[str]) Azure Speech region (e.g., `"eastus"`, `"westus2"`). Uses `AZURE_SPEECH_REGION` environment variable if not provided. - `language`: (str) The language code for transcription (default: `"en-US"`). See [supported languages](https://learn.microsoft.com/en-us/globalization/locale/standard-locale-names). - `sample_rate`: (int) The target audio sample rate in Hz for transcription (default: `16000`). The input audio at 48000Hz will be resampled to this rate. - `enable_phrase_list`: (bool) Whether to enable phrase list for better recognition accuracy (default: `False`). - `phrase_list`: (Optional[List[str]]) List of phrases to boost recognition for domain-specific terms (default: `None`). ## Additional Resources The following resources provide more information about using Azure with VideoSDK Agents SDK. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Complete overview of Azure Speech services. - **[Azure STT docs](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-speech-to-text)**: Azure Speech-to-Text documentation. - **[Getting Started Guide](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text?tabs=macos&pivots=programming-language-python#prerequisites)**: Azure STT setup and prerequisites. --- # Azure OpenAI STT The Azure OpenAI STT provider enables your agent to use Azure OpenAI's speech-to-text models (like Whisper) for converting audio input to text. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAISTT ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAISTT from videosdk.agents import CascadingPipeline # Initialize the Azure OpenAI STT model stt = OpenAISTT.azure( azure_deployment="gpt-4o-transcribe", language="en", ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o-mini-transcribe"`, `"gpt-4o-transcribe"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `language`: (str) Language code for transcription (default: `"en"`) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. --- # Cartesia STT The Cartesia STT provider enables your agent to use Cartesia's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Cartesia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cartesia" ``` ## Importing ```python from videosdk.plugins.cartesia import CartesiaSTT ``` ## Authentication The Cartesia plugin requires a [Cartesia API key](https://play.cartesia.ai/keys). Set `CARTESIA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cartesia import CartesiaSTT from videosdk.agents import CascadingPipeline # Initialize the Cartesia STT model stt = CartesiaSTT( # When CARTESIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-cartesia-api-key", language="en-US", model="ink-whisper", ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using an environment variable for credentials, don't pass the `api_key` as an argument to the model instance. The SDK automatically reads the environment variable. ::: ## Configuration Options - `api_key`: (str) Your Cartesia API key. Can also be set via the `CARTESIA_API_KEY` environment variable. - `model`: (str) The Cartesia STT model to use (e.g., `"ink-whisper"`). Defaults to `"ink-whisper"`. - `language`: (str) Language code for transcription (default: `"en"`). ## Additional resources The following resources provide more information about using Cartesia with VideoSDK Agents. - **[Cartesia docs](https://docs.cartesia.ai/build-with-cartesia/models/stt)**: Cartesia STT docs. --- # Deepgram STT The Deepgram STT provider enables your agent to use Deepgram's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Deepgram-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-deepgram" ``` ## Authentication The Deepgram plugin requires a [Deepgram API key](https://console.deepgram.com/). Set `DEEPGRAM_API_KEY` in your `.env` file. ## Importing **DeepgramSTT:** ```python from videosdk.plugins.deepgram import DeepgramSTT ``` --- **DeepgramSTTV2:** ```python from videosdk.plugins.deepgram import DeepgramSTTV2 ``` ## Example Usage **DeepgramSTT:** ```python from videosdk.plugins.deepgram import DeepgramSTT from videosdk.agents import CascadingPipeline # Initialize the Deepgram STT model stt = DeepgramSTT( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="nova-2", language="en-US", interim_results=True, punctuate=True, smart_format=True ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` --- **DeepgramSTTV2:** ```python from videosdk.plugins.deepgram import DeepgramSTTV2 from videosdk.agents import CascadingPipeline # Initialize the Deepgram STT V2 model with Flux stt = DeepgramSTTV2( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="flux-general-en", eager_eot_threshold=0.6, eot_threshold=0.8, eot_timeout_ms=7000 ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options **DeepgramSTT:** - `api_key`: Your Deepgram API key (can also be set via DEEPGRAM_API_KEY environment variable) - `model`: The Deepgram model to use (e.g., "`nova-2`", "`nova-3`", "`whisper-large`") (default: "`nova-2`") - `language`: (str) Language code for transcription (e.g., "`en-US`", "`es`", "`fr`") (default: "`en-US`") - `interim_results`: (bool) Enable real-time partial transcription results (default: `True`) - `punctuate`: (bool) Add punctuation to transcription (default: `True`) - `smart_format`: (bool) Apply intelligent formatting to output (default: `True`) - `filler_words`: (bool) Include filler words like "uh", "um" in transcription (default: `True`) - `sample_rate`: (int) Audio sample rate in Hz (default: `48000`) - `endpointing`: (int) Silence detection threshold in milliseconds (default: `50`) - `base_url`: (str) WebSocket endpoint URL (default: `"wss://api.deepgram.com/v1/listen"`) --- **DeepgramSTTV2:** ### Required Parameters - `api_key`: Your Deepgram API key (can also be set via DEEPGRAM_API_KEY environment variable) - `model`: The Flux model to use - language is embedded in model name (default: "`flux-general-en`")(currently only english is available) - `input_sample_rate`: (int) Input audio sample rate in Hz (default: `48000`) - `target_sample_rate`: (int) Target sample rate for Deepgram processing (default: `16000`) - `eager_eot_threshold`: Confidence threshold for early end-of-turn detection, range 0.0-1.0 (default: `0.6`) - Lower values = more aggressive early detection - Higher values = wait for higher confidence before early turn end - `eot_threshold`: Standard end-of-turn confidence threshold, range 0.0-1.0 (default: `0.8`) - Controls when a turn is definitively ended - `eot_timeout_ms`: Timeout in milliseconds before forcing end-of-turn (default: `7000`) - Maximum silence duration before automatically ending turn - `base_url`: (str) WebSocket endpoint URL (default: `"wss://api.deepgram.com/v2/listen"`) ## Additional Resources The following resources provide more information about using Deepgram with VideoSDK Agents SDK. - **[Deepgram docs V1](https://developers.deepgram.com/docs/live-streaming-audio)**: Deepgram's STT V1 docs - **[Deepgram docs V2](https://developers.deepgram.com/docs/flux/quickstart)**: Deepgram's STT V2 docs - **[Github URL V1](https://github.com/videosdk-live/agents/blob/main/videosdk-plugins/videosdk-plugins-deepgram/videosdk/plugins/deepgram/stt.py)** : Deepgram STT Plugin Source Code - **[Github URL V2](https://github.com/videosdk-live/agents/blob/main/videosdk-plugins/videosdk-plugins-deepgram/videosdk/plugins/deepgram/stt_v2.py)** : Deepgram STT V2 Plugin Source Code --- # ElevenLabs STT The ElevenLabs STT provider enables your agent to use `ElevenLabs` advanced speech-to-text models for high-accuracy, real-time audio transcription with advanced voice activity detection. ## Installation Install the ElevenLabs-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-elevenlabs" ``` ## Importing ```python from videosdk.plugins.elevenlabs import ElevenLabsSTT ``` ## Authentication The ElevenLabs plugin requires an [ElevenLabs API key](https://elevenlabs.io/app/settings/api-keys). Set `ELEVENLABS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.elevenlabs import ElevenLabsSTT from videosdk.agents import CascadingPipeline # Initialize the ElevenLabs STT model stt = ElevenLabsSTT( # When ELEVENLABS_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-elevenlabs-api-key", model_id="scribe_v2_realtime", language_code="en", commit_strategy="vad", vad_silence_threshold_secs=0.8, vad_threshold=0.4, min_speech_duration_ms=50, min_silence_duration_ms=50 ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your ElevenLabs API key (can also be set via ELEVENLABS_API_KEY environment variable) - `model_id`: (str) STT model identifier (default: `"scribe_v2_realtime"`) - `language_code`: (str) Language code for transcription (default: `"en"`) - `sample_rate`: (int) Sample rate of input audio in Hz (default: `48000`) - `commit_strategy`: (str) Strategy for committing transcripts (default: `"vad"`) - `"vad"` - Voice Activity Detection based commit strategy - `vad_silence_threshold_secs`: (float) Duration of silence in seconds to detect end-of-speech (default: `0.8`) - `vad_threshold`: (float) Threshold for detecting voice activity (default: `0.4`) - `min_speech_duration_ms`: (int) Minimum duration in milliseconds for a speech segment (default: `50`) - `min_silence_duration_ms`: (int) Minimum duration in milliseconds of silence to consider end-of-speech (default: `50`) ## Additional Resources The following resources provide more information about using ElevenLabs with VideoSDK Agents SDK. - **[ElevenLabs docs](https://elevenlabs.io/docs)**: ElevenLabs STT docs. --- # Gladia STT The Gladia STT provider enables your agent to use Gladia's fast and accurate speech-to-text models for real-time audio transcription with support for multiple languages and code-switching. ## Installation Install the Gladia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-gladia" ``` ## Authentication The Gladia plugin requires a [Gladia API key](https://app.gladia.io/signup). Set `GLADIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.gladia import GladiaSTT ``` ## Example Usage ```python from videosdk.plugins.gladia import GladiaSTT from videosdk.agents import CascadingPipeline # Initialize the Gladia STT model stt = GladiaSTT( # When GLADIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-gladia-api-key", languages=["en"], code_switching=True, receive_partial_transcripts=True ) # Add stt to a cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using a `.env` file for credentials, you do not need to pass the `api_key` as an argument to the model instance; the SDK reads it automatically. ::: ## Configuration Options - `api_key`: (str, optional) Your Gladia API key. Can also be set via the `GLADIA_API_KEY` environment variable. - `model`: (str, optional) The model to use. Defaults to `"solaria-1"`. - `languages`: (List[str], optional) A list of language codes to detect (e.g., `["en", "fr"]`). Defaults to `["en"]`. - `code_switching`: (bool, optional) Enables automatic language switching between the provided languages. Defaults to `True`. - `input_sample_rate`: (int, optional) The sample rate of the incoming audio. Defaults to `48000`. - `output_sample_rate`: (int, optional) The sample rate Gladia should process. Defaults to `16000`. - `encoding`: (str, optional) The audio encoding format. Defaults to `"wav/pcm"`. - `bit_depth`: (int, optional) The bit depth of the audio. Defaults to `16`. - `channels`: (int, optional) The number of audio channels. Defaults to `1` (mono). - `receive_partial_transcripts`: (bool, optional) Set to `True` to receive interim transcription results for lower latency. Defaults to `False`. --- # Google STT The Google STT provider enables your agent to use Google's advanced speech-to-text models for high-accuracy, real-time audio transcription. ## Installation Install the Google-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Importing ```python from videosdk.plugins.google import GoogleSTT ``` ## Setup Credentials/Authentication To use Google STT, you need to set up your Google Cloud credentials. You can do this by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file. ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json" ``` Alternatively, you can pass the path to the key file directly to the `GoogleSTT` constructor via the `api_key` parameter. **or** Set `GOOGLE_APPLICATION_CREDENTIALS` in your `.env` file. ## Example Usage ```python from videosdk.plugins.google import GoogleSTT from videosdk.agents import CascadingPipeline # Initialize the Google STT model stt = GoogleSTT( # If GOOGLE_APPLICATION_CREDENTIALS is set, you can omit api_key api_key="/path/to/your/keyfile.json", languages="en-US", model="latest_long", interim_results=True, punctuate=True ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using an environment variable for credentials, don't pass the `api_key` as an argument to the model instance. The SDK automatically reads the environment variable. ::: ## Configuration Options - `api_key`: (str) Path to your Google Cloud service account JSON file. This can also be set via the `GOOGLE_APPLICATION_CREDENTIALS` environment variable. - `languages`: (Union[str, list[str]]) Language code or a list of language codes for transcription (default: `"en-US"`). - `model`: (str) The Google STT model to use (e.g., `"latest_long"`, `"telephony"`) (default: `"latest_long"`). - `sample_rate`: (int) The target audio sample rate in Hz for transcription (default: `16000`). The input audio at 48000Hz will be resampled to this rate. - `interim_results`: (bool) Enable real-time partial transcription results (default: `True`). - `punctuate`: (bool) Add punctuation to transcription (default: `True`). - `min_confidence_threshold`: (float) The minimum confidence level for a transcription result to be considered valid (default: `0.1`). - `location`: (str) The Google Cloud location to use for the STT service (default: `"global"`). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Google STT docs](https://cloud.google.com/speech-to-text/docs)**: Google Cloud STT documentation. --- # Navana STT The Navana STT provider enables your agent to use Navana's Bodhi speech-to-text models, which are highly optimized for a variety of Indian languages and accents. ## Installation Install the Navana-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-navana" ``` ## Authentication The Navana plugin requires a **Customer ID** and an **API Key** from your [Navana Bodhi account](https://bodhi.navana.ai/). Set both `NAVANA_API_KEY` and `NAVANA_CUSTOMER_ID` in your `.env` file. ## Importing ```python from videosdk.plugins.navana import NavanaSTT ``` ## Example Usage ```python from videosdk.plugins.navana import NavanaSTT from videosdk.agents import CascadingPipeline # Initialize the Navana STT model stt = NavanaSTT( api_key="your-navana-api-key", customer_id="your-navana-customer-id", model="en-in-general-v2-8khz", language="en-IN" ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key`, `customer_id`, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Navana API key (required, can also be set via `NAVANA_API_KEY` environment variable). - `customer_id`: Your Navana Customer ID (required, can also be set via `NAVANA_CUSTOMER_ID` environment variable). - `model`: The Navana STT model to use (e.g., `"en-in-general-v2-8khz"`, `"hi-general-v2-8khz"`). - `language`: The language code for transcription (e.g., `"en-IN"`, `"hi-IN"`). ## Additional Resources The following resources provide more information about using Navana with the VideoSDK Agents SDK. - **[Navana Docs](https://navana.gitbook.io/bodhi/streaming-asr/streaming-websocket)**: Navana's official streaming API documentation. --- # Nvidia STT The Nvidia STT provider enables your agent to use Nvidia's Riva speech-to-text models for high-performance, low-latency speech recognition. ## Installation Install the Nvidia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-nvidia" ``` ## Authentication The Nvidia plugin requires an Nvidia API key. Set `NVIDIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.nvidia import NvidiaSTT ``` ## Example Usage ```python from videosdk.plugins.nvidia import NvidiaSTT from videosdk.agents import CascadingPipeline # Initialize the Nvidia STT model stt = NvidiaSTT( # When NVIDIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-nvidia-api-key", model="parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer", language_code="en-US", profanity_filter=False, automatic_punctuation=True ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Nvidia API key (required, can also be set via environment variable) - `model`: The Nvidia Riva model to use (default: `"parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer"`) - `server`: The Nvidia Riva server address (default: `"grpc.nvcf.nvidia.com:443"`) - `function_id`: The specific function ID for the service (default: `"1598d209-5e27-4d3c-8079-4751568b1081"`) - `language_code`: Language code for transcription (default: `"en-US"`) - `sample_rate`: Audio sample rate in Hz (default: `16000`) - `profanity_filter`: (bool) Enable or disable profanity filtering (default: `False`) - `automatic_punctuation`: (bool) Enable or disable automatic punctuation (default: `True`) - `use_ssl`: (bool) Enable SSL connection (default: `True`) ## Additional Resources The following resources provide more information about using Nvidia Riva with VideoSDK Agents SDK. - **[Nvidia Riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)**: Nvidia Riva documentation. --- # OpenAI STT The OpenAI STT provider enables your agent to use OpenAI's speech-to-text models (like Whisper) for converting audio input to text. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.openai import OpenAISTT ``` ## Example Usage ```python from videosdk.plugins.openai import OpenAISTT from videosdk.agents import CascadingPipeline # Initialize the OpenAI STT model stt = OpenAISTT( # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", model="whisper-1", language="en", prompt="Transcribe this audio with proper punctuation and formatting." ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your OpenAI API key (required, can also be set via environment variable) - `model`: The OpenAI STT model to use (e.g., `"whisper-1"`, `"gpt-4o-mini-transcribe"`) - `base_url`: Custom base URL for OpenAI API (optional) - `prompt`: (str) Custom prompt to guide transcription style and format - `language`: (str) Language code for transcription (default: `"en"`) - `turn_detection`: (dict) Configuration for detecting conversation turns ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/guides/speech-to-text)**: OpenAI STT API documentation. --- # Sarvam AI STT The Sarvam AI STT provider enables your agent to use Sarvam AI's speech-to-text models for transcription. This provider uses Voice Activity Detection (VAD) to send audio chunks for transcription after a period of silence. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAISTT ``` ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAISTT from videosdk.agents import CascadingPipeline # Initialize the Sarvam AI STT model stt = SarvamAISTT( # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", model="saarika:v2", language="en-IN" ) # Add stt to cascading pipeline pipeline = CascadingPipeline(stt=stt) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `model`: (str) The Sarvam AI model to use (default: `"saarika:v2"`). - `language`: (str) Language code for transcription (default: `"en-IN"`). - `input_sample_rate`: (int) The sample rate of the audio from the source in Hz (default: `48000`). - `output_sample_rate`: (int) The sample rate to which the audio is resampled before sending for transcription (default: `16000`). - `silence_threshold`: (float) The normalized amplitude threshold for silence detection (default: `0.01`). - `silence_duration`: (float) The duration of silence in seconds that triggers the end of a speech segment for transcription (default: `0.8`). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/)**: Sarvam's full docs site. --- # AWS Polly TTS The AWS Polly TTS provider enables your agent to use AWS Polly's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the AWS Poly-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-aws" ``` ## Importing ```python from videosdk.plugins.aws import AWSPollyTTS ``` ## Authentication - `AWS Account`: You have an active AWS account with permissions to access Amazon Polly. - `Region Selection`: You're operating in the US East (N. Virginia) (us-east-1) region, as model access is region-specific. - `AWS Credentials`: Your AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) are configured, either through environment variables or your preferred credential management method. ## Example Usage ```python from videosdk.plugins.aws import AWSPollyTTS from videosdk.agents import CascadingPipeline # Initialize the AWS Polly TTS model tts = AWSPollyTTS( voice="Joanna", engine="neural", speed=1.2, pitch=0.1, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `voice`: (str) Voice ID for the TTS output (default: `"Joanna"`). - `engine`: (str) Polly engine type: `"standard"` or `"neural"` (default: `"neural"`). - `region`: (str) AWS region for Polly service (default: `"us-east-1"` or from `AWS_DEFAULT_REGION`). - `aws_access_key_id`: (str) AWS access key ID (optional; can be set via environment variable). - `aws_secret_access_key`: (str) AWS secret access key (optional; can be set via environment variable). - `aws_session_token`: (str) Optional AWS session token for temporary credentials. - `speed`: (float) Speech rate multiplier (e.g., `1.0` is normal speed, `1.5` is 50% faster). - `pitch`: (float) Pitch adjustment multiplier (e.g., `0.0` is normal, `0.2` raises pitch). ## Additional Resources The following resources provide more information about using AWS Polly with VideoSDK Agents SDK. - **[AWS Polly docs](https://docs.aws.amazon.com/polly/latest/dg/what-is.html)**: AWS Polly documentation. --- # Azure TTS The Azure TTS provider enables your agent to use Microsoft Azure's high-quality text-to-speech models for generating natural-sounding voice output with advanced voice tuning and expressive speaking styles. ## Installation Install the Azure-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-azure" ``` ## Importing ```python from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle ``` ## Authentication The Azure TTS plugin requires an Azure AI Speech Service resource. **Setup Steps:** 1. Create an AI Services resource for Speech in the [Azure portal](https://portal.azure.com) or from [Azure AI Foundry](https://ai.azure.com/foundryProject/overview) 2. Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys Set `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in your `.env` file: ```bash AZURE_SPEECH_KEY=your-azure-speech-key AZURE_SPEECH_REGION=your-azure-region ``` ## Example Usage ```python from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle from videosdk.agents import CascadingPipeline # Configure voice tuning for prosody control voice_tuning = VoiceTuning( rate="fast", volume="loud", pitch="high" ) # Configure speaking style for expressive speech speaking_style = SpeakingStyle( style="cheerful", degree=1.5 ) # Initialize the Azure TTS model tts = AzureTTS( voice="en-US-EmmaNeural", language="en-US", tuning=voice_tuning, style=speaking_style ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `speech_key`, `speech_region`, and other credential parameters from your code. ::: ## Configuration Options - `speech_key`: (Optional[str]) Azure Speech API key. Uses `AZURE_SPEECH_KEY` environment variable if not provided. - `speech_region`: (Optional[str]) Azure Speech region (e.g., `"eastus"`, `"westus2"`). Uses `AZURE_SPEECH_REGION` environment variable if not provided. - `speech_endpoint`: (Optional[str]) Custom endpoint URL. Uses `AZURE_SPEECH_ENDPOINT` environment variable if not provided. - `voice`: (str) Voice name to use for audio output (default: `"en-US-EmmaNeural"`). Get available voices using the [Azure voices API](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python#select-synthesis-language-and-voice). - `language`: (str) Language code (optional, inferred from voice if not specified). - `tuning`: (`VoiceTuning`) Voice tuning object for rate, volume, and pitch control: - `rate`: (str) Speaking rate (`"x-slow"`, `"slow"`, `"medium"`, `"fast"`, `"x-fast"` or percentage like `"50%"`) - `volume`: (str) Speaking volume (`"silent"`, `"x-soft"`, `"soft"`, `"medium"`, `"loud"`, `"x-loud"` or percentage) - `pitch`: (str) Voice pitch (`"x-low"`, `"low"`, `"medium"`, `"high"`, `"x-high"` or frequency like `"+50Hz"`) - `style`: (`SpeakingStyle`) Speaking style object for expressive speech: - `style`: (str) Speaking style (e.g., `"cheerful"`, `"sad"`, `"angry"`, `"excited"`, `"friendly"`) - `degree`: (float) Style intensity from 0.01 to 2.0 (default: 1.0) - `deployment_id`: (str) Custom deployment ID for custom models. - `speech_auth_token`: (str) Authorization token for authentication. ## Voice Selection You can find available voices using the Azure Voices List API: ```bash curl --location --request GET 'https://eastus2.tts.speech.microsoft.com/cognitiveservices/voices/list' \ --header 'Ocp-Apim-Subscription-Key: YOUR_SPEECH_KEY' ``` Popular voice options include: - `en-US-EmmaNeural` (Female, neutral) - `en-US-BrianNeural` (Male, neutral) - `en-US-AriaNeural` (Female, cheerful) - `en-GB-SoniaNeural` (Female, British) ## Additional Resources The following resources provide more information about using Azure with VideoSDK Agents SDK. - **[Azure Speech Service Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview)**: Complete overview of Azure Speech services. - **[Azure TTS docs](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-text-to-speech)**: Azure Text-to-Speech documentation. - **[Voice Selection Guide](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-python#select-synthesis-language-and-voice)**: Guide for selecting synthesis language and voice. - **[Speech Synthesis Markup](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice#adjust-prosody)**: Learn about prosody adjustments and voice tuning. --- # Azure OpenAI TTS The Azure OpenAI TTS provider enables your agent to use Azure OpenAI's text-to-speech models for converting text responses to natural-sounding audio output. ## Installation Install the Azure OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAITTS ``` ## Authentication The Azure OpenAI plugin requires either an [Azure OpenAI API key](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/create-resource?pivots=web-portal). Set `AZURE_OPENAI_API_KEY` , `AZURE_OPENAI_ENDPOINT` and `OPENAI_API_VERSION` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAITTS from videosdk.agents import CascadingPipeline # Initialize the Azure OpenAI TTS model tts = OpenAITTS.azure( azure_deployment="gpt-4o-mini-tts", speed=1.0, response_format="pcm" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `azure_deployment`: The OpenAI deployment ID to use (by default it is model name: e.g., `"gpt-4o-mini-tts"`) - `api_key`: Your Azure OpenAI API key (can also be set via environment variable) - `azure_endpoint`: Your Azure OpenAI Deployment Endpoint URL (can also be set via environment variable) - `api_version`: Your Azure OpenAI API version (can also be set via environment variable) - `voice`: (str) Voice to use for audio output (e.g., `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"`) - `speed`: (float) Speed of the generated audio (0.25 to 4.0, default: 1.0) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. --- # Cartesia TTS The Cartesia TTS provider enables your agent to use Cartesia's high-quality, low-latency text-to-speech models for generating natural-sounding voice output. ## Installation Install the Cartesia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-cartesia" ``` ## Importing ```python from videosdk.plugins.cartesia import CartesiaTTS ``` ## Authentication The Cartesia plugin requires a [Cartesia API key](https://play.cartesia.ai/keys). Set `CARTESIA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.cartesia import CartesiaTTS from videosdk.agents import CascadingPipeline # Initialize the Cartesia TTS model tts = CartesiaTTS( # When CARTESIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-cartesia-api-key", model="sonic-2", voice_id="794f9389-aac1-45b6-b726-9d9369183238", language="en" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Cartesia API key. Can also be set via the `CARTESIA_API_KEY` environment variable. - `model`: (str) The Cartesia TTS model to use (e.g., `"sonic-2"`, `"sonic-turbo"`). Defaults to `"sonic-2"`. - `voice_id`: (str) The ID of the voice to use for generating speech. - `language`: (str) The language of the voice (e.g., `"en"`, `"fr"`). Defaults to `"en"`. ## Additional resources The following resources provide more information about using Cartesia with VideoSDK Agents. - **[Cartesia docs](https://docs.cartesia.ai/build-with-cartesia/models/tts)**: Cartesia TTS docs. --- # Deepgram TTS The Deepgram TTS provider enables your agent to use Deepgram's high-quality text-to-speech models for generating natural, expressive voice output with advanced voice capabilities. ## Installation Install the Deepgram-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-deepgram" ``` ## Importing ```python from videosdk.plugins.deepgram import DeepgramTTS ``` ## Authentication The Deepgram plugin requires an [Deepgram API key](https://developers.deepgram.com/docs/create-additional-api-keys). Set `DEEPGRAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.deepgram import DeepgramTTS from videosdk.agents import CascadingPipeline # Initialize the Deepgram TTS model tts = Deepgram( # When DEEPGRAM_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-deepgram-api-key", model="aura-asteria-en", encoding="linear16", # linear16, mulaw, alaw, opus, mp3, flac, aac sample_rate=24000 ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model` : The Deepgram model to use (e.g., `"aura-asteria-en"`, `"aura-luna-en"`) - `api_key`: Your Deepgram API key (can also be set via environment variable) - `encoding` : (str) Encoding allows you to specify the expected encoding of your audio output (default : `"linear16"`) - `sample_rate`: (int) Sample rate for output (default: `24000`) ## Additional Resources The following resources provide more information about using Deepgram with VideoSDK Agents SDK. - **[Deepgram docs](https://developers.deepgram.com/reference/text-to-speech-api/speak-streaming)**: Deepgram TTS docs. --- # ElevenLabs TTS The ElevenLabs TTS provider enables your agent to use ElevenLabs' high-quality text-to-speech models for generating natural, expressive voice output with advanced voice cloning capabilities. ## Installation Install the ElevenLabs-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-elevenlabs" ``` ## Importing ```python from videosdk.plugins.elevenlabs import ElevenLabsTTS, VoiceSettings ``` ## Authentication The ElevenLabs plugin requires an [ElevenLabs API key](https://elevenlabs.io/app/settings/api-keys). Set `ELEVENLABS_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.elevenlabs import ElevenLabsTTS, VoiceSettings from videosdk.agents import CascadingPipeline # Configure voice settings voice_settings = VoiceSettings( stability=0.71, similarity_boost=0.5, style=0.0, use_speaker_boost=True ) # Initialize the ElevenLabs TTS model tts = ElevenLabsTTS( # When ELEVENLABS_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-elevenlabs-api-key", model="eleven_flash_v2_5", voice="your-voice-id", speed=1.0, response_format="pcm_24000", voice_settings=voice_settings, enable_streaming=True ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The ElevenLabs model to use (e.g., `"eleven_flash_v2_5"`, `"eleven_multilingual_v2"`) - `voice`: (str) Voice ID to use for audio output (get from ElevenLabs dashboard) - `speed`: (float) Speed of the generated audio (default: 1.0) - `api_key`: Your ElevenLabs API key (can also be set via environment variable) - `response_format`: (str) Audio format for output (default: `"pcm_24000"`) - `voice_settings`: (`VoiceSettings`) Advanced voice configuration options: - `stability`: (float) Voice stability (0.0 to 1.0, default: 0.71) - `similarity_boost`: (float) Voice similarity enhancement (0.0 to 1.0, default: 0.5) - `style`: (float) Voice style exaggeration (0.0 to 1.0, default: 0.0) - `use_speaker_boost`: (bool) Enable speaker boost for clarity (default: `True`) - `base_url`: (str) Custom base URL for ElevenLabs API (optional) - `enable_streaming`: (bool) Enable real-time audio streaming (default: `False`) ## Additional Resources The following resources provide more information about using ElevenLabs with VideoSDK Agents SDK. - **[ElevenLabs docs](https://elevenlabs.io/docs)**: ElevenLabs TTS docs. --- # Google TTS The Google TTS provider enables your agent to use Google's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Google-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-google" ``` ## Importing ```python from videosdk.plugins.google import GoogleTTS, GoogleVoiceConfig ``` ## Authentication The Google plugin requires an [Gemini API key](https://aistudio.google.com/apikey). Set `GOOGLE_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.google import GoogleTTS, GoogleVoiceConfig from videosdk.agents import CascadingPipeline # Configure voice settings voice_config = GoogleVoiceConfig( languageCode="en-US", name="en-US-Chirp3-HD-Aoede", ssmlGender="FEMALE" ) # Initialize the Google TTS model tts = GoogleTTS( # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-google-api-key", speed=1.0, pitch=0.0, voice_config=voice_config ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Google Cloud TTS API key. Can also be set via the `GOOGLE_API_KEY` environment variable. - `speed`: (float) The speaking rate of the generated audio (default: `1.0`). - `pitch`: (float) The pitch of the generated audio. Can be between -20.0 and 20.0 (default: `0.0`). - `response_format`: (str) The format of the audio response. Currently only supports `"pcm"` (default: `"pcm"`). - `voice_config`: (`GoogleVoiceConfig`) Configuration for the voice to be used. - `languageCode`: (str) The language code of the voice (e.g., `"en-US"`, `"en-GB"`) (default: `"en-US"`). - `name`: (str) The name of the voice to use (e.g., `"en-US-Chirp3-HD-Aoede"`, `"en-US-News-N"`) (default: `"en-US-Chirp3-HD-Aoede"`). - `ssmlGender`: (str) The gender of the voice (`"MALE"`, `"FEMALE"`, `"NEUTRAL"`) (default: `"FEMALE"`). ## Additional Resources The following resources provide more information about using Google with VideoSDK Agents SDK. - **[Google TTS docs](https://cloud.google.com/text-to-speech/docs)**: Google Cloud TTS documentation. --- # Groq TTS The Groq TTS provider enables your agent to use Groq's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Groq-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-groq" ``` ## Importing ```python from videosdk.plugins.groq import GroqTTS ``` ## Authentication The Groq plugin requires an [Groq API key](https://console.groq.com/keys). Set `GROQ_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.groq import GroqTTS from videosdk.agents import CascadingPipeline # Initialize the Groq AI TTS model tts = GroqTTS( model="playai-tts", voice="Fritz-PlayAI", ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model` (str): The TTS model to use. Default: "playai-tts" - `voice` (str): The voice to use. Default: "Fritz-PlayAI" - `speed` (float): Speed of speech (0.5 to 5.0). Default: 1.0 - `api_key` (str, optional): Groq API key. If not provided, uses GROQ_API_KEY environment variable ## Additional Resources The following resources provide more information about using Groq with VideoSDK Agents SDK. - **[Groq docs](https://console.groq.com/docs/text-to-speech)**: Groq TTS docs. --- # Hume AI TTS The Hume AI TTS provider enables your agent to use Hume AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Hume AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-humeai" ``` ## Importing ```python from videosdk.plugins.humeai import HumeAITTS ``` ## Authentication The Hume plugin requires an [Hume API key](https://platform.hume.ai/settings/keys). Set `HUMEAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.hume import HumeAITTS from videosdk.agents import CascadingPipeline # Initialize the Hume AI TTS model tts = HumeAITTS( voice="Serene Assistant", instant_mode=True, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `instant_mode`: (bool) Whether to use instant mode synthesis (default: `True`). Instant mode requires specifying a voice. - `voice`: (str) Voice name to use (default: `"Serene Assistant"`). Required when `instant_mode` is `True`. - `speed`: (float) Speaking rate multiplier (default: `1.0`). Values >1.0 increase speed. - `api_key`: (str) Hume AI API key. Can also be set via the `HUMEAI_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Hume with VideoSDK Agents SDK. - **[Hume AI docs](https://dev.hume.ai/docs/text-to-speech-tts)**: Hume AI docs. --- # Inworld AI TTS The Inworld AI TTS provider enables your agent to use Inworld AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Inworld AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-inworldai" ``` ## Importing ```python from videosdk.plugins.inworld import InworldAITTS ``` ## Authentication The Inworld plugin requires an [Inworld API key](https://studio.inworld.ai/login). Set `INWORLD_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.inworldai import InworldAITTS from videosdk.agents import CascadingPipeline # Initialize the Inworld AI TTS model tts = InworldAITTS( api_key="your-api-key", voice_id="Hades", model_id="inworld-tts-1" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `model_id`: (str) Inworld TTS model identifier (default: `"inworld-tts-1"`). - `voice_id`: (str) Voice identifier to use (default: `"Hades"`). - `temperature`: (float) Sampling temperature for variation in prosody (default: `0.8`). - `api_key`: (str) Inworld API key. Can also be set via the `INWORLD_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Inworld with VideoSDK Agents SDK. - **[Inworld AI docs](https://docs.inworld.ai/docs/introduction)**: Inworld AI docs. --- # LMNT AI TTS The LMNT AI TTS provider enables your agent to use LMNT AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the LMNT AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-lmnt" ``` ## Importing ```python from videosdk.plugins.lmnt import LMNTTTS ``` ## Authentication The LMNT plugin requires an [LMNT API key](https://app.lmnt.com/account). Set `LMNT_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.lmnt import LMNTTTS from videosdk.agents import CascadingPipeline # Initialize the LMNT TTS model tts = LMNTTTS( voice="ava", model="blizzard", language="auto", ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your LMNT API key (can also be set via `LMNT_API_KEY` environment variable) - `voice`: Voice ID to use for synthesis (required) - `model`: Model to use for synthesis (default: "blizzard") - `language`: Language code for synthesis (default: "auto") ## Additional Resources The following resources provide more information about using LMNT with VideoSDK Agents SDK. - **[LMNT docs](https://docs.lmnt.com/)**: LMNT API docs. --- # Murf AI TTS The Murf AI TTS provider enables your agent to use Murf AI's high-quality text-to-speech models for generating natural, expressive voice output with advanced voice customization. ## Installation Install the Murf AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-murfai" ``` ## Importing ```python from videosdk.plugins.murfai import MurfAITTS, MurfAIVoiceSettings ``` ## Authentication The Murf AI plugin requires a [Murf AI API key](https://murf.ai/). Set `MURFAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.murfai import MurfAITTS, MurfAIVoiceSettings from videosdk.agents import CascadingPipeline # Configure voice settings voice_settings = MurfAIVoiceSettings( pitch=0, rate=0, style="Conversational", variation=1, multi_native_locale=None ) # Initialize the Murf AI TTS model tts = MurfAITTS( # When MURFAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-murfai-api-key", region="US_EAST", model="Falcon", voice="en-US-natalie", voice_settings=voice_settings, enable_streaming=True ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Murf AI API key (can also be set via MURFAI_API_KEY environment variable) - `region`: (str) The region code for API deployment (default: `"US_EAST"`) - Available regions: `"GLOBAL"`, `"US_EAST"`, `"US_WEST"`, `"INDIA"`, `"CANADA"`, `"SOUTH_KOREA"`, `"UAE"`, `"JAPAN"`, `"AUSTRALIA"`, `"EU_CENTRAL"`, `"UK"`, `"SOUTH_AFRICA"` - `model`: (str) The Murf AI model to use (default: `"Falcon"`) - Available models: `"Gen2"`, `"Falcon"` - `voice`: (str) Voice ID to use for audio output (default: `"en-US-natalie"`) - `voice_settings`: (`MurfAIVoiceSettings`) Advanced voice configuration options: - `pitch`: (int) Voice pitch adjustment, range varies by voice (default: `0`) - `rate`: (int) Speech rate adjustment, range varies by voice (default: `0`) - `style`: (str) Voice style/emotion (default: `"Conversational"`) - `variation`: (int) Voice variation for diversity (default: `1`) - `multi_native_locale`: (str) Optional locale for multi-native voices (default: `None`) - `enable_streaming`: (bool) Enable WebSocket streaming for low latency. When `False`, uses HTTP chunked transfer (default: `True`) ## Additional Resources The following resources provide more information about using Murf AI with VideoSDK Agents SDK. - **[Murf AI docs](https://murf.ai/api/docs/introduction/overview)**: Murf AI TTS docs. --- # Neuphonic TTS The Neuphonic TTS provider enables your agent to use Neuphonic's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Neuphonic-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-neuphonic" ``` ## Importing ```python from videosdk.plugins.neuphonic import NeuphonicTTS ``` ## Authentication The Neuphonic plugin requires an [Neuphonic API key](https://app.neuphonic.com/apikey). Set `NEUPHONIC_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.neuphonic import NeuphonicTTS from videosdk.agents import CascadingPipeline # Initialize the Neuphonic AI TTS model tts = NeuphonicTTS( lang_code="en", voice_id="8e9c4bc8-3979-48ab-8626-df53befc2090", speed=1.0, ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Neuphonic API key (can also be set via `NEUPHONIC_API_KEY` environment variable) - `lang_code`: Language code for the desired language (e.g., 'en', 'es', 'de', 'nl', 'hi') - `voice_id`: The voice ID for the desired voice - `speed`: Playback speed of the audio (range: 0.7-2.0, default: 1.0) ## Additional Resources The following resources provide more information about using Neuphonic with VideoSDK Agents SDK. - **[Neuphonic AI docs](https://docs.neuphonic.com/)**: Neuphonic docs. --- # Nvidia TTS The Nvidia TTS provider enables your agent to use Nvidia's Riva text-to-speech models for converting text responses to natural-sounding audio output with low latency. ## Installation Install the Nvidia-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-nvidia" ``` ## Authentication The Nvidia plugin requires an Nvidia API key. Set `NVIDIA_API_KEY` in your `.env` file. ## Importing ```python from videosdk.plugins.nvidia import NvidiaTTS ``` ## Example Usage ```python from videosdk.plugins.nvidia import NvidiaTTS from videosdk.agents import CascadingPipeline # Initialize the Nvidia TTS model tts = NvidiaTTS( # When NVIDIA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-nvidia-api-key", voice_name="Magpie-Multilingual.EN-US.Aria", language_code="en-US", sample_rate=24000 ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key and other credential parameters from your code. ::: ## Configuration Options - `api_key`: Your Nvidia API key (required, can also be set via environment variable) - `server`: The Nvidia Riva server address (default: `"grpc.nvcf.nvidia.com:443"`) - `function_id`: The specific function ID for the service (default: `"877104f7-e885-42b9-8de8-f6e4c6303969"`) - `voice_name`: (str) The voice to use (default: `"Magpie-Multilingual.EN-US.Aria"`) - `language_code`: (str) Language code for synthesis (default: `"en-US"`) - `sample_rate`: (int) Audio sample rate in Hz (default: `24000`) - `use_ssl`: (bool) Enable SSL connection (default: `True`) ## Additional Resources The following resources provide more information about using Nvidia Riva with VideoSDK Agents SDK. - **[Nvidia Riva docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)**: Nvidia Riva documentation. --- # OpenAI TTS The OpenAI TTS provider enables your agent to use OpenAI's text-to-speech models for converting text responses to natural-sounding audio output. ## Installation Install the OpenAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-openai" ``` ## Importing ```python from videosdk.plugins.openai import OpenAITTS ``` ## Authentication The OpenAI plugin requires an [OpenAI API key](https://platform.openai.com/api-keys). Set `OPENAI_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.openai import OpenAITTS from videosdk.agents import CascadingPipeline # Initialize the OpenAI TTS model tts = OpenAITTS( # When OPENAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-openai-api-key", model="tts-1", voice="alloy", speed=1.0, response_format="pcm" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code. ::: ## Configuration Options - `model`: The OpenAI TTS model to use (e.g., `"tts-1"`, `"tts-1-hd"`) - `voice`: (str) Voice to use for audio output (e.g., `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"`) - `speed`: (float) Speed of the generated audio (0.25 to 4.0, default: 1.0) - `instructions`: (str) Custom instructions to guide speech synthesis style - `api_key`: Your OpenAI API key (can also be set via environment variable) - `base_url`: Custom base URL for OpenAI API (optional) - `response_format`: (str) Audio format for output (default: `"pcm"`) ## Additional Resources The following resources provide more information about using OpenAI with VideoSDK Agents SDK. - **[OpenAI docs](https://platform.openai.com/docs/guides/text-to-speech)**: OpenAI TTS API documentation. --- # Papla Media TTS The Papla Media TTS provider enables your agent to use Papla Media's text-to-speech service for converting text responses into spoken audio. ## Installation Install the Papla Media-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-papla" ``` ## Importing ```python from videosdk.plugins.papla import PaplaTTS ``` ## Authentication The Papla Media plugin requires an API key, which you can generate from your app dashboard. Set `PAPLA_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.papla import PaplaTTS from videosdk.agents import CascadingPipeline # Initialize the Papla Media TTS service tts = PaplaMediaTTS( # When PAPLA_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-papla-media-api-key", ) # Add tts to a cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using a `.env` file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so you should omit the `api_key` parameter from your code. ::: ## Configuration Options ### Initialization Parameters These are the options you can set when creating an instance of `PaplaMediaTTS`. - `model_id` (str): The TTS model to use. Defaults to `"papla_p1"`. - `api_key` (str, optional): Your Papla Media API key. It's recommended to set this via the `PAPLA_API_KEY` environment variable instead. - `base_url` (str, optional): Custom base URL for the Papla Media API. Defaults to `"https://api.papla.media/v1"`. ## Additional Resources The following resources provide more information about using Papla Media with the VideoSDK Agent Framework. - **[Papla Media API Docs](https://api.papla.media/docs)**: Papla Media's official API documentation. --- # Resemble AI TTS The Resemble AI TTS provider enables your agent to use Resemble AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Resemble AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-resemble" ``` ## Importing ```python from videosdk.plugins.resemble import ResembleTTS ``` ## Authentication The Resemble plugin requires an [Resemble API key](https://app.resemble.ai/account/api). Set `RESEMBLE_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.resemble import ResembleTTS from videosdk.agents import CascadingPipeline # Initialize the Resemble AI TTS model tts = ResembleTTS( # When RESEMBLE_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-resemble-api-key", voice_uuid="55592656" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Resemble AI API key. Can also be set via the `RESEMBLE_API_KEY` environment variable. - `voice_uuid`: (str) The UUID of the voice to use for synthesis (default: `"55592656"`). ## Additional Resources The following resources provide more information about using Resemble with VideoSDK Agents SDK. - **[Resemble AI docs](https://docs.app.resemble.ai)**: Resemble AI docs. --- # Rime AI TTS The Rime AI TTS provider enables your agent to use Rime AI's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Rime AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-rime" ``` ## Importing ```python from videosdk.plugins.rime import RimeTTS ``` ## Authentication The Rime plugin requires an [Rime API key](https://rime.ai/). Set `RIME_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.Rime import RimeTTS from videosdk.agents import CascadingPipeline # Initialize the Rime AI TTS model tts = RimeTTS( speaker="river", model_id="mist", lang="eng", speed_alpha=1.0 ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `speaker`: (str) Voice ID to use (default: `"river"`). Must match the model's available speakers. - `model_id`: (str) Rime model identifier (default: `"mist"`). Supported: `"mist"`, `"mistv2"`. - `lang`: (str) Language code for the voice (default: `"eng"`). - `speed_alpha`: (float) Controls speaking rate (`1.0` is normal speed). - `reduce_latency`: (bool) Whether to minimize streaming delay (default: `False`). - `pause_between_brackets`: (bool) Insert pauses around bracketed text (default: `False`). - `phonemize_between_brackets`: (bool) Use phonemes for bracketed text (default: `False`). - `inline_speed_alpha`: (str) Optional per-word speed override (e.g., `"1.2,1.0,0.8"`). - `api_key`: (str) Rime API key. Can also be set via the `RIME_API_KEY` environment variable. ## Additional Resources The following resources provide more information about using Rime with VideoSDK Agents SDK. - **[Rime AI docs](https://docs.rime.ai/)**: Rime AI docs. --- # Sarvam AI TTS The Sarvam AI TTS provider enables your agent to use Sarvam AI's text-to-speech models for generating voice output. ## Installation Install the Sarvam AI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-sarvamai" ``` ## Importing ```python from videosdk.plugins.sarvamai import SarvamAITTS ``` ## Authentication The Sarvam plugin requires a [Sarvam API key](https://dashboard.sarvam.ai/key-management). Set `SARVAM_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.sarvamai import SarvamAITTS from videosdk.agents import CascadingPipeline # Initialize the Sarvam AI TTS model tts = SarvamAITTS( # When SARVAMAI_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-sarvam-ai-api-key", model="bulbul:v2", speaker="anushka", target_language_code="en-IN", pitch=0.0, pace=1.0, loudness=1.2 ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `api_key`: (str) Your Sarvam AI API key. Can also be set via the `SARVAMAI_API_KEY` environment variable. - `model`: (str) The Sarvam AI model to use (default: `"bulbul:v2"`). - `speaker`: (str) The speaker voice to use (default: `"anushka"`). - `target_language_code`: (str) The language code for the generated audio (default: `"en-IN"`). - `pitch`: (float) The pitch of the generated audio (default: `0.0`). - `pace`: (float) The pace or speed of the generated audio (default: `1.0`). - `loudness`: (float) The loudness of the generated audio (default: `1.2`). - `enable_preprocessing`: (bool) Whether to enable text preprocessing on the server (default: `True`). ## Additional Resources The following resources provide more information about using Sarvam AI with VideoSDK Agents SDK. - **[Sarvam docs](https://docs.sarvam.ai/)**: Sarvam's full docs site. --- # SmallestAI TTS The SmallestAI TTS provider enables your agent to use SmallestAI's high-quality text-to-speech models for generating voice output. ## Installation Install the SmallestAI-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-smallestai" ``` ## Importing ```python from videosdk.plugins.smallestai import SmallestAITTS ``` ## Authentication The Smallest AI plugin requires a [Smallest AI API key](https://console.smallest.ai/apikeys). Set `SMALLEST_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.smallestai import SmallestAITTS from videosdk.agents import CascadingPipeline # Initialize the SmallestAI TTS model tts = SmallestAITTS( # When SMALLEST_API_KEY is set in .env - DON'T pass api_key parameter api_key="your-smallestai-api-key", model="lightning", voice_id="emily" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit `api_key` from your code. ::: ## Configuration Options - `api_key`: (str) Your SmallestAI API key. Can also be set via the `SMALLEST_API_KEY` environment variable. - `model`: (str) The TTS model to use (e.g., `"lightning"`, `"lightning-large"`). Defaults to `"lightning"`. - `voice_id`: (str) The ID of the voice to use. Defaults to `"emily"`. - `speed`: (float) Speech speed multiplier. Defaults to `1.0`. - `consistency`: (float) Controls word repetition and skipping. Only supported in `lightning-large` model. Defaults to `0.5`. - `similarity`: (float) Controls similarity to the reference audio. Only supported in `lightning-large` model. Defaults to `0.0`. - `enhancement`: (bool) Enhances speech quality at the cost of increased latency. Only supported in `lightning-large` model. Defaults to `False`. ## Additional Resources The following resources provide more information about using Smallest AI with VideoSDK Agents SDK. - **[Smallest AI docs](https://waves-docs.smallest.ai/v3.0.1/content/introduction/introduction)**: Smallest AI docs. --- # Speechify TTS The Speechify TTS provider enables your agent to use Speechify's high-quality text-to-speech models for generating natural-sounding voice output. ## Installation Install the Speechify-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-speechify" ``` ## Importing ```python from videosdk.plugins.speechify import SpeechifyTTS ``` ## Authentication The Speechify plugin requires an [Speechify API key](https://console.sws.speechify.com/). Set `SPEECHIFY_API_KEY` in your `.env` file. ## Example Usage ```python from videosdk.plugins.speechify import SpeechifyTTS from videosdk.agents import CascadingPipeline # Initialize the Speechify TTS model tts = SpeechifyTTS( voice_id="kristy", model="simba-english" ) # Add tts to cascading pipeline pipeline = CascadingPipeline(tts=tts) ``` :::note When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit `api_key` and other credential parameters from your code. ::: ## Configuration Options - `voice_id`: (str) The Speechify voice to use (default: `"kristy"`). - `api_key`: (str) Speechify API key. Can also be set via the `SPEECHIFY_API_KEY` environment variable. - `model`: (str) The model variant to use (`"simba-base"`, `"simba-english"`, `"simba-multilingual"`, `"simba-turbo"`). Default: `"simba-english"`. - `language`: (str) Optional ISO language code for multilingual models (e.g., `"en"`, `"es"`). ## Additional Resources The following resources provide more information about using Speechify with VideoSDK Agents SDK. - **[Speechify AI docs](https://docs.sws.speechify.com/v1/docs)**: Speechify AI docs. --- # Turn Detector The Turn Detector uses a Hugging Face model to determine whether a user's turn is completed or not, enabling precise conversation flow management in cascading pipelines. ## Installation Install the Turn Detector-enabled VideoSDK Agents package: ```bash pip install "videosdk-plugins-turn-detector" ``` ## Importing ```python from videosdk.plugins.turn_detector import TurnDetector ``` ## Example Usage ```python from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.agents import CascadingPipeline # Pre-download the model (optional but recommended) pre_download_model() # Initialize the Turn Detector turn_detector = TurnDetector( threshold=0.7 ) # Add Turn Detector to cascading pipeline pipeline = CascadingPipeline(turn_detector=turn_detector) ``` ## Configuration Options - `threshold`: (float) Confidence threshold for turn completion detection (0.0 to 1.0, default: `0.7`) ## Pre-downloading Model To avoid delays during agent initialization, you can pre-download the Hugging Face model: ```python from videosdk.plugins.turn_detector import pre_download_model # Download model before running the agent pre_download_model() ``` ## Additional Resources The following resources provide more information about VideoSDK Turn Detector plugin for AI Agents SDK. --- The AI Agent SDK now supports session recordings, which can be enabled with a simple configuration. When enabled, all interactions between the user and the agent are recorded. These recordings can be played back directly from the dashboard with autoscrolling transcripts and precise timestamps, and you can also download them for offline review and analysis. ## Enabling Recording To enable recording for an AI agent session, you need to set the `recording` flag to `true` in the session context. Once that's done, start your agent as usual—no additional changes are required in the pipeline. By default, the recording flag is set to `false`. ```python job_context = JobContext( room_options = RoomOptions( room_id = "YOUR_ROOM_ID", name = "Agent", recording = True ) ) ``` --- The worker system provides a robust way to run AI agent instances using Python's multiprocessing. It offers process isolation, proper lifecycle management, and a clean separation between agent logic and infrastructure concerns. ## Key Components ### 1. WorkerJob `WorkerJob` is the main class that defines an agent task to be executed in a separate process. It takes two parameters: - `entrypoint`: An async function that accepts a JobContext parameter - `jobctx`: A JobContext object or a callable that returns a JobContext ```python job = WorkerJob(entrypoint=my_function, jobctx=my_context) ``` ### 2. JobContext `JobContext` provides the runtime environment for your agent, including: - **Room Management**: Handles VideoSDK room connections - **Shutdown Callbacks**: Allows cleanup operations - **Process Isolation**: Each job runs in its own process ### 3. Worker `Worker` manages the execution of jobs in separate processes, providing: - Process isolation for each agent instance - Automatic cleanup on shutdown - Error handling and logging ## Usage Example Here's a complete example of how to use the worker system with a voice agent: ```python import asyncio import aiohttp from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from videosdk.agents import Agent, AgentSession, RealTimePipeline, WorkerJob, JobContext, RoomOptions class MyVoiceAgent(Agent): def __init__(self): super().__init__( instructions="You are a helpful voice assistant that can answer questions and help with tasks.", ) async def on_enter(self) -> None: await self.session.say("Hello, how can I help you today?") async def on_exit(self) -> None: await self.session.say("Goodbye!") async def entrypoint(ctx: JobContext): model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", response_modalities=["AUDIO"] ) ) pipeline = RealTimePipeline(model=model) agent = MyVoiceAgent(ctx) session = AgentSession( agent=agent, pipeline=pipeline, ) async def cleanup_session(): print("Cleaning up session...") ctx.add_shutdown_callback(cleanup_session) try: # connect to the room await ctx.connect() await ctx.room.wait_for_participant() await session.start() await asyncio.Event().wait() except KeyboardInterrupt: print("Shutting down...") finally: await session.close() await ctx.shutdown() def make_context() -> JobContext: room_options = RoomOptions( room_id="", name="Sandbox Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=entrypoint, jobctx=make_context) job.start() ``` ## Configuration Options ### RoomOptions - `room_id`: The VideoSDK meeting ID - `auth_token`: Authentication token (or use VIDEOSDK_AUTH_TOKEN env var) - `name`: Agent name displayed in the meeting - `playground`: Enable playground mode for testing - `vision`: Enable vision capabilities - `avatar`: Use virtual avatars from available providers ## Best Practices 1. **Always use cleanup callbacks**: Register shutdown callbacks to ensure proper resource cleanup 2. **Handle exceptions gracefully**: Use try-finally blocks to ensure cleanup happens 3. **Use playground mode for testing**: Set `playground=True` for easy testing and debugging 4. **Set environment variables**: Use `VIDEOSDK_AUTH_TOKEN` for authentication 5. **Wait for participants**: Use `wait_for_participant()` to ensure agent waits for a participant The worker system provides a production-ready way to deploy AI agents with proper isolation, lifecycle management, and error handling. --- # VideoSDK AI SIP Framework A production-ready framework for creating AI-powered voice agents using VideoSDK and various SIP providers (e.g., Twilio). This framework enables you to build and deploy sophisticated conversational AI agents that can handle both inbound and outbound phone calls with natural language processing. ## How It Works The framework simplifies a complex process into a manageable workflow. Here’s a high-level overview of the architecture: 1. **Phone Call**: A user calls a phone number you have acquired from a SIP provider (like Twilio, Plivo, etc.). 2. **SIP Provider**: The provider receives the call and sends a webhook notification to your application server. 3. **Your Application Server**: This is the application you build using this framework. * It receives the webhook. * It uses the `SIPManager` to create a secure VideoSDK room for the call. * It launches your custom AI Agent. * It responds to the SIP provider with instructions (e.g., TwiML) to forward the call's audio into the VideoSDK room. 4. **VideoSDK & AI Agent**: Your AI Agent joins the room, receives the live audio from the phone call, processes it using your chosen AI models (for speech-to-text, language understanding, and text-to-speech), and responds in real-time to create a seamless, interactive conversation. --- ## Prerequisites Before you get started, ensure you have the following: ### System Requirements - **Python**: 3.11 or higher - **Network**: Public internet access for webhook delivery ### Required Credentials - **VideoSDK Credentials**: Sign up at [app.videosdk.live](https://app.videosdk.live/) to get your token and SIP credentials. ![VideoSDK SIP Credentials](https://strapi.videosdk.live/uploads/sip_dashboard_screenshot_8025aba2ec.png) - **SIP Provider Account**: Obtain provider-specific credentials. - **AI Model Provider**: An account with Google, OpenAI, or another supported provider. --- ## Get Started ### 1. Installation Create and activate a virtual environment **macOS/Linux:** ```js python3 -m venv venv source venv/bin/activate ``` --- **Windows:** ```js python3 -m venv venv venv\Scripts\activate ``` Install the core framework ```bash pip install videosdk-plugins-sip ``` Install plugins for your chosen AI services (e.g., Google) ```bash pip install videosdk-plugins-google ``` ### 2. Environment Configuration Your agent requires credentials for both VideoSDK and your chosen SIP provider. You can provide these through environment variables (recommended) or directly in your code. Create a `.env` file in your project's root directory, edit the file with your credentials. #### **VideoSDK Credentials (Required)** These are essential for the framework to function. ```ini VIDEOSDK_AUTH_TOKEN=your_videosdk_jwt_token VIDEOSDK_SIP_USERNAME=your_videosdk_sip_username VIDEOSDK_SIP_PASSWORD=your_videosdk_sip_password ``` #### **AI Model Credentials (Required)** Add the API key for your chosen AI provider. ```ini GOOGLE_API_KEY=your_google_api_key_here ``` #### **SIP Provider Credentials** Fill in the details for the provider you will be using. The framework will automatically use the correct variables based on the `SIP_PROVIDER` you set. **Twilio:** Get your credentials from the [Twilio console](https://console.twilio.com/dashboard). ```ini SIP_PROVIDER=twilio TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxx TWILIO_AUTH_TOKEN=your_auth_token TWILIO_PHONE_NUMBER=+1234567890 ``` Copy the example environment file and populate it with your credentials. ```bash cp env.example .env ``` Now, edit the `.env` file: ```ini # VideoSDK Configuration VIDEOSDK_AUTH_TOKEN=your_videosdk_jwt_token VIDEOSDK_SIP_USERNAME=your_videosdk_sip_username VIDEOSDK_SIP_PASSWORD=your_videosdk_sip_password # AI Model Configuration (Example for Google Gemini) GOOGLE_API_KEY=your_google_api_key # Provider Selection (currently, 'twilio' is supported) SIP_PROVIDER=twilio # Twilio Configuration TWILIO_ACCOUNT_SID=your_twilio_account_sid TWILIO_AUTH_TOKEN=your_twilio_auth_token TWILIO_PHONE_NUMBER=+1234567890 ``` ## AI Agent and SIP Setup Here’s how to structure your application. ### Step 1: Initialize the SIP Manager The `create_sip_manager` function is the main entry point. It establishes the connection to your SIP provider by reading the environment variables you configured. ```python import os from dotenv import load_dotenv from videosdk.plugins.sip import create_sip_manager # Load variables from the .env file load_dotenv() # This function reads your .env variables and configures the correct provider sip_manager = create_sip_manager( provider=os.getenv("SIP_PROVIDER"), videosdk_token=os.getenv("VIDEOSDK_AUTH_TOKEN"), # The provider_config dictionary passes provider-specific environment variables. provider_config={ # Twilio "account_sid": os.getenv("TWILIO_ACCOUNT_SID"), "auth_token": os.getenv("TWILIO_AUTH_TOKEN"), "phone_number": os.getenv("TWILIO_PHONE_NUMBER"), } ) ``` ### Step 2: Define Your Agent's Pipeline The pipeline defines which AI models your agent uses. Here, we are using Google's Gemini for a [Real-time Pipeline](https://docs.videosdk.live/ai_agents/core-components/realtime-pipeline). You could also use a [Cascading Pipeline](https://docs.videosdk.live/ai_agents/core-components/cascading-pipeline). ```python from videosdk.agents import RealTimePipeline from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig def create_agent_pipeline(): """This creates the AI model pipeline for our agent.""" model = GeminiRealtime( api_key=os.getenv("GOOGLE_API_KEY"), model="gemini-2.5-flash-native-audio-preview-12-2025", config=GeminiLiveConfig( voice="Leda", # Choose your desired voice response_modalities=["AUDIO"], # We want the agent to speak back ), ) return RealTimePipeline(model=model) ``` ### Step 3: Define Your Agent's Personality and Tools The `Agent` class defines the system prompt (instructions), personality, and custom [function tools](https://docs.videosdk.live/ai_agents/core-components/agent) and [MCP Servers](https://docs.videosdk.live/ai_agents/mcp-integration) that your agent can use. ```python import asyncio from videosdk.agents import Agent, function_tool, JobContext from typing import Optional class SIPAIAgent(Agent): """An AI agent for handling voice calls.""" def __init__(self, ctx: Optional[JobContext] = None): super().__init__( instructions="You are a friendly and helpful voice assistant. Keep your responses concise.", tools=[self.end_call], # You can also integrate other function tools and MCP Servers here. ) self.ctx = ctx self.greeting_message = "Hello! Thank you for calling. How can I assist you today?" async def on_enter(self) -> None: pass async def greet_user(self) -> None: """Greets the user with the message defined above.""" await self.session.say(self.greeting_message) async def on_exit(self) -> None: pass ``` ## Server Setup and Deployment Your application must be accessible from the public internet so that your SIP provider can send it webhooks. You have two main options for this. **Local:** For testing on your local machine, `ngrok` is the perfect tool. It creates a secure, public URL that tunnels directly to your local server. The `lifespan` manager in our example code handles this for you automatically. When you start the server, it will generate a unique URL and automatically configure the `SIPManager` with it. **Code Snippet (FastAPI Lifespan Manager):** ```python import os import logging from contextlib import asynccontextmanager from fastapi import FastAPI from pyngrok import ngrok logger = logging.getLogger(__name__) @asynccontextmanager async def lifespan(app: FastAPI): """Lifespan manager for FastAPI app startup and shutdown.""" port = int(os.getenv("PORT", 8000)) try: ngrok.kill() ngrok_auth_token = os.getenv("NGROK_AUTHTOKEN") if ngrok_auth_token: ngrok.set_auth_token(ngrok_auth_token) tunnel = ngrok.connect(port, "http") # The Base URL is generated here sip_manager.set_base_url(tunnel.public_url) logger.info(f"NGROK TUNNEL CREATED: {tunnel.public_url}") except Exception as e: logger.error(f"Failed to start ngrok tunnel: {e}") yield try: ngrok.kill() logger.info("Ngrok tunnel closed") except Exception as e: logger.error(f"Error closing ngrok tunnel: {e}") app = FastAPI(title="SIP AI Agent", lifespan=lifespan) ``` --- **Cloud/Server:** For a live application, you will deploy your code to a cloud server (e.g., AWS EC2, Google Cloud Run, Heroku) that has a permanent public IP address or domain name. In this case, you should **not** use the `ngrok` `lifespan` manager. Instead, set the base URL directly in your code. **Code Snippet (Cloud Server Setup):** ```python from fastapi import FastAPI # Your FastAPI app for production app = FastAPI(title="SIP AI Agent") # IMPORTANT: Set your server's public URL before starting the app. # This should be the actual domain where your service is hosted. PUBLIC_URL = "https://api.your-public-url.com" sip_manager.set_base_url(PUBLIC_URL) ``` :::note You must configure your SIP provider's webhook to point to `https://your-public-or-ngrok-url.com/webhook/incoming`. ::: ## API Endpoint Guide Your application server, powered by the `sip` framework, exposes a set of endpoints for controlling and monitoring calls. --- ### `POST /webhook/incoming` This is the **most important endpoint for handling inbound calls**. When a user calls your SIP provider's phone number, the provider sends an HTTP request (a webhook) to this URL. * **Purpose**: To serve as the primary entry point for all incoming phone calls. * **Provider Configuration**: You **must** configure this full URL in your SIP provider's dashboard for your phone number. * **Core Process**: 1. Receives the webhook from the SIP provider. 2. Creates a new VideoSDK room for the call. 3. Launches your `SIPAIAgent` in a separate process, which then waits in the room. 4. Responds to the provider with instructions (XML-based TwiML/ExoML) detailing how to forward the call's audio stream to the newly created room's SIP address. --- ### `POST /call/make` This endpoint allows you to **programmatically initiate an outbound call** from your agent to a user's phone number. ```bash # Replace with the destination phone number curl -X POST "http://localhost:8000/call/make?to_number=+1234567890" ``` * **Purpose**: To start new conversations with users. Ideal for automated reminders, lead qualification, or proactive support. * **Query Parameters**: | Parameter | Type | Description | Required | | :--- | :--- | :--- | :--- | | `to_number` | `string` | The full phone number to call, in E.164 format (e.g., `+15551234567`). | Yes | * **Core Process (Outbound Call Flow)**: 1. Your request hits the endpoint. 2. The `SIPManager` creates a VideoSDK room and immediately launches your `SIPAIAgent`. The agent then waits in the room. 3. The manager sends an API request to your SIP provider (e.g., Twilio), instructing it to call the `to_number`. 4. Crucially, it provides the SIP provider with a unique webhook URL for this specific call: `https:///sip/answer/{room_id}`. 5. When the user answers their phone, the SIP provider sends a webhook to that unique answer URL to connect the user to the waiting agent. --- ### `POST /sip/answer/{room_id}` This is an **internal-facing endpoint** designed to complete the outbound call loop. You will not call this endpoint directly. * **Purpose**: To serve as the dynamic "answer URL" for outbound calls. * **Path Parameters**: | Parameter | Type | Description | | :--- | :--- | :--- | | `room_id` | `string` | The unique ID of the VideoSDK room where the agent is waiting. | * **Core Process**: 1. This endpoint is called by the SIP provider *only after* the user answers an outbound call initiated by `/call/make`. 2. It uses the `room_id` to find the correct SIP address for the room where the agent is waiting. 3. It returns a simple TwiML/XML response that tells the provider how to bridge the just-answered call with the agent. --- ### `GET /sessions` A simple utility endpoint for **monitoring the health and status** of your service. * **Purpose**: To see how many calls are currently active. * **Core Process**: 1. Receives a simple `GET` request. 2. Checks the `SIPManager`'s internal state. 3. Returns a count of active sessions and a list of their corresponding room IDs. --- :::tip If you experience high latency when connecting a call, it may be due to a mismatch between the geographical region of your VideoSDK meeting server (which defaults to the nearest server region to you) and your SIP provider's region. To reduce latency, upgrade to an enterprise plan and set `VIDEOSDK_REGION=sip_provider_region` in your `.env` file for a low-latency experience. ::: --- VideoSDK's AI Agent framework offers powerful **Tracing and Observability** tools, providing deep insights into your AI agent's performance and behavior. These tools, accessible from the VideoSDK dashboard, allow you to monitor sessions, analyze interactions, and debug issues with precision. ## Prerequisites To View Tracing and Observability At VideoSDK Dashboard, make sure to install the VideoSDK AI Agent package using pip: ```bash pip install videosdk-agents==0.0.23 ``` :::note Tracing and Observability support was added starting from version 0.0.23, which is why this version is required. ::: ## Sessions The Sessions dashboard provides a comprehensive list of all interactions with your AI agents. Each session is a unique conversation between a user and an agent, identified by a `Session ID` and associated with a `Room ID`.
### Key Metrics For each session, you can monitor the following key metrics at a glance: - **Session ID**: A unique identifier for the session. - **Room ID**: The identifier of the room where the session took place. - **TTFW (Time to First Word)**: The time it takes for the agent to utter its first word after the user has finished speaking. This metric is crucial for measuring the responsiveness of your agent. - **P50, P90, P95**: These are percentile metrics for latency, providing a statistical distribution of response times. For example, P90 indicates that 90% of the responses were faster than the specified value. - **Interruption**: The number of times the agent was interrupted by the user. - **Duration**: The total duration of the session. - **Recording**: Indicates whether the session was recorded. You can play back the recording directly from the dashboard. - **Created At**: The timestamp of when the session was created. - **Actions**: From here, you can navigate to the detailed analytics view for the session. ## Session View By clicking on "View Analytics" for a specific session, you are taken to the Session View. This view provides a complete transcript of the conversation, along with timestamps and speaker identification (Caller or Agent).
If the session was recorded, you can play back the audio and follow along with the transcript, which automatically scrolls as the conversation progresses. This is an invaluable tool for understanding the user experience and identifying areas for improvement. By analyzing these metrics, you can quickly identify underperforming agents, diagnose latency issues, and gain a holistic view of the user experience. The next section will delve into the detailed session and trace views, where you can explore individual conversations and their underlying processes. --- The real power of VideoSDK's Tracing and Observability tools lies in the detailed session and trace views. These views provide a granular breakdown of each conversation, allowing you to analyze every turn, inspect component latencies, and understand the agent's decision-making process. ## Trace View The Trace View offers an even deeper level of insight, breaking down the entire session into a hierarchical structure of traces and spans.
### Session Configuration At the top level, you'll find the **Session Configuration**, which details all the parameters the agent was initialized with. This includes the models used for STT, LLM, and TTS, as well as any function tools or MCP tools that were configured. This information is crucial for reproducing and debugging specific agent behaviors. ### User & Agent Turns The core of the Trace View is the breakdown of the conversation into **User & Agent Turns**. Each turn represents a single exchange between the user and the agent.
Within each turn, you can see a detailed timeline of the underlying processes, including: - **STT (Speech-to-Text) Processing**: The time it took to transcribe the user's speech. - **EOU (End-of-Utterance) Detection**: The time taken to detect that the user has finished speaking. - **LLM Processing**: The time the Large Language Model took to process the input and generate a response. - **TTS (Text-to-Speech) Processing**: The time it took to convert the LLM's text response into speech. - **Time to First Byte**: The initial delay before the agent starts speaking. - **User Input Speech**: The duration of the user's speech. - **Agent Output Speech**: The duration of the agent's spoken response. ### Turn Properties For each turn, you can inspect the properties of the components involved. This includes the transcript of the user's input, the response from the LLM, and any errors that may have occurred.
By leveraging the detailed information in the Trace View, you can pinpoint performance bottlenecks, debug errors, and gain a comprehensive understanding of your AI agent's inner workings. ### Tool Calls When an LLM invokes a tool, the Trace View provides specific details about the tool call, including the tool's name and the parameters it was called with. This is essential for debugging integrations and ensuring that your agent's tools are functioning as expected.
--- # Vision For supported LLM providers ([OpenAI](/ai_agents/plugins/llm/openai.md), [Anthropic](/ai_agents/plugins/llm/anthropic.md), [Google](/ai_agents/plugins/llm/google.md)), you can add images to their chat context to leverage their full capabilities. You can add images as URLs or base64-encoded data from your frontend or directly. Additionally, you can use live video with a realtime model such as [Gemini Live](/ai_agents/plugins/realtime/google.md). ## Image Input (Cascading Pipeline) The agent's chat context supports both image and text input. You can add multiple images in a given session, although larger chat contexts may lead to slower response times. To add an image, you can simply pass an image URL or base64-encoded image data to the `ImageContent` class. Below is an example of image addition: ```python self.agent.chat_context.add_message( role=ChatRole.USER, content=[ImageContent(image="YOUR_IMAGE_URL")] ) ``` Sample code for adding image context from conversation flow: ```python class MyConversationFlow(ConversationFlow): def __init__(self, agent, stt=None, llm=None, tts=None): super().__init__(agent, stt, llm, tts) async def run(self, transcript: str) -> AsyncIterator[str]: await self.on_turn_start(transcript) # Add image context self.agent.chat_context.add_message( role=ChatRole.USER, content=[ImageContent(image="YOUR_IMAGE_URL")] ) async for response_chunk in self.process_with_llm(): yield response_chunk await self.on_turn_end() async def on_turn_start(self, transcript: str) -> None: self.is_turn_active = True async def on_turn_end(self) -> None: self.is_turn_active = False ``` ### Inference Detail If your LLM provider supports it, you can set the `inference_detail` parameter to "high" or "low" to control token usage and inference quality. The default is "auto", which uses the provider's default setting. :::info Inference detail is only available by OpenAI. ::: ## Live Video Input (Realtime Pipeline) Set the `vision` parameter to `True` in `RoomOptions` to enable live video input. This feature is only supported with the [Gemini Live](/ai_agents/plugins/realtime/google.md) model. ```python job_context = JobContext( room_options = RoomOptions( room_id = "YOUR_ROOM_ID", name = "Agent", vision = True ) ) ``` --- # AI Voice Agent Quick Start Get started with VideoSDK Agents in minutes. This guide covers both Realtime (speech-to-speech) and Cascaded (STT-LLM-TTS) pipeline implementations. ## Prerequisites Before you begin, ensure you have: - A VideoSDK authentication token (generate from [app.videosdk.live](https://app.videosdk.live)), follow to guide to [generate videosdk token](/ai_agents/authentication-and-token) - A VideoSDK meeting ID (you can generate one using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) or through the VideoSDK dashboard) - Python 3.12 or higher ## Understanding the Architecture Before diving into implementation, let's understand the two main pipeline architectures available: **real-time-pipeline:** **Realtime Pipeline** provides direct speech-to-speech processing with minimal latency: ![Realtime Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_realtime_pipeline.png) The realtime pipeline processes audio directly through a unified model that handles: - **User Voice Input** → **Speech to Speech model** → **Agent Voice Output** This approach offers the fastest response times and is ideal for real-time conversations. --- **cascading-pipeline:** **Cascading Pipeline** processes audio through distinct stages for maximum control: ![Cascading Pipeline Architecture](https://cdn.videosdk.live/website-resources/docs-resources/videosdk_casading_pipeline.png) The cascading pipeline processes audio through three sequential stages: - **User Voice Input** → **STT (Speech-to-Text)** → **LLM (Large Language Model)** → **TTS (Text-to-Speech)** → **Agent Voice Output** This approach provides better control over each processing stage and supports more complex AI reasoning. ## Installation Create and activate a virtual environment with Python 3.12 or higher: **macOS/Linux:** ```js python3.12 -m venv venv source venv/bin/activate ``` --- **Windows:** ```js python -m venv venv venv\Scripts\activate ``` **cascading-pipeline:** ```bash pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]" ``` > Want to use a different provider? Check out our plugins for [STT](https://docs.videosdk.live/ai_agents/plugins/stt/openai), [LLM](https://docs.videosdk.live/ai_agents/plugins/llm/openai), and [TTS](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs). --- **real-time-pipeline:** ```bash pip install videosdk-agents # Choose your real-time provider: # For OpenAI pip install "videosdk-plugins-openai" # For Gemini (LiveAPI) pip install "videosdk-plugins-google" # For AWS Nova pip install "videosdk-plugins-aws" ``` ## Environment Setup It's recommended to use environment variables for secure storage of API keys, secret tokens, and authentication tokens. Create a `.env` file in your project root: **cascading-pipeline:** ```shell title=".env" DEEPGRAM_API_KEY = "Your Deepgram API Key" OPENAI_API_KEY = "Your OpenAI API Key" ELEVENLABS_API_KEY = "Your ElevenLabs API Key" VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token" ``` > **API Keys** - Get API keys [Deepgram ↗](https://console.deepgram.com/), [OpenAI ↗](https://platform.openai.com/api-keys), [ElevenLabs ↗](https://elevenlabs.io/app/settings/api-keys) & [VideoSDK Dashboard ↗](https://app.videosdk.live/api-keys) follow to guide to [generate videosdk token ](/ai_agents/authentication-and-token) --- **real-time-pipeline:** ```bash title=".env" VIDEOSDK_AUTH_TOKEN="VideoSDK Auth token" OPENAI_API_KEY="Your OpenAI API Key" // For Google Live API // GOOGLE_API_KEY="Google Live API Key" // For AWS Nova API // AWS_ACCESS_KEY_ID="AWS Key Id" // AWS_SECRET_ACCESS_KEY="AWS Secret Key" // AWS_DEFAULT_REGION="AWS Region" ``` > **API Keys** - Get API keys [OpenAI ↗](https://platform.openai.com/api-keys) or [Gemini ↗](https://aistudio.google.com/app/apikey) or [AWS Nova Sonic ↗](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html) & [VideoSDK Dashboard ↗](https://app.videosdk.live/api-keys)> follow to guide to [generate videosdk token ](/ai_agents/authentication-and-token) ### Step 1: Creating a Custom Agent First, let's create a custom voice agent by inheriting from the base `Agent` class: **cascading-pipeline:** ```python title="main.py" import asyncio, os from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from typing import AsyncIterator # Pre-downloading the Turn Detector model pre_download_model() class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help?") async def on_exit(self): await self.session.say("Goodbye!") ``` --- **real-time-pipeline:** ```python title="main.py" import asyncio, os from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob from videosdk.plugins.openai import OpenAIRealtime, OpenAIRealtimeConfig from openai.types.beta.realtime.session import TurnDetection class MyVoiceAgent(Agent): def __init__(self): super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.") async def on_enter(self): await self.session.say("Hello! How can I help?") async def on_exit(self): await self.session.say("Goodbye!") ``` This code defines a basic voice agent with: - Custom instructions that define the agent's personality and capabilities - An entry message when joining a meeting - State change handling to track the agent's current activity ### Step 2: Assembling and Starting the Agent Session The pipeline connects your agent to an AI model. **cascading-pipeline:** ```python title="main.py" async def start_session(context: JobContext): # Create agent and conversation flow agent = MyVoiceAgent() conversation_flow = ConversationFlow(agent) # Create pipeline pipeline = CascadingPipeline( stt=DeepgramSTT(model="nova-2", language="en"), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow ) try: await context.connect() await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create name="VideoSDK Cascaded Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` --- **real-time-pipeline:** ```python title="main.py" async def start_session(context: JobContext): # Initialize Model model = OpenAIRealtime( model="gpt-realtime-2025-08-28", config=OpenAIRealtimeConfig( voice="alloy", # Available voices:alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, and verse modalities=["text", "audio"], turn_detection=TurnDetection( type="server_vad", threshold=0.5, prefix_padding_ms=300, silence_duration_ms=200, ) ) ) # Create pipeline pipeline = RealTimePipeline( model=model ) session = AgentSession( agent=MyVoiceAgent(), pipeline=pipeline ) try: await context.connect() await session.start() # Keep the session running until manually terminated await asyncio.Event().wait() finally: # Clean up resources when done await session.close() await context.shutdown() def make_context() -> JobContext: room_options = RoomOptions( # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create name="VideoSDK Realtime Agent", playground=True ) return JobContext(room_options=room_options) if __name__ == "__main__": job = WorkerJob(entrypoint=start_session, jobctx=make_context) job.start() ``` ### Step 3: Running the Project Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your `.env` file is properly configured and all dependencies are installed. **console-mode:** ```bash python main.py console ``` Want to see the magic instantly? Try console mode to interact with your agent directly through the terminal! No need to join a meeting room - just speak and listen through your local system. Perfect for quick testing and development. ![Console Mode](https://cdn.videosdk.live/website-resources/docs-resources/ai_agents_console_mode_image.png) Learn more about [Console Mode](/ai_agents/console_mode). --- **room:** ```bash python main.py ``` Once you run this command, a playground URL will appear in your terminal. You can use this URL to interact with your AI agent. ### Step 4: Connecting with VideoSDK Client Applications When working with a Client SDK, make sure to create the room first using the [Create Room API](https://docs.videosdk.live/api-reference/realtime-communication/create-room) . Then, simply pass the generated `room id` in both your client SDK and the `RoomOptions` for your AI Agent so they connect to the same session. :::tip Get started quickly with the [Quick Start Example](https://github.com/videosdk-live/agents-quickstart/) for the VideoSDK AI Agent SDK — everything you need to build your first AI agent fast. ::: --- # Wake Up Call Wake Up Call enables AI agents to automatically trigger actions when users remain inactive for a specified duration. This feature helps maintain user engagement and provides proactive assistance during conversation sessions. ## Overview The Wake Up Call system allows AI agents to: - Monitor user inactivity periods during conversations - Automatically trigger custom callback functions after specified timeouts - Re-engage users with proactive messages or actions ## Key Components ### 1. Wake Up Configuration Set the inactivity timeout duration in the `AgentSession` constructor using the `wake_up` parameter: ```python session = AgentSession( agent=agent, pipeline=pipeline, conversation_flow=conversation_flow, wake_up=10 # seconds ) ``` **Important**: If a `wake_up` time is provided, you must set a callback function before starting the session. If no `wake_up` time is specified, no timer or callback will be activated. ### 2. Callback Function Define a custom async function that will be executed when the inactivity threshold is reached: ```python async def on_wake_up(): print("Wake up triggered - user inactive for 10 seconds") session.say("Hello, how can I help you today?") # Assign the callback function to the session session.on_wake_up = on_wake_up ``` :::tip Get started quickly with the [Wake Up Call Example](https://github.com/videosdk-live/agents/tree/main/examples/wakeup_call.py) — everything you need to implement inactivity detection in your AI agents. ::: --- This quickstart guide will walk you through creating a powerful AI voice agent that can answer calls made to your WhatsApp Business number. We will achieve this using a direct SIP integration between the Meta Business Platform and VideoSDK, which simplifies the architecture and removes the need for a third-party telephony provider. ### **Architecture Overview - Call Flow** The diagram below illustrates the end-to-end call flow we are building. A call initiated by a **WhatsApp User** is received by the **Meta Business Platform**, which then forwards it directly via SIP to the **VideoSDK SIP Gateway**. From there, **Routing Rules** direct the call to our **AI Agent**. ![Whats Voice Agent Call Flow](https://assets.videosdk.live/images/whatsapp-ai-voice-agent-call-flow.png) ### **Prerequisites for Meta Configuration** This guide assumes you have already completed the initial setup of your business presence on the Meta platform. - A **[Meta (Facebook) Business Manager Account](https://business.facebook.com/)** that is verified. - A **Phone number** that has been added and verified in your WhatsApp Business Account (WABA). - A **[Meta Developer App](https://developers.facebook.com/)** with the `whatsapp_business_management` permission enabled. - A **[Permanent User Access Token](https://developers.facebook.com/docs/graph-api/get-started#step-2--generate-an-access-token)** for meta graph api endpoint. :::tip **Essential: Meta Graph API Setup** Integrating inbound/outbound WhatsApp calls requires updating your number's settings via the Meta Graph API. This guide covers the process in [Part 3: Enable WhatsApp SIP Forwarding](#part-3-enable-whatsapp-sip-forwarding). For a deeper understanding of the API, refer to the [official Meta Graph API overview](https://developers.facebook.com/docs/graph-api/overview). ::: ## Part 1: Build and Run Your Custom Voice Agent First, we'll create the AI agent that will handle the conversation logic. This agent will run on your local machine for testing. ### Step 1: Project Setup Create a directory for your project and add the following files: - `.env`: To store your secret credentials. - `requirements.txt`: To list the Python dependencies. - `main.py`: The main script for your AI agent. ### Step 2: Add Credentials and Dependencies In your `.env` file, add the necessary API keys. **realtime-pipeline:** ```bash title=".env" VIDEOSDK_AUTH_TOKEN="your_videosdk_token_here" GOOGLE_API_KEY="your_google_api_key_here" ``` > **API Keys**: Get your [Google API Key](https://aistudio.google.com/app/apikey) and create a [VideoSDK Account](https://app.videosdk.live/api-keys) to [generate your token ](/ai_agents/authentication-and-token). --- **cascading-pipeline:** ```bash title=".env" VIDEOSDK_AUTH_TOKEN="your_videosdk_token_here" DEEPGRAM_API_KEY="your_deepgram_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ELEVENLABS_API_KEY="your_elevenlabs_api_key_here" ``` > **API Keys**: Get keys from [Deepgram](https://console.deepgram.com/), [OpenAI](https://platform.openai.com/api-keys), [ElevenLabs](https://elevenlabs.io/app/settings/api-keys), and [VideoSDK Account](https://app.videosdk.live/api-keys) to [generate videosdk token ](/ai_agents/authentication-and-token). In `requirements.txt`, add the dependencies. **realtime-pipeline:** ```text title="requirements.txt" videosdk-agents==0.0.45 videosdk-plugins-google==0.0.45 python-dotenv==1.1.1 ``` --- **cascading-pipeline:** ```text title="requirements.txt" videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]==0.0.45 python-dotenv==1.1.1 ``` > **Latest Version**: Check the latest [videosdk-agents version on PyPI](https://pypi.org/project/videosdk-agents/) for the most recent release. ### Step 3: Create the Agent Logic Paste the following code into `main.py`. This defines the agent's personality and sets it up to be discoverable by VideoSDK's telephony service. **realtime-pipeline:** ```python title="main.py" import asyncio, os, traceback, logging from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, WorkerJob, Options from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig from dotenv import load_dotenv logging.basicConfig(level=logging.INFO) load_dotenv() # Define the agent's behavior and personality class MyWhatsappAgent(Agent): def __init__(self): super().__init__( instructions="You are a friendly and helpful assistant answering WhatsApp calls. Keep your responses concise and clear.", ) async def on_enter(self) -> None: await self.session.say("Hello! You've reached the VideoSDK assistant. How can I help you today?") async def on_exit(self) -> None: await self.session.say("Thank you for calling. Goodbye!") async def start_session(context: JobContext): model = GeminiRealtime( model="gemini-2.5-flash-native-audio-preview-12-2025", api_key=os.getenv("GOOGLE_API_KEY"), config=GeminiLiveConfig(voice="Leda", response_modalities=["AUDIO"]) ) pipeline = RealTimePipeline(model=model) session = AgentSession(agent=MyWhatsappAgent(), pipeline=pipeline) try: await context.connect() await session.start() await asyncio.Event().wait() finally: await session.close() await context.shutdown() if __name__ == "__main__": try: options = Options( agent_id="agent1", # CRITICAL: Unique ID for routing register=True, # REQUIRED: Register with VideoSDK for telephony max_processes=10, ) job = WorkerJob(entrypoint=start_session, options=options) job.start() except Exception as e: traceback.print_exc() ``` --- **cascading-pipeline:** ```python title="main.py" import asyncio, os, traceback, logging from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, WorkerJob, Options, ConversationFlow from videosdk.plugins.silero import SileroVAD from videosdk.plugins.turn_detector import TurnDetector, pre_download_model from videosdk.plugins.deepgram import DeepgramSTT from videosdk.plugins.openai import OpenAILLM from videosdk.plugins.elevenlabs import ElevenLabsTTS from dotenv import load_dotenv logging.basicConfig(level=logging.INFO) load_dotenv() pre_download_model() # Define the agent's behavior and personality class MyWhatsappAgent(Agent): def __init__(self): super().__init__( instructions="You are a friendly and helpful assistant answering WhatsApp calls. Keep your responses concise and clear.", ) async def on_enter(self) -> None: await self.session.say("Hello! You've reached the VideoSDK assistant. How can I help you today?") async def on_exit(self) -> None: await self.session.say("Thank you for calling. Goodbye!") async def start_session(context: JobContext): pipeline = CascadingPipeline( stt=DeepgramSTT(model="nova-2", language="en"), llm=OpenAILLM(model="gpt-4o"), tts=ElevenLabsTTS(model="eleven_flash_v2_5"), vad=SileroVAD(threshold=0.35), turn_detector=TurnDetector(threshold=0.8) ) session = AgentSession( agent=MyWhatsappAgent(), pipeline=pipeline, conversation_flow=ConversationFlow(MyWhatsappAgent()) ) try: await context.connect() await session.start() await asyncio.Event().wait() finally: await session.close() await context.shutdown() if __name__ == "__main__": try: options = Options( agent_id="agent1", # CRITICAL: Unique ID for routing register=True, # REQUIRED: Register with VideoSDK for telephony max_processes=10, ) job = WorkerJob(entrypoint=start_session, options=options) job.start() except Exception as e: traceback.print_exc() ``` ### Step 4: Install Dependencies and Run the Agent ```bash title="CLI Commands" # Create and activate a virtual environment python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install packages pip install -r requirements.txt # Run the agent python main.py ``` Your agent is now running and waiting for connections. Keep the terminal open. ## Part 2: Configure VideoSDK Gateways and Routing Next, we need to tell VideoSDK how to handle incoming calls and where to send them. ### Step 1: Configure an Inbound Gateway This is the entry point for calls coming from WhatsApp into VideoSDK. **dashboard:** Go to **Telephony > Inbound Gateways** in the [VideoSDK Dashboard](https://app.videosdk.live/telephony/inbound-gateways) and click **Add**. Give your gateway a name (e.g., "WhatsApp Gateway") and enter your WhatsApp Business phone number.
--- **api:** ```bash title="cURL" curl --request POST \ --url https://api.videosdk.live/v2/sip/inbound-gateways \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "name": "WhatsApp Gateway", "numbers": ["+1234567890"] }' ``` > **API Reference**: [Create Inbound Gateway](/api-reference/realtime-communication/sip/inbound-gateway/create-inbound-gateway) ### Step 2: Configure an Outbound Gateway This is the exit point for calls your agent makes to the phone network. **dashboard:** Go to **Telephony > Outbound Gateways** and click **Add**. Give it a name and provide the SIP details from your provider. For WhatsApp, this step is for enabling agent-initiated outbound calls. To get `username` and `password` make use of meta graph API, switch in **Via API** tab. {" "} {" "}
--- **api:** **Get SIP Credentials from Meta** First, you need to get the SIP credentials from the Meta Graph API. ```bash title="cURL" curl --location --globoff 'https://graph.facebook.com/v17.0/{{phone_id}}/settings?include_sip_credentials=true' \ --header 'Authorization: Bearer {{access_token}}' \ --header 'Content-Type: application/json' ``` The API response will look something like this: ```json title="Response" { "calling": { "status": "ENABLED", "call_icon_visibility": "DEFAULT", "callback_permission_status": "DISABLED", "srtp_key_exchange_protocol": "DTLS", "sip": { "status": "ENABLED", "servers": [ { "app_id": 1300814931425659, "hostname": "9WXXXXXXXX.sip.videosdk.live", "sip_user_password": "v18yo4xxxxxxxxxxxx" } ] } }, "storage_configuration": { "status": "DEFAULT" } } ``` **Create VideoSDK Outbound Gateway** Now, use the `sip_user_password` from the previous step to create an outbound gateway in VideoSDK. ```bash title="cURL" curl --request POST \ --url https://api.videosdk.live/v2/sip/outbound-gateways \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "name": "My Outbound Gateway", "numbers": ["+1234567890"], "address": "9WXXXXXXXX.sip.videosdk.live", "auth": { "username": "your_whatsapp_number", "password": "v18yo40Lhxxxxxx" } }' ``` > **API Reference**: [Create Outbound Gateway](/api-reference/realtime-communication/sip/outbound-gateway/create-outbound-gateway) ### Step 3: Create a Routing Rule This rule connects the Inbound Gateway to your specific AI agent. **dashboard:** Go to **Telephony > Routing Rules** and click **Add**. Configure the rule: - **Gateway**: Select the "WhatsApp Gateway" you just created. - **Numbers**: Add your WhatsApp Business phone number. - **Dispatch**: Choose **Agent**. - **Agent Type**: Set to `Self Hosted`. - **Agent ID**: Enter `agent1`. This **must exactly match** the `agent_id` in your `main.py` script. Click **Create**.
--- **api:** ```bash title="cURL" curl --request POST \ --url https://api.videosdk.live/v2/sip/routing-rules \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "gatewayId": "your_inbound_gateway_id", "name": "WhatsApp Call Routing", "numbers": ["+1234567890"], "dispatch": "agent", "agentType": "self_hosted", "agentId": "agent1" }' ``` > **API Reference**: [Create Routing Rule](/api-reference/realtime-communication/sip/routing-rules/create-routing-rule) --- ## Part 3: Enable WhatsApp SIP Forwarding Now, we'll instruct Meta to forward incoming WhatsApp calls to your VideoSDK Inbound Gateway. This is done via the **Meta Graph API**. ### Step 1: API Request Use the following `curl` command to update your WhatsApp phone number's settings. ```bash title="cURL" curl --location 'https://graph.facebook.com/v19.0/{{phone_number_id}}/settings' \ --header 'Authorization: Bearer {{access_token}}' \ --header 'Content-Type: application/json' \ --data '{ "calling": { "status": "ENABLED", "sip": { "status": "ENABLED", "servers": [ { "hostname": "9WXXXXXXX.sip.videosdk.live" } ] }, "srtp_key_exchange_protocol": "DTLS" } }' ``` **Replace the placeholders:** - `{{phone_number_id}}`: Your WhatsApp Business Phone Number ID from the Meta dashboard. - `{{access_token}}`: A valid User or System User access token with `whatsapp_business_management` permission. ### Step 2: API Response A successful request will return: ```json title="Response" { "success": true } ``` Your integration is now complete! Meta will forward all incoming voice calls to your WhatsApp number to VideoSDK, which will then route them to your running agent. --- ## Time to Talk! Test Your Agent :::tip **Keep Your Agent Running** Make sure your `main.py` script is still running locally before making or receiving calls. The agent must be active to handle any communication. ::: ### Receive an Inbound Call 1. Ensure your `main.py` script is still running locally. 2. Using a different WhatsApp account, place a voice call to your WhatsApp Business number. 3. Your local agent will answer, and you'll hear its greeting. Start a conversation! ### Make an Outbound Call To have your agent initiate a call to a WhatsApp number, use the VideoSDK SIP Call API. ```bash title="cURL" curl --request POST \ --url https://api.videosdk.live/v2/sip/call \ --header 'Authorization: YOUR_VIDEOSDK_TOKEN' \ --header 'Content-Type: application/json' \ --data '{ "gatewayId": "your_outbound_gateway_id", "sipCallTo": "whatsapp_number_to_call" }' ``` This commands your agent to dial out through your configured outbound gateway. :::tip **Geographic Optimization** For optimal performance, run your agent in the same geographic region as your SIP provider. This reduces latency and improves call quality. ::: ## Next Steps Congratulations! You've built and deployed a sophisticated AI telephony agent. You've seen how to run it locally and connect it to the global phone network for both inbound and outbound communication. - [Deploy Your Agent](/ai_agents/deployments/introduction): Learn how to deploy your AI agent to production - [Explore Telephony Docs](/telephony/introduction): Comprehensive telephony documentation and guides - [Provider Integrations](/telephony/integrations/twilio-sip-integration): SIP provider setup guides (Twilio, Vonage, etc.) --- ## Custom Video Track - Android - You can create a Video Track using `createCameraVideoTrack()` method of `VideoSDK`. - This method can be used to create video track using different encoding parameters, camera facing mode, bitrateMode, maxLayer and optimization mode. ### Parameters - **encoderConfig**: - type: `String` - required: `true` - default: `h480p_w720p` - You can choose from the below mentioned list of values for the encoder config. | Config | Resolution | Frame Rate | Optimized (kbps) | Balanced (kbps) | High Quality (kbps) | | :---------------- | :--------: | :--------: | :--------------: | :-------------: | :-----------------: | | h144p_w176p | 176x144 | 15 fps | 60 | 100 | 150 | | h240p_w320p | 320x240 | 15 fps | 80 | 150 | 300 | | h360p_w640p | 360x640 | 25 fps | 200 | 400 | 800 | | h480p_w640p | 640x480 | 25 fps | 300 | 600 | 1000 | | h480p_w720p | 720x480 | 30 fps | 400 | 700 | 1100 | | h720p_w960p | 960x720 | 30 fps | 800 | 1300 | 1800 | | h720p_w1280p | 1280x720 | 30 fps | 1000 | 1600 | 2400 | | h1080p_w1440p | 1440x1080 | 30 fps | 2000 | 2500 | 3500 | :::note Above mentioned encoder configurations are valid for both, landscape as well as portrait mode. ::: - **facingMode**: - type: `String` - required: `true` - Allowed values : `front` | `back` - It will specify wheater to use front or back camera for the video track. - **optimizationMode** - type: `CustomStreamTrack.VideoMode` - required: `true` - Allowed values: `motion` | `text` | `detail` - It will specify the optimization mode for the video track being generated. - **multiStream**: - type: `boolean` - required: `true` - It will specify if the stream should send multiple resolution layers or single resolution layer. - **context**: - type: `Context` - required: `true` - Pass the Android Context for this parameter. - **observer**: - type: `CapturerObserver` - required: `false` - If you want to use video filter from external SDK(e.g., [Banuba](https://www.banuba.com/)) then pass instance of `CapturerObserver` in this parameter. - **videoDeviceInfo**: - type: `VideoDeviceInfo` - required: `false` - If you want to specify a camera device to be used in the meeting. - **bitrateMode**: - type: `BitrateMode` - required: `false` - Allowed values : `BitrateMode.BANDWIDTH_OPTIMIZED` | `BitrateMode.BALANCED` | `BitrateMode.HIGH_QUALITY` - Controls the video quality and bandwidth consumption. You can choose between `BitrateMode.HIGH_QUALITY` for the best picture, `BitrateMode.BANDWIDTH_OPTIMIZED` to save data, or `BitrateMode.BALANCED` for a mix of both. Defaults to `BitrateMode.BALANCED`. - **maxLayer**: - type: `Integer` - required: `false` - Allowed values : `2` | `3` - Specifies the maximum number of simulcast layers to publish. This parameter only has an effect if `multiStream` is set to true. :::note For Banuba integraion with VideoSDK, please visit [Banuba Intergation with VideoSDK](/android/guide/video-and-audio-calling-api-sdk/video-processor/banuba-integration) ::: :::info - To learn more about optimizations and best practices for using custom video tracks, [follow this guide](/android/guide/video-and-audio-calling-api-sdk/render-media/optimize-video-track). ::: #### Returns - `CustomStreamTrack` ### Example **Kotlin:** ```javascript val videoCustomTrack: CustomStreamTrack = VideoSDK.createCameraVideoTrack("h720p_w960p", "front", CustomStreamTrack.VideoMode.MOTION, true, this, 2, BitrateMode.HIGH_QUALITY) ``` --- **Java:** ```javascript CustomStreamTrack customStreamTrack = VideoSDK.createCameraVideoTrack("h720p_w960p", "front", CustomStreamTrack.VideoMode.MOTION, true, this, 2, BitrateMode.HIGH_QUALITY); ``` ## Custom Audio Track - Android - You can create a Audio Track using `createAudioTrack()` method of `VideoSDK`. - This method can be used to create audio track using different encoding parameters. ### Parameters - **encoderConfig**: - type: `String` - required: `true` - default: `speech_standard` - You can choose from the below mentioned list of values for the encoder config. | Encoder Config | Bitrate | Auto Gain | Echo Cancellation | Noise Suppression | | ------------------- | :------: | :-------: | :---------------: | :---------------: | | speech_low_quality | 16 kbps | TRUE | TRUE | TRUE | | speech_standard | 24 kbps | TRUE | TRUE | TRUE | | music_standard | 32 kbps | FALSE | FALSE | FALSE | | standard_stereo | 64 kbps | FALSE | FALSE | FALSE | | high_quality | 128 kbps | FALSE | FALSE | FALSE | | high_quality_stereo | 192 kbps | FALSE | FALSE | FALSE | - **context** - type: `Context` - required: `true` - Pass the Android Context for this parameter. #### Returns - `CustomStreamTrack` ### Example **Kotlin:** ```js val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("speech_standard",this) ``` --- **Java:** ```js CustomStreamTrack audioCustomTrack=VideoSDK.createAudioTrack("speech_standard", this); ``` ## Custom Screen Share Track - Android - You can create a Screen Share track using `createScreenShareVideoTrack()` method of `VideoSDK`. - This method can be used to create screen share track using different encoding parameters. ### Parameters - **encoderConfig**: - type: `String` - required: `true` - default: `h720p_15fps` - You can choose from the below mentioned list of values for the encoder config. | Encoder Config | Resolution | Frame Rate | Bitrate | | -------------- | :--------: | :--------: | :----------: | | h360p_30fps | 640x360 | 3 fps | 200000 kbps | | h720p_5fps | 1280x720 | 5 fps | 400000 kbps | | h720p_15fps | 1280x720 | 15 fps | 1000000 kbps | | h1080p_15fps | 1920x1080 | 15 fps | 1500000 kbps | | h1080p_30fps | 1920x1080 | 15 fps | 1000000 kbps | :::note Above mentioned encoder configurations are valid for both, landscape as well as portrait mode. ::: - **data** - type: `Intent` - required: `true` - It is Intent received from onActivityResult when user provide permission for ScreenShare. - **context** - type: `Context` - required: `true` - Pass the Android Context for this parameter. - **listener** - type: `CustomTrackListener` - required: `true` - Callback to this listener will be made when track is ready with CustomTrack as parameter. ### Example **Kotlin:** ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this) { track -> meeting!!.enableScreenShare(track) } ``` --- **Java:** ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this, (track)->{ meeting.enableScreenShare(track); }); ``` --- # Meeting Error Codes - Android If you encounter any of the errors listed below, refer to the [Developer Experience Guide](../../guide/best-practices/developer-experience.md#listen-for-error-events), which offers recommended solutions based on common error categories. ### 1. Errors associated with Organization This table lists errors that occur due to the configuration and limitations within your organization’s account. These include account status, participant limits, and add-on service availability. | Type | Code | Message | |----------------------------------------|-------|-----------------------------------------------------| | ACCOUNT_DEACTIVATED | 4006 | Your account has been deactivated. Please contact VideoSDK support. | | ACCOUNT_DISCONTINUED | 4007 | Your account has been discontinued. Please reach out to support for more details. | | MAX_PARTICIPANT_REACHED | 4009 | You have reached the maximum participant limit for this meeting. | | MAX_SPEAKER_REACHED | 4010 | The maximum number of speakers for this meeting has been reached. | | MAX_SPEAKER_LIMIT_REACHED_ON_ORGANIZATION | 4026 | Your organization has reached the maximum number of speakers allowed. | | MAX_VIEWER_LIMIT_REACHED_ON_ORGANIZATION | 4027 | Your organization has reached the maximum number of viewers allowed. | | ADD_ON_SERVICES_DISABLED | 4021 | Add-On services have been disabled. Please contact support to enable them. | | MAX_RECORDING_LIMIT_REACHED_ON_ORGANIZATION | 4028 | You have reached max limit of recording on organization. To increase contact at support@videosdk.live | | MAX_HLS_LIMIT_REACHED_ON_ORGANIZATION | 4029 | You have reached max limit of hls on organization. To increase contact at support@videosdk.live | | MAX_LIVESTREAM_LIMIT_REACHED_ON_ORGANIZATION | 4030 | You have reached max limit of livestream on organization. To increase contact at support@videosdk.live | ### 2. Errors associated with Token Errors listed here are related to issues with the API key, authentication tokens, or permissions assigned to users. These errors can occur when tokens are missing, expired, or improperly configured. | Type | Code | Message | |-----------------------------------------|-------|--------------------------------------------------------| | INVALID_API_KEY | 4001 | The provided API key is either missing or invalid. | | INVALID_TOKEN | 4002 | The provided token is empty, invalid, or has expired. | | INVALID_PERMISSIONS | 4008 | The permissions in the token are incorrect. Please verify them. | | UNAUTHORIZED_MEETING_ID | 4022 | The provided token is not authorized for this meeting ID. | | UNAUTHORIZED_PARTICIPANT_ID | 4023 | The provided token is not authorized for this participant ID. | | UNAUTHORIZED_ROLE | 4024 | The role specified in the token is not valid for joining this meeting. | | UNAUTHORIZED_REQUEST | 4025 | Your request does not match the security configuration. | ### 3. Errors associated with Meeting and Participant This table lists errors resulting from invalid or missing meeting or participant details, including cases where a participant attempts to join with a duplicate ID. | Type | Code | Message | |-------------------------|-------|-------------------------------------------------| | INVALID_MEETING_ID | 4003 | The meeting ID provided is either invalid or missing. | | INVALID_PARTICIPANT_ID | 4004 | The participant ID is either invalid or missing. | | DUPLICATE_PARTICIPANT | 4005 | This participant has joined from another device. | ### 4. Errors associated with Add-on Service This section addresses errors that occur while using VideoSDK's add-on services such as recording, livestreaming, HLS streaming, and transcription. ### Recording-related errors | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | START_RECORDING_FAILED | 4011 | Failed to start recording. Please try again. | | STOP_RECORDING_FAILED | 4012 | Failed to stop recording. Please try again. | | RECORDING_FAILED | 5001 | Recording was stopped due to an error. | | START_PARTICIPANT_RECORDING_FAILED | 4035 | Failed to start participant recording. Please check the request. | | STOP_PARTICIPANT_RECORDING_FAILED | 4036 | Failed to stop participant recording. | | PARTICIPANT_RECORDING_FAILED | 5012 | Participant recording has stopped due to an error. | | START_TRACK_RECORDING_FAILED | 4037 | Failed to start track recording. | | STOP_TRACK_RECORDING_FAILED | 4038 | Failed to stop track recording. | | TRACK_RECORDING_FAILED | 5013 | Track recording stopped due to an unknown error. | | PREV_RECORDING_PROCESSING | 4018 | Previous recording session is being processed, please try again after few seconds. | | RECORDING_DURATION_LIMIT_REACHED | 5004 | Recording stopped due to maximum duration being reached. | ### Livestream-related errors | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | INVALID_LIVESTREAM_CONFIG | 4015 | Livestream 'outputs' configuration provided was invalid | | START_LIVESTREAM_FAILED | 4013 | Failed to start livestream. Please try again. | | STOP_LIVESTREAM_FAILED | 4014 | Failed to stop livestream. Please try again. | | LIVESTREAM_FAILED | 5002 | Livestream stopped due to an error. | | PREV_RTMP_RECORDING_PROCESSING | 4019 | Previous RTMP recording session is being processed, please try again after few seconds! | | LIVESTREAM_DURATION_LIMIT_REACHED | 5005 | Livestream stopped due to maximum duration being reached. | ### HLS-related errors | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | START_HLS_FAILED | 4016 | Failed to start HLS stream. | | STOP_HLS_FAILED | 4017 | Failed to stop HLS stream. | | HLS_FAILED | 5003 | HLS streaming stopped due to an error. | | PREV_HLS_STREAMING_PROCESSING | 4020 | Previous HLS streaming session is still being processed. | | HLS_DURATION_LIMIT_REACHED | 5006 | HLS stream stopped due to maximum duration being reached. | ### Transcription-related errors | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | START_TRANSCRIPTION_FAILED | 4031 | Failed to start transcription. Please try again. | | STOP_TRANSCRIPTION_FAILED | 4032 | Failed to stop transcription. | | TRANSCRIPTION_FAILED | 5007 | Transcription stopped due to an error. | ### 5.Errors associated with Media These errors involve media access, device availability, or permission-related issues affecting camera, microphone, and screen sharing. ### Device access-related errors | Type | Code | Message | |-------------------------------------------|-------|--------------------------------------------------| | ERROR_CAMERA_ACCESS | 3002 | Something went wrong. Unable to access camera. | | ERROR_MIC_ACCESS_DENIED | 3003 | It seems like microphone access was denied or dismissed. To proceed, kindly grant access through your device's settings. | | ERROR_CAMERA_ACCESS_DENIED | 3004 | It seems like camera access was denied or dismissed. To proceed, kindly grant access through your device's settings. | ### 6.Errors associated with Track These errors occur when there are issues with video or audio tracks, such as disconnections or invalid custom tracks. | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | ERROR_CUSTOM_SCREEN_SHARE_TRACK_ENDED | 3005 | The provided custom track is in an ended state. Please try again with new custom track. | | ERROR_CUSTOM_SCREEN_SHARE_TRACK_DISPOSED | 3006 | The provided custom track was disposed. Please try again with new custom track. | | ERROR_CHANGE_WEBCAM | 3007 | Something went wrong, and the camera could not be changed. Please try again. | ### 7.Errors associated with Actions Below error is triggered when an action is attempted before joining a meeting. | Type | Code | Message | |-----------------------------|-------|--------------------------------------------------| | ERROR_ACTION_PERFORMED_BEFORE_MEETING_JOINED | 3001 | Oops! Something went wrong. The room was in a connecting state, and during that time, an action encountered an issue. Please try again after joining a meeting. | --- # Initializing a Meeting - Android
## initialize() To initialize the meeting, first you have to initialize the `VideoSDK`. You can initialize the `VideoSDK` using `initialize()` method provided by the SDK. #### Parameters - **context**: Context #### Returns - _`void`_ ```js title="initialize" VideoSDK.initialize(Context context) ``` --- ## config() Now, you have to set `token` property of `VideoSDK` class. By using `config()` method, you can set the `token` property of `VideoSDK` class. Please refer this [documentation](/api-reference/realtime-communication/intro/) to generate a token. #### Parameters - **token**: String #### Returns - _`void`_ ```js title="config" VideoSDK.config(String token) ``` --- ## initMeeting() - Now, you can initialize the meeting using a factory method provided by the SDK called `initMeeting()`. - `initMeeting()` will generate a new [`Meeting`](./meeting-class/introduction.md) class and the initiated meeting will be returned. ```js title="initMeeting" VideoSDK.initMeeting( Context context, String meetingId, String name, boolean micEnabled, boolean webcamEnabled, String participantId, String mode, boolean multiStream, Map ### multiStream - It will specify if the stream should send multiple resolution layers or single resolution layer. - type: `boolean` - `REQUIRED` ### customTracks - If you want to use custom tracks from start of the meeting, you can pass map of custom tracks in this paramater. - type : `Map` or `null` - `REQUIRED` Please refer this [documentation](../../guide/video-and-audio-calling-api-sdk/features/custom-track/custom-video-track) to know more about CustomTrack. ### metaData - If you want to provide additional details about a user joining a meeting, such as their profile image, you can pass that information in this parameter. - type: `JSONObject` - `REQUIRED` ### signalingBaseUrl - If you want to use a proxy server with the VideoSDK, you can specify your baseURL here. - type: `String` - `OPTIONAL` :::note If you intend to use a proxy server with the VideoSDK, priorly inform us at support@videosdk.live ::: ### preferredProtocol - If you want to provide a preferred network protocol for communication, you can specify that in `PreferredProtocol`, with options including `UDP_ONLY`, `UDP_OVER_TCP`, and `TCP_ONLY`. - type: `PreferredProtocol` - `OPTIONAL` ## Returns ### meeting - After initializing the meeting, `initMeeting()` will return a new [`Meeting`](./meeting-class/introduction.md) instance. --- ## Example **Kotlin:** ```js title="initMeeting" VideoSDK.initialize(applicationContext) // Configure the token VideoSDK.config(token) // pass the token generated from VideoSDK Dashboard // Initialize the meeting var meeting = VideoSDK.initMeeting( arrayOf( this@MainActivity, "abc-1234-xyz", "John Doe", true, true, null, null, false, null, null ) ) ``` --- **Java:** ```js title="initMeeting" VideoSDK.initialize(getApplicationContext()); // Configure the token VideoSDK.config(token); // pass the token generated from VideoSDK Dashboard // Initialize the meeting Meeting meeting = VideoSDK.initMeeting({ MainActivity.this, "abc-1234-xyz", "John Doe", true, true, null, null, false, null, null, null }); ```
--- # MediaEffects library - Android
## Introduction - The `MediaEffects` library enhances video applications with advanced media effects, including virtual backgrounds. It supports real-time processing and is optimized for Android devices. - The `MediaEffects` library offers three classes to customize your video background: using a custom image, applying a blur effect, or choosing a solid color. :::info The Virtual Background feature in VideoSDK can be utilized regardless of the meeting environment, including the pre-call screen. ::: ## 1. BackgroundImageProcessor - `BackgroundImageProcessor` sets a specified image as the background in a video stream, allowing you to customize the visual appearance of the video. - `BackgroundImageProcessor` class provides following method. - `setBackgroundSource()` method updates the virtual background by setting a new image as the background that the user wants to switch to. - **Parameters**: `Uri`: An image URI for the background image. - **Return Type**: `void` **Kotlin:** ```js val uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg") val backgroundImageProcessor = BackgroundImageProcessor(uri) // Sets the background image val newUri = Uri.parse("https://img.freepik.com/free-photo/plant-against-blue-wall-mockup_53876-96052.jpg?size=626&ext=jpg&ga=GA1.1.2008272138.1723420800&semt=ais_hybrid") backgroundImageProcessor.setBackgroundSource(newUri) // Changed background image ``` --- **Java:** ```java Uri uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg"); BackgroundImageProcessor backgroundImageProcessor= new BackgroundImageProcessor(uri); // Sets the background image Uri uri = Uri.parse("https://img.freepik.com/free-photo/plant-against-blue-wall-mockup_53876-96052.jpg?size=626&ext=jpg&ga=GA1.1.2008272138.1723420800&semt=ais_hybrid"); backgroundImageProcessor.setBackgroundSource(uri); //Changed background image ``` ## 2. BackgroundBlurProcessor - `BackgroundBlurProcessor` applies a blur effect to the video background, with the intensity controlled by a float value, creating a softened visual effect. - `BackgroundBlurProcessor` class provides following method. - `setBlurRadius()` method adjusts the blur effect on the video background, with the blur strength controlled by the specified float value. - **Parameters**: `Float`: representing the blur strength; higher values mean stronger blur. The supported range is 0 to 25. - **Return Type**: `void` **Kotlin:** ```js val backgroundBlurProcessor = BackgroundBlurProcessor(25, this) // Applies a blur with intensity 25 backgroundBlurProcessor.setBlurRadius(17) // Changes the blur intensity to 17 ``` --- **Java:** ```java BackgroundBlurProcessor backgroundBlurProcessor = new BackgroundBlurProcessor(25, this);// Applies a blur with intensity 25 backgroundBlurProcessor.setBlurRadius(17); // changes the blur intensity to 17 ``` ## 3. BackgroundColorProcessor - `BackgroundColorProcessor` sets a solid color as the video background using a `Color` object, enabling you to create a uniform color backdrop for your video. - `BackgroundColorProcessor` class provides following method. - `setBackgroundColor()` method sets the color that user wants to switch to, for virtual background effect. - **Parameters**: `Integer`: Specifies the color for the virtual background. - **Return Type**: `void` **Kotlin:** ```js val backgroundColorProcessor = BackgroundColorProcessor(Color.BLUE) // Sets the background color to blue backgroundColorProcessor.setBackgroundColor(Color.CYAN) // Changes the background color to CYAN ``` --- **Java:** ```java BackgroundColorProcessor backgroundColorProcessor = new BackgroundColorProcessor(Color.BLUE);// Sets the background color to blue backgroundColorProcessor.setBackgroundColor(Color.CYAN); // changed the background color to CYAN ```
--- # Video SDK Meeting Class - Android
## Introduction The `Meeting` class includes properties, methods and meeting-event-listener-class for managing a meeting, participants, video, audio and share streams, messaging and UI customization. ## Meeting Properties
- [getmeetingId()](/android/api/sdk-reference/meeting-class/properties#getmeetingid)
- [getLocalParticipant()](./properties#getlocalparticipant)
- [getConnectionState()](./properties#getconnectionstate)
- [getParticipants()](./properties#getparticipants)
- [pubSub](./properties#pubsub)
- [LeaveReason Enum](./properties#leavereason-enum)
## Meeting Methods
- [join()](./methods#join)
- [leave()](./methods#leave)
- [end()](./methods#end)
- [enableWebcam()](./methods#enablewebcam)
- [disableWebcam()](./methods#disablewebcam)
- [unmuteMic()](./methods#unmutemic)
- [muteMic()](./methods#mutemic)
- [enableScreenShare()](./methods#enablescreenshare)
- [disableScreenShare()](./methods#disablescreenshare)
- [startRecording()](./methods#startrecording)
- [stopRecording()](./methods#stoprecording)
- [startLiveStream()](./methods#startlivestream)
- [stopLiveStream()](./methods#stoplivestream)
- [startHls()](./methods#starthls)
- [stopHls()](./methods#stophls)
- [startTranscription()](./methods#starttranscription)
- [stopTranscription()](./methods#stoptranscription)
- [changeMode()](./methods#changemode)
- [getMics()](./methods#getmics)
- [changeMic()](./methods#changemic)
- [setAudioDeviceChangeListener()](./methods#setaudiodevicechangelistener)
- [changeWebcam()](./methods#changewebcam)
- [uploadBase64File()](./methods#uploadbase64file)
- [fetchBase64File()](./methods#fetchbase64file)
- [addEventListener()](./methods#addeventlistener)
- [removeEventListener()](./methods#removeeventlistener)
- [removeAllListeners()](./methods#removealllisteners)
- [startWhiteboard()](./methods#startwhiteboard)
- [stopWhiteboard()](./methods#stopwhiteboard)
- [pauseAllStreams()](./methods#pauseallstreams)
- [resumeAllStreams()](./methods#resumeallstreams)
- [requestMediaRelay()](./methods#requestmediarelay)
- [stopMediaRelay()](./methods#stopmediarelay)
- [switchTo()](./methods#switchto)
## Meeting Events
- [onMeetingJoined](./meeting-event-listener-class#onmeetingjoined)
- [onMeetingLeft](./meeting-event-listener-class#onmeetingleft)
- [onParticipantJoined](./meeting-event-listener-class#onparticipantjoined)
- [onParticipantLeft](./meeting-event-listener-class#onparticipantleft)
- [onSpeakerChanged](./meeting-event-listener-class#onspeakerchanged)
- [onPresenterChanged](./meeting-event-listener-class#onpresenterchanged)
- [onEntryRequested](./meeting-event-listener-class#onentryrequested)
- [onEntryResponded](./meeting-event-listener-class#onentryresponded)
- [onWebcamRequested](./meeting-event-listener-class#onwebcamrequested)
- [onMicRequested](./meeting-event-listener-class#onmicrequested)
- [onRecordingStateChanged](./meeting-event-listener-class#onrecordingstatechanged)
- [onRecordingStarted](./meeting-event-listener-class#onrecordingstarted)
- [onRecordingStopped](./meeting-event-listener-class#onrecordingstopped)
- [onLivestreamStateChanged](./meeting-event-listener-class#onlivestreamstatechanged)
- [onLivestreamStarted](./meeting-event-listener-class#onlivestreamstarted)
- [onLivestreamStopped](./meeting-event-listener-class#onlivestreamstopped)
- [onHlsStateChanged](./meeting-event-listener-class#onhlsstatechanged)
- [onTranscriptionStateChanged](./meeting-event-listener-class#ontranscriptionstatechanged)
- [onTranscriptionText](./meeting-event-listener-class#ontranscriptiontext)
- [onExternalCallStarted](./meeting-event-listener-class#onexternalcallstarted)
- [onMeetingStateChanged](./meeting-event-listener-class#onmeetingstatechanged)
- [onParticipantModeChanged](./meeting-event-listener-class#onparticipantmodechanged)
- [onPinStateChanged()](./meeting-event-listener-class#onpinstatechanged)
- [onWhiteboardStarted()](./meeting-event-listener-class#onwhiteboardstarted)
- [onWhiteboardStopped()](./meeting-event-listener-class#onwhiteboardstopped)
- [onExternalCallRinging()](./meeting-event-listener-class#onexternalcallringing)
- [onExternalCallStarted()](./meeting-event-listener-class#onexternalcallstarted-1)
- [onExternalCallHangup()](./meeting-event-listener-class#onexternalcallhangup)
- [onPausedAllStreams()](./meeting-event-listener-class#onpausedallstreams)
- [onResumedAllStreams()](./meeting-event-listener-class#onresumedallstreams)
- [onMediaRelayRequestReceived()](./meeting-event-listener-class#onmediarelayrequestreceived)
- [onMediaRelayRequestResponse()](./meeting-event-listener-class#onmediarelayrequestreceived)
- [onMediaRelayStarted()](./meeting-event-listener-class#onmediarelaystarted)
- [onMediaRelayStopped()](./meeting-event-listener-class#onmediarelaystopped)
- [onMediaRelayError()](./meeting-event-listener-class#onmediarelayerror)
--- # MeetingEventListener Class - Android
--- ### implementation - You can implement all the methods of `MeetingEventListener` abstract Class and add the listener to `Meeting` class using the `addEventListener()` method of `Meeting` Class. #### Example **Kotlin:** ```javascript private val meetingEventListener = object : MeetingEventListener() { override fun onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()") } } ``` --- **Java:** ```javascript private final MeetingEventListener meetingEventListener = new MeetingEventListener() { @Override public void onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()"); } } ``` --- ### onMeetingJoined() - This event will be emitted when a [localParticipant](./properties#getlocalparticipant) successfully joined the meeting. #### Example **Kotlin:** ```javascript override fun onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()") } ``` --- **Java:** ```javascript @Override public void onMeetingJoined() { Log.d("#meeting", "onMeetingJoined()"); } ``` --- ### onMeetingLeft() - This event will be emitted when a [localParticipant](./properties#getlocalparticipant) left the meeting. - For more context on why the participant left, use the overloaded `onMeetingLeft(LeaveReason reason)` method. #### Example **Kotlin:** ```javascript override fun onMeetingLeft() { Log.d("#meeting", "onMeetingLeft()") } ``` --- **Java:** ```javascript @Override public void onMeetingLeft() { Log.d("#meeting", "onMeetingLeft()"); } ``` --- ### onMeetingLeft(LeaveReason reason) - This event is an overload for `onMeetingLeft()` and provides the specific reason why the local participant left the meeting. #### Event callback parameters - **reason**: `LeaveReason` - The reason why the participant left. See [LeaveReasons Enum](./properties#leavereason-enum) for all possible values. #### Example **Kotlin:** ```javascript override fun onMeetingLeft(reason: LeaveReason) { Log.d("#meeting", "onMeetingLeft() :: reason: ${reason.message}") } ``` --- **Java:** ```javascript @Override public void onMeetingLeft(LeaveReason reason) { Log.d("#meeting", "onMeetingLeft() :: reason: " + reason.getMessage()); } ``` --- ### onParticipantJoined() - This event will be emitted when a new [participant](../participant-class/introduction) joined the meeting. #### Event callback parameters - **participant**: [Participant](../participant-class/introduction) #### Example **Kotlin:** ```javascript override fun onParticipantJoined(participant: Participant) { Log.d("#meeting", participant.displayName + " joined"); } ``` --- **Java:** ```javascript @Override public void onParticipantJoined(Participant participant) { Log.d("#meeting", participant.getDisplayName() + " joined"); } ``` --- ### onParticipantLeft(Participant participant) - This event will be emitted when a joined [participant](../participant-class/introduction) left the meeting. - For more context on why the participant left, use the overloaded `onParticipantLeft(Participant participant, LeaveReason reason)` method. #### Event callback parameters - **participant**: [Participant](../participant-class/introduction) #### Example **Kotlin:** ```javascript override fun onParticipantLeft(participant: Participant) { Log.d("#meeting", participant.displayName + " left"); } ``` --- **Java:** ```javascript @Override public void onParticipantLeft(Participant participant) { Log.d("#meeting", participant.getDisplayName() + " left"); } ``` --- ### onParticipantLeft(Participant participant, LeaveReason reason) - This event is an overload for `onParticipantLeft()` and provides the specific reason why a remote participant left the meeting. #### Event callback parameters - **participant**: [Participant](../participant-class/introduction) - **reason**: `LeaveReason` - The reason why the participant left. See [LeaveReasons Enum](./properties#leavereason-enum) for all possible values. #### Example **Kotlin:** ```javascript override fun onParticipantLeft(participant: Participant, reason: LeaveReason) { Log.d("#meeting", "${participant.displayName} left :: reason: ${reason.message}") } ``` --- **Java:** ```javascript @Override public void onParticipantLeft(Participant participant, LeaveReason reason) { Log.d("#meeting", participant.getDisplayName() + " left :: reason: " + reason.getMessage()); } ``` --- ### onSpeakerChanged() - This event will be emitted when a active speaker changed. - If you want to know which participant is actively speaking, then this event will be used. - If no participant is actively speaking, then this event will pass `null` as en event callback parameter. #### Event callback parameters - **participantId**: String #### Example **Kotlin:** ```javascript override fun onSpeakerChanged(participantId: String?) { // } ``` --- **Java:** ```javascript @Override public void onSpeakerChanged(String participantId) { // } ``` --- ### onPresenterChanged() - This event will be emitted when any [participant](../participant-class/introduction) starts or stops screen sharing. - It will pass `participantId` as an event callback parameter. - If a participant stops screensharing, then this event will pass `null` as en event callback parameter. #### Event callback parameters - **participantId**: String #### Example **Kotlin:** ```javascript override fun onPresenterChanged(participantId: String) { // } ``` --- **Java:** ```javascript @Override public void onPresenterChanged(String participantId) { // } ``` --- ### onEntryRequested() - This event will be emitted when a new [participant](../participant-class/introduction) who is trying to join the meeting, is having permission **`ask_join`** in token. - This event will only be emitted to the [participants](./properties#getparticipants) in the meeting, who is having the permission **`allow_join`** in token. - This event will pass following parameters as an event parameters, `participantId` and `name` of the new participant who is trying to join the meeting, `allow()` and `deny()` to take required actions. #### Event callback parameters - **peerId**: String - **name**: String #### Example **Kotlin:** ```javascript override fun onEntryRequested(id: String?, name: String?) { // } ``` --- **Java:** ```javascript @Override public void onEntryRequested(String id, String name) { // } ``` --- ### onEntryResponded() - This event will be emitted when the `join()` request is responded. - This event will be emitted to the [participants](./properties#getparticipants) in the meeting, who is having the permission **`allow_join`** in token. - This event will be also emitted to the [participant](../participant-class/introduction) who requested to join the meeting. #### Event callback parameters - **participantId**: _String_ - **decision**: _"allowed"_ | _"denied"_ #### Example **Kotlin:** ```javascript override fun onEntryResponded(id: String?, decision: String?) { // } ``` --- **Java:** ```javascript @Override public void onEntryResponded(String id, String decision) { // } ``` --- ### onWebcamRequested() - This event will be emitted to the participant `B` when any other participant `A` requests to enable webcam of participant `B`. - On accepting the request, webcam of participant `B` will be enabled. #### Event callback parameters - **participantId**: String - **listener**: WebcamRequestListener \{ **accept**: Method; **reject**: Method } #### Example **Kotlin:** ```javascript override fun onWebcamRequested(participantId: String, listener: WebcamRequestListener) { // if accept request listener.accept() // if reject request listener.reject() } ``` --- **Java:** ```javascript @Override public void onWebcamRequested(String participantId, WebcamRequestListener listener) { // if accept request listener.accept(); // if reject request listener.reject(); } ``` ### onMicRequested() - This event will be emitted to the participant `B` when any other participant `A` requests to enable mic of participant `B`. - On accepting the request, mic of participant `B` will be enabled. #### Event callback parameters - **participantId**: String - **listener**: MicRequestListener \{ **accept**: Method; **reject**: Method } #### Example **Kotlin:** ```javascript override fun onMicRequested(participantId: String, listener: MicRequestListener) { // if accept request listener.accept() // if reject request listener.reject() } ``` --- **Java:** ```javascript @Override public void onMicRequested(String participantId, MicRequestListener listener) { // if accept request listener.accept(); // if reject request listener.reject(); } ``` --- ### onRecordingStateChanged() - This event will be emitted when the meeting's recording status changed. #### Event callback parameters - **recordingState**: String `recordingState` has following values - `RECORDING_STARTING` - Recording is in starting phase and hasn't started yet. - `RECORDING_STARTED` - Recording has started successfully. - `RECORDING_STOPPING` - Recording is in stopping phase and hasn't stopped yet. - `RECORDING_STOPPED` - Recording has stopped successfully. #### Example **Kotlin:** ```javascript override fun onRecordingStateChanged(recordingState: String) { when (recordingState) { "RECORDING_STARTING" -> { Log.d("onRecordingStateChanged", "Meeting recording is starting") } "RECORDING_STARTED" -> { Log.d("onRecordingStateChanged", "Meeting recording is started") } "RECORDING_STOPPING" -> { Log.d("onRecordingStateChanged", "Meeting recording is stopping") } "RECORDING_STOPPED" -> { Log.d("onRecordingStateChanged", "Meeting recording is stopped") } } } ``` --- **Java:** ```javascript @Override public void onRecordingStateChanged(String recordingState) { switch (recordingState) { case "RECORDING_STARTING": Log.d("onRecordingStateChanged", "Meeting recording is starting"); break; case "RECORDING_STARTED": Log.d("onRecordingStateChanged", "Meeting recording is started"); break; case "RECORDING_STOPPING": Log.d("onRecordingStateChanged", "Meeting recording is stopping"); break; case "RECORDING_STOPPED": Log.d("onRecordingStateChanged", "Meeting recording is stopped"); break; } } ``` --- ### onRecordingStarted() _`This event will be deprecated soon`_ - This event will be emitted when recording of the meeting is started. #### Example **Kotlin:** ```javascript override fun onRecordingStarted() { // } ``` --- **Java:** ```javascript @Override public void onRecordingStarted() { // } ``` --- ### onRecordingStopped() _`This event will be deprecated soon`_ - This event will be emitted when recording of the meeting is stopped. #### Example **Kotlin:** ```javascript override fun onRecordingStopped() { // } ``` --- **Java:** ```javascript @Override public void onRecordingStopped() { // } ``` --- ### onLivestreamStateChanged() - This event will be emitted when the meeting's livestream status changed. #### Event callback parameters - **livestreamState**: String `livestreamState` has following values - `LIVESTREAM_STARTING` - Livestream is in starting phase and hasn't started yet. - `LIVESTREAM_STARTED` - Livestream has started successfully. - `LIVESTREAM_STOPPING` - Livestream is in stopping phase and hasn't stopped yet. - `LIVESTREAM_STOPPED` - Livestream has stopped successfully. #### Example **Kotlin:** ```javascript override fun onLivestreamStateChanged(livestreamState: String?) { when (livestreamState) { "LIVESTREAM_STARTING" -> Log.d( "LivestreamStateChanged", "Meeting livestream is starting" ) "LIVESTREAM_STARTED" -> Log.d( "LivestreamStateChanged", "Meeting livestream is started" ) "LIVESTREAM_STOPPING" -> Log.d("LivestreamStateChanged", "Meeting livestream is stopping" ) "LIVESTREAM_STOPPED" -> Log.d("LivestreamStateChanged", "Meeting livestream is stopped" ) } } ``` --- **Java:** ```javascript @Override public void onLivestreamStateChanged(String livestreamState) { switch (livestreamState) { case "LIVESTREAM_STARTING": Log.d("LivestreamStateChanged", "Meeting livestream is starting"); break; case "LIVESTREAM_STARTED": Log.d("LivestreamStateChanged", "Meeting livestream is started"); break; case "LIVESTREAM_STOPPING": Log.d("LivestreamStateChanged", "Meeting livestream is stopping"); break; case "LIVESTREAM_STOPPED": Log.d("LivestreamStateChanged", "Meeting livestream is stopped"); break; } } ``` --- ### onLivestreamStarted() _`This event will be deprecated soon`_ - This event will be emitted when `RTMP` live stream of the meeting is started. #### Example **Kotlin:** ```javascript override fun onLivestreamStarted() { // } ``` --- **Java:** ```javascript @Override public void onLivestreamStarted() { // } ``` --- ### onLivestreamStopped() _`This event will be deprecated soon`_ - This event will be emitted when `RTMP` live stream of the meeting is stopped. #### Example **Kotlin:** ```javascript override fun onLivestreamStopped() { // } ``` --- **Java:** ```javascript @Override public void onLivestreamStopped() { // } ``` --- ### onHlsStateChanged() - This event will be emitted when the meeting's HLS(Http Livestreaming) status changed. #### Event callback parameters - **HlsState**: \{ **status**: String} - `status` has following values : - `HLS_STARTING` - HLS is in starting phase and hasn't started yet. - `HLS_STARTED` - HLS has started successfully. - `HLS_PLAYABLE` - HLS can be playable now. - `HLS_STOPPING` - HLS is in stopping phase and hasn't stopped yet. - `HLS_STOPPED` - HLS has stopped successfully. - when you receive `HLS_PLAYABLE` status you will receive 2 urls in response - `playbackHlsUrl` - Live HLS with playback support - `livestreamUrl` - Live HLS without playback support :::note `downstreamUrl` is now depecated. Use `playbackHlsUrl` or `livestreamUrl` in place of `downstreamUrl` ::: #### Example **Kotlin:** ```javascript override fun onHlsStateChanged(HlsState: JSONObject) { when (HlsState.getString("status")) { "HLS_STARTING" -> Log.d("onHlsStateChanged", "Meeting hls is starting") "HLS_STARTED" -> Log.d("onHlsStateChanged", "Meeting hls is started") "HLS_PLAYABLE" -> { Log.d("onHlsStateChanged", "Meeting hls is playable now") // on hls playable you will receive playbackHlsUrl and livestreamUrl val playbackHlsUrl = HlsState.getString("playbackHlsUrl") val livestreamUrl = HlsState.getString("livestreamUrl") } "HLS_STOPPING" -> Log.d("onHlsStateChanged", "Meeting hls is stopping") "HLS_STOPPED" -> Log.d("onHlsStateChanged", "Meeting hls is stopped") } } ``` --- **Java:** ```javascript @Override public void onHlsStateChanged(JSONObject HlsState) { switch (HlsState.getString("status")) { case "HLS_STARTING": Log.d("onHlsStateChanged", "Meeting hls is starting"); break; case "HLS_STARTED": Log.d("onHlsStateChanged", "Meeting hls is started"); break; case "HLS_PLAYABLE": Log.d("onHlsStateChanged", "Meeting hls is playable now"); // on hls started you will receive playbackHlsUrl and livestreamUrl String playbackHlsUrl = HlsState.getString("playbackHlsUrl"); String livestreamUrl = HlsState.getString("livestreamUrl"); break; case "HLS_STOPPING": Log.d("onHlsStateChanged", "Meeting hls is stopping"); break; case "HLS_STOPPED": Log.d("onHlsStateChanged", "Meeting hls is stopped"); break; } } ``` --- ### onTranscriptionStateChanged() - This event will be triggered whenever state of realtime transcription is changed. #### Event callback parameters - **data**: \{ **status**: String, **id**: String } - **status**: String - **id**: String `status` has following values - `TRANSCRIPTION_STARTING` - Realtime Transcription is in starting phase and hasn't started yet. - `TRANSCRIPTION_STARTED` - Realtime Transcription has started successfully. - `TRANSCRIPTION_STOPPING` - Realtime Transcription is in stopping phase and hasn't stopped yet. - `TRANSCRIPTION_STOPPED` - Realtime Transcription has stopped successfully. #### Example **Kotlin:** ```javascript override fun onTranscriptionStateChanged(data: JSONObject) { //Status can be :: TRANSCRIPTION_STARTING //Status can be :: TRANSCRIPTION_STARTED //Status can be :: TRANSCRIPTION_STOPPING //Status can be :: TRANSCRIPTION_STOPPED val status = data.getString("status") Log.d("MeetingActivity", "Transcription status: $status") } ``` --- **Java:** ```javascript @Override public void onTranscriptionStateChanged(JSONObject data) { //Status can be :: TRANSCRIPTION_STARTING //Status can be :: TRANSCRIPTION_STARTED //Status can be :: TRANSCRIPTION_STOPPING //Status can be :: TRANSCRIPTION_STOPPED String status = data.getString("status"); Log.d("MeetingActivity", "Transcription status: " + status); } ``` --- ### onTranscriptionText() - This event will be emitted when text for running realtime transcription received. #### Event callback parameters - **data**: TranscriptionText - **TranscriptionText.participantId**: String - **TranscriptionText.participantName**: String - **TranscriptionText.text**: String - **TranscriptionText.timestamp**: int - **TranscriptionText.type**: String #### Example **Kotlin:** ```javascript override fun onTranscriptionText(data: TranscriptionText) { val participantId = data.participantId val participantName = data.participantName val text = data.text val timestamp = data.timestamp val type = data.type Log.d("MeetingActivity", "$participantName: $text $timestamp") } ``` --- **Java:** ```javascript @Override public void onTranscriptionText(TranscriptionText data) { String participantId = data.getParticipantId(); String participantName = data.getParticipantName(); String text = data.getText(); int timestamp = data.getTimestamp(); String type = data.getType(); Log.d("MeetingActivity", participantName + ": " + text + " " + timestamp); } ``` --- ### onWhiteboardStarted() - This event will be triggered when the whiteboard is successfully started. #### Event callback parameters **url**: String #### Example **Kotlin:** ```javascript override fun onWhiteboardStarted(url: String) { super.onWhiteboardStarted(url) //... } ``` --- **Java:** ```java @Override public void onWhiteboardStarted(String url) { super.onWhiteboardStarted(url); //... } ``` --- ### onWhiteboardStopped() - This event will be triggered when the whiteboard session is successfully terminated. #### Example **Kotlin:** ```javascript override fun onWhiteboardStopped() { super.onWhiteboardStopped() //... } ``` --- **Java:** ```java @Override public void onWhiteboardStopped() { super.onWhiteboardStopped(); //... } ``` --- ### onExternalCallStarted() - This event will be emitted when local particpant receive incoming call. #### Example **Kotlin:** ```javascript override fun onExternalCallStarted() { // } ``` --- **Java:** ```javascript @Override public void onExternalCallStarted() { // } ``` --- ### onMeetingStateChanged() - This event will be emitted when state of meeting changes. - It will pass **`state`** as an event callback parameter which will indicate current state of the meeting. - All available states are `CONNECTING`, `CONNECTED`, `RECONNECTING`, `DISCONNECTED`. #### Event callback parameters - **state**: MeetingState #### Example **Kotlin:** ```javascript override fun onMeetingStateChanged(state: ConnectionState) { super.onMeetingStateChanged(state) Log.d("TAG", "onMeetingStateChanged: $state") } ``` --- **Java:** ```javascript @Override public void onMeetingStateChanged(ConnectionState state) { super.onMeetingStateChanged(state); Log.d("TAG", "onMeetingStateChanged: "+ state); } ``` --- ### onExternalCallRinging() This callback is triggered when the user's phone starts ringing. whether it’s a traditional phone call or a VoIP call (e.g., WhatsApp). This event allows us to detect when the user is receiving an external call. #### Example **Kotlin:** ```javascript override fun onExternalCallRinging() { Log.d("#meeting", "onExternalCallAnswered: User phone is ringing") } ``` --- **Java:** ```javascript @Override public void onExternalCallRinging() { Log.d("#meeting", "onExternalCallAnswered: User phone is ringing"); } ``` --- ### onExternalCallStarted() This callback is triggered when the user answers an external phone call. whether it’s a traditional phone call or a VoIP call (e.g., WhatsApp). This event allows us to detect when the user has started a call. #### Example **Kotlin:** ```javascript override fun onExternalCallStarted() { Log.d("#meeting", "onExternalCallAnswered: User call is answered") } ``` --- **Java:** ```javascript @Override public void onExternalCallStarted() { Log.d("#meeting", "onExternalCallAnswered: User call is answered"); } ``` --- ### onExternalCallHangup() This callback is triggered when an external call ends, whether it’s a traditional phone call or a VoIP call (e.g., WhatsApp). This event detects when a call has ended #### Example **Kotlin:** ```javascript override fun onExternalCallHangup() { Log.d("#meeting", "onExternalCallAnswered: User call ends") } ``` --- **Java:** ```javascript @Override public void onExternalCallHangup() { Log.d("#meeting", "onExternalCallAnswered: User call ends"); } ``` --- ### onPausedAllStreams() - This callback is triggered when all or specified media streams within the meeting are successfully paused #### Parameters - **`kind`**: Specifies the type of media stream that was paused. - **Type**: `String` - **Possible values**: - `"audio"`: Indicates that audio streams have been paused. - `"video"`: Indicates that video streams have been paused. - `"share"`: Indicates that screen-sharing video streams have been paused #### Example **Kotlin:** ```javascript override fun onPausedAllStreams(kind: String) { Log.d("TAG", "onPausedAllStreams: $kind") super.onPausedAllStreams(kind) } ``` --- **Java:** ```javascript @Override public void onPausedAllStreams(String kind) { Log.d("TAG", "onPausedAllStreams: " + kind); super.onPausedAllStreams(kind); } ``` --- ### onResumedAllStreams() - This callback is triggered when all or specified media streams within the meeting are successfully resumed #### Parameters - **`kind`**: Specifies the type of media stream that was resumed. - **Type**: `String` - **Possible values**: - `"audio"`: Indicates that audio streams have been resumed. - `"video"`: Indicates that video streams have been resumed. - `"share"`: Indicates that screen-sharing video streams have been resumed #### Example **Kotlin:** ```javascript override fun onResumedAllStreams(kind: String) { Log.d("TAG", "onResumedAllStreams: $kind") super.onResumedAllStreams(kind) } ``` --- **Java:** ```javascript @Override public void onResumedAllStreams(String kind) { Log.d("TAG", "onResumedAllStreams: " + kind); super.onResumedAllStreams(kind); } ``` --- ### onParticipantModeChanged() This event is triggered when a participant's mode is updated. It passes `data` as an event callback parameter, which includes the following: - **`SEND_AND_RECV`**: Both audio and video streams will be produced and consumed. - **`SIGNALLING_ONLY`**: Audio and video streams will not be produced or consumed. It is used solely for signaling. - **`RECV_ONLY`**: Only audio and video streams will be consumed without producing any. This event is triggered when a participant's mode is updated. #### Event Callback Parameters - **data**: `{ mode: String, participantId: String }` - **mode**: `String` - **participantId**: `String` #### Example **Kotlin:** ```javascript override fun onParticipantModeChanged(data: JSONObject?) { //... } ``` --- **Java:** ```javascript @Override public void onParticipantModeChanged(JSONObject data) { //... } ``` --- ### onPinStateChanged() - This event will be triggered when any participant got pinned or unpinned by any participant got pinned or unpinned by any participant. #### Event callback parameters - **pinStateData**: \{ **peerId**: String, **state**: JSONObject, **pinnedBy**: String } - **peerId**: String - **state**: JSONObject - **pinnedBy**: String #### Example **Kotlin:** ```javascript override fun onPinStateChanged(pinStateData: JSONObject?) { Log.d("onPinStateChanged: ", pinStateData.getString("peerId")) // id of participant who were pinned Log.d("onPinStateChanged: ", pinStateData.getJSONObject("state")) // { cam: true, share: true } Log.d("onPinStateChanged: ", pinStateData.getString("pinnedBy")) // id of participant who pinned that participant } ``` --- **Java:** ```javascript @Override public void onPinStateChanged(JSONObject pinStateData) { Log.d("onPinStateChanged: ", pinStateData.getString("peerId")); // id of participant who were pinned Log.d("onPinStateChanged: ", pinStateData.getJSONObject("state")); // { cam: true, share: true } Log.d("onPinStateChanged: ", pinStateData.getString("pinnedBy")); // id of participant who pinned that participant } ``` --- ### onMediaRelayRequestReceived() - This callback is triggered when a request is recieved for media relay in the destination meeting. #### Event callback parameters - **`participantId - (String)`**: Specifies the participantId who requested the media relay. - **`meetingId - (String)`**: Specifies the meeting from where the media relay request was made. - **`listener - (RelayRequestListener)`**: A callback interface with the following methods: - **accept()**: Call this to approve the media relay request. - **reject()**: Call this to deny the media relay request. #### Example **Kotlin:** ```javascript override fun onMediaRelayRequestReceived(participantId: String,liveStreamId: String,listener: RelayRequestListener) { // If accepting the request listener.accept() // If rejecting the request listener.reject() } ``` --- **Java:** ```javascript @Override public void onMediaRelayRequestReceived(String participantId, String liveStreamId, RelayRequestListener listener) { // if accept request listener.accept(); // if reject request listener.reject(); } ``` --- ### onMediaRelayRequestResponse() - This callback is triggered when a response is recieved for media relay request in the source meeting. #### Event callback parameters - **`participantId - (String)`**: Specifies the participantId who responded the request for the media relay. - **`decision - (String)`**: Specifies the decision whether the request for media relay was accepted or not. #### Example **Kotlin:** ```javascript override fun onMediaRelayRequestResponse(participantId: String, decision: String) { super.onMediaRelayRequestResponse(participantId, decision) Log.d("MediaRelay", "Participant ID: $participantId, Decision: $decision") } ``` --- **Java:** ```javascript @Override public void onMediaRelayRequestResponse(String participantId, String decision) { super.onMediaRelayRequestResponse(participantId, decision); Log.d("MediaRelay", "Participant ID: " + participantId + ", Decision: " + decision); } ``` --- ### onMediaRelayStarted() - This callback is triggered when the media relay to the destination meeting succesfully starts. #### Parameters - **`meetingId - (String)`**: Specifies the meeting where the media relay started. #### Example **Kotlin:** ```javascript override fun onMediaRelayStarted(relayMeetingId: String) { super.onMediaRelayStarted(relayMeetingId) Log.d("MediaRelay", "Media relay started to meeting ID: $relayMeetingId") } ``` --- **Java:** ```javascript override fun onMediaRelayStarted(relayMeetingId: String) { super.onMediaRelayStarted(relayMeetingId) Log.d("MediaRelay", "Media relay started to meeting ID: $relayMeetingId") } ``` --- ### onMediaRelayStopped() - This callback is triggered when the media relay to the destination meeting stops for any reason. #### Parameters - **`meetingId - (String)`**: Specifies the meeting where the media relay stopped. - **`reason - (String)`**: Specifies the reason why the media relay stopped #### Example **Kotlin:** ```javascript override fun onMediaRelayStopped(meetingId: String, reason: String) { super.onMediaRelayStopped(meetingId, reason) Log.d("MediaRelay", "Media relay stopped for meeting ID: $meetingId, Reason: $reason") } ``` --- **Java:** ```javascript @Override public void onMediaRelayStopped(String meetingId, String reason) { super.onMediaRelayStopped(meetingId, reason); Log.d("MediaRelay", "Media relay stopped for meeting ID: " + meetingId + ", Reason: " + reason); } ``` --- ### onMediaRelayError() - This callback is triggered when an error occurs during media relay to the destination meeting. #### Parameters - **`meetingId - (String)`**: Specifies the meeting where the media relay stopped. - **`error - (String)`**: Specifies the error that occured. #### Example **Kotlin:** ```javascript override fun onMediaRelayError(meetingId: String, error: String) { super.onMediaRelayError(meetingId, error) Log.e("MediaRelay", "Media relay error for meeting ID: $meetingId, Error: $error") } ``` --- **Java:** ```javascript @Override public void onMediaRelayError(String meetingId, String error) { super.onMediaRelayError(meetingId, error); Log.e("MediaRelay", "Media relay error for meeting ID: " + meetingId + ", Error: " + error); } ``` ---
--- # Meeting Class Methods - Android
### join() - It is used to join a meeting. - After meeting initialization by [`initMeeting()`](../initMeeting) it returns a new instance of [Meeting](./introduction). However by default, it will not automatically join the meeting. Hence, to join the meeting you should call `join()`. #### Events associated with `join()`: - Local Participant will receive a [`onMeetingJoined`](./meeting-event-listener-class#onmeetingjoined) event, when successfully joined. - Remote Participant will receive a [`onParticipantJoined`](./meeting-event-listener-class#onparticipantjoined) event with the newly joined [`Participant`](../participant-class/introduction) object from the event callback. #### Participant having `ask_join` permission inside token - If a token contains the permission `ask_join`, then the participant will not join the meeting directly after calling `join()`, but an event will be emitted to the participant having the permission `allow_join` called [`onEntryRequested`](./meeting-event-listener-class#onentryrequested). - After the decision from the remote participant, an event will be emitted to participant called [`onEntryResponded`](./meeting-event-listener-class#onentryresponded). This event will contain the decision made by the remote participant. #### Participant having `allow_join` permission inside token - If a token containing the permission `allow_join`, then the participant will join the meeting directly after calling `join()`. #### Returns - _`void`_ --- ### leave() - It is used to leave the current meeting. #### Events associated with `leave()`: - Local participant will receive a [`onMeetingLeft`](./meeting-event-listener-class#onmeetingleft) event. - All remote participants will receive a [`onParticipantLeft`](./meeting-event-listener-class#onparticipantleft) event with `participantId`. #### Returns - _`void`_ --- ### end() - It is used to end the current running session. - By calling `end()`, all joined [participants](properties#getparticipants) including [localParticipant](./properties.md#getlocalparticipant) of that session will leave the meeting. #### Events associated with `end()`: - All [participants](./properties.md#getparticipants) and [localParticipant](./properties.md#getlocalparticipant), will be emitted [`onMeetingLeft`](./meeting-event-listener-class#onmeetingleft) event. #### Returns - _`void`_ --- ### enableWebcam() - It is used to enable self camera. - [`onStreamEnabled`](../participant-class/participant-event-listener-class#onstreamenabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### disableWebcam() - It is used to disable self camera. - [`onStreamDisabled`](../participant-class/participant-event-listener-class#onstreamdisabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### unmuteMic() - It is used to enable self microphone. - [`onStreamEnabled`](../participant-class/participant-event-listener-class#onstreamenabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### muteMic() - It is used to disable self microphone. - [`onStreamDisabled`](../participant-class/participant-event-listener-class#onstreamdisabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. #### Returns - _`void`_ --- ### enableScreenShare() - it is used to enable screen-sharing. - [`onStreamEnabled`](../participant-class/participant-event-listener-class#onstreamenabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. - [`onPresenterChanged()`](./meeting-event-listener-class#onpresenterchanged) event will be trigget to all participant with `participantId`. #### Parameters - **data**: Intent #### Returns - _`void`_ --- ### disableScreenShare() - It is used to disable screen-sharing. - [`onStreamDisabled`](../participant-class/participant-event-listener-class#onstreamdisabled) event of `ParticipantEventListener` will be emitted with [`stream`](../stream-class/introduction) object from the event callback. - [`onPresenterChanged()`](./meeting-event-listener-class#onpresenterchanged) event will be trigget to all participant with `null`. #### Returns - _`void`_ --- ### uploadBase64File() - It is used to upload your file to Videosdk's Temporary storage. - `base64Data` convert your file to base64 and pass here. - `token` pass your videosdk token. Read more about token [here](/android/guide/video-and-audio-calling-api-sdk/authentication-and-token) - `fileName` provide your fileName with extension. - `TaskCompletionListener` will handle the result of the upload operation. - When the upload is complete, the `onComplete()` method of `TaskCompletionListener` will provide the corresponding `fileUrl`, which can be used to retrieve the uploaded file. - If an error occurs during the upload process, the `onError()` method of `TaskCompletionListener` will provide the error details. #### Parameters - **base64Data**: String - **token**: String - **fileName**: String - **listener**: TaskCompletionListener #### Returns - _`void`_ #### Example **Kotlin:** ```js private fun uploadFile() { val base64Data = " #### Parameters - **mode**: `String` - **mode**: `String` #### Returns - _`void`_ #### Events associated with `changeMode()`: - Every participant will receive a callback on [`onParticipantModeChanged()`](./meeting-event-listener-class#onparticipantmodechanged) **Kotlin:** ```javascript meeting!!.changeMode("SIGNALLING_ONLY") meeting!!.changeMode("SIGNALLING_ONLY") ``` --- **Java:** ```javascript meeting!!.changeMode("SIGNALLING_ONLY") meeting!!.changeMode("SIGNALLING_ONLY") ``` --- ### getMics() - It will return all connected mic devices. #### Returns - `Set` #### Example **Kotlin:** ```javascript val mics = meeting!!.mics var mic: String for (i in mics.indices) { mic = mics.toTypedArray()[i].toString() Toast.makeText(this, "Mic : $mic", Toast.LENGTH_SHORT).show() } ``` --- **Java:** ```javascript Set mics = meeting.getMics(); String mic; for (int i = 0; i < mics.size(); i++) { mic=mics.toArray()[i].toString(); Toast.makeText(this, "Mic : " + mic, Toast.LENGTH_SHORT).show(); } ``` --- ### changeMic() - It is used to change the mic device. - If multiple mic devices are connected, by using `changeMic()` one can change the mic device. #### Parameters - **device**: AppRTCAudioManager.AudioDevice #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting!!.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH) ``` --- **Java:** ```javascript meeting.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH); ``` --- ### changeWebcam() - It is used to change the camera device. - If multiple camera devices are connected, by using `changeWebcam()`, one can change the camera device with its respective device id. - You can get a list of connected video devices using [`VideoSDK.getVideoDevices()`](../videosdk-class/methods#getvideodevices) #### Parameters - **deviceId**: - The `deviceId` represents the unique identifier of the camera device you wish to switch to. If no deviceId is provided, the facing mode will toggle, from the back camera to the front camera if the back camera is currently in use, or from the front camera to the back camera if the front camera is currently in use. - type : String - `OPTIONAL` #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting!!.changeWebcam() ``` --- **Java:** ```javascript meeting.changeWebcam(); ``` --- ### pauseAllStreams() This method pauses active media streams within the meeting. #### Parameters - **kind**: Specifies the type of media stream to be paused. If this parameter is omitted, all media streams (audio, video, and screen share) will be paused. - **Type**: `String` - **Optional**: Yes - Possible values: - `"audio"`: Pauses audio streams. - `"video"`: Pauses video streams. - `"share"`: Pauses screen-sharing video streams. #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting!!.pauseAllStreams() ``` --- **Java:** ```javascript meeting.pauseAllStreams(); ``` --- ### resumeAllStreams() This method resumes media streams that have been paused #### Parameters - **kind**: Specifies the type of media stream to be resumed. If this parameter is omitted, all media streams (audio, video, and screen share) will be resumed. - **Type**: `String` - **Optional**: Yes - Possible values: - `"audio"`: Resumes audio streams. - `"video"`: Resumes video streams. - `"share"`:Resumes screen-sharing video streams. #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting!!.resumeAllStreams() ``` --- **Java:** ```javascript meeting.resumeAllStreams(); ``` --- ### requestMediaRelay() This method starts relaying selected media streams (like camera video, microphone audio, screen share) from the current meeting to a specified destination meeting. #### Parameters - **destinationMeetingId (String) – Required**: ID of the target meeting where media should be relayed. - **token (String) – : Authentication token for the destination meeting. - If you pass `null`, the SDK will use the existing authentication token. - **kinds (Array of Strings) – : Array of media types to relay. - Possible values: - `"audio"`: Resumes audio streams. - `"video"`: Resumes video streams. - `"share"`:Resumes screen-sharing video streams. - If you pass `null`, all media types (audio, video, share) will be relayed by default. #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting.requestMediaRelay("", null, null) ``` --- **Java:** ```javascript meeting.requestMediaRelay("",null,null); ``` --- ### stopMediaRelay() This method stops the ongoing media relay to a specific destination meeting. #### Parameters - **destinationMeetingId (String) – Required**: ID of the destination meeting where the media relay should be stopped. #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting.stopMediaRelay("") ``` --- **Java:** ```javascript meeting.stopMediaRelay(""); ``` --- ### switchTo() This method enables a seamless transition from the current meeting to another, without needing to disconnect and reconnect manually. #### Parameters - **meetingId (String) – Required**: ID of the new meeting to switch to. - **token (String) – Optional**: Authentication token for the new meeting. #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting.switchTo("") //or meeting.switchTo("",token) ``` --- **Java:** ```javascript meeting.switchTo(""); //or meeting.switchTo("",token); ``` --- ### setAudioDeviceChangeListener() - When a Local participant changes the Mic, `AppRTCAudioManager.AudioManagerEvents()` is triggered which can be set by using `setAudioDeviceChangeListener()` method. #### Parameters - **audioManagerEvents**: AppRTCAudioManager.AudioManagerEvents #### Returns - _`void`_ #### Example **Kotlin:** ```javascript meeting!!.setAudioDeviceChangeListener(object : AudioManagerEvents { override fun onAudioDeviceChanged( selectedAudioDevice: AppRTCAudioManager.AudioDevice, availableAudioDevices: Set ) { when (selectedAudioDevice) { AppRTCAudioManager.AudioDevice.BLUETOOTH -> Toast.makeText(this@MainActivity, "Selected AudioDevice: BLUETOOTH", Toast.LENGTH_SHORT).show() AppRTCAudioManager.AudioDevice.WIRED_HEADSET -> Toast.makeText(this@MainActivity, "Selected AudioDevice: WIRED_HEADSET", Toast.LENGTH_SHORT).show() AppRTCAudioManager.AudioDevice.SPEAKER_PHONE -> Toast.makeText(this@MainActivity, "Selected AudioDevice: SPEAKER_PHONE", Toast.LENGTH_SHORT).show() AppRTCAudioManager.AudioDevice.EARPIECE -> Toast.makeText(this@MainActivity, "Selected AudioDevice: EARPIECE", Toast.LENGTH_SHORT).show() } } }) ``` --- **Java:** ```javascript meeting.setAudioDeviceChangeListener(new AppRTCAudioManager.AudioManagerEvents() { @Override public void onAudioDeviceChanged(AppRTCAudioManager.AudioDevice selectedAudioDevice, Set availableAudioDevices) { switch (selectedAudioDevice) { case BLUETOOTH: Toast.makeText(MainActivity.this, "Selected AudioDevice: BLUETOOTH", Toast.LENGTH_SHORT).show(); break; case WIRED_HEADSET: Toast.makeText(MainActivity.this, "Selected AudioDevice: WIRED_HEADSET", Toast.LENGTH_SHORT).show(); break; case SPEAKER_PHONE: Toast.makeText(MainActivity.this, "Selected AudioDevice: SPEAKER_PHONE", Toast.LENGTH_SHORT).show(); break; case EARPIECE: Toast.makeText(MainActivity.this, "Selected AudioDevice: EARPIECE", Toast.LENGTH_SHORT).show(); break; } } }); ``` --- ### addEventListener() #### Parameters - **listener**: MeetingEventListener #### Returns - _`void`_ --- ### removeEventListener() #### Parameters - **listener**: MeetingEventListener #### Returns - _`void`_ --- ### removeAllListeners() #### Returns - _`void`_
--- # Meeting Class Properties - Android
## getmeetingId() - type: `String` - `getmeetingId()` will return `meetingId`, which is unique id of the meeting where the participant has joined. --- ## getLocalParticipant() - type: [Participant](../participant-class/introduction) - It will be the instance of [Participant](../participant-class/introduction) class for the local participant(You) who joined the meeting. --- ## getMeetingState() - type: `MeetingState` - `getMeetingState()` will return `MeetingState`, which is current connection state of the meeting. #### Example **Kotlin:** ```javascript meeting!!.getMeetingState() ``` --- **Java:** ```javascript meeting.getMeetingState(); ``` --- ## getParticipants() - type: [`Map`](https://developer.android.com/reference/java/util/Map) of [Participant](../participant-class/introduction) - `Map` - Map{'<'}`participantId`, [Participant](../participant-class/introduction)> - It will contain all joined participants in the meeting except the `localParticipant`. - This will be the [`Map`](https://developer.android.com/reference/java/util/Map) what will container all participants attached with the key as id of that participant. **Kotlin:** ```javascript val remoteParticipantId = "ajf897" val participant = meeting!!.participants[remoteParticipantId] ``` --- **Java:** ```javascript String remoteParticipantId = "ajf897"; Participant participant = meeting.getParticipants().get(remoteParticipantId); ``` --- ## pubSub - type: [`PubSub`](../pubsub-class/introduction) - It is used to enable Publisher-Subscriber feature in [`meeting`](introduction) class. Learn more about `PubSub`, [here](../pubsub-class/introduction) --- ## `LeaveReason` Enum The `LeaveReason` enum provides detailed context for why a meeting or participant departure event was triggered. The following table lists all possible values: | Code | Reason Constant | Description | | :--- | :--- | :--- | | 1001 | `WEBSOCKET_DISCONNECTED` | Socket disconnected. | | 1002 | `REMOVE_PEER` | Participant was removed from the meeting. | | 1003 | `REMOVE_PEER_VIEWER_MODE_CHANGED` | Participant Removed because viewer mode was changed. | | 1004 | `REMOVE_PEER_MEDIA_RELAY_STOP` | Participant Removed because media relay was stopped. | | 1005 | `SWITCH_ROOM` | Participant switched to a different room. | | 1006 | `ROOM_CLOSE` | The meeting has been closed. | | 1007 | `UNKNOWN` | Participant disconnected due to an unknown reason. | | 1008 | `REMOVE_ALL` | Remove All from the meeting. | | 1009 | `MEETING_END_API` | Meeting Ended. | | 1010 | `REMOVE_PEER_API` | Participant removed from the meeting. | | 1011 | `DUPLICATE_PARTICIPANT` | Leaving meeting, since this participantId joined from another device. | | 1101 | `MANUAL_LEAVE_CALLED` | Participant manually called the leave() method to exit the meeting. | | 1102 | `WEBSOCKET_CONNECTION_ATTEMPTS_EXHAUSTED` | Meeting left after multiple failed websocket connection attempts. | | 1103 | `JOIN_ROOM_FAILED` | Meeting left due to an error while joining the room. | | 1104 | `SWITCH_ROOM_FAILED` | Meeting left due to an error while switching rooms. |
--- # Meeting Class ## using Meeting Class The `Meeting Class` includes methods and events for managing meetings, participants, video & audio streams, data channels and UI customization. ## Constructor ### Meeting(String meetingId, Participant localParticipant) - return type : `void` ## Properties ### getmeetingId() - `getmeetingId()` will return `meetingId`, which represents the meetingId for the current meeting - return type : `void` ### getLocalParticipant() - `getLocalParticipant()` will return Local participant - return type :`Participant` ### getParticipants() - `getParticipants()` will return all Remote participant - return type : `void` ### pubSub() - `pubSub()` will return object of `PubSub` class - return type : `PubSub` ### Events ### Methods --- # MeetingEventListener Class ## using MeetingEventListener Class The `MeetingEventListener Class` is responsible for listening to all the events that are related to `Meeting Class`. ### Listeners --- # Video SDK Participant Class - Android
Participant class includes all the properties, methods and events related to all the participants joined in a particular meeting. ## Get local and remote participants You can get the local streams and participant meta from `meeting.getLocalParticipant()`. And a Map of joined participants is always available via `meeting.getParticipants()` **Kotlin:** ```js title="Javascript" val localParticipant = meeting!!.getLocalParticipant() val participants = meeting!!.getParticipants() ``` --- **Java:** ```js title="Javascript" Participant localParticipant = meeting.getLocalParticipant(); Map participants = meeting.getParticipants(); ``` ## Participant Properties
- [getId()](./properties#getid)
- [getDisplayName()](./properties#getdisplayname)
- [getQuality()](./properties#getquality)
- [isLocal()](./properties#islocal)
- [getStreams()](./properties#getstreams)
- [getMode()](./properties#getmode)
- [getMetaData()](./properties#getmetadata)
## Participant Methods
- [enableWebcam()](./methods#enablewebcam)
- [disableWebcam()](./methods#disablewebcam)
- [enableMic()](./methods#enablemic)
- [disableMic()](./methods#disablemic)
- [remove()](./methods#remove)
- [setQuality()](./methods#setquality)
- [setViewPort()](./methods#setviewport)
- [captureImage()](./methods#captureimage)
## Participant Events
- [onStreamEnabled](./participant-event-listener-class#onstreamenabled)
- [onStreamDisabled](./participant-event-listener-class#onstreamdisabled)
--- # Participant Class Methods - Android
### enableWebcam() - `enableWebcam()` is used to enable participant's camera. #### Events associated with `enableWebcam()` : - First the participant will get a callback on [onWebcamRequested()](../meeting-class/meeting-event-listener-class#onwebcamrequested) and once the participant accepts the request, webcam will be enabled. - Every Participant will receive a `streamEnabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### disableWebcam() - `disableWebcam()` is used to disable participant camera. #### Events associated with `disableWebcam()` : - Every Participant will receive a `streamDisabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### enableMic() - `enableMic()` is used to enable participant microphone. #### Events associated with `enableMic()` : - First the participant will get a callback on [onMicRequested()](../meeting-class/meeting-event-listener-class#onmicrequested) and once the participant accepts the request, mic will be enabled. - Every Participant will receive a `streamEnabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### disableMic() - `disableMic()` is used to disable participant microphone. #### Events associated with `disableMic()`: - Every Participant will receive a `streamDisabled` event of `ParticipantEventListener` Class with `stream` object. #### Returns - `void` --- ### pin() - It is used to set pin state of the participant. You can use it to pin the screen share, camera or both of the participant. It accepts a paramter of type `String`. Default `SHARE_AND_CAM` #### Parameters - **pinType**: `SHARE_AND_CAM` | `CAM` | `SHARE` --- ### unpin() - It is used to unpin participant. You can use it to unpin the screen share, camera or both of the participant. It accepts a paramter of type `String`. Default is `SHARE_AND_CAM` #### Parameters - **pinType**: `SHARE_AND_CAM` | `CAM` | `SHARE` --- ### setQuality() - `setQuality()` is used to set the quality of the participant's video stream. #### Parameters - `quality`: low | med | high #### Returns - `void` --- ### setViewPort() - `setViewPort()` is used to set the quality of the participant's video stream based on the viewport height and width. #### Parameters - **width**: int - **height**: int #### Returns - `void` :::info MultiStream is not supported by the Android SDK. Use `customTrack` rather than `setQuality()` and `setViewPort()` if you want to change participant's quality who joined using our Android SDK. To know more about customTrack visit [here](/android/guide/video-and-audio-calling-api-sdk/features/custom-track/custom-video-track) ::: --- ### remove() - `remove()` is used to remove the participant from the meeting. #### Events associated with `remove()` : - Local participant will receive a [`onMeetingLeft`](../meeting-class/meeting-event-listener-class.md#onmeetingleft) event. - All remote participants will receive a [`onParticipantLeft`](../meeting-class/meeting-event-listener-class.md#onparticipantleft) event with `participantId`. #### Returns - `void` --- ### captureImage() - It is used to capture image of local participant's current videoStream. - You need to pass an implementation of `TaskCompletionListener` as a parameter. This listener will handle the result of the image capture task. - When the image capture task is complete, the `onComplete()` method will provide the image in the form of a `base64` string. If an error occurs, the `onError()` method will provide the error details. #### Parameters - **height**: int - **width**: int - **listener**: TaskCompletionListener #### Returns - _`void`_ --- ### getVideoStats() - `getVideoStats()` will return an JSONArray which will contain details regarding the participant's critical video metrics such as **Jitter**, **Packet Loss**, **Quality Score** etc. #### Returns - `JSONArray` - `jitter` : It represents the distortion in the stream. - `bitrate` : It represents the bitrate of the stream which is being transmitted. - `totalPackets` : It represents the total packet count which were transmitted for that particiular stream. - `packetsLost` : It represents the total packets lost during the transimission of the stream. - `rtt` : It represents the time between the stream being reached to client from the server in milliseconds(ms). - `codec`: It represents the codec used for the stream. - `network`: It represents the network used to transmit the stream - `size`: It is object containing the height, width and frame rate of the stream. **Kotlin:** ```javascript val videoStats = participant.getVideoStats() // will return all possible stream layers in JSONArray val videoStat = videoStats.getJSONObject(0) // will return the first stream layer in JSONObject ``` --- **Java:** ```javascript JSONArray videoStats = participant.getVideoStats(); // will return all possible stream layers in JSONArray JSONObject videoStat = videoStats.getJSONObject(0); // will return the first stream layer in JSONObject ``` :::note getVideoStats() will return the metrics for the participant at that given point of time and not average data of the complete meeting. To view the metrics for the complete meeting using the stats API documented [here](/api-reference/realtime-communication/fetch-session-quality-stats). ::: :::info If you are getting `rtt` greater than 300ms, try using a different region which is nearest to your user. To know more about changing region [visit here](/api-reference/realtime-communication/create-room). If you are getting high packet loss, try using the `customTrack` for better experience. To know more about customTrack [visit here](/android/guide/video-and-audio-calling-api-sdk/features/custom-track/custom-video-track) ::: --- ### getAudioStats() - `getAudioStats()` will return an JSONObject which will contain details regarding the participant's critical audio metrics such as **Jitter**, **Packet Loss**, **Quality Score** etc. #### Returns - `JSONObject` - `jitter` : It represents the distortion in the stream. - `bitrate` : It represents the bitrate of the stream which is being transmitted. - `totalPackets` : It represents the total packet count which were transmitted for that particiular stream. - `packetsLost` : It represents the total packets lost during the transimission of the stream. - `rtt` : It represents the time between the stream being reached to client from the server in milliseconds(ms). - `codec`: It represents the codec used for the stream. - `network`: It represents the network used to transmit the stream **Kotlin:** ```javascript val audioStat = videoStats.getAudioStats() ``` --- **Java:** ```javascript JSONObject audioStat = videoStats.getAudioStats(); ``` :::note getAudioStats() will return the metrics for the participant at that given point of time and not average data of the complete meeting. To view the metrics for the complete meeting using the stats API documented [here](/api-reference/realtime-communication/fetch-session-quality-stats). ::: :::info If you are getting `rtt` greater than 300ms, try using a different region which is nearest to your user. To know more about changing region [visit here](/api-reference/realtime-communication/create-room). ::: ### getShareStats() - `getShareStats()` will return an JSONObject which will contain details regarding the participant's critical video metrics such as **Jitter**, **Packet Loss**, **Quality Score** etc. #### Returns - `JSONObject` - `jitter` : It represents the distortion in the stream. - `bitrate` : It represents the bitrate of the stream which is being transmitted. - `totalPackets` : It represents the total packet count which were transmitted for that particiular stream. - `packetsLost` : It represents the total packets lost during the transimission of the stream. - `rtt` : It represents the time between the stream being reached to client from the server in milliseconds(ms). - `codec`: It represents the codec used for the stream. - `network`: It represents the network used to transmit the stream - `size`: It is object containing the height, width and frame rate of the stream. **Kotlin:** ```javascript val shareStat = videoStats.getShareStats() ``` --- **Java:** ```javascript JSONObject shareStat = videoStats.getShareStats(); ``` :::note getShareStats() will return the metrics for the participant at that given point of time and not average data of the complete meeting. To view the metrics for the complete meeting using the stats API documented [here](/api-reference/realtime-communication/fetch-session-quality-stats). ::: :::info If you are getting `rtt` greater than 300ms, try using a different region which is nearest to your user. To know more about changing region [visit here](/api-reference/realtime-communication/create-room). ::: ### addEventListener() #### Parameters - **listener**: ParticipantEventListener #### Returns - _`void`_ --- ### removeEventListener() #### Parameters - **listener**: ParticipantEventListener #### Returns - _`void`_ --- ### removeAllListeners() #### Returns - _`void`_
--- # ParticipantEventListener Class - Android
### Implementation - You can implement all the methods of `ParticipantEventListener` abstract Class and add the listener to `Participant` class using the `addEventListener()` method of `Participant` Class. --- ### onStreamEnabled() - `onStreamEnabled()` is a callback which gets triggered whenever a participant's video, audio or screen share stream is enabled. #### Event callback parameters - **stream**: [Stream](../stream-class/introduction.md) --- ### onStreamDisabled() - `onStreamDisabled()` is a callback which gets triggered whenever a participant's video, audio or screen share stream is disabled. #### Event callback parameters - **stream**: [Stream](../stream-class/introduction.md) --- ### onStreamPaused() - This event will be emitted when any participant pauses consuming or producing stream of any type. --- ### onStreamResumed() - This event will be emitted when any participant resumes consuming or producing stream of any type. --- ### onE2eeStateChanged() - This event will be emitted when participant's E2EE State changes. --- ### Example **Kotlin:** ```js meeting!!.localParticipant.addEventListener(object : ParticipantEventListener() { override fun onStreamEnabled(stream: Stream) { // } override fun onStreamDisabled(stream: Stream) { // } override fun onStreamPaused(kind: String, reason: String) { // } override fun onStreamResumed(kind: String, reason: String) { // } override fun onE2eeStateChanged(state: E2EEState, stream: Stream) { // } }); ``` --- **Java:** ```js participant.addEventListener(new ParticipantEventListener() { @Override public void onStreamEnabled(Stream stream) { // } @Override public void onStreamDisabled(Stream stream) { // } @Override public void onStreamPaused(String kind, String reason) { // } @Override public void onStreamResumed(String kind, String reason) { // } @Override public void onE2eeStateChanged(E2EEState state, Stream stream) { // } }); ```
--- # Participant Class Properties - Android
### getId() - type: `String` - `getId` will return unique id of the participant who has joined the meeting. --- ### getDisplayName() - type: `String` - It will return the `displayName` of the participant who has joined the meeting. --- ### getQuality() - type: `String` - `getQuality()` will return quality of participant's stream. Stream could be `audio` , `video` or `share`. --- ### isLocal() - type: `boolean` - `isLocal()` will return `true` if participant is Local,`false` otherwise. --- ### getStreams() - type: `Map` - It will represents the stream for that particular participant who has joined the meeting. Streams could be `audio` , `video` or `share`. --- ### getMode() - type : `string` - ` getMode()` will return mode of the Participant. --- ### getMetaData() - type : `JSONObject` - `getMetaData()` will return additional information, that you have passed in `initMeeting()`.
--- # Participant Class ## Introduction The `Participant Class` includes methods and events for participants and their associated video & audio streams, data channels and UI customization. ## Properties ### getId() - `getId()` will return participant's Id - return type : `String` ### getDisplayName() - `getDisplayName()` will return name of participant - return type : `String` ### getQuality() - `getQuality()` will return quality of participant's video stream - return type : `String` ### isLocal() - `isLocal()` will return `true` if participant is Local,`false` otherwise - return type : `boolean` ### getStreams() - `getStreams()` will return streams of participant - return type : `Map` - Map contains `streamId` as key and `stream` as value ## Events ### addEventListener(ParticipantEventListener listener) - By using `addEventListener(ParticipantEventListener listener)`, we can add listener to the List of `ParticipantEventListener` - return type : `void` ### removeEventListener(ParticipantEventListener listener) - By using `removeEventListener(ParticipantEventListener listener)`, we can remove listener from List of `ParticipantEventListener` - return type : `void` ### removeAllListeners() - By using `removeAllListeners()`, we can remove all listener from List - return type : `void` ## Methods ### enableMic() - By using `enableMic()` function, a participant can enable the Mic of any particular Remote Participant - When `enableMic()` is called, - Local Participant will receive a callback on `streamEnabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback for `onMicRequested()` and once the remote participant accepts the request, mic will be enabled for that participant - return type : `void` ### disableMic() - By using `disableMic()` function, a participant can disable the Mic of any particular Remote Participant - When `enableMic()` is called, - Local Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - return type : `void` ### enableWebcam() - By using `enableWebcam()` function, a participant can enable the Webcam of any particular Remote Participant - When `enableWebcam()` is called, - Local Participant will receive a callback on `streamEnabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback for `webcamRequested()` and once the remote participant accepts the request, webcam will be enabled for that participant - return type : `void` ### disableWebcam() - By using `disableWebcam()` function, a participant can disable the Webcam of any particular Remote Participant - When `disableWebcam()` is called, - Local Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - Remote Participant will receive a callback on `streamDisabled()` of `ParticipantEventListener` class - return type : `void` ### remove() - By using `remove()` function, a participant can remove any particular Remote Participant - When `remove()` is called, - Local Participant will receive a callback on `meetingLeft()` - Remote Participant will receive a callback on `participantLeft()` - return type : `void` ### setQuality() - By using `setQuality()`,you can set quality of participant's video stream - return type : `void` --- # ParticipantEventListener Class ## using ParticipantEventListener Class The `ParticipantEventListener Class` is responsible for listening to all the events that are related to `Participant Class`. ### Listeners --- # Video SDK PubSub Class - Android
## Introduction PubSub class provides the methods to implement Publisher-Subscriber feature in your Application. ## PubSub Methods
- [publish()](methods#publish)
- [subscribe()](methods#subscribe)
- [unsubscribe()](methods#unsubscribe)
--- # PubSub Class Methods - Android
### publish() - `publish()` is used to publish messages on a specified topic in the meeting. #### Parameters - topic - type: `String` - This is the name of the topic, for which message will be published. - message - type: `String` - This is the actual message. - options - type: [`PubSubPublishOptions`](pubsub-publish-options-class) - This specifies the options for publish. - payload - type: `JSONObject` - `OPTIONAL` - If you need to include additional information along with a message, you can pass here as `JSONObject`. #### Returns - _`void`_ #### Example **Kotlin:** ```js // Publish message for 'CHAT' topic val publishOptions = PubSubPublishOptions() publishOptions.isPersist = true meeting!!.pubSub.publish("CHAT", "Hello from Android", publishOptions) ``` --- **Java:** ```js // Publish message for 'CHAT' topic PubSubPublishOptions publishOptions = new PubSubPublishOptions(); publishOptions.setPersist(true); meeting.pubSub.publish("CHAT", "Hello from Android", publishOptions); ``` --- ### subscribe() - `subscribe()` is used to subscribe a particular topic to get all the messages of that particular topic in the meeting. #### Parameters - topic: - type: `String` - Participants can listen to messages on that particular topic. - listener: - type: [`PubSubMessageListener`](pubsub-message-listener-class) #### Returns - _`void`_ #### Example **Kotlin:** ```js val pubSubMessageListener = object : PubSubMessageListener { override fun onMessageReceived(message: PubSubMessage) { Log.d("#message", "onMessageReceived: ${message.message}") } override fun onOldMessagesReceived(messages: List) { Log.d("#message", "onOldMessagesReceived: $messages") } } // Subscribe for 'CHAT' topic meeting!!.pubSub.subscribe("CHAT", pubSubMessageListener) ``` --- **Java:** ```js PubSubMessageListener pubSubMessageListener = new PubSubMessageListener() { @Override public void onMessageReceived(PubSubMessage message) { Log.d("#message", "onMessageReceived: " + message.getMessage()); } @Override public void onOldMessagesReceived(List messages) { Log.d("#message", "onOldMessagesReceived: " + messages); } }; // Subscribe for 'CHAT' topic meeting.pubSub.subscribe("CHAT", pubSubMessageListener); ``` --- ### unsubscribe() - `unsubscribe()` is used to unsubscribe a particular topic on which you have subscribed priviously. #### Parameters - topic: - type: `String` - This is the name of the topic to be unsubscribed. - listener: - type: [`PubSubMessageListener`](pubsub-message-listener-class) #### Returns - _`void`_ #### Example **Kotlin:** ```js // Unsubscribe for 'CHAT' topic meeting!!.pubSub.unsubscribe("CHAT", pubSubMessageListener) ``` --- **Java:** ```js // Unsubscribe for 'CHAT' topic meeting.pubSub.unsubscribe("CHAT", pubSubMessageListener); ```
--- # Properties - Android
### getId() - type: `String` - `getId()` will return unique id of the pubsub message. --- ### getMessage() - type: `String` - `getMessage()` will return message that has been published on the specific topic. --- ### getTopic() - type: `String` - `getTopic()` will return name of the message topic. --- ### getSenderId() - type: `String` - `getSenderId()` will return id of the participant, who has sent the message. --- ### getSenderName() - type: `String` - `getSenderName()` will return name of the participant, who has sent the pubsub message. --- ### getTimestamp() - type: `long` - `getTimestamp()` will return timestamp at which, the pubsub message was sent. --- ### getPayload() - type: `JSONObject` - `getPayload()` will return data that you have send with message.
--- # PubSubMessageListener Class - Android
--- #### onMessageReceived() - This event will be emitted whenever any pubsub message received. #### onOldMessagesReceived() - This event will be emitted history of all old messages #### Example **Kotlin:** ```javascript val pubSubMessageListener = object : PubSubMessageListener { override fun onMessageReceived(message: PubSubMessage) { Log.d("#message", "onMessageReceived: ${message.message}") } override fun onOldMessagesReceived(messages: List) { Log.d("#message", "onOldMessagesReceived: $messages") } } ``` --- **Java:** ```javascript PubSubMessageListener pubSubMessageListener = new PubSubMessageListener() { @Override public void onMessageReceived(PubSubMessage message) { Log.d("#message", "onMessageReceived: " + message.getMessage()); } @Override public void onOldMessagesReceived(List messages) { Log.d("#message", "onOldMessagesReceived: " + messages); } }; ```
--- # PubSubPublishOptions Class - Android
## Properties ### persist - type: `boolean` - defaultValue: `false` - This property specifies whether to store messages on server for upcoming participants. - If the value of this property is true, then server will store pubsub messages for the upcoming participants. --- ### sendOnly - type: `String[]` - defaultValue: `null` - If you want to send a message to specific participants, you can pass their respective `participantId` here. If you don't provide any IDs or pass a `null` value, the message will be sent to all participants by default. :::note Make sure that participantId present in the array must be subscribe to that specific topic. :::
--- # PubSub Class ## using PubSub Class The `PubSub` includes methods for pubsub. ### Methods --- # PubSubMessage Class ## using PubSubMessage Class The `PubSubMessage` includes properties of PubSub message. ### Properties --- # PubSubPublishOptions Class ## using PubSubPublishOptions Class The `PubSubPublishOptions` includes properties of PubSub options. ### Properties ### Methods --- # RealtimeStore Methods - Android
### `set()` - `set()` is used to store or update data in the RealtimeStore. If a key already exists, it will be overwritten with the new value. Passing `null` as the value deletes the key. #### Parameters - key - type: `String` - The unique key to store the data under. - value - type: `String?` - The string value to be stored. Pass `null` to delete the key. - callback - type: `RealtimeStoreCallback` - A callback that reports the success or failure of the operation. #### Returns - _`void`_ #### Example **Kotlin:** ```kotlin meeting.realtimeStore.set("YOUR_KEY", message, object : RealtimeStoreCallback { override fun onSuccess(value: String) { Log.d("VideoSDK", "Value: $value") } override fun onError(error: String) { Log.e("VideoSDK", "Error: $error") } }) ``` --- **Java:** ```java meeting.realtimeStore.set("YOUR_KEY", message, new RealtimeStoreCallback() { @Override public void onSuccess(String value) { Log.d("VideoSDK", "Value: " + value); } @Override public void onError(String error) { Log.e("VideoSDK", "Error: " + error); } }); ``` --- ### `get()` - `get()` retrieves the current value associated with a given key. #### Parameters - key: - type: `String` - The key whose value you want to retrieve. - callback: - type: `RealtimeStoreCallback` - A callback that provides the value on success or an error on failure. #### Returns - _`void`_ #### Example **Kotlin:** ```kotlin meeting.realtimeStore.get("YOUR_KEY", object : RealtimeStoreCallback { override fun onSuccess(value: String) { Log.d("VideoSDK", "Value: $value") } override fun onError(error: String) { Log.e("VideoSDK", "Error: $error") } }) ``` --- **Java:** ```java meeting.realtimeStore.get("YOUR_KEY", new RealtimeStoreCallback() { @Override public void onSuccess(String value) { Log.d("VideoSDK", "value: " + value); } @Override public void onError(String error) { Log.e("VideoSDK", "Error: " + error); } }); ``` --- ### `observe()` - `observe()` subscribes to real-time updates for a given key. When the key’s value changes, the `onValueChanged` method on your `RealtimeStoreListener` is triggered. #### Parameters - key: - type: `String` - The key to observe. - listener: - type: `RealtimeStoreListener` - An object that will receive notifications about value changes. #### Returns - _`void`_ #### Example **Kotlin:** ```kotlin val realtimeStoreListener = object : RealtimeStoreListener { override fun onValueChanged(newValue: String, updatedBy: Participant?) { Log.d("VideoSDK", "onValueChange: Received update from: ${updatedBy.displayName}") Log.d("VideoSDK", "onValueChange: New Value -> $newValue") } } meeting.realtimeStore.observe(KEY_CHAT_HISTORY, realtimeStoreListener!!) ``` --- **Java:** ```java private RealtimeStoreListener realtimeStoreListener; realtimeStoreListener = new RealtimeStoreListener() { @Override public void onValueChanged(String newValue, Participant updatedBy) { Log.d("VideoSDK", "onValueChange: Received update from: " + updatedBy.getDisplayName()); Log.d("VideoSDK", "onValueChange: New Value -> " + newValue); } }; meeting.realtimeStore.observe(KEY_CHAT_HISTORY, realtimeStoreListener); ``` --- ### `stopObserving()` - `stopObserving()` stops receiving updates for a specific key and listener combination. #### Parameters - key: - type: `String` - The key you want to stop observing. - listener: - type: `RealtimeStoreListener` - The listener that should be removed. #### Returns - _`void`_ #### Example **Kotlin:** ```kotlin meeting.realtimeStore.stopObserving("YOUR_KEY", realtimeStoreListener) ``` --- **Java:** ```java meeting.realtimeStore.stopObserving("YOUR_KEY", realtimeStoreListener); ```
--- # RealtimeStore The `RealtimeStore` class allows you to store, update, retrieve, and observe custom key-value data within a meeting — in real time. It acts as a shared data layer across all connected participants, ensuring asynchronized states throughout the session. ## RealtimeStore Methods
- [set()](methods#set)
- [delete()](methods#set)
- [get()](methods#get)
- [observe()](methods#observe)
- [stopObserving()](methods#stopobserving)
--- # `RealtimeStoreCallback` Class - Android A callback interface used for asynchronous operations in `RealtimeStore`.
--- ### `onSuccess()` - This event will be emitted when the operation completes successfully. #### Parameters - result - type: `String` - The result of the operation. --- ### `onError()` - This event will be emitted when the operation fails. #### Parameters - error - type: `String` - A string message detailing what went wrong.
--- # RealtimeStoreListener Class - Android
--- #### onValueChanged() - This event will be emitted whenever the observed key's value changes. #### Example **Kotlin:** ```kotlin val realtimeStoreListener = object : RealtimeStoreListener { override fun onValueChanged(newValue: String, updatedBy: Participant?) { Log.d("VideoSDK", "onValueChange: Received update from: ${updatedBy.displayName}") Log.d("VideoSDK", "onValueChange: New Value -> $newValue") } } ``` --- **Java:** ```java private RealtimeStoreListener realtimeStoreListener; realtimeStoreListener = new RealtimeStoreListener() { @Override public void onValueChanged(String newValue, Participant updatedBy) { Log.d("VideoSDK", "onValueChange: Received update from: " + updatedBy.getDisplayName()); Log.d("VideoSDK", "onValueChange: New Value -> " + newValue); } }; ```
--- # Setup - Android ## Setting up android sdk Android SDK is client for real-time communication for android devices. It inherits the same terminology as all other SDKs does. ## Minimum OS/SDK versions It supports the following OS/SDK versions. ### Android: minSdkVersion >= 21 ## Installation 1. If your Android Studio Version is older than Android Studio Bumblebees, add the repository to project's `build.gradle` file.
If your are using Android Studio Bumblebees or newer Version, add the repository to `settings.gradle` file. :::note You can use imports with Maven Central after rtc-android-sdk version `0.1.12`. Whether on Maven or Jitpack, the same version numbers always refer to the same SDK. ::: **Maven Central:** ```js title="build.gradle" allprojects { repositories { // ... google() mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` ```js title="settings.gradle" dependencyResolutionManagement{ repositories { // ... google() mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` --- **Jitpack:** ```js title="build.gradle" allprojects { repositories { // ... google() maven { url 'https://jitpack.io' } mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` ```js title="settings.gradle" dependencyResolutionManagement{ repositories { // ... google() maven { url 'https://jitpack.io' } mavenCentral() maven { url "https://maven.aliyun.com/repository/jcenter" } } } ``` ### Step 2: Add the following dependency in your app's `app/build.gradle`. ```js title="app/build.gradle" dependencies { implementation 'live.videosdk:rtc-android-sdk:0.3.0' // library to perform Network call to generate a meeting id implementation 'com.amitshekhar.android:android-networking:1.0.2' // other app dependencies } ``` :::important Android SDK compatible with armeabi-v7a, arm64-v8a, x86_64 architectures. If you want to run the application in an emulator, choose ABI x86_64 when creating a device. ::: ## Integration ### Step 1: Add the following permissions in `AndroidManifest.xml`. ```xml title="AndroidManifest.xml" ``` ### Step 2: Create `MainApplication` class which will extend the `android.app.Application`. **Kotlin:** ```js title="MainApplication.kt" package live.videosdk.demo; import live.videosdk.android.VideoSDK class MainApplication : Application() { override fun onCreate() { super.onCreate() VideoSDK.initialize(applicationContext) } } ``` --- **Java:** ```js title="MainApplication.java" package live.videosdk.demo; import android.app.Application; import live.videosdk.android.VideoSDK; public class MainApplication extends Application { @Override public void onCreate() { super.onCreate(); VideoSDK.initialize(getApplicationContext()); } } ``` ### Step 3: Add `MainApplication` to `AndroidManifest.xml`. ```js title="AndroidManifest.xml" ``` ### Step 4: In your `JoinActivity` add the following code in `onCreate()` method. **Kotlin:** ```js title="JoinActivity.kt" override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_join) val meetingId = "" val participantName = "John Doe" var micEnabled = true var webcamEnabled = true // generate the jwt token from your api server and add it here VideoSDK.config("JWT TOKEN GENERATED FROM SERVER") // create a new meeting instance meeting = VideoSDK.initMeeting( this@MeetingActivity, meetingId, participantName, micEnabled, webcamEnabled, null, null, false, null, null) // get permissions and join the meeting with meeting.join(); // checkPermissionAndJoinMeeting(); } ``` --- **Java:** ```js title="JoinActivity.java" @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_join); final String meetingId = ""; final String participantName = "John Doe"; final boolean micEnabled = true; final boolean webcamEnabled = true; // generate the jwt token from your api server and add it here VideoSDK.config("JWT TOKEN GENERATED FROM SERVER"); // create a new meeting instance Meeting meeting = VideoSDK.initMeeting( MainActivity.this, meetingId, participantName, micEnabled, webcamEnabled, null, null, false, null, null ); // get permissions and join the meeting with meeting.join(); // checkPermissionAndJoinMeeting(); } ``` All set! Here is the link to the complete sample code on [Github](https://github.com/videosdk-live/videosdk-rtc-android-java-sdk-example). Please refer to the [documentation](initMeeting) for a full list of available methods, events and features of the SDK. --- # Video SDK Stream Class - Android
Stream class is responsible for handling audio, video and screen sharing streams. Stream class defines instance of audio, video and shared screen stream of participants. ## Stream Properties
- [getId()](./properties#getid)
- [getKind()](./properties#getkind)
- [getTrack()](./properties#gettrack)
## Stream Methods
- [resume()](methods#resume)
- [pause()](./methods#pause)
--- # Stream Class Methods - Android
### resume() - By using `resume()`, a participant can resume the stream of Remote Participant. #### Returns - `void` --- ### pause() - By using `pause()`, a participant can pause the stream of Remote Participant. #### Returns - `void`
--- # Stream Class Properties - Android
### getId() - type: `String` - `getId()` will return id for that stream . --- ### getKind() - type: `String` - `getKind()` will return `kind`, which represents the type of stream which could be `audio` | `video` or `share` . --- ### getTrack() - type: `MediaStreamTrack` - `getTrack()` will return a MediaStreamTrack object stored in the MediaStream object.
--- # Stream Class ## Introduction The `Stream Class` includes methods and events of video & audio streams. ## Properties ### getId() - `getId()` will return Id of stream - return type : `String` ### getKind() - `getKind()` will return kind of stream, which can `audio`,`video` or `share` - return type : `String` ### getTrack() - `getTrack()` will return a MediaStreamTrack object stored in the MediaStream object - return type : `MediaStreamTrack` ## Methods ### pause() - By using `pause()` function, a participant can pause the stream of Remote Participant - return type : `void` ### resume() - By using `resume()` function, a participant can resume the stream of Remote Participant - return type : `void` ---

{props.title}

{" "} When integrating VideoSDK, you will come across with multiple terms like meeting, room, session and many more. Let's discuss each term in detail so while integrating VideoSDK you don't get confused. ### Meeting / Room - A Meeting or Room is a VideoSDK object which represents a real-time audio, video, and/or screen-share session, and is the basic building block for media sharing among participants which is returned by the VideoSDK once a successful connection is established. - Meeting or Room can be uniquely identified by `meetingId` or `roomId`. :::note Meeting and Room are both the same. Similarly `meetingId` and `roomId` are same. ::: ### Session - A Session is the instance of an ongoing meeting/room which has one or more participants in it. A single room or meeting can have multiple sessions. - Each session can be uniquely identified by `sessionId`. - All the sessions are associated with `meetingId`/`roomId` and can be listed using it. ### Participant - Participant is a VideoSDK object which represents each user/client in the meeting or room and can share audio/video media with other. - There can be multiple participants in a meeting/room and each participant can be uniquely identified by `participantId`. ### MediaStream - MediaStream is the collection of audio and video tracks which holds the segments of audio/video media that are send and received by all the participants in the meeting. ### RTMP Live Stream - RTMP live streaming is used to live stream your video conferencing apps to platforms like YouTube, Twitch, Facebook, etc. by providing the platform specific `streamKey` and `streamUrl`. ### Interactive Live streaming (HLS) - Interactive live streaming is used to live stream your video conferencing apps within your own platform to allow a larger number of viewers who can not be accomodated in a real-time conference. --- # VideoSDK Class Events - Android
### onAudioDeviceChanged() - This event will be emitted when an audio device, is connected to or removed from the device. #### Example **Kotlin:** ```javascript VideoSDK.setAudioDeviceChangeListener(object : VideoSDK.AudioDeviceChangeEvent { override fun onAudioDeviceChanged( selectedAudioDevice: AudioDeviceInfo?, audioDevices: MutableSet? ) { Log.d( "VideoSDK", "Selected Audio Device: " + selectedAudioDevice.label ) for (audioDevice in audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.label) } } }) ``` --- **Java:** ```javascript VideoSDK.setAudioDeviceChangeListener(new VideoSDK.AudioDeviceChangeEvent() { @Override public void onAudioDeviceChanged(AudioDeviceInfo selectedAudioDevice, Set audioDevices) { Log.d("VideoSDK", "Selected Audio Device: " + selectedAudioDevice.getLabel()); for (AudioDeviceInfo audioDevice : audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.getLabel()); } } }); ``` ---
--- # VideoSDK Class - Android
## Introduction The `VideoSDK` class includes properties, methods and events for creating and configuring a meeting, and managing media devices. //import properties from "./../data/meeting-class/properties.json"; ## VideoSDK Properties
- [getSelectedAudioDevice()](./properties.md#getselectedaudiodevice)
- [getSelectedVideoDevice()](./properties#getselectedvideodevice)
## VideoSDK Methods
- [initialize()](./methods#initialize)
- [config()](./methods#config)
- [initMeeting()](./methods#initmeeting)
- [getDevices()](./methods#getdevices)
- [getVideoDevices()](./methods#getvideodevices)
- [getAudioDevices()](./methods#getaudiodevices)
- [checkPermissions()](./methods#checkpermissions)
- [setAudioDeviceChangeListener()](./methods#setaudiodevicechangelistener)
- [setSelectedAudioDevice()](./methods#setselectedaudiodevice)
- [setSelectedVideoDevice()](./methods#setselectedvideodevice)
## VideoSDK Events
- [onAudioDeviceChanged](./events.md#onaudiodevicechanged)
--- # VideoSDK Class Methods - Android
### initialize() To initialize the meeting, first you have to initialize the `VideoSDK`. You can initialize the `VideoSDK` using `initialize()` method provided by the SDK. #### Parameters - **context**: Context #### Returns - _`void`_ ```js title="initialize" VideoSDK.initialize(Context context) ``` --- ### config() By using `config()` method, you can set the `token` property of `VideoSDK` class. Please refer this [documentation](/api-reference/realtime-communication/intro/) to generate a token. #### Parameters - **token**: String #### Returns - _`void`_ ```js title="config" VideoSDK.config(String token) ``` --- ### initMeeting() - Initialize the meeting using a factory method provided by the SDK called `initMeeting()`. - `initMeeting()` will generate a new [`Meeting`](../meeting-class/introduction.md) class and the initiated meeting will be returned. ```js title="initMeeting" VideoSDK.initMeeting( Context context, String meetingId, String name, boolean micEnabled, boolean webcamEnabled, String participantId, String mode, boolean multiStream, Map customTracks JSONObject metaData, String signalingBaseUrl PreferredProtocol preferredProtocol ) ``` - Please refer this [documentation](../initMeeting.md#initmeeting) to know more about `initMeeting()`. --- ### getDevices() - The `getDevices()` method returns a list of the currently available media devices, such as microphones, cameras, headsets, and so forth. The method returns a list of `DeviceInfo` objects describing the devices. - `DeviceInfo` class has four properties : 1. `DeviceInfo.deviceId` - Returns a string that is an identifier for the represented device, persisted across sessions. 2. `DeviceInfo.label` - Returns a string describing this device (for example `BLUETOOTH`). 3. `DeviceInfo.kind` - Returns an enumerated value that is either `video` or `audio`. 4. `DeviceInfo.FacingMode` - Returns a value of type `FacingMode` indicating which camera device is in use (front or back). #### Returns - `Set` #### Example **Kotlin:** ```javascript val devices: Set = VideoSDK.getDevices() for (deviceInfo in devices) { Log.d("VideoSDK", "Device's DeviceId " + deviceInfo.deviceId) Log.d("VideoSDK", "Device's Label " + deviceInfo.label) Log.d("VideoSDK", "Device's Kind " + deviceInfo.kind) Log.d("VideoSDK", "Device's Facing Mode " + deviceInfo.facingMode) //Value will be null for Audio Devices } ``` --- **Java:** ```javascript Set devices = VideoSDK.getDevices(); for (DeviceInfo deviceInfo : devices) { Log.d("VideoSDK", "Device's DeviceId " + deviceInfo.getDeviceId()); Log.d("VideoSDK", "Device's Label " + deviceInfo.getLabel()); Log.d("VideoSDK", "Device's Kind " + deviceInfo.getKind()); Log.d("VideoSDK", "Device's Facing Mode " + deviceInfo.getFacingMode()) //Value will be null for Audio Devices } ``` --- ### getVideoDevices() - The `getVideoDevices` method returns a list of currently available video devices. The method returns a list of `VideoDeviceInfo` objects describing the video devices. - `VideoDeviceInfo` class has four properties : 1. `VideoDeviceInfo.deviceId` - Returns a string that is an identifier for the represented device, persisted across sessions. 2. `VideoDeviceInfo.label` - Returns a string describing this device (for example `BLUETOOTH`). 2. `VideoDeviceInfo.kind` - Returns an enumerated value that is `video` . 4. `VideoDeviceInfo.FacingMode` - Returns a value of type `FacingMode` indicating which camera device is in use (front or back). #### Returns - `Set` #### Example **Kotlin:** ```js val videoDevices: Set = VideoSDK.getVideoDevices() for (videoDevice in videoDevices) { Log.d("VideoSDK", "Video Device's DeviceId " + videoDevice.deviceId) Log.d("VideoSDK", "Video Device's Label " + videoDevice.label) Log.d("VideoSDK", "Video Device's Kind " + videoDevice.kind) } ``` --- **Java:** ```js Set videoDevices = VideoSDK.getVideoDevices(); for (VideoDeviceInfo videoDevice: videoDevices) { Log.d("VideoSDK", "Video Device's DeviceId " + videoDevice.getDeviceId()); Log.d("VideoSDK", "Video Device's Label " + videoDevice.getLabel()); Log.d("VideoSDK", "Video Device's Kind " + videoDevice.getKind()); } ``` --- ### getAudioDevices() - The `getAudioDevices` method returns a list of currently available audio devices. The method returns a list of `AudioDeviceInfo` objects describing the audio devices. - `AudioDeviceInfo` class has three properties : 1. `AudioDeviceInfo.deviceId` - Returns a string that is an identifier for the represented device, persisted across sessions. 2. `AudioDeviceInfo.label` - Returns a string describing this device (for example `BLUETOOTH`). 3. `AudioDeviceInfo.kind` - Returns an enumerated value that is `audio`. #### Returns - `Set` #### Example **Kotlin:** ```js val audioDevices: Set = VideoSDK.getAudioDevices() for (audioDevice in audioDevices) { Log.d("VideoSDK", "Audio Device's DeviceId " + audioDevice.deviceId) Log.d("VideoSDK", "Audio Device's Label " + audioDevice.label) Log.d("VideoSDK", "Audio Device's Kind " + audioDevice.kind) } ``` --- **Java:** ```js Set audioDevices = VideoSDK.getAudioDevices(); for (AudioDeviceInfo audioDevice: audioDevices) { Log.d("VideoSDK", "Audio Device's DeviceId " + audioDevice.getDeviceId()); Log.d("VideoSDK", "Audio Device's Label " + audioDevice.getLabel()); Log.d("VideoSDK", "Audio Device's Kind " + audioDevice.getKind()); } ``` --- ### setAudioDeviceChangeListener() - The `AudioDeviceChangeEvent` is emitted when an audio device, is connected to or removed from the device. This event can be set by using `setAudioDeviceChangeListener()` method. #### Parameters - **audioDeviceChangeEvent**: AudioDeviceChangeEvent #### Returns - _`void`_ #### Example **Kotlin:** ```javascript VideoSDK.setAudioDeviceChangeListener { selectedAudioDevice: AudioDeviceInfo, audioDevices: Set -> Log.d( "VideoSDK", "Selected Audio Device: " + selectedAudioDevice.label ) for (audioDevice in audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.label) } } ``` --- **Java:** ```javascript VideoSDK.setAudioDeviceChangeListener((selectedAudioDevice, audioDevices) -> { Log.d("VideoSDK", "Selected Audio Device: " + selectedAudioDevice.getLabel()); for (AudioDeviceInfo audioDevice : audioDevices) { Log.d("VideoSDK", "Audio Devices" + audioDevice.getLabel()); } }); ``` --- ### checkPermissions() - The `checkPermissions()` method verifies whether permissions to access camera and microphone devices have been granted. If the required permissions are not granted, the method will proceed to request these permissions from the user. #### Parameters - context - type: `Context` - `REQUIRED` - The android context. - permission - type: `List` - `REQUIRED` - The permission to be requested. - permissionHandler - type: `PermissionHandler` - `REQUIRED` - The permission handler object for handling callbacks of various user actions such as permission granted, permission denied, etc. - rationale - type: `String` - `OPTIONAL` - Explanation to be shown to user if they have denied permission earlier. If this parameter is not provided, permissions will be requested without showing the rationale dialog. - options - type: `Permissions.Options` - `OPTIONAL` - The options object for setting title and description of dialog box that prompts users to manually grant permissions by navigating to device settings. If this parameter is not provided,the default title and decription will be used for the dialog box. #### Returns - _`void`_ #### Example **Kotlin:** ```js private val permissionHandler: PermissionHandler = object : PermissionHandler() { override fun onGranted() {} override fun onBlocked( context: Context, blockedList: java.util.ArrayList ): Boolean { for (blockedPermission in blockedList) { Log.d("VideoSDK Permission", "onBlocked: $blockedPermission") } return super.onBlocked(context, blockedList) } override fun onDenied( context: Context, deniedPermissions: java.util.ArrayList ) { for (deniedPermission in deniedPermissions) { Log.d("VideoSDK Permission", "onDenied: $deniedPermission") } super.onDenied(context, deniedPermissions) } override fun onJustBlocked( context: Context, justBlockedList: java.util.ArrayList, deniedPermissions: java.util.ArrayList ) { for (justBlockedPermission in justBlockedList) { Log.d("VideoSDK Permission", "onJustBlocked: $justBlockedPermission") } super.onJustBlocked(context, justBlockedList, deniedPermissions) } } val permissionList: MutableList = ArrayList() permissionList.add(Permission.audio) permissionList.add(Permission.video) permissionList.add(Permission.bluetooth) val rationale = "Please provide permissions" val options = Permissions.Options().setRationaleDialogTitle("Info").setSettingsDialogTitle("Warning") //If you wish to disable the dialog box that prompts //users to manually grant permissions by navigating to device settings, //you can set options.sendDontAskAgainToSettings(false) VideoSDK.checkPermissions(this, permissionList, rationale, options, permissionHandler) ``` --- **Java:** ```js private final PermissionHandler permissionHandler = new PermissionHandler() { @Override public void onGranted() { } @Override public boolean onBlocked(Context context, ArrayList blockedList) { for (Permission blockedPermission : blockedList) { Log.d("VideoSDK Permission", "onBlocked: " + blockedPermission); } return super.onBlocked(context, blockedList); } @Override public void onDenied(Context context, ArrayList deniedPermissions) { for (Permission deniedPermission : deniedPermissions) { Log.d("VideoSDK Permission", "onDenied: " + deniedPermission); } super.onDenied(context, deniedPermissions); } @Override public void onJustBlocked(Context context, ArrayList justBlockedList, ArrayList deniedPermissions) { for (Permission justBlockedPermission : justBlockedList) { Log.d("VideoSDK Permission", "onJustBlocked: " + justBlockedPermission); } super.onJustBlocked(context, justBlockedList, deniedPermissions); } }; List permissionList = new ArrayList<>(); permissionList.add(Permission.audio); permissionList.add(Permission.video); permissionList.add(Permission.bluetooth); String rationale = "Please provide permissions"; Permissions.Options options = new Permissions.Options().setRationaleDialogTitle("Info").setSettingsDialogTitle("Warning"); //If you wish to disable the dialog box that prompts //users to manually grant permissions by navigating to device settings, //you can set options.sendDontAskAgainToSettings(false) VideoSDK.checkPermissions(this, permissionList, rationale, options, permissionHandler); ``` --- ### setSelectedAudioDevice() - It sets the selected audio device, allowing the user to specify which audio device to use in the meeting. #### Parameters - **selectedAudioDevice**: AudioDeviceInfo #### Returns - _`void`_ #### Example **Kotlin:** ```js val audioDevices: Set = VideoSDK.getAudioDevices() val audioDeviceInfo: AudioDeviceInfo = audioDevices.toTypedArray().get(0) as AudioDeviceInfo VideoSDK.setSelectedAudioDevice(audioDeviceInfo) ``` --- **Java:** ```js Set audioDevices = VideoSDK.getAudioDevices(); AudioDeviceInfo audioDeviceInfo = (AudioDeviceInfo) audioDevices.toArray()[0]; VideoSDK.setSelectedAudioDevice(audioDeviceInfo); ``` --- ### setSelectedVideoDevice() - It sets the selected video device, allowing the user to specify which video device to use in the meeting. #### Parameters - **selectedVideoDevice**: VideoDeviceInfo #### Returns - _`void`_ #### Example **Kotlin:** ```js val videoDevices: Set = VideoSDK.getVideoDevices() val videoDeviceInfo: VideoDeviceInfo = videoDevices.toTypedArray().get(1) as VideoDeviceInfo VideoSDK.setSelectedVideoDevice(videoDeviceInfo) ``` --- **Java:** ```js Set videoDevices = VideoSDK.getVideoDevices(); VideoDeviceInfo videoDeviceInfo = (VideoDeviceInfo) videoDevices.toArray()[1]; VideoSDK.setSelectedVideoDevice(videoDeviceInfo); ``` --- ### applyVideoProcessor() - This method allows users to dynamically apply virtual background to their video stream during a live session. #### Parameters - videoFrameProcessor - type: `VideoFrameProcessor` - This is an object of the `VideoFrameProcessor` class, which overrides the `onFrameCaptured(VideoFrame videoFrame)` method. #### Returns - _`void`_ #### Example **Kotlin:** ```js val uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg") val backgroundImageProcessor = BackgroundImageProcessor(uri) VideoSDK.applyVideoProcessor(backgroundImageProcessor) ``` --- **Java:** ```java Uri uri = Uri.parse("https://st.depositphotos.com/2605379/52364/i/450/depositphotos_523648932-stock-photo-concrete-rooftop-night-city-view.jpg"); BackgroundImageProcessor backgroundImageProcessor = new BackgroundImageProcessor(uri) VideoSDK.applyVideoProcessor(backgroundColorProcessor); ``` --- ### removeVideoProcessor() - This method provides users with a convenient way to revert their video background to its original state, removing any previously applied virtual background. - **Returns:** - _`void`_ #### Example ```js VideoSDK.removeVideoProcessor(); ```
--- # VideoSDK Class Properties - Android
### getSelectedAudioDevice() - type: `AudioDeviceInfo` - The `getSelectedAudioDevice()` method will return the object of the audio device, which is currently in use. --- ### getSelectedVideoDevice() - type: `VideoDeviceInfo` - The `getSelectedVideoDevice()` method will return the object of the video device, which is currently in use.
--- # VideoSDK Class The entry point into real-time communication SDK. ## using VideoSDK Class The `VideoSDK Class` includes methods and events to initialize and configure the SDK. It is a factory class. ### Parameters ### Methods ## Example **Kotlin:** ```js title="initMeeting" // Configure the token VideoSDK.config(token) // Initialize the meeting Meeting meeting = VideoSDK.initMeeting( context, meetingId, // required name, // required micEnabled, // required webcamEnabled, // required null, // required null, // required null // required ) }); ``` --- **Java:** ```js title="initMeeting" // Configure the token VideoSDK.config(token) // Initialize the meeting Meeting meeting = VideoSDK.initMeeting({ context, meetingId, // required name, // required micEnabled, // required webcamEnabled, // required null, // required null // required null // required }); ``` --- # App Size Optimization - Android This guide is designed to help developers optimize app size, enhancing performance and efficiency across different devices. By following these best practices, you can reduce load times, minimize storage requirements, and deliver a more seamless experience to users, all while preserving essential functionality. ## Deliver Leaner Apps with App Bundles Using Android App Bundles (AAB) is an effective way to optimize the size of your application, making it lighter and more efficient for users to download and install. App Bundles allow Google Play to dynamically generate APKs tailored to each device, so users only download the resources and code relevant to their specific configuration. This approach reduces app size significantly, leading to faster installs and conserving storage space on users’ devices. Recommended Practices: - `Enable App Bundles`: Configure your build to use the App Bundle format instead of APKs. This will allow Google Play to optimize your app for each device type automatically. - `Organize Resources by Device Type`: Ensure that resources (like images and layouts) are organized by device type (such as screen density or language) to maximize the benefits of App Bundles. - `Test Modularization`: If your app contains large, optional features, use dynamic feature modules to let users download them on demand. This reduces the initial download size and provides features only as needed. - `Monitor Size Reductions`: Regularly analyze your app size to see where the most savings occur, and make sure that App Bundle optimizations are effectively reducing your app size across different device configurations. ### Optimize Libraries for a Leaner App Experience Managing dependencies carefully is essential for minimizing app size and improving performance. Every library or dependency included in your app adds to its overall size, so it’s crucial to only incorporate what’s necessary. Optimizing dependencies helps streamline your app, reduce load times, and enhance maintainability. Recommended Practices: - `Use Only Essential Libraries`: Review all libraries and dependencies, removing any that are not critical to your app’s functionality. This helps avoid unnecessary bloat. - `Leverage Lightweight Alternatives`: Whenever possible, choose lightweight libraries or modularized versions of larger ones. For example, opt for a specific feature module rather than including an entire library. - `Monitor Library Updates`: Regularly update your dependencies to take advantage of any optimizations or size reductions made by the library maintainers. Newer versions are often more efficient. - `Minimize Native Libraries`: If your app uses native libraries, ensure they’re essential and compatible across platforms, as they can significantly increase app size. - `Analyze Dependency Tree`: Use tools like Gradle’s dependency analyzer to identify unnecessary or redundant dependencies, ensuring your app’s dependency tree is as lean as possible. ### Optimize with ProGuard ProGuard is a powerful tool for shrinking, optimizing, and obfuscating your code, which can significantly reduce your app's size and improve performance. By removing unused code and reducing the size of classes, fields, and methods, ProGuard helps to minimize the footprint of your app without sacrificing functionality. Additionally, ProGuard’s obfuscation feature enhances security by making reverse engineering more difficult. You can refer to the official [documentation](https://developer.android.com/build/shrink-code) for more information. Recommended Practices: - `Enable ProGuard`: To enable ProGuard in your project, ensure that your `proguard-rules.pro` file is properly configured, and add the following lines to your `build.gradle` file: ```js title="build.gradle" buildTypes { release { minifyEnabled true proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro' } } ``` - `Customize ProGuard Rules`: Carefully review and customize ProGuard rules in the proguard-rules.pro file to avoid stripping essential code. For example, to keep a specific class, add: ```js -keep class com.example.myapp.MyClass { *; } ``` If you encounter an issue after enabling ProGuard rules, refer to our [known issues section](https://docs.videosdk.live/android/guide/video-and-audio-calling-api-sdk/known-issues). --- # Developer Experience Guidelines - Android This section provides best practices for creating a smooth and efficient development process when working with VideoSDK. From handling errors gracefully to managing resources and event subscriptions, these guidelines help developers build more reliable and maintainable applications. Following these practices can simplify troubleshooting, prevent common pitfalls, and improve overall application performance. ### Initiate Key Features After Meeting Join Event To provide a seamless and reliable meeting experience, initiate specific features **only** after the [onMeetingJoined()](https://docs.videosdk.live/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingjoined) event has been triggered. - **Trigger Key Actions After Joining the Meeting** : Initiating crucial actions after the `onMeetingJoined()` event helps avoid errors and optimizes the meeting setup, ensuring a smoother experience for participants. If your application utilizes any of the following features or you want to perform any action as soon as meeting joins, it's recommended to call them only after the meeting has successfully started: - `Chat Subscription`: To enable in-meeting chat functionality, subscribe to the chat topic after the `onMeetingJoined()` event is triggered. This ensures that messages are reliably received by participants.
- `Device Management`: If you need users to use specific audio or video devices when the meeting is first joined, you can utilize the [`setSelectedAudioDevice()`](https://docs.videosdk.live/android/api/sdk-reference/videosdk-class/methods#setselectedaudiodevice) and [`setSelectedVideoDevice()`](https://docs.videosdk.live/android/api/sdk-reference/videosdk-class/methods#setselectedvideodevice) methods of `VideoSDK` class. - `Recording and Transcription`: To automatically start recording or transcription as soon as the meeting begins, configure the `autoStartConfig` in the `createMeeting` API. For detailed information, refer to the documentation [here](https://docs.videosdk.live/api-reference/realtime-communication/create-room#autoCloseConfig). ### Dispose Custom Tracks When Necessary Proper disposal of custom tracks is essential for managing system resources and ensuring a smooth experience. In most scenarios, tracks are automatically disposed of by the SDK, ensuring efficient resource management. However, in specific cases outlined below, you will need to dispose of custom tracks explicitly: 1. **When Enabling/Disabling the Camera on a Precall Screen**: - If your application includes a precall screen and you want to ensure that the device's camera is not used when the camera is disabled, you must dispose of the custom video track. Otherwise, the device’s camera will continue to be used even when the camera is off. - Additionally, remember to create a new track when the user enables the camera again. - If you don’t need to manage the camera's usage on the device level (i.e., you’re okay with the camera being used whether it’s enabled or disabled), you can skip this step. - Here's how you can manage customTrack on a precall screen : **Kotlin:** ```js import live.videosdk.rtc.android.CustomStreamTrack import live.videosdk.rtc.android.VideoSDK import live.videosdk.rtc.android.VideoView class JoinActivity : AppCompatActivity() { private var videoTrack: CustomStreamTrack? = null private var joinView: VideoView? = null private fun toggleWebcam(videoDevice: VideoDeviceInfo?) { if (isWebcamEnabled) { // check the track state is LIVE if(videoTrack?.track?.state()?.equals("LIVE") == true){ videoTrack?.track?.dispose() // Dispose the current video track videoTrack?.track?.setEnabled(false) // Disable the track } videoTrack = null joinView!!.removeTrack() // Remove the video track from the view joinView!!.releaseSurfaceViewRenderer() joinView!!.visibility = View.INVISIBLE; } else { // Re-enabling the webcam by creating a new track videoTrack = VideoSDK.createCameraVideoTrack( "h720p_w960p", "front", CustomStreamTrack.VideoMode.TEXT, true, this, videoDevice // Passes the VideoDeviceInfo object of the user's selected device ) // display in localView joinView!!.addTrack(videoTrack!!.track as VideoTrack?) joinView!!.visibility = View.VISIBLE } isWebcamEnabled = !isWebcamEnabled // Toggle webcam state } } ``` --- **Java:** ```js import live.videosdk.rtc.android.CustomStreamTrack import live.videosdk.rtc.android.VideoSDK import live.videosdk.rtc.android.VideoView public class JoinActivity extends AppCompatActivity { private CustomStreamTrack videoTrack = null; private VideoView joinView = null; private boolean isWebcamEnabled = false; private void toggleWebcam(VideoDeviceInfo videoDevice) { if (isWebcamEnabled) { // Check if the track state is LIVE if (videoTrack != null && "LIVE".equals(videoTrack.getTrack().state())) { videoTrack.getTrack().dispose(); // Dispose the current video track videoTrack.getTrack().setEnabled(false); // Disable the track } videoTrack = null; joinView.removeTrack(); // Remove the video track from the view joinView.releaseSurfaceViewRenderer(); joinView.setVisibility(View.INVISIBLE); } else { // Re-enabling the webcam by creating a new track videoTrack = VideoSDK.createCameraVideoTrack( "h720p_w960p", "front", CustomStreamTrack.VideoMode.TEXT, true, this, videoDevice // Passes the VideoDeviceInfo object of the user's selected device ); // Display in the local view joinView.addTrack((VideoTrack) videoTrack.getTrack()); joinView.setVisibility(View.VISIBLE); } isWebcamEnabled = !isWebcamEnabled; // Toggle webcam state } } ``` ## Listen for Error Events Listening to error events enables your application to handle unexpected issues efficiently, providing users with clear feedback and potential solutions. Error codes pinpoint specific problems, whether from configuration settings, account restrictions, permission limitations, or device constraints. Here are recommended solutions based on common error categories: 1. [Errors associated with Organization](../../api/sdk-reference/error-codes.md#1-errors-associated-with-organization): If you encounter errors related to your organization (e.g., account status or participant limits), reach out to support at support@videosdk.live or reach out to us on [Discord](https://discord.com/invite/Qfm8j4YAUJ) for assistance. 2. [Errors associated with Token](../../api/sdk-reference/error-codes#2-errors-associated-with-token): For errors related to authentication tokens, ensure the token is valid and hasn’t expired, then try the request again. 3. [Errors associated with Meeting and Participant](../../api/sdk-reference/error-codes#3-errors-associated-with-meeting-and-participant): Check that meetingId and participantId are correctly passed and valid. Also, ensure each participant has a unique participantId to avoid duplicate entries. 4. [Errors associated with Add-on Service](../../api/sdk-reference/error-codes#4-errors-associated-with-add-on-service): If you encounter errors with add-on services (such as recording or streaming), try restarting the service after receiving a failure event. For example, if a `START_RECORDING_FAILED` error event occurs, attempt to call the `startRecording()` method again. If you're using webhooks, you can also retry on [recording-failed](https://docs.videosdk.live/api-reference/realtime-communication/user-webhooks#recording-failed) hook. 5. [Errors associated with Media](../../api/sdk-reference/error-codes#5errors-associated-with-media): Inform the user about media access issues, such as microphone or camera permissions. Design the UI to clearly indicate what is preventing the mic or camera from enabling, helping the user understand the problem. 6. [Errors associated with Track](../../api/sdk-reference/error-codes#6errors-associated-with-track): Ensure that the track you’ve created and passed to enable the mic or camera methods meets the required specifications. 7. [Errors associated with Actions](../../api/sdk-reference/error-codes#7errors-associated-with-actions): If you need to perform actions as soon as a meeting joins, only initiate them after receiving the [onMeetingJoined()](https://docs.videosdk.live/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingjoined) event, otherwise it will not work well. - Here's how to listen for the error event: **Kotlin:** ```js private val meetingEventListener: MeetingEventListener = object : MeetingEventListener() { //.. override fun onError(error: JSONObject) { try { val errorCodes: JSONObject = VideoSDK.getErrorCodes() val code = error.getInt("code") Log.d("#error", "Error is: " + error["message"]) } catch (e: Exception) { e.printStackTrace() } } } ``` --- **Java:** ```js private final MeetingEventListener meetingEventListener = new MeetingEventListener() { //.. @Override public void onError(JSONObject error) { try { JSONObject errorCodes = VideoSDK.getErrorCodes(); int code = error.getInt("code"); Log.d("#error", "Error is: " + error.get("message")); } catch (Exception e) { e.printStackTrace(); } } }; ``` --- # Handle Large Rooms - Android Managing large meetings requires specific strategies to ensure performance, stability, and a seamless user experience. This section provides best practices for optimizing VideoSDK applications to handle high participant volumes effectively. By implementing these recommendations, you can reduce lag, maintain video and audio quality, and provide a smooth experience even in large rooms. ## User Interface Optimization When hosting large meetings, an optimized UI helps manage participant visibility and ensures smooth performance. Recommended Practices: - `Limit Visible Participants`: Display only a limited number of participants on screen at any given time, adapting the view based on screen size. Use pagination to allow users to browse or switch between additional participants seamlessly. For example, you could display only users whose video stream is enabled, or you could choose to display all active speakers. This approach helps manage screen space efficiently, ensuring that the most relevant participants are visible without overwhelming the interface. - `Prioritize Active Speakers`: Ensure all active speakers are displayed on the screen to highlight who is currently talking, helping participants stay engaged and aware of ongoing discussions. To identify which participant is speaking, you can use the [onSpeakerChanged()](https://docs.videosdk.live/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onspeakerchanged) event. ## Optimizing Media Streams In large video calls, it’s important to manage media streams effectively to optimize system resources while maintaining a smooth user experience. Recommended Practices: - `Pause Streams for Non-Visible Participants`: To optimize performance, pause the video and/or audio streams of participants who are not currently visible on the screen. This reduces unnecessary resource consumption. - `Resume Streams When Visible`: Once a participant comes into view, resume their stream to provide an uninterrupted experience as they appear on the screen. For detailed setup instructions on how to achieve this, check out our in-depth documentation [here](https://docs.videosdk.live/android/guide/video-and-audio-calling-api-sdk/render-media/layout-and-grid-management#pauseresume-stream). ## Media Stream Quality Adjustment In large meetings, managing media stream quality is essential to balance performance and user experience. Recommended Practices: - `High Quality for Active Speakers`: For all active speakers, set the video stream quality to a higher level using the setQuality method (e.g., `setQuality("high")`). This ensures that participants will receive higher-quality video for active speakers, providing a clearer and more engaging experience. - `Lower Quality for Non-Speaking Participants`: For other participants who are not actively speaking, set their video stream quality to a lower level (e.g., `setQuality("low")`). This helps conserve bandwidth and system resources while maintaining overall meeting performance. Checkout the documentation for `setQuality()` method [here](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#setquality) --- # User Experience Guidelines - Android This guide aims to help developers optimize the user experience and functionality of video conferencing applications with VideoSDK. By following these best practices, you can create smoother interactions, minimize common issues, and deliver a more reliable experience for users. Here are our recommended best practices to enhance the user experience in your application: | **Section** | **Description** | |--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [Configure Precall for Effortless Meeting Join](#configure-precall-for-effortless-meeting-join) | Users may enter meetings unprepared due to device or connection issues. A Precall setup can help them configure devices and settings beforehand for a smooth start. | | [Listen Key Events for Optimal User Experience](#listen-key-events-for-optimal-user-experience) | Users can feel lost without real-time updates on meeting status, events, and errors. Event monitoring and notifications keep them informed and engaged. | | [Handling Media Devices](#handling-media-devices) | Users may want to change their audio or video setup mid-meeting but struggle to manage device controls. Providing easy device switching enhances control and flexibility. | | [Monitoring Real-Time Participant Statistics](#monitoring-real-time-participant-statistics) | Poor video or audio quality without real-time feedback leaves users frustrated. Real-time stats let them assess connection quality and troubleshoot issues actively. | ## Configure Precall for Effortless Meeting Join A Precall step is crucial for ensuring users are set up correctly and have no device before joining a meeting. This step allows users to configure their devices and settings before entering a meeting, leading to a smoother experience and minimizing technical issues once the call begins. Recommended Practices: - `Request Permissions`: Prompt users to grant microphone, and camera permissions before entering the meeting, ensuring seamless access to their devices. - `Device Selection`: Allow users to select their preferred camera, and microphone giving them control over their setup from the start. - `Entry Preferences`: Provide options to join with the microphone and camera either on or off, letting users choose their level of engagement upon entry. - `Camera Preview`: Show a live camera preview, allowing users to adjust angles and lighting to ensure they appear clearly and professionally. - `Virtual Backgrounds`: Allow users to choose from different virtual backgrounds or enter with a virtual background enabled, enhancing privacy and creating a more polished appearance. For detailed setup instructions on each of these features, check out our in-depth documentation [here](https://docs.videosdk.live/android/guide/video-and-audio-calling-api-sdk/setup-call/precall).
## Monitor Key Events for Optimal User Experience Listening for crucial events is vital for providing users with a responsive and engaging experience in your application. By effectively managing state changes and user notifications, you can keep participants informed and enhance their overall experience during meetings. Recommended Practices: - `Monitor State Change Events`: Listen for state change events, such as `onMeetingStateChanged` and `onRecordingStateChanged`, and notify users promptly about these transitions. Keeping users informed helps them understand the current state of the meeting.
- `UI Handling on Event Trigger`: Update the user interface only in response to specific events. For instance, display that the meeting is recording only when the `onRecordingStateChanged` event with the status `RECORDING_STARTED` is received, rather than when the record button is clicked. This ensures users receive accurate and timely updates.
- `Notify Participants of Join/Leave Events`: Keep users informed about participant activity by notifying them when someone joins or leaves the meeting. This fosters a sense of presence and awareness of who is currently available to engage. - `Listen for Error Events`: It is crucial to monitor error events and notify users promptly when issues arise. Clear communication about errors can help users troubleshoot and address problems quickly, minimizing disruptions to the meeting. ## Handling Media Devices Providing seamless control over devices enhances user convenience and allows participants to adjust their setup for the best meeting experience. Proper device management within the UI also helps users stay informed about their current settings and troubleshoot issues effectively. Recommended Practices: - `Allow Device Switching`: Provide users with the option to switch between available microphone, and camera devices during the meeting. This flexibility is essential, especially if users want to adjust their setup mid-call. - `Display Selected Devices`: Ensure the UI shows users which microphone, and camera devices are currently selected. Clear device labeling in the interface can reduce confusion and help users verify their setup at a glance.
## Monitoring Real-Time Participant Statistics Providing real-time insights into stream quality allows participants to monitor and optimize their connection for the best experience. With detailed metrics on video, audio, and screen sharing, users can assess and troubleshoot quality issues, ensuring smooth and uninterrupted meetings. To display these statistics, you can use the [getVideoStats()](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#getvideostats), [getAudioStats()](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#getaudiostats), and [getShareStats()](https://docs.videosdk.live/android/api/sdk-reference/participant-class/methods#getsharestats) methods.

:::note To show the popup dialog for the participant's realtime stats, you can [refer to this function](https://github.com/videosdk-live/videosdk-rtc-android-kotlin-sdk-example/blob/main/app/src/main/java/live/videosdk/rtc/android/kotlin/Common/Utils/HelperClass.kt#L91). ::: --- # Face Match API This API verifies whether two provided images are same or not. It returns a boolean value indicating whether the faces match. :::important This API is available in Enterprise plan only. ::: ## HTTP Method and Endpoint **POST** | `https://api.videosdk.live/ai/v1/face-verification/verify` ## Headers Parameter ### Authorization **values** : YOUR_TOKEN_WITHOUT_ANY_PREFIX This will be a JWT token generate using VideoSDK ApiKey and Secret. _Note_ : the token will not include any prefix such as "Basic " or "Bearer ". Just pass a token as value. You can generate a new token by referring to this Guide: [Generate Auth token](https://docs.videosdk.live/api-reference/realtime-communication/intro) ### Content-Type **values** : `application/json` This is useful for json body parameters, so that VideoSDK servers can understand that the incoming body parameter will be a JSON string. ## Data Parameter #### Base64 Encoding Format The images should be converted from binary format to a Base64 string Images must be Base64 encoded in the following format: ``` data:image/jpeg;base64,${Base64data} ``` #### Request Format The request body should contain two Base64-encoded images in the following format: ```js const data = { img1: "data:image/jpeg;base64,${Base64data}", // Base64-encoded image 1 img2: "data:image/jpeg;base64,${Base64data}", // Base64-encoded image 2 }; ``` ## Sending Image Comparison Request The request is made using the following code: This sends the encoded images to the specified URL for comparison. ## Code Snippet for API Integration Below is a code snippet demonstrating how to use the Face Match API (BETA) with Node.js: ```js // Load environment variables from .env file dotenv.config(); // Get the directory of the current file const __dirname = path.resolve(); // Function to read image files and convert them to base64 function getBase64Image(imageName) { const filePath = path.join(__dirname, "image", imageName); console.log(`Resolved file path: ${filePath}`); try { const image = readFileSync(filePath); const base64Data = Buffer.from(image).toString("base64"); return `data:image/jpeg;base64,${base64Data}`; } catch (error) { console.error(`Error reading image file: ${filePath}`, error.message); throw error; } } // Function to perform Face Match async function faceMatch(ovdImageName, selfieImageName) { const url = "https://api.videosdk.live/ai/v1/face-verification/verify"; const headers = { Authorization: `${process.env.API_KEY}`, "Content-Type": "application/json", }; const data = { img1: getBase64Image(ovdImageName), img2: getBase64Image(selfieImageName), }; try { const response = await axios.post(url, data, { headers }); console.log("Face Match Result:", response.data); } catch (error) { console.error( "Error during face match:", error.response ? error.response.data : error.message ); } } // Main function to run the API calls async function main() { const ovdImageName = "img1.jpg"; const selfieImageName = "img2.jpg"; await faceMatch(ovdImageName, selfieImageName); } // Run the main function main().catch((error) => console.error("Error in main function:", error.message) ); ``` ## Examples of Face Match API (BETA) Usage ### Example 1 : Matching Images of the Same Person In this scenario, we take two different pictures of the same individual. For instance:
Image 1

Image 1

Image 2

Image 2

When these images are sent to the Face Match API (BETA), the expected output would be: ``` Face Match Result: { "verified": true } ``` This result indicates that despite differences in lighting, angle, or expression, the API successfully recognizes that both images depict the same individual. ### Example 2 : Matching Images of Different Persons In this case, we compare two photos of different individuals. For example:
Image 1

Image 1

Image 2

Image 2

When these images are processed through the Face Match API (BETA), the expected output would be: ``` Face Match Result: { "verified": false } ``` This result demonstrates the API's ability to differentiate between distinct individuals accurately. --- # Face Spoof Detection This API verifies if the image is original or spoofed. It returns a boolean value spoof_detected in output. :::important This API is available in Enterprise plan only. ::: ## HTTP Method and Endpoint **POST** | `POST https://api.videosdk.live/ai/v1/face-verification/detect-spoof` ## Headers Parameter ### Authorization **values** : YOUR_TOKEN_WITHOUT_ANY_PREFIX This will be a JWT token generate using VideoSDK ApiKey and Secret. _Note_ : the token will not include any prefix such as "Basic " or "Bearer ". Just pass a token as value. You can generate a new token by refering this Guide: [Generate Auth token](https://docs.videosdk.live/api-reference/realtime-communication/intro) ### Content-Type **values** : `application/json` This is usefull for json body parameters, so that VideoSDK servers can understand that the incoming body parameter will be a JSON string. ## Data Parameter #### Base64 Encoding Format The images should be converted from binary format to a Base64 string Images must be Base64 encoded in the following format: ``` data:image/jpeg;base64,${Base64data} ``` #### Request Format The request body should contain two Base64-encoded images in the following format: ```js const data = { img: "data:image/jpeg;base64,${Base64data}", // Base64-encoded image }; ``` ## Sending Image Comparison Request The request is made using the following code: This sends the encoded images to the specified URL for comparison. ## Code Snippet for API Integration Below is a code snippet demonstrating how to use the Face Spoof Detection API (BETA) with Node.js: ```js // Load environment variables from .env file dotenv.config(); // Get the directory of the current file const __dirname = path.resolve(); // Function to read image files and convert them to base64 function getBase64Image(imageName) { const filePath = path.join(__dirname, "image", imageName); console.log(`Resolved file path: ${filePath}`); try { const image = readFileSync(filePath); const base64Data = Buffer.from(image).toString("base64"); return `data:image/jpeg;base64,${base64Data}`; } catch (error) { console.error(`Error reading image file: ${filePath}`, error.message); throw error; } } async function spoofDetection(spoofimg) { const url = "https://api.videosdk.live/ai/v1/face-verification/detect-spoof"; const headers = { Authorization: `${process.env.API_KEY}`, "Content-Type": "application/json", }; const data = { img: getBase64Image(spoofimg), }; try { const response = await axios.post(url, data, { headers }); console.log(response.data); } catch (error) { console.error( "Error during spoof detection:", error.response ? error.response.data : error.message ); } } // Main function to run the API calls async function main() { const spoofimg = "img.jpg"; await spoofDetection(spoofimg); } // Run the main function main().catch((error) => console.error("Error in main function:", error.message) ); ``` ##### Given below is a test run of the example. **Input Image** :
Image 1
When this image is sent to the Face Spoof Detection API (BETA), the response will return a boolean value spoof_detected in output : ```js { spoof_detected: true; // true if spoof detected in image else false accuracy: 0.9899068176746368; // accuracy of spoof detection } ``` --- # Number of Face Detection This API checks how many faces are there in the image and returns an int value number_of_faces in output. :::important This API is available in Enterprise plan only. ::: ## HTTP Method and Endpoint **POST** | `https://api.videosdk.live/ai/v1/face-verification/detect-faces` ## Headers Parameter ### Authorization **values** : YOUR_TOKEN_WITHOUT_ANY_PREFIX This will be a JWT token generate using VideoSDK ApiKey and Secret. _Note_ : the token will not include any prefix such as "Basic " or "Bearer ". Just pass a token as value. You can generate a new token by refering this Guide: [Generate Auth token](https://docs.videosdk.live/api-reference/realtime-communication/intro) ### Content-Type **values** : `application/json` This is usefull for json body parameters, so that VideoSDK servers can understand that the incoming body parameter will be a JSON string. ## Data Parameter #### Base64 Encoding Format The images should be converted from binary format to a Base64 string Images must be Base64 encoded in the following format: ``` data:image/jpeg;base64,${Base64data} ``` #### Request Format The request body should contain two Base64-encoded images in the following format: ```js const data = { img: "data:image/jpeg;base64,${Base64data}", // Base64-encoded image }; ``` ## Sending Image Comparison Request The request is made using the following code: This sends the encoded images to the specified URL for comparison. ## Code Snippet for API Integration Below is a code snippet demonstrating how to use the number of faces detection API with Node.js: ```js // Load environment variables from .env file dotenv.config(); // Get the directory of the current file const __dirname = path.resolve(); // Function to read image files and convert them to base64 function getBase64Image(imageName) { const filePath = path.join(__dirname, "image", imageName); console.log(`Resolved file path: ${filePath}`); try { const image = readFileSync(filePath); const base64Data = Buffer.from(image).toString("base64"); return `data:image/jpeg;base64,${base64Data}`; } catch (error) { console.error(`Error reading image file: ${filePath}`, error.message); throw error; } } async function noOfFaces(getImg) { const url = "https://api.videosdk.live/ai/v1/face-verification/detect-faces"; const headers = { Authorization: `${process.env.API_KEY}`, "Content-Type": "application/json", }; const data = { img: getBase64Image(getImg), }; try { const response = await axios.post(url, data, { headers }); console.log(response.data); } catch (error) { console.error( "Error during Number of face detection :", error.response ? error.response.data : error.message ); } } // Main function to run the API calls async function main() { const getImg = "img1.jpg"; await spoofDetection(getImg); } // Run the main function main().catch((error) => console.error("Error in main function:", error.message) ); ``` ##### Given below is a test run of the example. **Input Image** : group_photo.jpg - A photo showing a group of five people.
Image 1
When this image is sent to the Face Detection API, The response will return an int value number_of_faces in output : ```js { number_of_faces: 5; // number of faces detected } ``` --- # OCR API --- title: Ocr | Android SDK Documentation | VideoSDK description: Learn how to implement ocr using VideoSDK's Android SDK. Step-by-step integration guide for seamless identity verification or SIP connectivity. --- This API will take front and back part of the document in input, scan through the document and return all the available fields of the document in JSON formatted response. Fields can vary as per type of document in response.. :::important This API is available in Enterprise plan only. ::: ## HTTP Method and Endpoint **POST** | `https://api.videosdk.live/ai/v1/ocr` ## Headers Parameter ### Authorization **values** : YOUR_TOKEN_WITHOUT_ANY_PREFIX This will be a JWT token generate using VideoSDK ApiKey and Secret. _Note_ : the token will not include any prefix such as "Basic " or "Bearer ". Just pass a token as value. You can generate a new token by referring this Guide: [Generate Auth token](https://docs.videosdk.live/api-reference/realtime-communication/intro) ### Content-Type **values** : `application/json` This is useful for json body parameters, so that VideoSDK servers can understand that the incoming body parameter will be a JSON string. ## Data Parameter #### Base64 Encoding Format The images should be converted from binary format to a Base64 string Images must be Base64 encoded in the following format: ``` data:image/jpeg;base64,${Base64data} ``` #### Request Format The request body should contain two Base64-encoded images in the following format: ``` const data = { "frontpart": "data:image/jpeg;base64,${Base64data}", // Base64-encoded image "backpart": "data:image/jpeg;base64,${Base64data}", // Base64-encoded image } ``` ## Sending Image Comparison Request The request is made using the following code: This sends the encoded images to the specified URL for comparison. ## Response The response will return all the available fields of document in JSON formated response : ```javascript { idType: "", // document Type idNumber: "", // document Number name: "", // name dateOfBirth: "", address: "", gender: "", mobileNumber: "" }; ``` ## Code Snippet for API Integration Below is a code snippet demonstrating how to use OCR API (BETA) with Node.js: ```javascript // Load environment variables from .env file dotenv.config(); // Get the directory of the current file const __dirname = path.resolve(); // Function to read image files and convert them to base64 function getBase64Image(imageName) { const filePath = path.join(__dirname, "image", imageName); console.log(`Resolved file path: ${filePath}`); try { const image = readFileSync(filePath); const base64Data = Buffer.from(image).toString("base64"); return `data:image/jpeg;base64,${base64Data}`; } catch (error) { console.error(`Error reading image file: ${filePath}`, error.message); throw error; } } async function ocr(frontImg, backImg) { const url = "https://api.videosdk.live/ai/v1/ocr"; const headers = { Authorization: `${process.env.API_KEY}`, "Content-Type": "application/json", }; const data = { frontPart: getBase64Image(frontImg), backPart: getBase64Image(backImg), }; try { const response = await axios.post(url, data, { headers }); console.log(response.data); } catch (error) { console.error( "Error during OCR :", error.response ? error.response.data : error.message ); } } // Main function to run the API calls async function main() { const frontImg = "img4.jpg"; const backImg = "img5.jpg"; await ocr(frontImg, backImg); } // Run the main function main().catch((error) => console.error("Error in main function:", error.message) ); ``` --- # Understanding Analytics Dashboard - Android Welcome to the world of actionable insights and empowered decision-making. VideoSDK's session analytics dashboard is your gateway to understanding, optimizing, and elevating every aspect of your sessions. ## Accessing Analytics Made Easy Navigating through session data is a breeze with VideoSDK. Simply head to your session page at https://app.videosdk.live/meetings/sessions, where a treasure trove of session information awaits. ### How to Access Analytics? Open Session Analytics effortlessly by following these steps: 1. **Click on Meeting-ID:** - Directly access analytics by clicking on the meeting-ID within the session table. 2. **Hover and Click View Analytics:** - Hover over a specific meeting row to reveal the **View Analytics** button in the Actions Column. - Click on **View Analytics** to open the Session Overview sidebar, unlocking a wealth of insights. ![Access Analytics](https://cdn.videosdk.live/website-resources/docs-resources/access_analytics.png) --- There are three tabs available within the session analytics view: ## **1. Session Overview** This tab provides an overview of the session, including its duration and participant details. You can explore individual participant statistics to understand their session performance better. ## **2. Errors** In this tab, you can find information about any errors encountered during the session. It helps you identify and address issues like network problems or technical glitches promptly. ## **3. Session Stats** Explore data and metrics sent and received by participants to measure performance of the session in this tab. It offers insights into data exchange among participants, including metrics on jitter, RTT, and packet loss, resolutions, and fps aiding in assessing communication efficiency. Let's dive deeper into each of these tabs to gain a better understanding of session analytics. # Session Overview The session overview page is your compass in the sea of data. Here's what you'll uncover: - **Meeting ID:** Unique identifier for the meeting in the format of `abcd-efgh-ijkl`. - **Session ID:** Unique identifier for each session, uniquely identified by its `sessionId`. - **Session Status:** Indicates if the session is ongoing or ended. - **Session Initiating Time:** Time taken by the first participant to establish the connection. - **Start and End Time:** Marks the start and end of the session. - **Total Unique Participants Joined:** Total number of unique participants in the session. - **Total Session Duration:** Overall duration of the session from start to end. - **Total Participant Minutes:** A sum of all participant duration. - **Recording, HLS, RTMP:** Additional services used in the session. ### Participant Table List - **Participant ID:** Unique identifier for each participant. - **Participant Name:** Personalized identification for each participant. - **Join Time:** Time taken by the participant to establish connection. - **Duration:** Total time spent by this participant in the session. ### Efficient Session Management with Actions Enhance your session management with streamlined actions: - **Kick Out Participants:** Effortlessly remove/kickout participants from ongoing sessions. - **Detailed Participant Analytics:** - Hover over a specific participant row to reveal the `View Stats` button at the end of the row. - Click on `View Stats` to open the Participant Overview sidebar. ![View Stats](https://cdn.videosdk.live/website-resources/docs-resources/view_stats.png) ## Explore Participant Insights Discover valuable participant data that provides a clear view of engagement and experience: - **Participant ID:** Unique identifier for each participant. - **Participant Name:** Personalized identification for better interaction. - **Joined At:** Indicates the precise moment when participant connects to the session. - **Left At:** Indicates the precise moment when participant left the session. - **Total Duration:** Total duration of participant within that session. - **Joining Time:** Time taken by the this participant to establish the connection. - **Location:** Approx. geographic location from where the participant joined. - **Platform:** Specifies whether participants are using a desktop or mobile device. - **Device Info:** Offers details regarding the participant's device. - **OS:** Provides information about the participant's OS. - **Browser:** Provides specifics about the participant's browser - **SDK Version:** Indicates the version of the SDK used by participants. ### Understanding Participant Call Health We've developed call health to offer a rapid assessment of participant performance during calls. This feature highlights the performance of audio, video, and screen sharing audio-video separately. We've utilized color theory, with green indicating good performance, orange for moderate, and red for poor performance, to enhance clarity. For detailed insights, simply hover over the bars ![Participant Call Health](https://cdn.videosdk.live/website-resources/docs-resources/participant_call_health.png) ## Participant Session Stats Dive into Participant Session Stats for valuable insights into your session experience! With just a click on `View Session Stats` at the bottom of the page, unlock a treasure trove of data crucial for understanding audio and video experiences. Within this section, you can observe quality metrics for the selected participant, comparing them to others. Additionally, the `Sessions Stats` covers quality stats comparison for every participant. Keep reading to explore more about Session Stats. **Visualizing Session Performance from Both Sides** This section provides a two-sided view of your session's metrics. - **Left Side: Sender Participant Graph** This section displays graphs representing the metrics sent by the sender participant.. Here, you can see how the various factors impacted the data you transmitted. - **Right Side: Receiver Participant Graph** On the right side, you'll find a dropdown menu where you can select a specific participant. Choosing a receiver will display graphs showcasing the metrics **received by that participant**. This allows you to compare the sending experience (left side) with the receiving experience (right side) for different participants. ![VideoSDK Jitter Graph](https://cdn.videosdk.live/website-resources/docs-resources/video_jitter_graph.png) **See How Your Session Performed** These metrics give you a clear picture of your session's quality. Understanding them helps you spot any issues and keep things running smoothly. Let's dive into what each metric means! **Jitter:** Imagine your internet connection like a bumpy road. Jitter is how much those bumps cause your signal to bounce around. Less jitter means a smoother ride for your data (audio and video, threshold ≤ 30ms). **RTT (Latency):** This is how long it takes for data to travel between you and the server you're connected to. Think of it like the time it takes for a message to get delivered – a lower RTT means a faster delivery (affects both audio and video, threshold ≤ 300ms). **Bitrate:** This measures the amount of data flowing through your connection per second. Imagine it like the width of a water pipe – a higher bitrate allows for more data to flow, which can be good for high-quality audio/video or fast downloads. **Packet Loss:** Think of data traveling in tiny packets. Packet loss is when some of those packets get dropped along the way. More packet loss means information might be missing, affecting things like audio/video quality or lag in games (threshold ≤ 5%). **Resolution:** This refers to the sharpness and detail of an image or video. Think of it like the number of pixels on your screen – a higher resolution means a crisper picture (**Video Only**). **FPS (Frames Per Second):** This measures how many images (frames) are displayed on your screen every second. Imagine it like a flipbook – a higher FPS creates a smoother and more fluid animation or video experience (**Video Only**). ![VideoSDK FPS Graph](https://cdn.videosdk.live/website-resources/docs-resources/video_fps_graph.png) ![VideoSDK Resolution Graph](https://cdn.videosdk.live/website-resources/docs-resources/video_resolution_graph.png) --- # **Investigate Session Errors** Get to the root of smoother sessions by addressing errors directly. Use the Errors tab to explore details of errors encountered during your session. Note: Session errors are visible with JS SDK v0.0.82 or higher, React SDK v0.1.85 or higher, and React Native v0.1.6 or newer versions. - **Participant Name & ID:** Quickly identify the participant associated with each error for swift resolution. - **Error Types:** Understand the different types of errors encountered, such as network issues or connection disruptions. - **Detailed Descriptions:** Access clear explanations of errors to take actionable steps towards solution. ![VideoSDK Error Details](https://cdn.videosdk.live/website-resources/docs-resources/error_new.png) # **Analyse Session Stats** Similar to `Participant Session Stats`, this tab covers quality statistics sent by individual participants. You can select any participant to compare effectively with others. Choose a sender participant from the dropdown menu on the left and select a recipient on the right to compare data over different time frames. This allows you to identify and explore issues within specific durations. This tab covers the same metrics as covered in the `participant session stats`: Jitter, RTT (Latency), Bitrate, Packet-loss, Resolution (Video only), and FPS (Video only). ![VideoSDK Session Stats](https://cdn.videosdk.live/website-resources/docs-resources/session_stats.png) --- # Understanding Call Quality - Android When developing a video call app, customer satisfaction heavily depends on the app's video and audio quality and its fluctuation. ## Call Quality From the user's perspective, good video quality is defined as smooth and clear video, along with crystal clear audio. Developers consider good video quality as high-resolution, high-frame-rate video with minimal latency and high-bitrate audio with minimal audio loss. ## Factors affecting Quality When measuring video and audio quality, several variables come into play. The common factors affecting quality are as follows: ### `1. Network Bandwidth` - Network bandwidth is the measure of a user's network capacity, indicating how much data can be received and sent. - If the participant's bandwidth is low, they are likely to experience pixelated or frozen video, along with robotic voice or minor audio interruptions. - Frequent changes in network providers can also lead to significant fluctuations in internet bandwidth, resulting in poor video quality. ### `2. Latency` - Latency refers to the time it takes for data to transfer from one machine to another. - If the meeting is hosted in `Ohio` region, and users are joining from `Singapore` region, this can result in a long delay for data to transfer between mackines. Therefore, it is advisable to choose a server based on your user base. - With VideoSDK, you can specify the server during the creation of a Meeting/Room. ### `3. CPU Usage` - CPU Usage is also a determining factor, as all the audio and video streams going out and coming in need to be encoded or decoded, requiring a significant amount of computation. - The higher the resolution and frame rate of the videos, the greater the computation required, which can lead to a bottleneck in delivering good-quality video. - If the CPU is heavily consumed, it can also result in choppy or robotic audio. ### `4. Camera and Mic Quality` - The camera and microphone should capture high-quality streams to ensure that remote users don't receive a low-quality stream even if they have bad network bandwidth. ## Identifying various issues related to Quality. In order to identify potential issues, VideoSDK collects various audio and video-related metrics that can help pinpoint quality concerns. Take a look at these metrics and understand what they indicate. ### `1. Resolution and Framerate` - Resolution and frame rate serve as crucial metrics for video quality in a video call app. Resolution indicates the number of pixels in a video image, while frame rate denotes the number of frames displayed per second. - While higher resolutions and frame rates can enhance video quality, they also demand more bandwidth and processing power. It's essential to optimize these metrics based on the devices and network conditions of your users. - For instance, if the majority of your users are on mobile devices, opting for a lower resolution and frame rate may be more suitable to ensure smooth playback and minimize bandwidth usage. - If your user base comprises both mobile and desktop devices, adopting higher resolution for desktops and mid-resolutions for mobile devices can contribute to improved performance and quality while conserving bandwidth on mobile devices. ### `2. Bitrate` - Bitrate represents the number of bits per second transmitted or received during the transmission of audio or video streams. It is a crucial parameter for assessing the quality of audio or video streams and should be adjusted for each resolution to strike a balance between performance and bandwidth utilization. - In scenarios with excellent network conditions, a higher bitrate can lead to significantly improved video quality. However, it's essential to be cautious when dealing with very high bitrates on mobile devices, as it may result in heating issues due to the substantial computational requirements for encoding and decoding videos. ##### Example We used the same phone **(iPhone 14)**, for both participants, but there were differences in resolution and bitrate. The first participant had a resolution of **`1280x720`** with a bitrate of **`1442 kbps`**, while the second participant had a resolution of **`960x540`** with a bitrate of **`642 kbps`**. Surprisingly, both participants' videos appeared to be of equal quality despite variations in resolution and bitrate. ![resolution-and-bitrate](/img/resolution-and-bitrate.png) ### `3. Packet Loss` - Packet loss is a metric that reveals the number of lost data packets during transmission from the sender to the receiver. It can happen due to network congestion, hardware or software issues, or network latency. Increased packet loss can lead to degraded video and audio quality, as the absence of packets may cause gaps or distortions in the media stream.
### `4. Jitter` - The audio and video packets are sent out at random intervals over a specified time frame as they travel between the server and client. Jitter occurs when there is a variation in transmitting or receiving these data packets due to a faulty network connection. - When data packets experience delays during transmission to the participant, usually because the network is busy, they may result in pixelated video during a video call or sound choppy, distorted, or robotic in a voice call upon arrival. This creates jitter, with packets arriving at random intervals. ### `5. Round Trip Time (Latency)` - Round trip time refers to the duration it takes for data packets to be transmitted from the user's device to the server and back. If the servers are located far from the user's location, they may experience high latency (delay). - With VideoSDK, this factor is addressed as we automatically choose the nearest available server for participants. However, if you are geofencing to a specific region, ensure that you choose the server nearest to your users.
![resolution-and-bitrate](/img/rtt.png)
## Checking Realtime Statistics VideoSDK provides methods to check the realtime statistics of audio and video of all the participants. ### `getVideoStats()` - The `getVideoStats()` method returns an object containing the different quality parameters for video stream, which can be accessed through the `useParticipant` hook. - This object will contain values for the specific participant's video stream, including resolution, frame rate, bitrate, jitter, round trip time, and packet loss.
### `getAudioStats()` - The `getAudioStats()` method returns an object containing the different quality parameters for audio stream, which can be accessed through the `useParticipant` hook. - This object will contain values for the specific participant's audio stream, including bitrate, jitter, round trip time and packet loss. ### `getShareStats()` - The `getShareStats()` method returns an object containing the different quality parameters for screen share stream, which can be accessed through the `useParticipant` hook. - This objects will contain values for the specific participant's screen share stream, including resolution, frame rate, bitrate, jitter, round trip time, and packet loss. ### `getShareAudioStats()` - The `getShareAudioStats()` method returns an object containing the different quality parameters for screen share audio stream, which can be accessed through the `useParticipant` hook. - This object will contain values for the specific participant's screen share audio stream, including bitrate, jitter, round trip time and packet loss. :::note To show the popup dialog for the participant's realtime stats, you can [refer to this component](https://github.com/videosdk-live/videosdk-rtc-react-sdk-example/blob/main/src/utils/common.js#L142). ::: ## Quality analysis Graphs For all sessions conducted using VideoSDK, you can access quality analysis graphs from the [VideoSDK Dashboard](https://app.videosdk.live/meetings/sessions). These graphs help you visualize real-time data and identify spikes in certain parameters during calls, aiding in understanding the reasons for quality issues. ![quality analysis](/img/quality-analysis.png) ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [getVideoStats()](/android/api/sdk-reference/participant-class/methods#getvideostats) - [getAudioStats()](/android/api/sdk-reference/participant-class/methods#getaudiostats) - [getShareStats()](/android/api/sdk-reference/participant-class/methods#getsharestats) --- # Change Mode - Android In a live stream, audience members usually join in `RECV_ONLY` mode, meaning they can only view and listen to the hosts. However, if a host invites an audience member to actively participate (e.g., to speak or present), the audience member can switch their mode to `SEND_AND_RECV` using the changeMode() method. This guide explains how to use the changeMode() method and walks through a sample implementation where a host invites an audience member to become a host using PubSub. ### `changeMode()` - The `changeMode()` method from the `Meeting` class allows a participant to switch between modes during a live stream—for example, from audience to host. #### Example **Kotlin:** ```js class LiveStreamActivity : AppCompatActivity() { override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_live_stream); // initialize the meeting liveStream = VideoSDK.initMeeting(... ) ... // join meeting liveStream!!.join() // Button to change mode val changeModeBtn: Button = findViewById(R.id.btnChangeMode) changeModeBtn.setOnClickListener { liveStream.changeMode(MeetingMode.SEND_AND_RECV) } } } ``` --- **Java:** ```java public class LiveStreamActivity extends AppCompatActivity { @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_live_stream); // initialize the meeting Meeting liveStream = VideoSDK.initMeeting( ...); ... // join meeting liveStream.join() // Button to change mode Button changeModeBtn = findViewById(R.id.btnChangeMode); changeModeBtn.setOnClickListener(new View.OnClickListener() { @Override public void onClick(View v) { liveStream.changeMode(MeetingMode.SEND_AND_RECV); } }); } } ``` #### Implementation Guide #### Step 1 : Create a Pubsub Topic - Set up a PubSub topic to send a mode change request from the host to a specific audience member. **Kotlin:** ```js fun sendInvite(livestream: Meeting, participantId: String) { val pubSubPublishOptions = PubSubPublishOptions().apply { isPersist = false } livestream.pubSub.publish( "REQUEST_TO_JOIN_AS_HOST_$participantId", // PubSub topic specific to participant "SEND_AND_RECV", // message pubSubPublishOptions ) } ``` --- **Java:** ```java void sendInvite(Meeting livestream, String participantId) { PubSubPublishOptions pubSubPublishOptions = new PubSubPublishOptions(); pubSubPublishOptions.setPersist(false); livestream.pubSub.publish( "REQUEST_TO_JOIN_AS_HOST_"+ participantId, // PubSub topic specific to participant "SEND_AND_RECV", // message pubSubPublishOptions ); } ``` #### Step 2 : Create an Invite Button - Add an "Invite on Stage" button for each audience member. When clicked, it publishes a PubSub message with the mode "SEND_AND_RECV" to that participant. **Kotlin:** ```js // In ParticipantListAdapter.kt, inside showPopup method if (participant!!.mode == "RECV_ONLY") { popup.menu.add("Add as a co-host") popup.setOnMenuItemClickListener { item: MenuItem -> if (item.toString() == "Add as a co-host") { sendInvite(liveStream,participantId); holder.requestedIndicator.visibility = View.VISIBLE holder.btnParticipantMoreOptions.isEnabled = false return@setOnMenuItemClickListener true } false } } ``` --- **Java:** ```java // In ParticipantListAdapter.java, inside showPopup method if ("RECV_ONLY".equals(participant.getMode())) { popup.getMenu().add("Add as a co-host"); popup.setOnMenuItemClickListener(new PopupMenu.OnMenuItemClickListener() { @Override public boolean onMenuItemClick(MenuItem item) { if ("Add as a co-host".equals(item.toString())) { sendInvite(liveStream,participantId); holder.requestedIndicator.setVisibility(View.VISIBLE); holder.btnParticipantMoreOptions.setEnabled(false); return true; } return false; } }); } ``` #### Step 3 : Create a Listener to Change the Mode - On the audience side, subscribe to the specific PubSub topic. When a mode request is received, update the participant’s mode using changeMode(). **Kotlin:** ```js // In your class where you define coHostListener val coHostListener = object : PubSubMessageListener { override fun onMessage(pubSubMessage: PubSubMessage) { showCoHostRequestDialog() } } liveStream.pubSub.subscribe( "REQUEST_TO_JOIN_AS_HOST_${liveStream.localParticipant.id}", coHostListener ) // In the showCoHostRequestDialog method private fun showCoHostRequestDialog() { // Dialog setup code... acceptBtn.setOnClickListener { liveStream.changeMode("SEND_AND_RECV") } // Rest of the dialog code... } ``` --- **Java:** ```java // In your class where you define coHostListener PubSubMessageListener coHostListener = new PubSubMessageListener () { @Override public void onMessage(PubSubMessage pubSubMessage) { showCoHostRequestDialog(); } }; liveStream.pubSub.subscribe("REQUEST_TO_JOIN_AS_HOST_" + liveStream.getLocalParticipant().getId(), coHostListener); // In the showCoHostRequestDialog method private void showCoHostRequestDialog() { // Dialog setup code... acceptBtn.setOnClickListener(new View.OnClickListener() { @Override public void onClick(View v) { liveStream.changeMode("SEND_AND_RECV"); } }); // Rest of the dialog code... } ``` ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [changeMode()](/android/api/sdk-reference/meeting-class/methods#changemode) - [Participant](/android/api/sdk-reference/participant-class/properties) - [Meeting](/android/api/sdk-reference/meeting-class/properties) - [pubSub()](/android/api/sdk-reference/pubsub-class/introduction) --- # Remove Participant - Android When hosting a live stream, it's essential for the host to have the capability to to remove a participant from the live stream. This can be useful in various scenarios where a participant is causing disturbance, behaving inappropriately, or is not following the guidelines. This guide focuses on this very aspect of removing a particpant from the live stream. VideoSDK provides three ways to do so: 1. [Using SDK](#1-using-sdk) 2. [Using VideoSDK Dashboard](#2-using-videosdk-dashboard) 3. [Using Rest API](#3-using-rest-api) ## 1. Using SDK ### `remove()` The `remove()` method allows for the removal of a participant during an on-going session. This can be helpful when moderation is required in a particular live stream. **Kotlin:** ```js btnRemoveParticipant!!.setOnClickListener { _: View? -> val remoteParticipantId = "" // Get specific participant instance val participant = meeting!!.participants[remoteParticipantId] // Remove participant from active session // This will emit an event called "onParticipantLeft" for that particular participant participant!!.remove() } ``` --- **Java:** ```js btnRemoveParticipant.setOnClickListener(new View.OnClickListener() { @Override public void onClick(View v) { String remoteParticipantId = ""; // Get specific participant instance Participant participant = meeting.getParticipants().get(remoteParticipantId); // Remove participant from active session // This will emit an event called "onParticipantLeft" for that particular participant participant.remove(); } }); ``` ### Events associated with remove() Following callbacks are received when a participant is removed from the meeting. - The participant who was removed from the meeting will receive a callback on the [`onMeetingLeft`](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingleft) event of the `Meeting` class. - All other [remote participants](/android/guide/video-and-audio-calling-api-sdk/concept-and-architecture#2-participant) will receive a callback [`onParticipantLeft`](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onparticipantleft) with Participant object. ## 2. Using VideoSDK Dashboard - For removing a participant using the VideoSDK Dashboard, navigate to the session page on [VideoSDK Dashboard](https://app.videosdk.live/meetings/sessions). Select the specific session, and from the list of participants, choose the participant you wish to remove. Utilize the provided options to remove the selected participant from the session.
## 3. Using Rest API - You can also remove a particular participant from the live stream [using the REST API](/api-reference/realtime-communication/remove-participant). - To employ this method, you need the `sessionId` of the live stream and the `participantId` of the individual you intend to remove. ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [remove()](/android/api/sdk-reference/participant-class/methods#remove) - [onMeetingLeft()](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onmeetingleft) - [onParticipantLeft()](/android/api/sdk-reference/meeting-class/meeting-event-listener-class#onparticipantleft) --- # Audience Polls during Live Stream - Android Interactive polls are a great way to increase engagement during livestreams. Using VideoSDK’s PubSub mechanism, you can easily implement real-time audience polling, where viewers can vote and see live results instantly. This guide walks you through how to create, send, and visualize poll results during a livestream. ## Step 1: Creating and Publishing a Poll To initiate a poll, use the `PubSub` class with a `POLL` topic. The poll structure should include a question and multiple options. This message will be published to all participants. **Kotlin:** ```js class CreatePollDialog() { companion object { const val POLL_TOPIC = "POLL" } fun show() { // ... UI setup code ... dialog.getButton(AlertDialog.BUTTON_POSITIVE).setOnClickListener { // Validate inputs // Create poll object val options = mutableListOf(option1, option2) if (option3.isNotEmpty()) options.add(option3) if (option4.isNotEmpty()) options.add(option4) val poll = SimplePoll(question, options) // Publish poll to all participants val pubSubPublishOptions = PubSubPublishOptions() pubSubPublishOptions.isPersist = true liveStream.pubSub.publish(POLL_TOPIC, poll.toJsonString(), pubSubPublishOptions) // Show results dialog to host PollResultsDialog(context, liveStream).show(poll) dialog.dismiss() } } } ``` --- **Java:** ```java public class CreatePollDialog { public static final String POLL_TOPIC = "POLL"; public CreatePollDialog() { // default constructor } public void show() { // ... UI setup code ... dialog.getButton(AlertDialog.BUTTON_POSITIVE).setOnClickListener(v -> { // Validate inputs // Create poll object List options = new ArrayList<>(); options.add(option1); options.add(option2); if (!option3.isEmpty()) options.add(option3); if (!option4.isEmpty()) options.add(option4); SimplePoll poll = new SimplePoll(question, options); // Publish poll to all participants PubSubPublishOptions pubSubPublishOptions = new PubSubPublishOptions(); pubSubPublishOptions.setPersist(true); liveStream.pubSub.publish(POLL_TOPIC, poll.toJsonString(), pubSubPublishOptions); // Show results dialog to host new PollResultsDialog(context, liveStream).show(poll); dialog.dismiss(); }); } } ``` ## Step 2: Subscribing to Polls and Displaying Options Participants can listen to the POLL topic and render voting options dynamically based on the incoming data. **Kotlin:** ```js class PollVotingDialog() { companion object { const val POLL_RESPONSE_TOPIC = "POLL_RESPONSE" } fun show(poll: SimplePoll) { // ... UI setup code ... // Create option buttons dynamically poll.options.forEach { option -> val button = Button(context) button.text = option button.setOnClickListener { // Submit vote val response = SimplePollResponse( pollId = poll.id, option = option, participantId = liveStream.localParticipant.id, participantName = liveStream.localParticipant.displayName ) liveStream.pubSub.publish(POLL_RESPONSE_TOPIC, response.toJsonString()) // Disable all buttons after voting for (i in 0 until optionsContainer.childCount) { optionsContainer.getChildAt(i).isEnabled = false } } optionsContainer.addView(button) } } } ``` --- **Java:** ```java public class PollVotingDialog { public static final String POLL_RESPONSE_TOPIC = "POLL_RESPONSE"; public void show(SimplePoll poll) { // ... UI setup code ... // Create option buttons dynamically for (String option : poll.getOptions()) { Button button = new Button(context); button.setText(option); button.setOnClickListener(v -> { // Submit vote SimplePollResponse response = new SimplePollResponse( poll.getId(), option, liveStream.getLocalParticipant().getId(), liveStream.getLocalParticipant().getDisplayName() ); liveStream.pubSub.publish(POLL_RESPONSE_TOPIC, response.toJsonString()); // Disable all buttons after voting for (int i = 0; i < optionsContainer.getChildCount(); i++) { optionsContainer.getChildAt(i).setEnabled(false); } }); optionsContainer.addView(button); } } } ``` ## Step 3: Aggregating and Displaying Poll Results The host can subscribe to the POLL_RESPONSE topic to collect responses and render the result in real-time. **Kotlin:** ```js class PollResultsDialog() { private val pollResults = ConcurrentHashMap() private var totalVotes = 0 companion object { const val POLL_RESPONSE_TOPIC = "POLL_RESPONSE" } fun show(poll: SimplePoll) { // ... UI setup code ... // Initialize results map with 0 votes for each option poll.options.forEach { option -> pollResults[option] = 0 } // Create initial result bars updateResultBars() // Listen for poll responses responsesListener = object : PubSubMessageListener { override fun onMessageReceived(pubSubMessage: PubSubMessage) { try { val response = SimplePollResponse.fromJsonString(pubSubMessage.message) val option = response.option // Update vote count pollResults[option] = (pollResults[option] ?: 0) + 1 totalVotes++ // Update UI on main thread (context as? android.app.Activity)?.runOnUiThread { updateResultBars() } } catch (e: Exception) { e.printStackTrace() } } override fun onOldMessagesReceived(messages: List) { // Persisted message list Log.d("VideoSDK", "messages: $messages") } } // Subscribe to poll responses liveStream.pubSub.subscribe(POLL_RESPONSE_TOPIC, responsesListener) } private fun updateResultBars() { // Clear previous results optionsContainer?.removeAllViews() // Create result bars for each option pollResults.forEach { (option, votes) -> val percentage = if (totalVotes > 0) (votes * 100) / totalVotes else 0 // Create progress bar UI showing percentage // ... UI code to display results ... } } } ``` --- **Java:** ```java import java.util.concurrent.ConcurrentHashMap; import java.util.Map; public class PollResultsDialog { private ConcurrentHashMap pollResults = new ConcurrentHashMap<>(); private int totalVotes = 0; public static final String POLL_RESPONSE_TOPIC = "POLL_RESPONSE"; public void show(SimplePoll poll) { // ... UI setup code ... // Initialize results map with 0 votes for each option for (String option : poll.options) { pollResults.put(option, 0); } // Create initial result bars updateResultBars(); // Listen for poll responses responsesListener = new PubSubMessageListener() { @Override public void onMessageReceived(PubSubMessage PubSubMessage) { try { SimplePollResponse response = SimplePollResponse.fromJsonString(pubSubMessage.message); String option = response.option; // Update vote count pollResults.put(option, pollResults.getOrDefault(option, 0) + 1); totalVotes++; // Update UI on main thread if (context instanceof android.app.Activity) { ((android.app.Activity) context).runOnUiThread(new Runnable() { @Override public void run() { updateResultBars(); } }); } } catch (Exception e) { e.printStackTrace(); } } @Override public void onOldMessagesReceived(List messages) { // Persisted message list Log.d("VideoSDK", "onOldMessagesReceived: " + messages); } }; // Subscribe to poll responses liveStream.pubSub.subscribe(POLL_RESPONSE_TOPIC, responsesListener); } private void updateResultBars() { // Clear previous results if (optionsContainer != null) { optionsContainer.removeAllViews(); } // Create result bars for each option for (Map.Entry entry : pollResults.entrySet()) { String option = entry.getKey(); int votes = entry.getValue(); int percentage = (totalVotes > 0) ? (votes * 100) / totalVotes : 0; // Create progress bar UI showing percentage // ... UI code to display results ... } } } ``` ### API Reference The API references for all the methods and events utilized in this guide are provided below. - [pubSub()](/android/api/sdk-reference/pubsub-class/introduction) ---

{props.title}

{" "} ### Why we are using JWT based Token ? Token based authentication allows users to verify their identity by providing generated API key and secret. We use JWT token for the authentication purpose because Token-based authentication is **widely used** in modern web applications and APIs because it offers several benefits over traditional authentication. For example, it can **reduce the risk of the credentials being misused**, and it allows for **more fine-grained control** over access to resources. Additionally, tokens can be easily revoked or expired, making it easier to manage access rights. ### How to generate Token ? To manage secured communication, every participant that connects to the meeting needs an access token. You can easily generate this token by using your `apiKey` and `secret-key` which you can get from [VideoSDK Dashboard](https://app.videosdk.live/api-keys). #### 1. Generating token from Dashboard If you are looking to do **testing or for development purpose**, you can generate a temporary token from [VideoSDK Dashboard's API section](https://app.videosdk.live/api-keys).
:::tip The best practice for getting token includes generating it from your backend server which will help in **keeping your credentials safe**. ::: #### 2. Generating token in your backend - Your server will generate access token using your API key and secret. - While generating a token, you can provide **expiration time, permissions and roles** which are discussed later in this section. - Your client obtains token from your backend server. - For token validation, client will pass this token to VideoSDK server. - VideoSDK server will only allow entry in the meeting if the token is valid. ![img2.png](../../static/img/authentication-and-token.png) Follow our official example repositories to setup token API [videosdk-rtc-api-server-examples](https://github.com/videosdk-live/videosdk-rtc-api-server-examples) ### Payload while generating token ```js { apikey: API_KEY, //MANDATORY permissions: [`allow_join`], //`ask_join` || `allow_mod` //MANDATORY version: 2, //OPTIONAL roomId: ROOM_ID, //OPTIONAL, participantId: PARTICIPANT_ID, //OPTIONAL roles: ['crawler', 'rtc'], //OPTIONAL } ``` - **`apikey`(Mandatory)**: This must be the API Key generated from the VideoSDK Dashboard. You can get it from [here](https://app.videosdk.live/api-keys). - **`permissions`(Mandatory)**: By providing the permissions, you can control what a participant can do in the meeting and whether he can join the meeting directly. Available permissions are: - **`allow_join`**: The participant is **allowed to join** the meeting directly. - **`ask_join`**: The participant is required to **ask for permission to join** the meeting. The participant having the permission `allow_join` can accept or reject a participant whenever someone wants to join. - **`allow_mod`**: The participant is **allowed to toggle** webcam & mic of other participants. - **`version`(optional)**: For accessing the [v2 API](/api-reference/realtime-communication/intro), you need to provide `2` as the version value. - For passing `roomId`, `participantId` or `roles` parameters in payload, it is essential to set the version value to `2`. - **`roomId`(optional)**: To provide customised access control, you can make the token applicable to a particular room by including the `roomId` in the payload. - **`participantId`(optional)**: You can include the `participantId` in the payload to limit the token's access to a particular participant. - **`roles`(optional)**: - **`crawler`**: This role is only for accessing [v2 API](/api-reference/realtime-communication/intro), you can not use this token for running the `Meeting`/`Room`. - **`rtc`**: This role is only allowed for running the `Meeting` / `Room`, with this role you can not use [server-side APIs](/api-reference/realtime-communication/intro). Then, you have to sign this payload with your **`SECRET KEY`** and `jwt` options using the **`HS256 algorithm`**. ### Expiration time You can set any expiration time to the token. But in the **production environment**, it is recommended to generate a token with **short expiration time** because by any chance if someone gets hold of the token, it won't be valid for a longer period of time. ### What happens if token is expired? If your token is expired, the user won't be able to join the meeting and all the API calls will give error with message `Token is invalid or expired`. :::note Token is validated only once while joining the meeting, so if a person joins the meeting and the token gets expired after that, there won't be any issue in the current meeting. ::: ## How to check validity of token? 1. After generating the token, visit [jwt.io](https://jwt.io) and paste your token in the given area. 2. You will be able to see the payload you passed while generating the token and also be able to see the expiration time and token creation time. ![img1.png](../../static/img/validate-token.png) ## API Settings & Controls The VideoSDK Dashboard provides a powerful [API Settings](https://app.videosdk.live/api-keys) section where you can configure various services and controls for your meetings. These settings allow you to customize the behavior of different features directly from the dashboard without requiring code changes. ## Audio/Video Configurations :::note Runtime configuration from the dashboard has higher precedence than static SDK configurations. If a value is modified in the dashboard, it will override any previously defined SDK settings for that session. Example: Code: Video = HD → Dashboard: Video = Full HD → Applied: Full HD ::: ### Maximum Send Bitrate Per Participant Control the maximum total upload bandwidth for all media streams from a single participant. ![Maximum Send Bitrate](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.00.11%E2%80%AFPM.png) **Available Options:** - **Auto**: Removes the upload bitrate limit (recommended for best quality) - **SD (800kbps)**: Standard definition bitrate - **HD (1500kbps)**: High definition bitrate - **Full HD (3000kbps)**: Full high definition bitrate :::tip Setting this to **Auto** provides the best video quality as it removes upload bitrate constraints, allowing adaptive streaming based on network conditions. ::: ## Recording & Streaming Configuration :::note Recording and HLS follow a precedence-based configuration model: API/SDK configuration → Highest priority Dashboard configuration → Used only when no API values are defined Any parameters explicitly set via the API will not be overridden by dashboard changes. ::: ### Recording Configuration Configure recording settings for your meetings including layout, quality, and auto-start options. ![Recording Settings](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.01.00%E2%80%AFPM.png) **Recording Controls:** - **Enabled**: Toggle to enable/disable recording functionality - **Enabled Modes**: Choose between Video & Audio or Audio Only recording - **Start Recording Automatically**: Auto-start recording when meeting begins - **Layout Style**: Select Grid layout for recording composition - **Maximum Tiles In Grid**: Set the number of participants visible in grid (1-25) - **Who to Prioritize**: Choose Active Speaker or Pinned Participant - **Theme**: Select System Default, Light, or Dark theme for recordings - **Video Orientation**: Choose between Landscape or Portrait mode - **Recording Quality**: Select quality level (e.g., HD Medium) - **Recording Mode**: Choose between Video & Audio or Audio Only ### HLS Streaming Configuration Configure HTTP Live Streaming (HLS) settings for broadcasting your meetings. ![HLS Streaming Settings](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.02.18%E2%80%AFPM.png) **HLS Streaming Controls:** - **Enabled**: Toggle to enable/disable HLS streaming - **Enabled Modes**: Choose between Video & Audio or Audio Only streaming - **Auto Start HLS**: Automatically start HLS when meeting begins - **Record HLS Stream**: Enable recording of the HLS stream - **Layout Style**: Select Grid layout for stream composition - **Maximum Tiles In Grid**: Set the number of participants visible (1-25) - **Who to Prioritize**: Choose Active Speaker or Pinned Participant - **Theme**: Select System Default, Light, or Dark theme - **HLS Orientation**: Choose between Landscape or Portrait mode - **HLS Quality**: Select streaming quality (e.g., HD Medium) - **HLS Streaming Mode**: Choose between Video & Audio or Audio Only ### Additional Service Controls Enable or disable additional services for your API key. ![Service Controls](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.03.08%E2%80%AFPM.png) **Available Services:** - **RTMP Output**: Enable RTMP streaming to external platforms - **SIP Integration**: Enable SIP (Session Initiation Protocol) integration for telephony - **Realtime Translation**: Enable real-time language translation in meetings ### Transcription & Summary Configure transcription and domain whitelisting for your API key. ![Transcription Settings](https://assets.videosdk.live/images/Screenshot%202026-01-09%20at%2012.03.50%E2%80%AFPM.png) **Transcription Controls:** - **Enabled**: Toggle to enable/disable transcription and summary features - **Whitelist Domain**: Add domain restrictions for API key usage - Enter domain names in the format `https://domain.name` - Click "Add Domain" to whitelist specific domains - Only whitelisted domains can use this API key - When no domains are whitelisted, the API key can be used from any domain --- # Developer Experience Guidelines - Android --- # Handle Large Rooms - Android --- # User Experience Guidelines - Android --- # Chat during Live Stream - Android Enhance your live stream experience by enabling real-time audience chat using VideoSDK's PubSub class. Whether you’re streaming a webinar, online event, or an interactive session, integrating a chat system lets your viewers engage, ask questions, and react instantly. This guide shows how to build a group or private chat interface for a live stream using the Publish-Subscribe (PubSub) mechanism. This guide focuses on using PubSub to implement Chat functionality. If you are not familiar with the PubSub mechanism and `PubSub` class , you can [follow this guide](/android/guide/video-and-audio-calling-api-sdk/collaboration-in-meeting/pubsub). ## Implementing Chat ### `Group Chat` 1. First step in creating a group chat is choosing the topic which all the participants will publish and subscribe to send and receive the messages. We will be using `CHAT` as the topic for this one. 2. On the send button, publish the message that the sender typed in the `EditText` field. **Kotlin:** ```js import androidx.appcompat.app.AppCompatActivity import android.os.Bundle import android.view.View import android.widget.EditText import android.widget.Toast import androidx.appcompat.widget.Toolbar import live.videosdk.rtc.android.Meeting import live.videosdk.rtc.android.listeners.PubSubMessageListener import live.videosdk.rtc.android.model.PubSubPublishOptions class ChatActivity : AppCompatActivity() { // Meeting var meeting: liveStream? = null override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_chat) /** * Here, we have created 'MainApplication' class, which extends android.app.Application class. * It has Meeting property and getter and setter methods of Meeting property. * In your android manifest, you must declare the class implementing android.app.Application * (add the android:name=".MainApplication" attribute to the existing application tag): * In MainActivity.kt, we have set Meeting property. * * For Example: (MainActivity.kt) * var meeting = VideoSDK.initMeeting(context, meetingId, ParticipantName, micEnabled, webcamEnabled, paricipantId, mode, multiStream, customTrack, metaData, signalingBaseUrl) * (this.application as MainApplication).meeting = meeting */ // Get Meeting liveStream = (this.application as MainApplication).meeting findViewById(R.id.btnSend).setOnClickListener(view -> sendMessage()); } private fun sendMessage() { // get message from EditText val message: String = etmessage.getText().toString() if (!TextUtils.isEmpty(message)) { val publishOptions = PubSubPublishOptions() publishOptions.setPersist(true) // Sending the Message using the publish method //highlight-next-line liveStream!!.pubSub.publish("CHAT", message, publishOptions) // Clearing the message input etmessage.setText("") } else { Toast.makeText( this@ChatActivity, "Please Enter Message", Toast.LENGTH_SHORT ).show() } } } ``` --- **Java:** ```js import androidx.appcompat.app.AppCompatActivity; import android.os.Bundle; import java.util.List; import live.videosdk.rtc.android.Meeting; import live.videosdk.rtc.android.lib.PubSubMessage; import live.videosdk.rtc.android.listeners.PubSubMessageListener; import live.videosdk.rtc.android.model.PubSubPublishOptions; public class ChatActivity extends AppCompatActivity { // Meeting Meeting liveStream; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_chat); /** * Here, we have created 'MainApplication' class, which extends android.app.Application class. * It has Meeting property and getter and setter methods of Meeting property. * In your android manifest, you must declare the class implementing android.app.Application * (add the android:name=".MainApplication" attribute to the existing application tag): * In MainActivity.java, we have set Meeting property. * * For Example: (MainActivity.java) * Meeting meeting = VideoSDK.initMeeting(context, meetingId, ParticipantName, micEnabled, webcamEnabled, participantId, mode, mutliStream, customTrack,metaData, signalingBaseUrl, preferredProtocol); * ((MainApplication) this.getApplication()).setMeeting(meeting); */ // Get Meeting liveStream = ((MainApplication) this.getApplication()).getMeeting(); findViewById(R.id.btnSend).setOnClickListener(view -> sendMessage()); } private void sendMessage() { // get message from EditText String message = etmessage.getText().toString(); if (!message.equals("")) { PubSubPublishOptions publishOptions = new PubSubPublishOptions(); publishOptions.setPersist(true); // Sending the Message using the publish method //highlight-next-line liveStream.pubSub.publish("CHAT", message, publishOptions); // Clearing the message input etmessage.setText(""); } else { Toast.makeText(ChatActivity.this, "Please Enter Message", Toast.LENGTH_SHORT).show(); } } } ``` 3. Next step would be to display the messages others send. For this we have to `subscribe` to that topic i.e `CHAT` and display all the messages. **Kotlin:** ```js class ChatActivity : AppCompatActivity() { // PubSubMessageListener //highlight-start val pubSubMessageListener = object : PubSubMessageListener { override fun onMessageReceived(message: PubSubMessage) { Log.d("#message", "onMessageReceived: ${message.message}") } override fun onOldMessagesReceived(messages: List) { // Persisted message list Log.d("#message", "onOldMessagesReceived: $messages") } } //highlight-end override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_chat) //... // Subscribe for 'CHAT' topic //highlight-next-line liveStream!!.pubSub.subscribe("CHAT", pubSubMessageListener) } } ``` --- **Java:** ```js public class ChatActivity extends AppCompatActivity { // PubSubMessageListener //highlight-start PubSubMessageListener pubSubMessageListener = new PubSubMessageListener() { @Override public void onMessageReceived(PubSubMessage message) { Log.d("#message", "onMessageReceived: " + message.getMessage()); } @Override public void onOldMessagesReceived(List messages) { // Persisted message list Log.d("#message", "onOldMessagesReceived: " + messages); } }; //highlight-end @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_chat); //.. // Subscribe for 'CHAT' topic //highlight-next-line liveStream.pubSub.subscribe("CHAT", pubSubMessageListener); } } ``` 4. Final step in the group chat would be `unsubscribe` to that topic, which you had previously subscribed but no longer needed. Here we are `unsubscribe` to `CHAT` topic on activity destroy. **Kotlin:** ```js class ChatActivity : AppCompatActivity() { override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_chat) //... } override fun onDestroy() { // Unsubscribe for 'CHAT' topic //highlight-next-line liveStream!!.pubSub.unsubscribe("CHAT", pubSubMessageListener) super.onDestroy() } } ``` --- **Java:** ```js public class ChatActivity extends AppCompatActivity { @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_chat); //.. } @Override protected void onDestroy() { // Unsubscribe for 'CHAT' topic //highlight-next-line liveStream.pubSub.unsubscribe("CHAT", pubSubMessageListener); super.onDestroy(); } } ``` ### `Private Chat` (1:1 between Host and Viewer) Private messaging is ideal when a host or moderator needs to directly respond to a viewer’s question. This can be achieved using the `sendOnly` property. **Kotlin:** ```js class ChatActivity : AppCompatActivity() { //.. private fun sendMessage() { // get message from EditText val message: String = etmessage.getText().toString() if (!TextUtils.isEmpty(message)) { val publishOptions = PubSubPublishOptions() publishOptions.setPersist(true) //highlight-start // Pass the participantId of the participant to whom you want to send the message. var sendOnly: Array = arrayOf("xyz") publishOptions.setSendOnly(sendOnly); //highlight-end // Sending the Message using the publish method //highlight-next-line liveStream!!.pubSub.publish("CHAT", message, publishOptions) // Clearing the message input etmessage.setText("") } else { Toast.makeText( this@ChatActivity, "Please Enter Message", Toast.LENGTH_SHORT ).show() } } } ``` --- **Java:** ```js public class ChatActivity extends AppCompatActivity { //... private void sendMessage() { // get message from EditText String message = etmessage.getText().toString(); if (!message.equals("")) { PubSubPublishOptions publishOptions = new PubSubPublishOptions(); publishOptions.setPersist(true); //highlight-start // Pass the participantId of the participant to whom you want to send the message. String[] sendOnly = { "xyz" }; publishOptions.setSendOnly(sendOnly); //highlight-end // Sending the Message using the publish method //highlight-next-line liveStream.pubSub.publish("CHAT", message, publishOptions); // Clearing the message input etmessage.setText(""); } else { Toast.makeText(ChatActivity.this, "Please Enter Message", Toast.LENGTH_SHORT).show(); } } } ``` ### Downloading Chat Messages All the messages from the PubSub which where published with `persist : true` and can be downloaded as an `.csv` file. This file will be available in the VideoSDK dashboard as well as throught the [Sessions API](/api-reference/realtime-communication/fetch-session-using-sessionid). ### API Reference The API references for all the methods and events utilised in this guide are provided below. - [pubSub()](/android/api/sdk-reference/pubsub-class/introduction) --- # Cloud Proxy - Android Our SDK features a Cloud Proxy that allows you to manage how your streaming content is routed through different network paths. This feature ensures compliance with various regional internet regulations by restricting traffic to specified proxy servers, based on your needs and the geographical location of users. This capability is crucial for adhering to local data protection laws and optimizes the streaming experience by overcoming geographical network constraints. By directing user connections to appropriate regional servers, Cloud Proxy enhances overall service performance and reliability. Cloud Proxy offers three straightforward operating modes to fit different business and firewall requirements: - **UDP_OVER_TCP (Default)**: In this mode, the connection starts by attempting to establish a UDP connection for media transmission. If that fails, it will automatically shifts to a secure TCP connection, which is compatible with most firewalls. - **Force UDP**: This mode uses only UDP to send media, ensuring high-quality streams. It's ideal for environments where media quality is critical and the firewall can be configured accordingly. - **Force TCP**: Only uses TCP for secure media transmission, suitable for strict firewall settings that only permit TCP traffic over centain ports. This might require additional firewall configuration and could affect media quality under poor network conditions. ### In this sequence diagram: - Client to Proxy Server: The client sends a request to the proxy server, which processes and forwards it based on predefined rules. - Proxy Server to Destination Server: The proxy server sends the request to the destination server, receives the response, and relays it back to the client. :::info Cloud Proxy is only available under Enteprise Plan, [Contact Sales](https://www.videosdk.live/contact) for more information. ::: ## Implementation **Kotlin:** ```js val liveStream: Meeting = VideoSDK.initMeeting( this@MainActivity, "meetingId", "John Due", true, true, null, null, false, null, null,"proxy.yourwebsite.com", VideoSDK.PreferredProtocol.UDP_ONLY ); ``` --- **Java:** ```java Meeting liveStream = VideoSDK.initMeeting( MainActivity.this, "meetingId", "John Due",true, true, null, null, false, null,null,"proxy.yourwebsite.com", VideoSDK.PreferredProtocol.UDP_ONLY ); ``` ### Parameters - preferredProtocol: - UDP_OVER_TCP (default): Initially the server attempts to establish a connection using UDP, if that fails it automatically switches to TCP protocol. - UDP_ONLY: Force UDP protocol - TCP_ONLY: Force TCP protocol - signalingBaseUrl: Proxy URL to origin signaling and media. ## API Reference The API references for all the methods and events utilized in this guide are provided below. - [initMeeting()](/android/api/sdk-reference/initMeeting) --- # Concept and Architecture - Android Before diving into the concept, let's understand the VideoSDK, VideoSDK is a software development kit that offers tools and APIs for creating apps that are based on video and audio. It typically includes features such as video and audio calls, chat, cloud recording, simulcasting (RTMP), interactive live streaming (HLS), and many more across a wide range of platforms and devices. ## Concepts ![img.png](../../../../static/img/room-concept.png) ### `1. Meeting / Room` - Meeting or Room object in the VideoSDK provide a virtual place for participants to interact and engage in real-time voice, video, and screen-sharing sessions. The object is in charge of handling media streams and participant communication. - Meeting or Room can be uniquely identified by `meetingId` or `roomId`. ### `2. Participant` - Participant is a VideoSDK object that represents each user/client in the meeting or room and allows them to share audio/video assets. - `2.1 Local Participant` : The local participant is the one that runs on the user's device. The local participant has control over their own media streams, including the ability to start and stop audio and video. - The local participant in a meeting/room can also connect with other participants by transmitting and receiving audio and video streams, exchanging chat messages, and more. - `2.2 Remote Participant` : The remote participant receives audio and video streams from the local participant and other remote participants and also has the ability to exchange audio, video, and chat messages with the local participant. - Each participant in VideoSDK can be uniquely identified by `participantId`. ### `3. MediaStream & Track` - A mediastream is a collection of audio & video tracks that can be transmitted between participants in real-time. - A track is a continuous flow of audio or video data and can be thought of as a stream of media frames. - A mediastream can contain multiple tracks. One video track for the video feed from the camera and one audio track for the audio feed from the microphone. These tracks can be transmitted between participants in VideoSDK Meeting / Room. ### `4. Events / Notifications` - Events / Notifications can be used to inform users about various activities happening in a Meeting / Room, including participant join/leave and new messages. They can also be used to alert users about any SDK-level errors that occur during a call. ### `5. Session` - A Session is the instance of an ongoing meeting/room which has one or more participants in it. A single room or meeting can have multiple sessions. - Each session can be uniquely identified by `sessionId`. ![img.png](../../../../static/img/meeting-session.jpg) --- ![img.png](../../../../static/img/recording-hls-rtmp.png) ### `6. Cloud Recording` - Cloud recording in VideoSDK refers to the process of recording audio or video content and storing it on a remote server or VideoSDK server. ### `7. Simulcasting (RTMP)` - RTMP is a popular protocol for live streaming video content from a VideoSDK to platforms such as YouTube, Twitch, Facebook, and others. - By providing the platform-specific `stream key` and `stream URL`, the VideoSDK can connect to the platform's RTMP server and transmit the live video stream. ### `8. Http Live Streaming (HLS)` - Interactive live streaming (HLS) refers to a type of live streaming where viewers can actively engage with the content being streamed and with other viewers in real-time. - In an interactive live stream (HLS), viewers can take part in a variety of activities like live polling, Q&A sessions, and even sending virtual gifts to the content creator or each other. ## Architecture This diagram demonstrates end-to-end flow to implement video & audio calls, record calls, and go live on social media. ![Video-sdk-architecture!](/img/video-sdk-archietecture.svg) --- # Custom Audio Sources - Android For a high-quality streaming experience, fine-tuning audio tracks becomes essential—especially when delivering content to a broader live audience. To enhance your live audio pipeline, we've introduced the capability to provide a custom audio track for a hosts's stream both before and during a live session. ## Custom Audio Track This feature allows you to integrate advanced audio layers like background noise suppression, echo cancellation, and more—so your stream sounds polished and professional to every viewer. ### `How to Create Custom Audio Track ?` - You can create a Audio Track using `createAudioTrack()` method of `VideoSDK`. - This method can be used to create audio track using different encoding parameters. #### Example **Kotlin:** ```js val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("speech_standard",this) // `high_quality` | `music_standard`, Default : `speech_standard` ``` --- **Java:** ```js CustomStreamTrack audioCustomTrack=VideoSDK.createAudioTrack("speech_standard", this); // `high_quality` | `music_standard`, Default : `speech_standard` ``` - `speech_standard` : This config is optimised for normal voice communication. - `high_quality` : This config is used for getting RAW audio, where you can apply your `noiseConfig`. - `music_standard` : This config is optimised for communication, where sharing of musical notes such as songs or instrumental sounds, is important. ### `How to Setup Custom Audio Track ?` The custom track can be set up both before and after the initialization of the meeting. 1. [Setting up a Custom Track during the initialization of a meeting](/android/guide/video-and-audio-calling-api-sdk/render-media/optimize-audio-track#1-setting-up-a-custom-track-during-the-initialization-of-a-meeting) 2. [Setting up a Custom Track with methods](/android/guide/video-and-audio-calling-api-sdk/render-media/optimize-audio-track#2-setting-up-a-custom-track-with-methods) ##### 1. Setup during live stream initialization If you're starting the stream with the mic enabled `(micEnabled: true)` and wish to use a custom track from the beginning, pass it through the config of MeetingProvider. :::caution Custom Track will not apply on `micEnabled: false` configuration. ::: ##### Example **Kotlin:** ```js override fun onCreate(savedInstanceState: Bundle?) { //.. val customTracks: MutableMap = HashMap() //highlight-start val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("high_quality", this) customTracks["mic"] = audioCustomTrack //Key must be "mic" //highlight-end // create a new meeting instance val liveStream = VideoSDK.initMeeting( this@MainActivity,meetingId,participantName, //MicEnabled , If true, it will use the passed custom track to turn mic on true, //WebcamEnabled true, //ParticipantId null, //Mode null, //MultiStream false, //Pass the custom tracks here //highlight-next-line customTracks, //MetaData null ) } ``` --- **Java:** ```js @Override protected void onCreate(Bundle savedInstanceState) { //.. Map customTracks = new HashMap<>(); //highlight-start CustomStreamTrack audioCustomTrack = VideoSDK.createAudioTrack("high_quality", this); customTracks.put("mic", audioCustomTrack); //Key must be "mic" //highlight-end // create a new meeting instance Meeting liveStream = VideoSDK.initMeeting( MainActivity.this, meetingId, participantName, //MicEnabled , If true, it will use the passed custom track to turn mic on true, //WebcamEnabled true, //ParticipantId null, //Mode null, //MultiStream false, //Pass the custom tracks here //highlight-next-line customTracks, //MetaData null ); } ``` #### 2. Setup dynamically using methods During the live stream, you can update the audio source by passing the `CustomStreamTrack` in the `unmuteMic()` method of `Meeting`. You can also pass custom track in `changeMic()` method of `Meeting`. :::tip Make sure to call the `muteMic()` method before you create a new track as it may lead to unexpected behavior. ::: ##### Example **Kotlin:** ```js try { val audioCustomTrack: CustomStreamTrack = VideoSDK.createAudioTrack("high_quality", this) liveStream!!.unmuteMic(audioCustomTrack) //or liveStream!!.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH, audioCustomTrack) } catch (e: JSONException) { e.printStackTrace() } ``` --- **Java:** ```js try { CustomStreamTrack audioCustomTrack = VideoSDK.createAudioTrack("high_quality", this); liveStream.unmuteMic(audioCustomTrack); //or liveStream.changeMic(AppRTCAudioManager.AudioDevice.BLUETOOTH,audioCustomTrack); }catch (JSONException e) { e.printStackTrace(); } ``` ## API Reference The API references for all the methods and events utilised in this guide are provided below. - [Custom Audio Track](/android/api/sdk-reference/custom-tracks#custom-audio-track---android) --- # Custom ScreenShare Sources - Android To deliver high-quality livestreams, it's essential to fine-tune screen share tracks being broadcasted. Whether you’re hosting a webinar, or going live with a presentation, using custom media tracks gives you better control over stream quality and performance. ## Custom Screen Share Track This feature enables the customization of screenshare streams with enhanced optimization modes and predefined encoder configuration (resolution + FPS) for specific use cases, which can then be sent to other hosts and audience members. ### `How to Create Custom Screen Share Track ?` - You can create a Screen Share track using `createScreenShareVideoTrack()` method of `VideoSDK`. - This method can be used to create video track using different encoding parameters and optimization mode. #### Example **Kotlin:** ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack( //highlight-next-line // This will accept the height & FPS of video you want to capture. "h720p_15fps", // `h360p_30fps` | `h1080p_30fps` // Default : `h720p_15fps` //highlight-next-line // It is Intent received from onActivityResult when user provide permission for ScreenShare. data, //highlight-next-line // Pass Conext this) //highlight-next-line //Callback to this listener will be made when track is ready with CustomTrack as parameter { track -> meeting!!.enableScreenShare(track) } ``` --- **Java:** ```javascript // data is received from onActivityResult method. VideoSDK.createScreenShareVideoTrack( //highlight-next-line // This will accept the height & FPS of video you want to capture. "h720p_15fps", // `h360p_30fps` | `h1080p_30fps` // Default : `h720p_15fps` /highlight-next-line // It is Intent received from onActivityResult when user provide permission for ScreenShare data, //highlight-next-line // Pass Conext this, //highlight-next-line //Callback to this listener will be made when track is ready with CustomTrack as parameter (track)->{meeting.enableScreenShare(track);} ); ``` ### `How to Setup Custom Screen Share Track ?` In order to switch tracks during the meeting, you have to pass the `CustomStreamTrack` in the `enableScreenShare()` method of `Meeting`. :::note Make sure to call `disableScreenShare()` before you create a new track as it may lead to unexpected behavior. ::: ##### Example **Kotlin:** ```javascript @TargetApi(21) private fun askPermissionForScreenShare() { val mediaProjectionManager = application.getSystemService( Context.MEDIA_PROJECTION_SERVICE ) as MediaProjectionManager startActivityForResult( mediaProjectionManager.createScreenCaptureIntent(), CAPTURE_PERMISSION_REQUEST_CODE ) } @RequiresApi(api = Build.VERSION_CODES.LOLLIPOP) override fun onActivityResult(requestCode: Int, resultCode: Int, data: Intent?) { super.onActivityResult(requestCode, resultCode, data) if (requestCode != CAPTURE_PERMISSION_REQUEST_CODE) return if (resultCode == RESULT_OK) { //highlight-start VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this) { track -> liveStream!!.enableScreenShare(track) } //highlight-end } } ``` --- **Java:** ```javascript @TargetApi(21) private void askPermissionForScreenShare() { MediaProjectionManager mediaProjectionManager = (MediaProjectionManager) getApplication().getSystemService( Context.MEDIA_PROJECTION_SERVICE); startActivityForResult( mediaProjectionManager.createScreenCaptureIntent(), CAPTURE_PERMISSION_REQUEST_CODE); } @RequiresApi(api = Build.VERSION_CODES.LOLLIPOP) @Override public void onActivityResult(int requestCode, int resultCode, Intent data) { super.onActivityResult(requestCode, resultCode, data); if (requestCode != CAPTURE_PERMISSION_REQUEST_CODE) return; if (resultCode == Activity.RESULT_OK) { //highlight-start VideoSDK.createScreenShareVideoTrack("h720p_15fps", data, this, (track)->{ liveStream.enableScreenShare(track); }); //highlight-end } } ``` ## API Reference The API references for all the methods and events utilised in this guide are provided below. - [Custom Video Track](/react/api/sdk-reference/custom-tracks#custom-video-track---react) - [Custom Screen Share Track](/react/api/sdk-reference/custom-tracks#custom-screen-share-track---react) --- # Custom Video Sources - Android To deliver high-quality livestreams, it's essential to fine-tune the video tracks being broadcasted. Whether you’re hosting a webinar, or streaming an event, using custom video tracks gives you better control over stream quality and performance. ## Custom Video Track This feature can be used to add custom video encoder configurations, optimization mode (whether you want to focus on motion, text or detail of the video) and background removal & video filter from external SDK(e.g., Banuba)and send it to other participants. ### `How to Create a Custom Video Track ?` - You can create a Video Track using `createCameraVideoTrack()` method of `VideoSDK`. - This method can be used to create video track using different encoding parameters, camera facing mode, and optimization mode and return `CustomStreamTrack`. #### Example **Kotlin:** ```javascript val videoCustomTrack: CustomStreamTrack = VideoSDK.createCameraVideoTrack( // highlight-next-line // This will accept the resolution (height x width) of video you want to capture. "h720p_w960p", // "h720p_w960p" | "h720p_w1280p" ... // Default : "h480p_w720p" // highlight-next-line // It will specify whether to use front or back camera for the video track. "front", "back", Default : "front" // highlight-next-line // We will discuss this parameter in next step. CustomStreamTrack.VideoMode.MOTION, // CustomStreamTrack.VideoMode.TEXT, CustomStreamTrack.VideoMode.DETAIL , Default : CustomStreamTrack.VideoMode.MOTION // highlight-next-line // multiStream - we will discuss this parameter in next step. false, // true // highlight-next-line // Pass Context this, // highlight-next-line // This is Optional parameter. We will discuss this parameter in next step. observer) ``` --- **Java:** ```javascript CustomStreamTrack customStreamTrack = VideoSDK.createCameraVideoTrack( // highlight-next-line // This will accept the resolution (height x width) of video you want to capture. "h480p_w640p", // "h720p_w960p" | "h720p_w1280p" ... // Default : "h480p_w640p" // highlight-next-line // It will specify whether to use front or back camera for the video track. "front", // "back, Default : "front"" // highlight-next-line // We will discuss this parameter in next step. CustomStreamTrack.VideoMode.MOTION, // CustomStreamTrack.VideoMode.TEXT, CustomStreamTrack.VideoMode.DETAIL , Default : CustomStreamTrack.VideoMode.MOTION // highlight-next-line // multiStream - we will discuss this parameter in next step. false, // true // highlight-next-line // Pass Context this // highlight-next-line // This is Optional parameter. We will discuss this parameter in next step. observer); ``` :::caution The behavior of custom track configurations is influenced by the capabilities of the device. For example, if you set the encoder configuration to 1080p but the webcam only supports 720p, the encoder configuration will automatically adjust to the highest resolution that the device can handle, which in this case is 720p. ::: ##### What is `optimizationMode`? - This parameter specifies the optimization mode for the video track being generated. - `motion` : This type of track focuses more on motion video such as webcam video, movies or video games. - It will degrade `resolution` in order to maintain `frame rate`. - `text` : This type of track focuses on significant sharp edges and areas of consistent color that can change frequently such as presentations or web pages with text content. - It will degrade `frame rate` in order to maintain `resolution`. - `detail` : This type of track focuses more on the details of the video such as, presentations, painting or line art. - It will degrade `frame rate` in order to maintain `resolution`. ##### What is `multiStream`? - By enabling multiStream, your livestream will broadcast multiple resolutions (e.g., 720p, 480p, 360p), allowing viewers to receive the best stream quality based on their network. The **`multiStream : true`** configuration indicates that VideoSDK, by default, sends multiple resolution video streams to the server. For example, if a user's device capability is 720p, VideoSDK sends streams in 720p, 640p, and 480p resolution. This enables VideoSDK to deliver the appropriate stream to each participant based on their network bandwidth.
![Multi Stream False](/img/multistream_true.png)
Setting **`multiStream : false`** restricts VideoSDK to send only one stream, helping to maintain quality by focusing on a single resolution.
![Multi Stream False](/img/multistream_false.png)
:::danger The `setQuality` parameter will not have any effect if multiStream is set to `false`. ::: ### `How to Setup a Custom Video Track ?` You can plug in your custom video track either before going live or dynamically while the session is ongoing. 1. [Setup during live stream initialization](/react/guide/video-and-audio-calling-api-sdk/render-media/optimize-video-track#1-setup-during-live-stream-initialization) 2. [Setup dynamically using methods](/react/guide/video-and-audio-calling-api-sdk/render-media/optimize-video-track#2-setup-dynamically-using-methods) ##### 1. Setting up a Custom Track during the initialization of a liveStream If you're starting the stream with the webcam enabled `(webcamEnabled: true)` and wish to use a custom track from the beginning, pass it through the config of initMeeting as shown below. :::caution Custom Track will not apply on the `webcamEnabled: false` configuration. ::: ##### Example **Kotlin:** ```js override fun onCreate(savedInstanceState: Bundle?) { //.. //highlight-start val customTracks: MutableMap ## API Reference The API references for all the methods and events utilised in this guide are provided below. - [Custom Video Track](/react/api/sdk-reference/custom-tracks#custom-video-track---react) --- # Customized Live Stream - Android VideoSDK is a platform that offers a range of video streaming tools and solutions for content creators, publishers, and developers. ### Custom Template - Custom template is template for live stream, which allows users to add real-time graphics to their streams. - With custom templates, users can create unique and engaging video experiences by overlaying graphics, text, images, and animations onto their live streams. These graphics can be customized to match the branding. - Custom templates enable users to create engaging video content with real-time graphics, with live scoreboards, social media feeds, and other customizations, users can easily create unique and visually appealing streams that stands out from the crowd. :::note Custom templates can be used with recordings and RTMP service provided by VideoSDK as well. ::: ### What you can do with Custom Template Using a custom template, you may create a variety of various modes. Here are a few of the more well-known modes that you can create. - **`PK Host:`** Host can organise player vs player battle. Below image is example of gaming battle. - **`Watermark:`** Host can add & update watermark anywhere in the template. In below image we have added VideoSDK watermark on top right side of the screen. - **`News Mode:`** Host can add dynamic text in lower third banner. in below image we have added some sample text in bottom left of the screen. ![Mobile Custom Template ](https://cdn.videosdk.live/website-resources/docs-resources/mobile_custom_template.png) ## Custom template with VideoSDK In this section, we will discuss how Custom Templates work with VideoSDK. - **`Host`**: The host is responsible for starting the live streaming by passing the `templateURL`. The `templateURL` is the URL of the hosted template webpage. The host is also responsible for managing the template, such as changing text, logos, and switching template layout, among other things. - **`VideoSDK Template Engine`** : The VideoSDK Template Engine accepts and opens the templateURL in the browser. It listens to all the events performed by the Host and enables customization of the template according to the Host's preferences. - **`Viewer`**: The viewer can stream the content. They can watch the live stream with the added real-time graphics, which makes for a unique and engaging viewing experience. ![custom template](https://cdn.videosdk.live/website-resources/docs-resources/custom_template.png) ### Understanding Template URL The template URL is the webpage that VideoSDK Template Engine will open while composing the live stream. The template URL will appear as shown below. ![template url](https://cdn.videosdk.live/website-resources/docs-resources/custom_template_url.png) The Template URL consists of two parts: - Your actual page URL, which will look something like `https://example.com/videosdk-template`. - Query parameters, which will allow the VideoSDK Template Engine to join the meeting when the URL is opened. There are a total of three query parameters: - `token`: This will be your token, which will be used to join the meeting. - `meetingId`: This will be the meeting ID that will be joined by the VideoSDK Template Engine. - `participantId`: This will be the participant ID of the VideoSDK Template Engine, which should be passed while joining the template engine in your template so that the tempalte engine participant is not visible to other participants. **This parameter will be added by the** **VideoSDK**. :::info Above mentioned query parameters are mandatory. Apart from these parameters, you can pass any other extra parameters which are required according to your use-case. ::: ### **Creating Template** **`Step 1:`** Create a new React App using the below command ```js npx create-react-app videosdk-custom-template ``` :::note You can use VideoSDK's React or JavaScript SDK to create custom template. Following is the example of building custom template with React SDK. ::: **`Step 2:`** Install the VideoSDK using the below-mentioned npm command. Make sure you are in your react app directory before you run this command. ```js npm install "@videosdk.live/react-sdk" //For the Participants Video npm install "react-player" ``` ###### App Architecture ![template architechture](https://cdn.videosdk.live/website-resources/docs-resources/custom_template_arch.png) ###### Structure of the Project ```jsx title="Project Structure" root ├── node_modules ├── public ├── src │ ├── components │ ├── MeetingContainer.js │ ├── ParticipantsAudioPlayer.js │ ├── ParticipantsView.js │ ├── Notification.js │ ├── icons │ ├── App.js │ ├── index.js ├── package.json . ``` **`Step 3:`** Next we will fetch the query parameters, from the URL which we will later use to initialize the meeting ```js title=App.js function App() { const { meetingId, token, participantId } = useMemo(() => { //highlight-start const location = window.location; const urlParams = new URLSearchParams(location.search); const paramKeys = { meetingId: "meetingId", token: "token", participantId: "participantId", }; Object.keys(paramKeys).forEach((key) => { paramKeys[key] = urlParams.get(key) ? decodeURIComponent(urlParams.get(key)) : null; }); return paramKeys; //highlight-end }, []); } ``` **`Step 4:`** Now we will initialize the meeting with the parameters we extracted from the URL. Make sure `joinWithoutUserInteraction` is specified, so that the template engine is able to join directly into the meeting, on the page load. ```js title=App.js function App(){ //highlight-next-line ... return meetingId && token && participantId ? (
) : null; } ``` **`Step 5:`** Let us create the `MeetingContainer` which will render the meeting view for us. - It will also listen to the PubSub messages from the `CHANGE_BACKGROUND` topic, which will change the background color of the meeting. - It will have `Notification` component which will show any messages share by Host :::note We will be using the PubSub mechanism to communicate with the template. You can learn more about [PubSub from here](../video-and-audio-calling-api-sdk/collaboration-in-meeting/pubsub). ::: ```js title=MeetingContainer.js export const MeetingContainer = () => { const { isMeetingJoined, participants, localParticipant } = useMeeting(); //highlight-next-line const { messages } = usePubSub("CHANGE_BACKGROUND"); const remoteSpeakers = [...participants.values()].filter((participant) => { return ( participant.mode == Constants.modes.SEND_AND_RECV && !participant.local ); }); return isMeetingJoined ? (
0 ? messages.at(messages.length - 1).message : "#fff", //highlight-end }} > //highlight-next-line
1 ? "1fr 1fr" : "1fr", flex: 1, maxHeight: `100vh`, overflowY: "auto", gap: "20px", padding: "20px", alignItems: "center", justifyItems: "center", }} > {[...remoteSpeakers].map((participant) => { return ( //highlight-start //highlight-end ); })}
//highlight-next-line
) : (
); }; ``` **`Step 6:`** Let us create the `ParticipantView` and `ParticipantsAudioPlayer` which will render the video and audio of the participants respectively. ```js title=ParticipantView.js export const ParticipantView = (props) => { const { webcamStream, webcamOn, displayName, micOn } = useParticipant( props.participantId ); const videoStream = useMemo(() => { if (webcamOn && webcamStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(webcamStream.track); return mediaStream; } }, [webcamStream, webcamOn]); return (
{webcamOn && webcamStream ? ( ) : (
{String(displayName).charAt(0).toUpperCase()}
)}
{displayName}{" "} {!micOn && }
); }; ``` ```js title=ParticipantsAudioPlayer.js const ParticipantAudio = ({ participantId }) => { const { micOn, micStream, isLocal } = useParticipant(participantId); const audioPlayer = useRef(); useEffect(() => { if (!isLocal && audioPlayer.current && micOn && micStream) { const mediaStream = new MediaStream(); mediaStream.addTrack(micStream.track); audioPlayer.current.srcObject = mediaStream; audioPlayer.current.play().catch((err) => {}); } else { audioPlayer.current.srcObject = null; } }, [micStream, micOn, isLocal, participantId]); return