Skip to main content
Version: 1.0.x

Overview

The VideoSDK AI Agent SDK provides a powerful framework for building AI agents that can participate in real-time conversations. This guide explains the core components and demonstrates how to create a complete agentic workflow. The SDK serves as a real-time bridge between AI models and your users, facilitating seamless voice and media interactions.

Architecture

The Agent Session orchestrates the entire workflow, combining the Agent with a Pipeline for real-time communication. The unified Pipeline automatically detects the best mode based on the components you provide — whether that's a full cascade STT-LLM-TTS setup, a realtime speech-to-speech model, or a hybrid of both.

Overview

  1. Agent - This is the base class for defining your agent's identity and behavior. Here, you can configure custom instructions, manage its state, and register function tools.
  2. Pipeline - This unified component manages the real-time flow of audio and data between the user and the AI models. It auto-detects the optimal mode based on the components you provide:
    • Cascade Mode - Provide STT, LLM, TTS, VAD, and Turn Detector for maximum flexibility and control over each processing stage.
    • Realtime Mode - Provide a realtime model (e.g., OpenAI Realtime, Google Gemini Live, AWS Nova Sonic) for lowest-latency speech-to-speech processing.
    • Hybrid Mode - Combine a realtime model with an external STT (for knowledge base support) or external TTS (for custom voice support).
  3. Agent Session - This component brings together the agent and pipeline to manage the agent's lifecycle within a VideoSDK meeting.
  4. Pipeline Hooks - A middleware system for intercepting and processing data at any stage of the pipeline. Use hooks for custom STT/TTS processing, observing or modifying LLM output, lifecycle events, and more.

Supporting Components

These components work behind the scenes to support the core functionality of the AI Agent SDK:

  • Execution & Lifecycle Management

    • JobContext - Provides the execution environment and lifecycle management for AI agents. It encapsulates the context in which an agent job is running.

    • WorkerJob - Manages the execution of jobs and worker processes using Python's multiprocessing, allowing for concurrent agent operations.

  • Configuration & Settings

    • RoomOptions - This allows you to configure the behavior of the session, including room settings and other advanced features for the agent's interaction within a meeting.

    • Options - This is used to configure the behavior of the worker, including logging and other execution settings.

  • External Integration

    • MCP Servers - These enable the integration of external tools through either stdio or HTTP transport.
      • MCPServerStdio - Facilitates direct process communication for local Python scripts.
      • MCPServerHTTP - Enables HTTP-based communication for remote servers and services.

Advanced Features

The AI Agent SDK includes a range of advanced features to build sophisticated conversational agents:

Examples - Try Out Yourself

We have examples to get you started. Go ahead, try out, talk to agent and customize according to your needs.

Got a Question? Ask us on discord