VideoSDK Inference

VideoSDK Inference provides a unified gateway to access various AI models for Speech-to-Text (STT), LLM (Large Language Models), Text-to-Speech (TTS), and Real-time multimodal communication.

With VideoSDK Inference, you don't need to provide your own API keys for individual AI providers (like Sarvam AI, Google Gemini, etc.). VideoSDK handles the authentication and API connections through its unified gateway, allowing you to get started instantly. The services will be charged from your VideoSDK account balance.

Installation

The Inference plugin is part of the core VideoSDK Agents SDK. You can install it using pip:

pip install videosdk-agents

Importing

You can import the STT, LLM, TTS, Denoise, and Realtime classes from the videosdk.agents.inference module.

from videosdk.agents.inference import STT, LLM, TTS, Denoise, Realtime

Setup Authentication

Authentication for the Inference gateway is handled via the VIDEOSDK_AUTH_TOKEN environment variable.

VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token"

Cascading Pipeline
Realtime Pipeline

In a CascadingPipeline, you can use VideoSDK Inference to handle speech recognition and synthesis. This example shows how to use Sarvam AI's models via the VideoSDK gateway.

Example Usage

import logging
from videosdk.agents import (
    Agent,
    AgentSession,
    CascadingPipeline,
    ConversationFlow,
    JobContext,
    RoomOptions,
    WorkerJob,
)
from videosdk.agents.inference import STT, LLM, TTS, Denoise
from videosdk.plugins.silero import SileroVAD

# Minimal logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

class SimpleAgent(Agent):
    """Simple voice agent for testing inference STT."""

    def __init__(self):
        super().__init__(
            instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
        )

    async def on_enter(self) -> None:
        await self.session.say(
            "Hello! I'm using VideoSDK Inference for speech recognition. How can I help you?"
        )

    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")


async def entrypoint(ctx: JobContext):
    """Main entrypoint for the agent."""

    agent = SimpleAgent()
    conversation_flow = ConversationFlow(agent)

    # Create pipeline with Inference STT, LLM, TTS & Denoise (via VideoSDK Gateway)
    pipeline = CascadingPipeline(
        # Inference STT, LLM, TTS, Denoise (via VideoSDK Gateway)
        stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"),
        llm=LLM.google(model_id="gemini-2.5-flash"),
        tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
        denoise=Denoise.sanas(),
        vad=SileroVAD(),
    )

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow,
    )

    await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
    """Create job context for playground mode."""
    room_options = RoomOptions(
        name="Inference Test Agent",
        playground=True
    )
    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=entrypoint, jobctx=make_context)
    job.start()

The RealTimePipeline uses the VideoSDK Inference Gateway to handle multimodal models like Gemini Live 2.5 Flash Native Audio, which manages the connection efficiently and reduces latency.

Example Usage

import logging
from videosdk.agents import (
    Agent,
    AgentSession,
    RealTimePipeline,
    ConversationFlow,
    JobContext,
    RoomOptions,
    WorkerJob,
)
from videosdk.agents.inference import Realtime

# Minimal logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

class SimpleAgent(Agent):
    """Simple voice agent for testing inference realtime."""

    def __init__(self):
        super().__init__(
            instructions="""You are a helpful and friendly voice assistant. 
You speak in a natural, conversational tone. Keep your responses concise but informative.""",
        )

    async def on_enter(self) -> None:
        await self.session.say(
            "Hello! I'm using the VideoSDK Inference Gateway with Gemini. How can I help you today?"
        )

    async def on_exit(self) -> None:
        await self.session.say("Goodbye! Have a great day!")


async def entrypoint(ctx: JobContext):
    """Main entrypoint for the agent."""

    agent = SimpleAgent()
    conversation_flow = ConversationFlow(agent)

    # Create RealTimePipeline with Inference Realtime (Gemini)
    pipeline = RealTimePipeline(
        model=Realtime.gemini(
            model_id="gemini-2.5-flash-native-audio-preview-12-2025",
            voice="Puck",
            language_code="en-US",
            response_modalities=["AUDIO"],
            temperature=0.7
        ),
    )

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow,
    )

    await session.start(wait_for_participant=True, run_until_shutdown=True)

Configuration Options

STT Configuration

`STT.sarvam()`

model_id: (str) The specific Sarvam model ID (e.g., "saarika:v2.5").
language: (str) Language code for transcription (e.g., "en-IN").

`STT.google()`

model_id: (str) The Google model ID (e.g., "chirp_3").
language: (str) Language code for transcription (default: "en-US").

LLM Configuration

`LLM.google()`

model_id: (str) The Gemini model version (e.g., "gemini-2.5-flash").
temperature: (float) Sampling temperature for response randomness (default: 0.7).

TTS Configuration

`TTS.sarvam()`

model_id: (str) The Sarvam model ID (e.g., "bulbul:v2").
speaker: (str) The speaker name (e.g., "anushka").
language: (str) Language code (e.g., "en-IN").

`TTS.google()`

model_id: (str) The Google model ID (e.g., "Chirp3-HD").
voice_id: (str) The voice ID (e.g., "Achernar").
language: (str) Language code (e.g., "en-US").

Denoise Configuration

`Denoise.sanas()`

Integrates Sanas for real-time speech enhancement and noise suppression.

Realtime Configuration

`Realtime.gemini()`

model_id: (str) The Gemini model version (e.g., "gemini-2.5-flash-native-audio-preview-12-2025").
voice: (str) The voice to use (e.g., "Puck", "Charon", "Kore", "Fenrir", "Aoede").
language_code: (str) Language code (e.g., "en-US").
response_modalities: (list) List of modalities, e.g., ["AUDIO"] or ["TEXT", "AUDIO"].
temperature: (float) Sampling temperature (default: 0.7).

Additional Resources

The following resources provide more information about using VideoSDK inferencing.

Inference Pricing: Detailed provider wise pricing

SDK Reference

GitHub Repository

Python Package

Got a Question? Ask us on discord

Installation​

Importing​

Setup Authentication​

Example Usage​

Example Usage​

Configuration Options​

STT Configuration​

STT.sarvam()​

STT.google()​

LLM Configuration​

LLM.google()​

TTS Configuration​

TTS.sarvam()​

TTS.google()​

Denoise Configuration​

Denoise.sanas()​

Realtime Configuration​

Realtime.gemini()​

Additional Resources​

Installation

Importing

Setup Authentication

Example Usage

Example Usage

Configuration Options

STT Configuration

`STT.sarvam()`

`STT.google()`

LLM Configuration

`LLM.google()`

TTS Configuration

`TTS.sarvam()`

`TTS.google()`

Denoise Configuration

`Denoise.sanas()`

Realtime Configuration

`Realtime.gemini()`

Additional Resources