Skip to main content

VideoSDK Inference

VideoSDK Inference provides a unified gateway to access various AI models for Speech-to-Text (STT), LLM (Large Language Models), Text-to-Speech (TTS), and Real-time multimodal communication.

With VideoSDK Inference, you don't need to provide your own API keys for individual AI providers (like Sarvam AI, Google Gemini, etc.). VideoSDK handles the authentication and API connections through its unified gateway, allowing you to get started instantly. The services will be charged from your VideoSDK account balance.

Installation

The Inference plugin is part of the core VideoSDK Agents SDK. You can install it using pip:

pip install videosdk-agents

Importing

You can import the STT, LLM, TTS, and Realtime classes from the videosdk.agents.inference module.

from videosdk.agents.inference import STT, LLM, TTS, Realtime

Setup Authentication

Authentication for the Inference gateway is handled via the VIDEOSDK_AUTH_TOKEN environment variable.

VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token"

In a CascadingPipeline, you can use VideoSDK Inference to handle speech recognition and synthesis. This example shows how to use Sarvam AI's models via the VideoSDK gateway.

Example Usage

import logging
from videosdk.agents import (
Agent,
AgentSession,
CascadingPipeline,
ConversationFlow,
JobContext,
RoomOptions,
WorkerJob,
)
from videosdk.agents.inference import STT, LLM, TTS
from videosdk.plugins.silero import SileroVAD

# Minimal logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

class SimpleAgent(Agent):
"""Simple voice agent for testing inference STT."""

def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
)

async def on_enter(self) -> None:
await self.session.say(
"Hello! I'm using VideoSDK Inference for speech recognition. How can I help you?"
)

async def on_exit(self) -> None:
await self.session.say("Goodbye!")


async def entrypoint(ctx: JobContext):
"""Main entrypoint for the agent."""

agent = SimpleAgent()
conversation_flow = ConversationFlow(agent)

# Create pipeline with Inference STT & TTS (via VideoSDK Gateway)
pipeline = CascadingPipeline(
# Inference STT, LLM, TTS (via VideoSDK Gateway)
stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"),
llm=LLM.google(model_id="gemini-2.5-flash"),
tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
vad=SileroVAD(),
)

session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow,
)

await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
"""Create job context for playground mode."""
room_options = RoomOptions(
name="Inference Test Agent",
playground=True
)
return JobContext(room_options=room_options)

if __name__ == "__main__":
job = WorkerJob(entrypoint=entrypoint, jobctx=make_context)
job.start()

Configuration Options

STT Configuration

STT.sarvam()

  • model_id: (str) The specific Sarvam model ID (e.g., "saarika:v2.5").
  • language: (str) Language code for transcription (e.g., "en-IN").

STT.google()

  • model_id: (str) The Google model ID.
  • language: (str) Language code for transcription (default: "en-US").

LLM Configuration

LLM.google()

  • model_id: (str) The Gemini model version (e.g., "gemini-2.5-flash").
  • temperature: (float) Sampling temperature for response randomness (default: 0.7).

TTS Configuration

TTS.sarvam()

  • model_id: (str) The Sarvam model ID (e.g., "bulbul:v2").
  • speaker: (str) The speaker name (e.g., "anushka").
  • language: (str) Language code (e.g., "en-IN").

TTS.google()

  • model_id: (str) The Google model ID (e.g., "Chirp3-HD").
  • voice_id: (str) The voice ID (e.g., "Achernar").
  • language: (str) Language code (e.g., "en-US").

Realtime Configuration

Realtime.gemini()

  • model_id: (str) The Gemini model version (e.g., "gemini-2.5-flash-native-audio-preview-12-2025").
  • voice: (str) The voice to use (e.g., "Puck", "Charon", "Kore", "Fenrir", "Aoede").
  • language_code: (str) Language code (e.g., "en-US").
  • response_modalities: (list) List of modalities, e.g., ["AUDIO"] or ["TEXT", "AUDIO"].
  • temperature: (float) Sampling temperature (default: 0.7).

Additional Resources

The following resources provide more information about using VideoSDK inferencing.

Got a Question? Ask us on discord