Skip to main content
Version: 1.0.x

VideoSDK Inference

VideoSDK Inference provides a unified gateway to access various AI models for Speech-to-Text (STT), LLM (Large Language Models), Text-to-Speech (TTS), and Real-time multimodal communication.

With VideoSDK Inference, you don't need to provide your own API keys for individual AI providers (like Deepgram, Google Gemini, OpenAI, etc.). VideoSDK handles the authentication and API connections through its unified gateway, allowing you to get started instantly. The services will be charged from your VideoSDK account balance.

Installation

The Inference plugin is part of the core VideoSDK Agents SDK. You can install it using pip:

pip install videosdk-agents

Importing

You can import the provider-specific inference classes (e.g. SarvamAISTT, GoogleLLM, SarvamAITTS, SanasDenoise, GeminiRealtime, and TurnV2) from the videosdk.agents.inference module.

from videosdk.agents.inference import SarvamAISTT, GoogleLLM, SarvamAITTS, SanasDenoise, GeminiRealtime, TurnV2

Setup Authentication

Authentication for the Inference gateway is handled via the VIDEOSDK_AUTH_TOKEN environment variable.

VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token"

In cascading mode, you can use VideoSDK Inference to handle speech recognition and synthesis. This example shows how to use Sarvam AI's models via the VideoSDK gateway.

Example Usage

import logging
from videosdk.agents import (
Agent,
AgentSession,
Pipeline,
JobContext,
RoomOptions,
WorkerJob,
)
from videosdk.agents.inference import SarvamAISTT, GoogleLLM, SarvamAITTS, SanasDenoise, TurnV2
from videosdk.agents.plugins import SileroVAD

# Minimal logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

class SimpleAgent(Agent):
"""Simple voice agent for testing inference STT."""

def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
)

async def on_enter(self) -> None:
await self.session.say(
"Hello! I'm using VideoSDK Inference for speech recognition. How can I help you?"
)

async def on_exit(self) -> None:
await self.session.say("Goodbye!")


async def entrypoint(ctx: JobContext):
"""Main entrypoint for the agent."""

agent = SimpleAgent()

# Create pipeline with Inference STT, LLM, TTS & Denoise (via VideoSDK Gateway)
pipeline = Pipeline(
# Inference STT, LLM, TTS, Denoise (via VideoSDK Gateway)
stt=SarvamAISTT(model_id="saarika:v2.5", language="en-IN"),
llm=GoogleLLM(model_id="gemini-2.5-flash"),
tts=SarvamAITTS(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
denoise=SanasDenoise(),
# Turn detection (via VideoSDK Gateway): echo-small (default, fastest)
turn_detector=TurnV2.echo_small(),
vad=SileroVAD(),
)

session = AgentSession(
agent=agent,
pipeline=pipeline,
)

await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
"""Create job context for playground mode."""
room_options = RoomOptions(
name="Inference Test Agent",
playground=True
)
return JobContext(room_options=room_options)

if __name__ == "__main__":
job = WorkerJob(entrypoint=entrypoint, jobctx=make_context)
job.start()

Supported Models

The following models are available through the VideoSDK Inference Gateway.

Speech-to-Text (STT)

ProviderModel NameModel ID
AssemblyAIUniversal Streaming Englishuniversal-streaming-english
AssemblyAIUniversal Streaming Multilingualuniversal-streaming-multilingual
DeepgramFlux General (English)flux-general-en
DeepgramNova 2nova-2
DeepgramNova 3 Generalnova-3-general
GoogleChirp 2chirp_2
GoogleChirp 3chirp_3
Sarvam AISaaras V3saaras:v3

Large Language Models (LLM)

ProviderModel NameModel ID
GoogleGemini 2.5 Flashgemini-2.5-flash
GoogleGemini 2.5 Flash Litegemini-2.5-flash-lite
GoogleGemini 3 Flash Previewgemini-3-flash-preview
GoogleGemini 3.1 Flash Lite Previewgemini-3.1-flash-lite-preview
GoogleGemini 3.1 Pro Previewgemini-3.1-pro-preview
SarvamSarvam 30Bsarvam-30b
SarvamSarvam 105Bsarvam-105b

Text-to-Speech (TTS)

ProviderModel NameModel ID
CartesiaSonic 3sonic-3
DeepgramAura 2aura-2
GoogleChirp 3 HDChirp3-HD
GoogleGemini 2.5 Flash TTSgemini-2.5-flash-tts
GoogleGemini 2.5 Flash Preview TTSgemini-2.5-flash-preview-tts
GoogleGemini 2.5 Pro TTSgemini-2.5-pro-tts
GoogleGemini 2.5 Pro Preview TTSgemini-2.5-pro-preview-tts
GoogleGemini 3.1 Flash TTS Previewgemini-3.1-flash-tts-preview
Sarvam AIBulbul V2bulbul:v2
Sarvam AIBulbul V3bulbul:v3

Turn Detector

ProviderModel NameModel ID
VideoSDKEcho Smallecho-small
VideoSDKEcho Largeecho-large
TurnSenseTurn Senselatishab/turnsense

Realtime

ProviderModel NameModel ID
GoogleGemini 3.1 Flash Live Previewgemini-3.1-flash-live-preview

Configuration Options

STT Configuration

SarvamAISTT()

  • model_id: (str) The Sarvam model ID (default: "saaras:v3").
  • language: (str) Language code for transcription (default: "en-IN"). Supports Indian languages.
  • input_sample_rate: (int) Input audio sample rate (default: 48000).
  • output_sample_rate: (int) Output sample rate for processing (default: 16000).
  • enable_streaming: (bool) Enable streaming mode (default: True).
  • base_url: (str) Custom inference gateway URL.

GoogleSTT()

  • model_id: (str) The Google model ID (default: "chirp_3"). Options: "chirp_3", "chirp_2", "latest_long", "latest_short".
  • language: (str) Primary language code for transcription (default: "en-US").
  • languages: (list) List of languages for auto-detection (default: [language]).
  • interim_results: (bool) Return interim transcription results (default: True).
  • punctuate: (bool) Add punctuation to transcripts (default: True).
  • location: (str) Google Cloud region (default: "asia-south1").
  • input_sample_rate: (int) Input audio sample rate (default: 48000).
  • output_sample_rate: (int) Output sample rate for processing (default: 16000).
  • enable_streaming: (bool) Enable streaming mode (default: True).
  • base_url: (str) Custom inference gateway URL.

LLM Configuration

GoogleLLM()

  • model_id: (str) The Gemini model version (default: "gemini-2.5-flash").
  • temperature: (float) Sampling temperature for response randomness, 0.0 to 1.0 (default: 0.7).
  • tool_choice: (str) Tool calling mode: "auto", "required", or "none" (default: "auto").
  • max_output_tokens: (int) Maximum tokens in model responses (default: None).
  • top_p: (float) Nucleus sampling parameter, 0.0 to 1.0 (default: None).
  • top_k: (int) Limits tokens considered for each generation step (default: None).
  • presence_penalty: (float) Penalizes token presence, -2.0 to 2.0 (default: None).
  • frequency_penalty: (float) Penalizes token frequency, -2.0 to 2.0 (default: None).
  • base_url: (str) Custom inference gateway URL.

TTS Configuration

SarvamAITTS()

  • model_id: (str) The Sarvam model ID (default: "bulbul:v3").
  • speaker: (str) The speaker voice (default: "shubh").
  • language: (str) Language code (default: "en-IN").
  • sample_rate: (int) Audio sample rate (default: 24000).
  • enable_streaming: (bool) Enable streaming mode (default: True).
  • base_url: (str) Custom inference gateway URL.

GoogleTTS()

  • model_id: (str) The Google Cloud TTS model ID (default: "Chirp3-HD").
  • voice_id: (str) The voice name (default: "Achernar"). Options: Achernar, Aoede, Charon, Fenrir, Kore, Leda, Orus, Puck, Zephyr.
  • language: (str) Language code (default: "en-US").
  • speed: (float) Speech speed (default: 1.0).
  • pitch: (float) Voice pitch (default: 0.0).
  • sample_rate: (int) Audio sample rate (default: 24000).
  • enable_streaming: (bool) Enable streaming mode (default: True).
  • base_url: (str) Custom inference gateway URL.

Denoise Configuration

SanasDenoise()

Integrates Sanas for real-time speech enhancement and noise suppression.

  • model_id: (str) The Sanas model ID (default: "VI_G_NC3.0").
  • sample_rate: (int) Audio sample rate in Hz (default: 16000).
  • channels: (int) Number of audio channels (default: 1 for mono).
  • base_url: (str) Custom inference gateway URL.
  • max_connection_attempts: (int) Maximum connection retry attempts (default: 5).

Turn Detection Configuration

Echo is VideoSDK's recommended turn detector for natural, real-time conversations. It is a multilingual model supporting 12 languages in total, spanning both Indian and international languages, and it comes in two variants: echo-small for the lowest latency and echo-large for the highest accuracy.

TurnV2 is a server-hosted End-of-Utterance (EOU) detector served via the VideoSDK Inference Gateway. It analyzes the latest user transcript and classifies the turn into one of four states, letting the agent decide whether the user has finished speaking before responding:

StateMeaning
CompleteThe user has finished their turn.
IncompleteThe user is still mid-sentence or not finished yet.
BackchannelA short acknowledgement user gives (e.g. "uh-huh", "okay okay", etc.).
WaitThe user wants the agent to stop its speaking immediately (e.g. "wait a minute", "hold on", "stop for a moment", etc.).

TurnV2.echo_small()

  • The default, lowest-latency model optimized for the fastest possible turn detection.
  • Best when responsiveness matters most.

TurnV2.echo_large()

  • A higher-accuracy model that trades a little latency for better classification.
  • Best when accuracy matters more than raw speed.
Which model should I use?

Use echo-small for the lowest latency (the default), or echo-large when accuracy matters more than raw speed.

echo-large and echo-small models are supported from videosdk-agents>=1.0.18.

pip install "videosdk-agents>=1.0.18"
from videosdk.agents.inference import TurnV2

# Fastest, lowest latency (default)
turn_detector = TurnV2.echo_small()

# Higher accuracy
turn_detector = TurnV2.echo_large()

Realtime Configuration

GeminiRealtime()

  • model: (str) The Gemini model version (default: "gemini-3.1-flash-live-preview").
  • voice: (str) The voice to use (default: "Puck"). Options: "Puck", "Charon", "Kore", "Fenrir", "Aoede".
  • language_code: (str) Language code for speech synthesis (default: "en-US").
  • response_modalities: (list) Response types, e.g., ["AUDIO"] or ["TEXT", "AUDIO"] (default: ["AUDIO"]).
  • temperature: (float) Sampling temperature, 0.0 to 1.0 (default: None).
  • top_p: (float) Nucleus sampling parameter, 0.0 to 1.0 (default: None).
  • top_k: (float) Limits tokens considered for each generation step (default: None).
  • candidate_count: (int) Number of response candidates (default: 1).
  • max_output_tokens: (int) Maximum tokens in model responses (default: None).
  • presence_penalty: (float) Penalizes token presence, -2.0 to 2.0 (default: None).
  • frequency_penalty: (float) Penalizes token frequency, -2.0 to 2.0 (default: None).
  • base_url: (str) Custom inference gateway URL.

Additional Resources

The following resources provide more information about using VideoSDK inferencing.

Got a Question? Ask us on discord