VideoSDK Inference
VideoSDK Inference provides a unified gateway to access various AI models for Speech-to-Text (STT), LLM (Large Language Models), Text-to-Speech (TTS), and Real-time multimodal communication.
With VideoSDK Inference, you don't need to provide your own API keys for individual AI providers (like Deepgram, Google Gemini, OpenAI, etc.). VideoSDK handles the authentication and API connections through its unified gateway, allowing you to get started instantly. The services will be charged from your VideoSDK account balance.
Installation
The Inference plugin is part of the core VideoSDK Agents SDK. You can install it using pip:
pip install videosdk-agents
Importing
You can import the provider-specific inference classes (e.g. SarvamAISTT, GoogleLLM, SarvamAITTS, SanasDenoise, GeminiRealtime, and TurnV2) from the videosdk.agents.inference module.
from videosdk.agents.inference import SarvamAISTT, GoogleLLM, SarvamAITTS, SanasDenoise, GeminiRealtime, TurnV2
Setup Authentication
Authentication for the Inference gateway is handled via the VIDEOSDK_AUTH_TOKEN environment variable.
VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token"
- Cascading Mode
- Realtime Mode
In cascading mode, you can use VideoSDK Inference to handle speech recognition and synthesis. This example shows how to use Sarvam AI's models via the VideoSDK gateway.
Example Usage
import logging
from videosdk.agents import (
Agent,
AgentSession,
Pipeline,
JobContext,
RoomOptions,
WorkerJob,
)
from videosdk.agents.inference import SarvamAISTT, GoogleLLM, SarvamAITTS, SanasDenoise, TurnV2
from videosdk.agents.plugins import SileroVAD
# Minimal logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
class SimpleAgent(Agent):
"""Simple voice agent for testing inference STT."""
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
)
async def on_enter(self) -> None:
await self.session.say(
"Hello! I'm using VideoSDK Inference for speech recognition. How can I help you?"
)
async def on_exit(self) -> None:
await self.session.say("Goodbye!")
async def entrypoint(ctx: JobContext):
"""Main entrypoint for the agent."""
agent = SimpleAgent()
# Create pipeline with Inference STT, LLM, TTS & Denoise (via VideoSDK Gateway)
pipeline = Pipeline(
# Inference STT, LLM, TTS, Denoise (via VideoSDK Gateway)
stt=SarvamAISTT(model_id="saarika:v2.5", language="en-IN"),
llm=GoogleLLM(model_id="gemini-2.5-flash"),
tts=SarvamAITTS(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
denoise=SanasDenoise(),
# Turn detection (via VideoSDK Gateway): echo-small (default, fastest)
turn_detector=TurnV2.echo_small(),
vad=SileroVAD(),
)
session = AgentSession(
agent=agent,
pipeline=pipeline,
)
await session.start(wait_for_participant=True, run_until_shutdown=True)
def make_context() -> JobContext:
"""Create job context for playground mode."""
room_options = RoomOptions(
name="Inference Test Agent",
playground=True
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
job = WorkerJob(entrypoint=entrypoint, jobctx=make_context)
job.start()
The Pipeline in realtime mode uses the VideoSDK Inference Gateway to handle multimodal models like Gemini Live 2.5 Flash Native Audio, which manages the connection efficiently and reduces latency.
Example Usage
import logging
from videosdk.agents import (
Agent,
AgentSession,
Pipeline,
JobContext,
RoomOptions,
WorkerJob,
)
from videosdk.agents.inference import GeminiRealtime
# Minimal logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
class SimpleAgent(Agent):
"""Simple voice agent for testing inference realtime."""
def __init__(self):
super().__init__(
instructions="""You are a helpful and friendly voice assistant.
You speak in a natural, conversational tone. Keep your responses concise but informative.""",
)
async def on_enter(self) -> None:
await self.session.say(
"Hello! I'm using the VideoSDK Inference Gateway with Gemini. How can I help you today?"
)
async def on_exit(self) -> None:
await self.session.say("Goodbye! Have a great day!")
async def entrypoint(ctx: JobContext):
"""Main entrypoint for the agent."""
agent = SimpleAgent()
# Create Pipeline with Inference Realtime (Gemini)
pipeline = Pipeline(
llm=GeminiRealtime(
model="gemini-3.1-flash-live-preview",
voice="Puck",
language_code="en-US",
response_modalities=["AUDIO"],
temperature=0.7
),
)
session = AgentSession(
agent=agent,
pipeline=pipeline,
)
await session.start(wait_for_participant=True, run_until_shutdown=True)
def make_context() -> JobContext:
"""Create job context for playground mode."""
room_options = RoomOptions(
name="Inference Agent",
playground=True
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
job = WorkerJob(entrypoint=entrypoint, jobctx=make_context)
job.start()
Supported Models
The following models are available through the VideoSDK Inference Gateway.
Speech-to-Text (STT)
| Provider | Model Name | Model ID |
|---|---|---|
| AssemblyAI | Universal Streaming English | universal-streaming-english |
| AssemblyAI | Universal Streaming Multilingual | universal-streaming-multilingual |
| Deepgram | Flux General (English) | flux-general-en |
| Deepgram | Nova 2 | nova-2 |
| Deepgram | Nova 3 General | nova-3-general |
| Chirp 2 | chirp_2 | |
| Chirp 3 | chirp_3 | |
| Sarvam AI | Saaras V3 | saaras:v3 |
Large Language Models (LLM)
| Provider | Model Name | Model ID |
|---|---|---|
| Gemini 2.5 Flash | gemini-2.5-flash | |
| Gemini 2.5 Flash Lite | gemini-2.5-flash-lite | |
| Gemini 3 Flash Preview | gemini-3-flash-preview | |
| Gemini 3.1 Flash Lite Preview | gemini-3.1-flash-lite-preview | |
| Gemini 3.1 Pro Preview | gemini-3.1-pro-preview | |
| Sarvam | Sarvam 30B | sarvam-30b |
| Sarvam | Sarvam 105B | sarvam-105b |
Text-to-Speech (TTS)
| Provider | Model Name | Model ID |
|---|---|---|
| Cartesia | Sonic 3 | sonic-3 |
| Deepgram | Aura 2 | aura-2 |
| Chirp 3 HD | Chirp3-HD | |
| Gemini 2.5 Flash TTS | gemini-2.5-flash-tts | |
| Gemini 2.5 Flash Preview TTS | gemini-2.5-flash-preview-tts | |
| Gemini 2.5 Pro TTS | gemini-2.5-pro-tts | |
| Gemini 2.5 Pro Preview TTS | gemini-2.5-pro-preview-tts | |
| Gemini 3.1 Flash TTS Preview | gemini-3.1-flash-tts-preview | |
| Sarvam AI | Bulbul V2 | bulbul:v2 |
| Sarvam AI | Bulbul V3 | bulbul:v3 |
Turn Detector
| Provider | Model Name | Model ID |
|---|---|---|
| VideoSDK | Echo Small | echo-small |
| VideoSDK | Echo Large | echo-large |
| TurnSense | Turn Sense | latishab/turnsense |
Realtime
| Provider | Model Name | Model ID |
|---|---|---|
| Gemini 3.1 Flash Live Preview | gemini-3.1-flash-live-preview |
Configuration Options
STT Configuration
SarvamAISTT()
model_id: (str) The Sarvam model ID (default:"saaras:v3").language: (str) Language code for transcription (default:"en-IN"). Supports Indian languages.input_sample_rate: (int) Input audio sample rate (default:48000).output_sample_rate: (int) Output sample rate for processing (default:16000).enable_streaming: (bool) Enable streaming mode (default:True).base_url: (str) Custom inference gateway URL.
GoogleSTT()
model_id: (str) The Google model ID (default:"chirp_3"). Options:"chirp_3","chirp_2","latest_long","latest_short".language: (str) Primary language code for transcription (default:"en-US").languages: (list) List of languages for auto-detection (default:[language]).interim_results: (bool) Return interim transcription results (default:True).punctuate: (bool) Add punctuation to transcripts (default:True).location: (str) Google Cloud region (default:"asia-south1").input_sample_rate: (int) Input audio sample rate (default:48000).output_sample_rate: (int) Output sample rate for processing (default:16000).enable_streaming: (bool) Enable streaming mode (default:True).base_url: (str) Custom inference gateway URL.
LLM Configuration
GoogleLLM()
model_id: (str) The Gemini model version (default:"gemini-2.5-flash").temperature: (float) Sampling temperature for response randomness, 0.0 to 1.0 (default:0.7).tool_choice: (str) Tool calling mode:"auto","required", or"none"(default:"auto").max_output_tokens: (int) Maximum tokens in model responses (default:None).top_p: (float) Nucleus sampling parameter, 0.0 to 1.0 (default:None).top_k: (int) Limits tokens considered for each generation step (default:None).presence_penalty: (float) Penalizes token presence, -2.0 to 2.0 (default:None).frequency_penalty: (float) Penalizes token frequency, -2.0 to 2.0 (default:None).base_url: (str) Custom inference gateway URL.
TTS Configuration
SarvamAITTS()
model_id: (str) The Sarvam model ID (default:"bulbul:v3").speaker: (str) The speaker voice (default:"shubh").language: (str) Language code (default:"en-IN").sample_rate: (int) Audio sample rate (default:24000).enable_streaming: (bool) Enable streaming mode (default:True).base_url: (str) Custom inference gateway URL.
GoogleTTS()
model_id: (str) The Google Cloud TTS model ID (default:"Chirp3-HD").voice_id: (str) The voice name (default:"Achernar"). Options:Achernar,Aoede,Charon,Fenrir,Kore,Leda,Orus,Puck,Zephyr.language: (str) Language code (default:"en-US").speed: (float) Speech speed (default:1.0).pitch: (float) Voice pitch (default:0.0).sample_rate: (int) Audio sample rate (default:24000).enable_streaming: (bool) Enable streaming mode (default:True).base_url: (str) Custom inference gateway URL.
Denoise Configuration
SanasDenoise()
Integrates Sanas for real-time speech enhancement and noise suppression.
model_id: (str) The Sanas model ID (default:"VI_G_NC3.0").sample_rate: (int) Audio sample rate in Hz (default:16000).channels: (int) Number of audio channels (default:1for mono).base_url: (str) Custom inference gateway URL.max_connection_attempts: (int) Maximum connection retry attempts (default:5).
Turn Detection Configuration
Echo is VideoSDK's recommended turn detector for natural, real-time conversations. It is a multilingual model supporting 12 languages in total, spanning both Indian and international languages, and it comes in two variants: echo-small for the lowest latency and echo-large for the highest accuracy.
TurnV2 is a server-hosted End-of-Utterance (EOU) detector served via the VideoSDK Inference Gateway. It analyzes the latest user transcript and classifies the turn into one of four states, letting the agent decide whether the user has finished speaking before responding:
| State | Meaning |
|---|---|
Complete | The user has finished their turn. |
Incomplete | The user is still mid-sentence or not finished yet. |
Backchannel | A short acknowledgement user gives (e.g. "uh-huh", "okay okay", etc.). |
Wait | The user wants the agent to stop its speaking immediately (e.g. "wait a minute", "hold on", "stop for a moment", etc.). |
TurnV2.echo_small()
- The default, lowest-latency model optimized for the fastest possible turn detection.
- Best when responsiveness matters most.
TurnV2.echo_large()
- A higher-accuracy model that trades a little latency for better classification.
- Best when accuracy matters more than raw speed.
Use echo-small for the lowest latency (the default), or echo-large when accuracy matters more than raw speed.
echo-large and echo-small models are supported from videosdk-agents>=1.0.18.
pip install "videosdk-agents>=1.0.18"
from videosdk.agents.inference import TurnV2
# Fastest, lowest latency (default)
turn_detector = TurnV2.echo_small()
# Higher accuracy
turn_detector = TurnV2.echo_large()
Realtime Configuration
GeminiRealtime()
model: (str) The Gemini model version (default:"gemini-3.1-flash-live-preview").voice: (str) The voice to use (default:"Puck"). Options:"Puck","Charon","Kore","Fenrir","Aoede".language_code: (str) Language code for speech synthesis (default:"en-US").response_modalities: (list) Response types, e.g.,["AUDIO"]or["TEXT", "AUDIO"](default:["AUDIO"]).temperature: (float) Sampling temperature, 0.0 to 1.0 (default:None).top_p: (float) Nucleus sampling parameter, 0.0 to 1.0 (default:None).top_k: (float) Limits tokens considered for each generation step (default:None).candidate_count: (int) Number of response candidates (default:1).max_output_tokens: (int) Maximum tokens in model responses (default:None).presence_penalty: (float) Penalizes token presence, -2.0 to 2.0 (default:None).frequency_penalty: (float) Penalizes token frequency, -2.0 to 2.0 (default:None).base_url: (str) Custom inference gateway URL.
Additional Resources
The following resources provide more information about using VideoSDK inferencing.
- Inference Pricing: Detailed provider wise pricing
Got a Question? Ask us on discord

