VideoSDK Inference
VideoSDK Inference provides a unified gateway to access various AI models for Speech-to-Text (STT), LLM (Large Language Models), Text-to-Speech (TTS), and Real-time multimodal communication.
With VideoSDK Inference, you don't need to provide your own API keys for individual AI providers (like Sarvam AI, Google Gemini, etc.). VideoSDK handles the authentication and API connections through its unified gateway, allowing you to get started instantly. The services will be charged from your VideoSDK account balance.
Installation
The Inference plugin is part of the core VideoSDK Agents SDK. You can install it using pip:
pip install videosdk-agents
Importing
You can import the STT, LLM, TTS, and Realtime classes from the videosdk.agents.inference module.
from videosdk.agents.inference import STT, LLM, TTS, Realtime
Setup Authentication
Authentication for the Inference gateway is handled via the VIDEOSDK_AUTH_TOKEN environment variable.
VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token"
- Cascading Pipeline
- Realtime Pipeline
In a CascadingPipeline, you can use VideoSDK Inference to handle speech recognition and synthesis. This example shows how to use Sarvam AI's models via the VideoSDK gateway.
Example Usage
import logging
from videosdk.agents import (
Agent,
AgentSession,
CascadingPipeline,
ConversationFlow,
JobContext,
RoomOptions,
WorkerJob,
)
from videosdk.agents.inference import STT, LLM, TTS
from videosdk.plugins.silero import SileroVAD
# Minimal logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
class SimpleAgent(Agent):
"""Simple voice agent for testing inference STT."""
def __init__(self):
super().__init__(
instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
)
async def on_enter(self) -> None:
await self.session.say(
"Hello! I'm using VideoSDK Inference for speech recognition. How can I help you?"
)
async def on_exit(self) -> None:
await self.session.say("Goodbye!")
async def entrypoint(ctx: JobContext):
"""Main entrypoint for the agent."""
agent = SimpleAgent()
conversation_flow = ConversationFlow(agent)
# Create pipeline with Inference STT & TTS (via VideoSDK Gateway)
pipeline = CascadingPipeline(
# Inference STT, LLM, TTS (via VideoSDK Gateway)
stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"),
llm=LLM.google(model_id="gemini-2.5-flash"),
tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
vad=SileroVAD(),
)
session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow,
)
await session.start(wait_for_participant=True, run_until_shutdown=True)
def make_context() -> JobContext:
"""Create job context for playground mode."""
room_options = RoomOptions(
name="Inference Test Agent",
playground=True
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
job = WorkerJob(entrypoint=entrypoint, jobctx=make_context)
job.start()
The RealTimePipeline uses the VideoSDK Inference Gateway to handle multimodal models like Gemini Live 2.5 Flash Native Audio, which manages the connection efficiently and reduces latency.
Example Usage
import logging
from videosdk.agents import (
Agent,
AgentSession,
RealTimePipeline,
ConversationFlow,
JobContext,
RoomOptions,
WorkerJob,
)
from videosdk.agents.inference import Realtime
# Minimal logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
class SimpleAgent(Agent):
"""Simple voice agent for testing inference realtime."""
def __init__(self):
super().__init__(
instructions="""You are a helpful and friendly voice assistant.
You speak in a natural, conversational tone. Keep your responses concise but informative.""",
)
async def on_enter(self) -> None:
await self.session.say(
"Hello! I'm using the VideoSDK Inference Gateway with Gemini. How can I help you today?"
)
async def on_exit(self) -> None:
await self.session.say("Goodbye! Have a great day!")
async def entrypoint(ctx: JobContext):
"""Main entrypoint for the agent."""
agent = SimpleAgent()
conversation_flow = ConversationFlow(agent)
# Create RealTimePipeline with Inference Realtime (Gemini)
pipeline = RealTimePipeline(
model=Realtime.gemini(
model_id="gemini-2.5-flash-native-audio-preview-12-2025",
voice="Puck",
language_code="en-US",
response_modalities=["AUDIO"],
temperature=0.7
),
)
session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow,
)
await session.start(wait_for_participant=True, run_until_shutdown=True)
Configuration Options
STT Configuration
STT.sarvam()
model_id: (str) The specific Sarvam model ID (e.g.,"saarika:v2.5").language: (str) Language code for transcription (e.g.,"en-IN").
STT.google()
model_id: (str) The Google model ID.language: (str) Language code for transcription (default:"en-US").
LLM Configuration
LLM.google()
model_id: (str) The Gemini model version (e.g.,"gemini-2.5-flash").temperature: (float) Sampling temperature for response randomness (default:0.7).
TTS Configuration
TTS.sarvam()
model_id: (str) The Sarvam model ID (e.g.,"bulbul:v2").speaker: (str) The speaker name (e.g.,"anushka").language: (str) Language code (e.g.,"en-IN").
TTS.google()
model_id: (str) The Google model ID (e.g.,"Chirp3-HD").voice_id: (str) The voice ID (e.g.,"Achernar").language: (str) Language code (e.g.,"en-US").
Realtime Configuration
Realtime.gemini()
model_id: (str) The Gemini model version (e.g.,"gemini-2.5-flash-native-audio-preview-12-2025").voice: (str) The voice to use (e.g.,"Puck","Charon","Kore","Fenrir","Aoede").language_code: (str) Language code (e.g.,"en-US").response_modalities: (list) List of modalities, e.g.,["AUDIO"]or["TEXT", "AUDIO"].temperature: (float) Sampling temperature (default:0.7).
Additional Resources
The following resources provide more information about using VideoSDK inferencing.
- Inference Pricing: Detailed provider wise pricing
Got a Question? Ask us on discord

