Vision & Multi-modality
Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see.
The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases.
Pipeline Architecture Overview
The framework provides two pipeline types with different vision support:
| Pipeline Type | Vision Capabilities | Supported Models | Use Cases |
|---|---|---|---|
| CascadingPipeline | Live frame capture & static images | OpenAI, Anthropic, Google | On-demand frame analysis, document analysis, visual Q&A |
| RealTimePipeline | Continuous live video streaming | Google Gemini Live only | Real-time visual interactions, live video commentary |
Cascading Pipeline Vision
The CascadingPipeline supports vision through two approaches: capturing live video frames from participants, or processing static images. This works with all supported LLM providers (OpenAI, Anthropic, Google).
Enabling Vision
Enable vision capabilities by setting vision=True in RoomOptions:
from videosdk.agents import JobContext, RoomOptions
room_options = RoomOptions(
room_id="your-room-id",
name="Vision Agent",
vision=True # Enable vision capabilities
)
job_context = JobContext(room_options=room_options)
Live Frame Capture
Capture video frames from meeting participants on-demand using agent.capture_frames():
from videosdk.agents import Agent, AgentSession, CascadingPipeline
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.google import GoogleLLM
class VisionAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful assistant that can analyze images."
)
async def entrypoint(ctx: JobContext):
agent = VisionAgent(ctx)
conversation_flow = ConversationFlow(agent)
pipeline = CascadingPipeline(
stt=DeepgramSTT(),
llm=GoogleLLM(),
tts=ElevenLabsTTS(),
vad=SileroVAD(),
turn_detector=TurnDetector()
)
session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow,
)
shutdown_event = asyncio.Event()
async def on_pubsub_message(message):
print("Pubsub message received:", message)
if isinstance(message, dict) and message.get("message") == "capture_frames":
print("Capturing frame....")
try:
frames = agent.capture_frames(num_of_frames=1)
if frames:
print(f"Captured {len(frames)} frame(s)")
await session.reply(
"Please analyze this frame and describe what you see in details within one line.",
frames=frames
)
else:
print("No frames available. Make sure vision is enabled in RoomOptions.")
except ValueError as e:
print(f"Error: {e}")
def on_pubsub_message_wrapper(message):
asyncio.create_task(on_pubsub_message(message))
#rest of the code..
The capture_frames function returns an array and the max number of frames you can input is 5 (num_of_frames <=5)
Key Features:
- On-Demand Capture: Capture frames only when needed, triggered by events or user requests
- Event-Driven: Use PubSub or other triggers to capture frames at the right moment
- Flexible Analysis: Send custom instructions along with frames for specific analysis tasks
Silent Capture (Saving Captured Frames)
You can save captured video frames to disk for later analysis or debugging. The frames returned by agent.capture_frames() are av.VideoFrame objects that can be converted to JPEG images.
(Silent capture - as it doesn't invoke any agent speech saying the image is being captured unless explicity set to do so)
import io
from av import VideoFrame
from PIL import Image
def save_frame_as_jpeg(frame: VideoFrame, filename: str) -> None:
"""Save a video frame as a JPEG file."""
img = frame.to_image() # Convert to PIL Image
img.save(filename, format="JPEG")
# In your agent code
frames = agent.capture_frames(num_of_frames=1)
if frames:
# Save the first frame
save_frame_as_jpeg(frames[0], "captured_frame.jpg")
# Or save as bytes for uploading/processing
buffer = io.BytesIO()
frames[0].to_image().save(buffer, format="JPEG")
jpeg_bytes = buffer.getvalue()
Use Cases:
- Debugging: Save frames to verify what the agent is seeing
- Logging: Archive frames for audit trails or quality assurance
- Preprocessing: Save frames before sending to external vision APIs
- Thumbnails: Generate preview images for user interfaces
Static Image Processing
For pre-existing images or URLs, use the ImageContent class:
from videosdk.agents import ChatRole, ImageContent
# Add image from URL
agent.chat_context.add_message(
role=ChatRole.USER,
content=[ImageContent(image="https://example.com/image.jpg")]
)
# Add image with custom settings
image_content = ImageContent(
image="https://example.com/document.png",
inference_detail="high" # "auto", "high", or "low"
)
agent.chat_context.add_message(
role=ChatRole.USER,
content=[image_content]
)
Provider Support
All major LLM providers support vision in CascadingPipeline:
| Provider | Vision Models | Capabilities |
|---|---|---|
| OpenAI | GPT-4 Vision models | Configurable detail levels, URL & base64 support |
| Anthropic | Claude 3 models | Advanced image understanding, document analysis |
| Gemini models | Comprehensive visual analysis, multi-image support |
Best Practices
- Frame Timing: Capture frames at meaningful moments (e.g., when user asks "what do you see?")
- Error Handling: Always check if frames are available before processing
- Vision Enablement: Ensure
vision=Trueis set inRoomOptionsfor frame capture - Image Quality: Use appropriate resolutions for your use case (1024x1024 recommended for detailed analysis)
Here is the example you can try out : Cascading Pipeline Vision Example
RealTime Pipeline Vision
The RealTimePipeline enables continuous live video processing for real-time visual interactions. Video frames are automatically streamed to the model as they arrive.
Live Video Processing
Live video input is enabled through the vision parameter in RoomOptions and requires Google's Gemini Live model.
from videosdk.agents import Agent, AgentSession, RealTimePipeline,WorkerJob, JobContext, RoomOptions
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
async def start_session(context: JobContext):
# Initialize Gemini with vision capabilities
model = GeminiRealtime(
model="gemini-2.0-flash-live-001",
config=GeminiLiveConfig(
voice="Leda",
response_modalities=["AUDIO"]
)
)
pipeline = RealTimePipeline(model=model)
agent = VisionAgent()
session = AgentSession(
agent=agent,
pipeline=pipeline,
)
await session.start(wait_for_participant=True, run_until_shutdown=True)
# Enable live video processing
def make_context() -> JobContext:
room_options = RoomOptions(room_id="<room_id>",
name="Sandbox Agent",
playground=True,
vision=True
)
return JobContext(
room_options=room_options
)
Video Processing Flow
When vision is enabled, the system automatically does following:
- Continuous Capture: Captures video frames from meeting participants
- Frame Processing: Processes frames at optimal intervals (throttled to 0.5 seconds)
- Model Integration: Sends visual data to the Gemini Live model
- Context Integration: Integrates visual understanding with conversation context
RealTimePipeline Limitations
- Model Restriction: Only works with
GeminiRealtimemodel - Network Requirements: Requires stable network connections for optimal performance
- Frame Rate: Automatically throttled to prevent overwhelming the model
Here is the example you can try out : Realtime Pipeline Vision Example
Choosing the Right Approach
| Use Case | Recommended Pipeline | Why |
|---|---|---|
| On-demand frame analysis | CascadingPipeline | Capture frames only when needed, works with all LLM providers |
| Document/image Q&A | CascadingPipeline | Process static images with custom instructions |
| Real-time video commentary | RealTimePipeline | Continuous streaming for live visual interactions |
| Multi-provider support | CascadingPipeline | Works with OpenAI, Anthropic, and Google |
| Lowest latency | RealTimePipeline | Direct streaming to Gemini Live model |
Examples - Try Out Yourself
Checkout examples of using Realtime and Cascading Vision functionality
Cascading Pipeline Vision
On-demand frame capture and static image processing
Realtime Pipeline Vision
Continuous video streaming with Gemini Realtime API
Frequently Asked Questions
Can I use vision with any LLM provider?
CascadingPipeline vision works with OpenAI, Anthropic, and Google LLMs. RealTimePipeline vision only works with Google's Gemini Live model.
How do I capture frames at specific moments?
Use event-driven triggers like PubSub messages or user speech to call agent.capture_frames() at the right time. See the example code above for implementation details.
What's the difference between frame capture and continuous streaming?
Frame capture (CascadingPipeline) captures frames on-demand when you call capture_frames(). Continuous streaming (RealTimePipeline) automatically sends video frames to the model in real-time.
Got a Question? Ask us on discord

