Skip to main content

Vision & Multi-modality

Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see.

The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases.

Pipeline Architecture Overview

The framework provides two pipeline types with different vision support:

Pipeline TypeVision CapabilitiesSupported ModelsUse Cases
CascadingPipelineLive frame capture & static imagesOpenAI, Anthropic, GoogleOn-demand frame analysis, document analysis, visual Q&A
RealTimePipelineContinuous live video streamingGoogle Gemini Live onlyReal-time visual interactions, live video commentary

Cascading Pipeline Vision

The CascadingPipeline supports vision through two approaches: capturing live video frames from participants, or processing static images. This works with all supported LLM providers (OpenAI, Anthropic, Google).

Enabling Vision

Enable vision capabilities by setting vision=True in RoomOptions:

from videosdk.agents import JobContext, RoomOptions  

room_options = RoomOptions(
room_id="your-room-id",
name="Vision Agent",
vision=True # Enable vision capabilities
)

job_context = JobContext(room_options=room_options)

Live Frame Capture

Capture video frames from meeting participants on-demand using agent.capture_frames():

from videosdk.agents import Agent, AgentSession, CascadingPipeline  
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.google import GoogleLLM

class VisionAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful assistant that can analyze images."
)

async def entrypoint(ctx: JobContext):

agent = VisionAgent(ctx)
conversation_flow = ConversationFlow(agent)

pipeline = CascadingPipeline(
stt=DeepgramSTT(),
llm=GoogleLLM(),
tts=ElevenLabsTTS(),
vad=SileroVAD(),
turn_detector=TurnDetector()
)
session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow,
)

shutdown_event = asyncio.Event()
async def on_pubsub_message(message):
print("Pubsub message received:", message)
if isinstance(message, dict) and message.get("message") == "capture_frames":
print("Capturing frame....")
try:
frames = agent.capture_frames(num_of_frames=1)
if frames:
print(f"Captured {len(frames)} frame(s)")
await session.reply(
"Please analyze this frame and describe what you see in details within one line.",
frames=frames
)
else:
print("No frames available. Make sure vision is enabled in RoomOptions.")
except ValueError as e:
print(f"Error: {e}")

def on_pubsub_message_wrapper(message):
asyncio.create_task(on_pubsub_message(message))
#rest of the code..
tip

The capture_frames function returns an array and the max number of frames you can input is 5 (num_of_frames <=5)

Key Features:

  • On-Demand Capture: Capture frames only when needed, triggered by events or user requests
  • Event-Driven: Use PubSub or other triggers to capture frames at the right moment
  • Flexible Analysis: Send custom instructions along with frames for specific analysis tasks

Silent Capture (Saving Captured Frames)

You can save captured video frames to disk for later analysis or debugging. The frames returned by agent.capture_frames() are av.VideoFrame objects that can be converted to JPEG images.
(Silent capture - as it doesn't invoke any agent speech saying the image is being captured unless explicity set to do so)

main.py
import io  
from av import VideoFrame
from PIL import Image

def save_frame_as_jpeg(frame: VideoFrame, filename: str) -> None:
"""Save a video frame as a JPEG file."""
img = frame.to_image() # Convert to PIL Image
img.save(filename, format="JPEG")

# In your agent code
frames = agent.capture_frames(num_of_frames=1)
if frames:
# Save the first frame
save_frame_as_jpeg(frames[0], "captured_frame.jpg")

# Or save as bytes for uploading/processing
buffer = io.BytesIO()
frames[0].to_image().save(buffer, format="JPEG")
jpeg_bytes = buffer.getvalue()

Use Cases:

  • Debugging: Save frames to verify what the agent is seeing
  • Logging: Archive frames for audit trails or quality assurance
  • Preprocessing: Save frames before sending to external vision APIs
  • Thumbnails: Generate preview images for user interfaces

Static Image Processing

For pre-existing images or URLs, use the ImageContent class:

from videosdk.agents import ChatRole, ImageContent  

# Add image from URL
agent.chat_context.add_message(
role=ChatRole.USER,
content=[ImageContent(image="https://example.com/image.jpg")]
)

# Add image with custom settings
image_content = ImageContent(
image="https://example.com/document.png",
inference_detail="high" # "auto", "high", or "low"
)

agent.chat_context.add_message(
role=ChatRole.USER,
content=[image_content]
)

Provider Support

All major LLM providers support vision in CascadingPipeline:

ProviderVision ModelsCapabilities
OpenAIGPT-4 Vision modelsConfigurable detail levels, URL & base64 support
AnthropicClaude 3 modelsAdvanced image understanding, document analysis
GoogleGemini modelsComprehensive visual analysis, multi-image support

Best Practices

  • Frame Timing: Capture frames at meaningful moments (e.g., when user asks "what do you see?")
  • Error Handling: Always check if frames are available before processing
  • Vision Enablement: Ensure vision=True is set in RoomOptions for frame capture
  • Image Quality: Use appropriate resolutions for your use case (1024x1024 recommended for detailed analysis)

Here is the example you can try out : Cascading Pipeline Vision Example


RealTime Pipeline Vision

The RealTimePipeline enables continuous live video processing for real-time visual interactions. Video frames are automatically streamed to the model as they arrive.

Live Video Processing

Live video input is enabled through the vision parameter in RoomOptions and requires Google's Gemini Live model.

main.py
from videosdk.agents import Agent, AgentSession, RealTimePipeline,WorkerJob, JobContext, RoomOptions
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig

async def start_session(context: JobContext):
# Initialize Gemini with vision capabilities
model = GeminiRealtime(
model="gemini-2.0-flash-live-001",
config=GeminiLiveConfig(
voice="Leda",
response_modalities=["AUDIO"]
)
)

pipeline = RealTimePipeline(model=model)
agent = VisionAgent()

session = AgentSession(
agent=agent,
pipeline=pipeline,
)

await session.start(wait_for_participant=True, run_until_shutdown=True)

# Enable live video processing
def make_context() -> JobContext:
room_options = RoomOptions(room_id="<room_id>",
name="Sandbox Agent",
playground=True,
vision=True
)
return JobContext(
room_options=room_options
)

Video Processing Flow

When vision is enabled, the system automatically does following:

  1. Continuous Capture: Captures video frames from meeting participants
  2. Frame Processing: Processes frames at optimal intervals (throttled to 0.5 seconds)
  3. Model Integration: Sends visual data to the Gemini Live model
  4. Context Integration: Integrates visual understanding with conversation context

RealTimePipeline Limitations

  • Model Restriction: Only works with GeminiRealtime model
  • Network Requirements: Requires stable network connections for optimal performance
  • Frame Rate: Automatically throttled to prevent overwhelming the model

Here is the example you can try out : Realtime Pipeline Vision Example

Choosing the Right Approach

Use CaseRecommended PipelineWhy
On-demand frame analysisCascadingPipelineCapture frames only when needed, works with all LLM providers
Document/image Q&ACascadingPipelineProcess static images with custom instructions
Real-time video commentaryRealTimePipelineContinuous streaming for live visual interactions
Multi-provider supportCascadingPipelineWorks with OpenAI, Anthropic, and Google
Lowest latencyRealTimePipelineDirect streaming to Gemini Live model

Examples - Try Out Yourself

Checkout examples of using Realtime and Cascading Vision functionality

Frequently Asked Questions

Can I use vision with any LLM provider?

CascadingPipeline vision works with OpenAI, Anthropic, and Google LLMs. RealTimePipeline vision only works with Google's Gemini Live model.

How do I capture frames at specific moments?

Use event-driven triggers like PubSub messages or user speech to call agent.capture_frames() at the right time. See the example code above for implementation details.

What's the difference between frame capture and continuous streaming?

Frame capture (CascadingPipeline) captures frames on-demand when you call capture_frames(). Continuous streaming (RealTimePipeline) automatically sends video frames to the model in real-time.

Got a Question? Ask us on discord