Vision & Multi-modality
Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see.
The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases.
Pipeline Architecture Overview
The framework provides two pipeline types with different vision support:
| Pipeline Type | Vision Capabilities | Supported Models | Use Cases | 
|---|---|---|---|
| RealTimePipeline | Live video processing | Google Gemini Live only | Real-time visual interactions, live video commentary | 
| CascadingPipeline | Static image processing | OpenAI, Anthropic, Google | Document analysis, image description, visual Q&A | 
RealTimePipeline Vision
The RealTimePipeline enables live video processing for real-time visual interactions. This pipeline processes video frames as they arrive and integrates visual understanding directly into the conversation flow.
Live Video Processing
Live video input is enabled through the vision parameter in RoomOptions and requires Google's Gemini Live model.
from videosdk.agents import JobContext, RoomOptions, AgentSession, RealTimePipeline  
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig  
  
# Enable live video processing  
job_context = JobContext(  
    room_options=RoomOptions(  
        room_id="your-room-id",  
        name="Visual Agent",  
        vision=True  # Enable live video processing  
    )  
)  
  
# Configure Gemini with vision capabilities  
model = GeminiRealtime(  
    model="gemini-2.0-flash-live-001",  
    config=GeminiLiveConfig(  
        response_modalities=["TEXT", "AUDIO"]  
    )  
)  
  
pipeline = RealTimePipeline(model=model)
Video Processing Flow
When vision is enabled, the system automatically does following:
- Video Frame Capture: Captures video frames from meeting participants
- Frame Processing: Processes frames at optimal intervals (throttled to 0.5 seconds)
- Model Integration: Sends visual data to the Gemini Live model
- Context Integration: Integrates visual understanding with conversation context
The video processing is handled through the handle_video_input method.
RealTimePipeline Limitations
- Model Restriction: Only works with GeminiRealtimemodel
- Network Requirements: Requires stable network connections for optimal performance
- Frame Rate: Automatically throttled to prevent overwhelming the model
Here is the example you can try out : Realtime Pipeline Vision Example
CascadingPipeline Vision
The CascadingPipeline supports static image processing through the ImageContent class, enabling agents to analyze and discuss images shared by users. This works with all supported LLM providers.
Static Image Processing
Images are added to conversation context using the ImageContent class
from videosdk.agents import ChatRole, ImageContent, EncodeOptions, ResizeOptions  
  
# Add image from URL  
self.agent.chat_context.add_message(  
    role=ChatRole.USER,  
    content=[ImageContent(image="https://example.com/image.jpg")]  
)  
  
# Add image with custom settings  
image_content = ImageContent(  
    image="https://example.com/document.png",  
    inference_detail="high",  # "auto", "high", or "low"  
    encode_options=EncodeOptions(  
        format="JPEG",  
        quality=90,  
        resize_options=ResizeOptions(width=1024, height=768)  
    )  
)  
  
self.agent.chat_context.add_message(  
    role=ChatRole.USER,  
    content=[image_content]  
)
Image Configuration Options
The ImageContent class provides extensive configuration for image processing.
| Parameter | Options | Description | 
|---|---|---|
| inference_detail | "auto", "high", "low" | Detail level for LLM image analysis | 
| encode_options.format | "JPEG", "PNG" | Image encoding format | 
| encode_options.quality | 1-100 | JPEG quality setting | 
| resize_options | ResizeOptions(width, height) | Image resizing configuration | 
inference_detail only works for OpenAI Models.
Integration with Conversation Flow
from videosdk.agents import ConversationFlow, ChatRole, ImageContent  
from typing import AsyncIterator  
  
class VisualConversationFlow(ConversationFlow):  
    async def run(self, transcript: str) -> AsyncIterator[str]:  
        await self.on_turn_start(transcript)  
          
        # Add visual context when user mentions images  
        if "look at this" in transcript.lower():  
            self.agent.chat_context.add_message(  
                role=ChatRole.USER,  
                content=[ImageContent(image="user_shared_image_url")]  
            )  
          
        async for response_chunk in self.process_with_llm():  
            yield response_chunk  
              
        await self.on_turn_end()
Provider Support
| OpenAI Vision | Anthropic Claude | Google Gemini | 
|---|---|---|
| Supports GPT-4 Vision models | Advanced image understanding capabilities | Static image processing through chat interface | 
| Configurable inference detail levels | Document and diagram analysis | Comprehensive visual analysis | 
| Handles both URLs and base64 images | Multi-image conversations | Document understanding | 
Here is the example you can try out : Cascading Pipeline Vision Example
Best Practices
RealTimePipeline Best Practices
- Network Stability: Ensure stable network connections for optimal live video processing.
- Model Selection: Only use GeminiRealtimemodel for live video capabilities.
- Performance: Frame processing is automatically optimized and throttled.
CascadingPipeline Best Practices
- Image Quality: Use appropriate resolutions (1024x1024 recommended for detailed analysis).
- Inference Detail: Choose "high"for detailed analysis,"low"for quick processing.
- Token Management: Monitor token usage with high-detail image processing.
- Provider Selection: Choose LLM provider based on specific vision capabilities needed.
General Multi-modal Design
- Context Integration: Provide clear context for image analysis requests.
- Fallback Handling: Handle cases where visual processing isn't available.
- User Experience: Combine visual and text inputs naturally in conversation flow.
Examples - Try Out Yourself
Checkout examples of using Realtime and Cascading Vision functionality
Realtime Pipeline Vision
Realtime Vision functionality with Gemini Realtime API
Cascading Pipeline Vision
Vision functionality via Chat in cascading pipeline
Frequently Asked Questions
Does Realtime Pipeline have chat context functionality?
The Chat Context functionality is currently maintained only in Cascading Pipeline. Realtime Pipeline Models has their own chat context maintained and isn't explicitly managed.
Got a Question? Ask us on discord

