Vision & Multi-modality

Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see.

The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases.

Pipeline Architecture Overview

The framework provides two pipeline types with different vision support:

Pipeline Type	Vision Capabilities	Supported Models	Use Cases
RealTimePipeline	Live video processing	Google Gemini Live only	Real-time visual interactions, live video commentary
CascadingPipeline	Static image processing	OpenAI, Anthropic, Google	Document analysis, image description, visual Q&A

RealTimePipeline Vision

The RealTimePipeline enables live video processing for real-time visual interactions. This pipeline processes video frames as they arrive and integrates visual understanding directly into the conversation flow.

Live Video Processing

Live video input is enabled through the vision parameter in RoomOptions and requires Google's Gemini Live model.

main.py
from videosdk.agents import JobContext, RoomOptions, AgentSession, RealTimePipeline
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig

# Enable live video processing
job_context = JobContext(
    room_options=RoomOptions(
        room_id="your-room-id",
        name="Visual Agent",
        vision=True  # Enable live video processing
    )
)

# Configure Gemini with vision capabilities
model = GeminiRealtime(
    model="gemini-2.0-flash-live-001",
    config=GeminiLiveConfig(
        response_modalities=["TEXT", "AUDIO"]
    )
)

pipeline = RealTimePipeline(model=model)

Video Processing Flow

When vision is enabled, the system automatically does following:

Video Frame Capture: Captures video frames from meeting participants
Frame Processing: Processes frames at optimal intervals (throttled to 0.5 seconds)
Model Integration: Sends visual data to the Gemini Live model
Context Integration: Integrates visual understanding with conversation context

The video processing is handled through the handle_video_input method.

RealTimePipeline Limitations

Model Restriction: Only works with GeminiRealtime model
Network Requirements: Requires stable network connections for optimal performance
Frame Rate: Automatically throttled to prevent overwhelming the model

Here is the example you can try out : Realtime Pipeline Vision Example

CascadingPipeline Vision

The CascadingPipeline supports static image processing through the ImageContent class, enabling agents to analyze and discuss images shared by users. This works with all supported LLM providers.

Static Image Processing

Images are added to conversation context using the ImageContent class

main.py
from videosdk.agents import ChatRole, ImageContent, EncodeOptions, ResizeOptions

# Add image from URL
self.agent.chat_context.add_message(
    role=ChatRole.USER,
    content=[ImageContent(image="https://example.com/image.jpg")]
)

# Add image with custom settings
image_content = ImageContent(
    image="https://example.com/document.png",
    inference_detail="high",  # "auto", "high", or "low"
    encode_options=EncodeOptions(
        format="JPEG",
        quality=90,
        resize_options=ResizeOptions(width=1024, height=768)
    )
)

self.agent.chat_context.add_message(
    role=ChatRole.USER,
    content=[image_content]
)

Image Configuration Options

The ImageContent class provides extensive configuration for image processing.

Parameter	Options	Description
`inference_detail`	"auto", "high", "low"	Detail level for LLM image analysis
`encode_options.format`	"JPEG", "PNG"	Image encoding format
`encode_options.quality`	1-100	JPEG quality setting
`resize_options`	`ResizeOptions(width, height)`	Image resizing configuration

note

inference_detail only works for OpenAI Models.

Integration with Conversation Flow

main.py
from videosdk.agents import ConversationFlow, ChatRole, ImageContent
from typing import AsyncIterator

class VisualConversationFlow(ConversationFlow):
    async def run(self, transcript: str) -> AsyncIterator[str]:
        await self.on_turn_start(transcript)

        # Add visual context when user mentions images
        if "look at this" in transcript.lower():
            self.agent.chat_context.add_message(
                role=ChatRole.USER,
                content=[ImageContent(image="user_shared_image_url")]
            )

        async for response_chunk in self.process_with_llm():
            yield response_chunk

        await self.on_turn_end()

Provider Support

OpenAI Vision	Anthropic Claude	Google Gemini
Supports GPT-4 Vision models	Advanced image understanding capabilities	Static image processing through chat interface
Configurable inference detail levels	Document and diagram analysis	Comprehensive visual analysis
Handles both URLs and base64 images	Multi-image conversations	Document understanding

Here is the example you can try out : Cascading Pipeline Vision Example

Best Practices

RealTimePipeline Best Practices

Network Stability: Ensure stable network connections for optimal live video processing.
Model Selection: Only use GeminiRealtime model for live video capabilities.
Performance: Frame processing is automatically optimized and throttled.

CascadingPipeline Best Practices

Image Quality: Use appropriate resolutions (1024x1024 recommended for detailed analysis).
Inference Detail: Choose "high" for detailed analysis, "low" for quick processing.
Token Management: Monitor token usage with high-detail image processing.
Provider Selection: Choose LLM provider based on specific vision capabilities needed.

Context Integration: Provide clear context for image analysis requests.
Fallback Handling: Handle cases where visual processing isn't available.
User Experience: Combine visual and text inputs naturally in conversation flow.

Frequently Asked Questions

Does Realtime Pipeline have chat context functionality?

The Chat Context functionality is currently maintained only in Cascading Pipeline. Realtime Pipeline Models has their own chat context maintained and isn't explicitly managed.

Got a Question? Ask us on discord

Pipeline Architecture Overview​

RealTimePipeline Vision​

Live Video Processing​

Video Processing Flow​

RealTimePipeline Limitations​

CascadingPipeline Vision​

Static Image Processing​

Image Configuration Options​

Integration with Conversation Flow​

Provider Support​

Best Practices​

RealTimePipeline Best Practices​

CascadingPipeline Best Practices​

General Multi-modal Design​

Frequently Asked Questions​