Skip to main content

Vision & Multi-modality

Vision and multi-modal capabilities enable your AI agents to process and understand visual content alongside text and audio. This creates richer, more interactive experiences where agents can analyze images, respond to visual cues, and engage in conversations about what they see.

The VideoSDK Agents framework supports vision capabilities through two distinct pipeline architectures, each with different capabilities and use cases.

Pipeline Architecture Overview

The framework provides two pipeline types with different vision support:

Pipeline TypeVision CapabilitiesSupported ModelsUse Cases
RealTimePipelineLive video processingGoogle Gemini Live onlyReal-time visual interactions, live video commentary
CascadingPipelineStatic image processingOpenAI, Anthropic, GoogleDocument analysis, image description, visual Q&A

RealTimePipeline Vision

The RealTimePipeline enables live video processing for real-time visual interactions. This pipeline processes video frames as they arrive and integrates visual understanding directly into the conversation flow.

Live Video Processing

Live video input is enabled through the vision parameter in RoomOptions and requires Google's Gemini Live model.

main.py
from videosdk.agents import JobContext, RoomOptions, AgentSession, RealTimePipeline  
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig

# Enable live video processing
job_context = JobContext(
room_options=RoomOptions(
room_id="your-room-id",
name="Visual Agent",
vision=True # Enable live video processing
)
)

# Configure Gemini with vision capabilities
model = GeminiRealtime(
model="gemini-2.0-flash-live-001",
config=GeminiLiveConfig(
response_modalities=["TEXT", "AUDIO"]
)
)

pipeline = RealTimePipeline(model=model)

Video Processing Flow

When vision is enabled, the system automatically does following:

  1. Video Frame Capture: Captures video frames from meeting participants
  2. Frame Processing: Processes frames at optimal intervals (throttled to 0.5 seconds)
  3. Model Integration: Sends visual data to the Gemini Live model
  4. Context Integration: Integrates visual understanding with conversation context

The video processing is handled through the handle_video_input method.

RealTimePipeline Limitations

  • Model Restriction: Only works with GeminiRealtime model
  • Network Requirements: Requires stable network connections for optimal performance
  • Frame Rate: Automatically throttled to prevent overwhelming the model

Here is the example you can try out : Realtime Pipeline Vision Example

CascadingPipeline Vision

The CascadingPipeline supports static image processing through the ImageContent class, enabling agents to analyze and discuss images shared by users. This works with all supported LLM providers.

Static Image Processing

Images are added to conversation context using the ImageContent class

main.py
from videosdk.agents import ChatRole, ImageContent, EncodeOptions, ResizeOptions  

# Add image from URL
self.agent.chat_context.add_message(
role=ChatRole.USER,
content=[ImageContent(image="https://example.com/image.jpg")]
)

# Add image with custom settings
image_content = ImageContent(
image="https://example.com/document.png",
inference_detail="high", # "auto", "high", or "low"
encode_options=EncodeOptions(
format="JPEG",
quality=90,
resize_options=ResizeOptions(width=1024, height=768)
)
)

self.agent.chat_context.add_message(
role=ChatRole.USER,
content=[image_content]
)

Image Configuration Options

The ImageContent class provides extensive configuration for image processing.

ParameterOptionsDescription
inference_detail"auto", "high", "low"Detail level for LLM image analysis
encode_options.format"JPEG", "PNG"Image encoding format
encode_options.quality1-100JPEG quality setting
resize_optionsResizeOptions(width, height)Image resizing configuration
note

inference_detail only works for OpenAI Models.

Integration with Conversation Flow

main.py
from videosdk.agents import ConversationFlow, ChatRole, ImageContent  
from typing import AsyncIterator

class VisualConversationFlow(ConversationFlow):
async def run(self, transcript: str) -> AsyncIterator[str]:
await self.on_turn_start(transcript)

# Add visual context when user mentions images
if "look at this" in transcript.lower():
self.agent.chat_context.add_message(
role=ChatRole.USER,
content=[ImageContent(image="user_shared_image_url")]
)

async for response_chunk in self.process_with_llm():
yield response_chunk

await self.on_turn_end()

Provider Support

OpenAI VisionAnthropic ClaudeGoogle Gemini
Supports GPT-4 Vision modelsAdvanced image understanding capabilitiesStatic image processing through chat interface
Configurable inference detail levelsDocument and diagram analysisComprehensive visual analysis
Handles both URLs and base64 imagesMulti-image conversationsDocument understanding

Here is the example you can try out : Cascading Pipeline Vision Example

Best Practices

RealTimePipeline Best Practices

  • Network Stability: Ensure stable network connections for optimal live video processing.
  • Model Selection: Only use GeminiRealtime model for live video capabilities.
  • Performance: Frame processing is automatically optimized and throttled.

CascadingPipeline Best Practices

  • Image Quality: Use appropriate resolutions (1024x1024 recommended for detailed analysis).
  • Inference Detail: Choose "high" for detailed analysis, "low" for quick processing.
  • Token Management: Monitor token usage with high-detail image processing.
  • Provider Selection: Choose LLM provider based on specific vision capabilities needed.

General Multi-modal Design

  • Context Integration: Provide clear context for image analysis requests.
  • Fallback Handling: Handle cases where visual processing isn't available.
  • User Experience: Combine visual and text inputs naturally in conversation flow.

Examples - Try Out Yourself

Checkout examples of using Realtime and Cascading Vision functionality

Frequently Asked Questions

Does Realtime Pipeline have chat context functionality?

The Chat Context functionality is currently maintained only in Cascading Pipeline. Realtime Pipeline Models has their own chat context maintained and isn't explicitly managed.

Got a Question? Ask us on discord