Skip to main content

Vision

For supported LLM providers (OpenAI, Anthropic, Google), you can add images to their chat context to leverage their full capabilities. You can add images as URLs or base64-encoded data from your frontend or directly. Additionally, you can use live video with a realtime model such as Gemini Live.

Image Input (Cascading Pipeline)

The agent's chat context supports both image and text input. You can add multiple images in a given session, although larger chat contexts may lead to slower response times.

To add an image, you can simply pass an image URL or base64-encoded image data to the ImageContent class. Below is an example of image addition:

self.agent.chat_context.add_message(
role=ChatRole.USER, content=[ImageContent(image="YOUR_IMAGE_URL")]
)

Sample code for adding image context from conversation flow:

class MyConversationFlow(ConversationFlow):
def __init__(self, agent, stt=None, llm=None, tts=None):
super().__init__(agent, stt, llm, tts)

async def run(self, transcript: str) -> AsyncIterator[str]:
await self.on_turn_start(transcript)
# Add image context
self.agent.chat_context.add_message(
role=ChatRole.USER, content=[ImageContent(image="YOUR_IMAGE_URL")]
)
async for response_chunk in self.process_with_llm():
yield response_chunk
await self.on_turn_end()

async def on_turn_start(self, transcript: str) -> None:
self.is_turn_active = True

async def on_turn_end(self) -> None:
self.is_turn_active = False

Inference Detail

If your LLM provider supports it, you can set the inference_detail parameter to "high" or "low" to control token usage and inference quality. The default is "auto", which uses the provider's default setting.

info

Inference detail is only available by OpenAI.

Live Video Input (Realtime Pipeline)

Set the vision parameter to True in RoomOptions to enable live video input. This feature is only supported with the Gemini Live model.

job_context = JobContext(
room_options = RoomOptions(
room_id = "YOUR_ROOM_ID",
name = "Agent",
vision = True
)
)

Got a Question? Ask us on discord