Skip to main content
Version: 1.0.x

Context Window

Context Window provides automatic conversation history management for the Pipeline. Instead of manually trimming messages, you configure a ContextWindow with token and item budgets — it automatically compresses old turns into summaries and truncates excess items before each LLM call, ensuring your agent maintains long-term memory without exceeding context limits.

Context Window

tip

Context Window replaces manual context management. All token budgeting, history compression, and truncation is now handled automatically through a single configuration object.

How Context Window Works

Context Window is configured on a Pipeline instance via the context_window parameter. It runs a two-step management cycle before every LLM call:

StepActionPurpose
1. CompressSummarize old conversation turns via LLMPreserve long-term memory without keeping every message
2. TruncateRemove oldest non-protected itemsEnforce hard token and item count limits

Three item types are always protected and never removed:

Protected ItemReason
System messageAgent instructions must persist
Summary messageCompressed history is the agent's long-term memory
Last user messageLLMs require conversation to end with a user turn
main.py
from videosdk.agents import Agent, Pipeline, AgentSession, JobContext, RoomOptions, WorkerJob, ContextWindow
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.cartesia import CartesiaTTS
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

pipeline = Pipeline(
stt=DeepgramSTT(),
llm=OpenAILLM(),
tts=CartesiaTTS(),
vad=SileroVAD(),
turn_detector=TurnDetector(),

# Configure context window management
context_window=ContextWindow(
max_tokens=4000,
max_context_items=20,
keep_recent_turns=3,
max_tool_calls_per_turn=10,
),
)

Configuration Parameters

max_tokens

Maximum estimated token budget for the entire conversation history. When exceeded, old turns are compressed then truncated.

Type: int | None
Default: None (no token limit)

Example:

main.py
context_window=ContextWindow(
max_tokens=4000, # ~5 city plans + conversation with 3 tools each
)

max_context_items

Maximum number of items (messages + tool calls + tool results) in the context. Either limit can trigger compression/truncation.

Type: int | None
Default: None (no item limit)

Example:

main.py
context_window=ContextWindow(
max_context_items=20, # Keep context compact
)

keep_recent_turns

Number of recent user-assistant exchanges kept verbatim during compression. Everything older gets summarized by the LLM.

Type: int
Default: 3

Example:

main.py
context_window=ContextWindow(
keep_recent_turns=5, # Keep last 5 exchanges word-for-word
)

max_tool_calls_per_turn

Maximum number of tool calls allowed in a single user turn. This is a safety limit to prevent infinite loops where the LLM keeps requesting tools without ever producing a text response.

Type: int
Default: 10

Example:

main.py
context_window=ContextWindow(
max_tool_calls_per_turn=10, # Allow up to 10 sequential tool calls
)
note

For multi-city queries like "Plan for Dubai AND Mumbai", each city requires 3 tool calls. Setting this to 10 gives headroom for 2-3 cities plus any redundant LLM calls.

summary_llm

Optional separate LLM for generating summaries. If not set, the agent's main LLM is used automatically. Use a smaller/cheaper model to reduce costs.

Type: LLM | None
Default: None (uses agent's main LLM)

Example:

main.py
from videosdk.plugins.openai import OpenAILLM

context_window=ContextWindow(
max_tokens=4000,
keep_recent_turns=3,
summary_llm=OpenAILLM(model="gpt-4o-mini"), # Cheaper model for summaries
)

The manage() Cycle

The manage() method runs automatically before each LLM call. It performs two steps in order:

Step 1: Compress

When the context exceeds the token or item budget and there are enough old turns to compress (more than keep_recent_turns + 1), compression kicks in:

  1. Split — Separate items into old turns and recent turns (keeping the last N user exchanges)
  2. Render — Convert old items into human-readable text for the summarization prompt
  3. Summarize — Call the LLM (or summary_llm) to generate a concise summary
  4. Replace — Remove all old items and insert the summary as an assistant message marked {"summary": True}

What the summary preserves:

  • Key facts, names, and numbers
  • Decisions made and their reasoning
  • Tool/function call results and outcomes
  • Commitments or promises the assistant made
  • User objectives, preferences, and unresolved tasks

Step 2: Truncate

After compression (or if compression wasn't needed), truncation enforces hard limits:

  1. Remove the oldest non-protected items one at a time
  2. Function call/output pairs are removed together to avoid orphaned tool calls
  3. Continue until both max_tokens and max_context_items are satisfied
  4. If only protected items remain, stop even if still over budget

How Tool Chaining Works

Context Window integrates seamlessly with tool chaining. Here's the lifecycle of a multi-tool turn:

  1. User says "Plan for Dubai" → LLM returns get_weather(Dubai)
  2. Tool executes → result added to context → LLM called again
  3. LLM returns get_clothing_advice(22°C) → execute → call LLM again
  4. LLM returns get_activity_suggestion(22°C, "jacket") → execute → call LLM
  5. LLM returns text "Dubai is 22°C, wear a jacket, go hiking!" → spoken by TTS

That's 3 tool calls + 1 text response = 4 rounds, well within max_tool_calls_per_turn=10.

note

Some LLMs (Anthropic Claude, OpenAI GPT-4o) can return multiple tool calls in a single response. These are collected and executed in parallel using asyncio.gather, then all results are added to context before the next LLM call. Google Gemini sends one tool call at a time (always sequential).


Complete Example

Here's a full example combining Context Window with tool chaining for a production-ready travel assistant:

main.py
import aiohttp
from videosdk.agents import Agent, AgentSession, Pipeline, function_tool, JobContext, RoomOptions, WorkerJob, ContextWindow
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.cartesia import CartesiaTTS
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

@function_tool
async def get_weather(city: str) -> dict:
"""Get the current weather temperature for a given city."""
city_coords = {
"dubai": (25.2048, 55.2708),
"mumbai": (19.0760, 72.8777),
"new york": (40.7128, -74.0060),
}
coords = city_coords.get(city.lower(), (25.2048, 55.2708))
lat, lon = coords
url = f"https://api.open-meteo.com/v1/forecast?latitude={lat}&longitude={lon}&current=temperature_2m"
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status == 200:
data = await response.json()
temp = data["current"]["temperature_2m"]
return {"city": city, "temperature": temp, "unit": "Celsius"}
else:
return {"city": city, "temperature": 25, "unit": "Celsius", "note": "fallback"}


@function_tool
async def get_clothing_advice(temperature: float) -> dict:
"""Get clothing recommendation based on temperature."""
if temperature > 35:
advice = "Very light breathable clothes, hat, and sunscreen."
elif temperature > 25:
advice = "Light clothes like t-shirt and shorts."
elif temperature > 15:
advice = "Light jacket or sweater with comfortable pants."
elif temperature > 5:
advice = "Warm coat, scarf, and layered clothing."
else:
advice = "Heavy winter coat, gloves, hat, and thermal layers."
return {"temperature": temperature, "clothing_advice": advice}


class TravelAgent(Agent):
def __init__(self):
super().__init__(
instructions=(
"You are a helpful travel assistant. When a user asks what to do in a city:\n"
"1. FIRST call get_weather to get the temperature\n"
"2. THEN call get_clothing_advice with that temperature\n"
"4. Combine results into a natural spoken response (2-3 sentences max)."
),
tools=[get_weather, get_clothing_advice],
)

async def on_enter(self) -> None:
await self.session.say("Hi! I'm your travel assistant. Ask me about any city!")

async def on_exit(self) -> None:
pass


async def start_session(context: JobContext):
agent = TravelAgent()

pipeline = Pipeline(
stt=DeepgramSTT(),
llm=OpenAILLM(),
tts=CartesiaTTS(),
vad=SileroVAD(),
turn_detector=TurnDetector(),

# ── Context Window Configuration ───────────────────────────
#
# max_tokens: Token budget for the conversation.
# With 2 tools per city, each city adds ~150 tokens.
# 4000 tokens fits ~8 city plans + conversation.
#
# max_context_items: Maximum messages + tool calls.
# Either limit can trigger compression/truncation.
#
# keep_recent_turns: Recent exchanges kept verbatim.
# Everything older gets summarized by the LLM.
#
# max_tool_calls_per_turn: Safety limit per turn.
# Prevents infinite tool-call loops.
#
context_window=ContextWindow(
max_tokens=4000,
max_context_items=20,
keep_recent_turns=3,
max_tool_calls_per_turn=10,
),
)

session = AgentSession(agent=agent, pipeline=pipeline)
await session.start(wait_for_participant=True, run_until_shutdown=True)


def make_context() -> JobContext:
room_options = RoomOptions(
room_id="<room_id>",
name="Travel Agent",
playground=True,
)
return JobContext(room_options=room_options)


if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()

Token Estimation

Context Window uses a lightweight heuristic for token counting (~4 characters per token). This is fast enough for real-time budget decisions but is not a replacement for provider-reported usage.

Item TypeEstimation Method
Text messagelen(text) // 4
Image contentFixed 300 tokens
Function calllen(name) // 4 + len(arguments) // 4 + 5
Function outputlen(name) // 4 + len(output) // 4 + 5
Per-item overhead4 tokens

Parameter Reference

ParameterTypeDefaultPurpose
max_tokensint | NoneNoneToken budget for conversation history
max_context_itemsint | NoneNoneMaximum items (messages + tool calls)
keep_recent_turnsint3Recent exchanges kept verbatim
max_tool_calls_per_turnint10Safety limit for tool calls per turn
summary_llmLLM | NoneNoneOptional dedicated LLM for summaries

Examples - Try Out Yourself

Got a Question? Ask us on discord