Speech Handle
Speech control in VideoSDK agents operates through two complementary layers: session-level methods for initiating speech and utterance-level handles for managing speech lifecycle. This document covers both aspects of controlling agent speech output.
Session-Level Speech Control
The AgentSession provides three primary methods for controlling agent speech output
1. Say
say(message: str):
Sends a direct message from the agent to meeting participants.
# Basic usage
await session.say("Hello! How can I help you today?")
# In agent lifecycle hooks
class MyAgent(Agent):
async def on_enter(self):
await self.session.say("Welcome to the meeting!")
2. Reply
reply(instructions: str, wait_for_playback: bool = True):
Generates agent responses dynamically using custom instructions while maintaining conversation context.
Parameters:
instructions: Custom instructions for generating the responsewait_for_playback: WhenTrue, prevents user interruptions until playback completes
# Generate immediate response
await session.reply(instructions="Please summarize the conversation so far")
# Wait for complete playback before allowing new inputs
await session.reply(
instructions="Explain the next steps",
wait_for_playback=True
)
# Practical example in function tools
class MyAgent(Agent):
@function_tool
async def get_summary(self) -> str:
await self.session.reply(
instructions="Based on our conversation, let me provide a summary..."
)
return "Summary generated"
3. Interrupt
interrupt():
Immediately stops the agent's current speech operation.
# Emergency stop during agent response
session.interrupt()
# User interruption handling
class InteractiveAgent(Agent):
async def handle_user_input(self, user_input: str):
if "stop" in user_input.lower():
self.session.interrupt()
await self.session.reply(instructions="How can I help you instead?")
@function_tool
async def emergency_stop(self) -> str:
"""Stop current agent operation immediately"""
self.session.interrupt()
return "Agent stopped successfully"
Utterance-Level Management
UtteranceHandle manages individual agent utterances, preventing overlapping speech and enabling graceful interruption handling.
Core Concepts
-
Lifecycle Management
- Each
UtteranceHandletracks a single utterance from creation through completion.
- Each
-
Completion States
An utterance can complete in two ways:
- Natural Completion: The TTS finishes playing the audio
- User Interruption: The user starts speaking during playback
-
Awaitable Pattern
- The handle supports Python's async/await syntax for sequential speech control.
API Reference
| Property/Method | Return Type | Description |
|---|---|---|
id | str | Unique identifier for the utterance |
done() | bool | Returns True if utterance is complete |
interrupted | bool | Returns True if user interrupted |
interrupt() | None | Manually marks utterance as interrupted |
__await__() | Generator | Enables awaiting the handle |
Usage Patterns
-
Sequential Speech
To prevent overlapping TTS, await each handle before starting the next utterance:
# Correct approach
handle1 = self.session.say(f"The current temperature is {temperature}°C.")
await handle1 # Wait for first utterance to complete
handle2 = self.session.say("Do you live in this city?")
await handle2 # Wait for second utterance to complete -
Checking Interruption Status
Access the current utterance handle via
self.session.current_utteranceto detect interruptions:utterance: UtteranceHandle | None = self.session.current_utterance
# In long-running operations, check periodically
for i in range(10):
if utterance and utterance.interrupted:
logger.info("Task was interrupted by the user.")
return "The task was cancelled because you interrupted me."
await asyncio.sleep(1)
Best Practices
- Sequential Speech: Always await handles when you need sequential speech to prevent audio overlap
- Interruption Handling: Check
interruptedstatus in long-running operations to enable graceful cancellation - Handle References: Store handle references if you need to check status later in your function
- Avoid Concurrent Tasks: Don't use
create_task()for speech that should play sequentially
Common Use Cases
- Multi-part responses: When function tools need to speak multiple sentences in sequence
- Long-running operations: Tasks that should be cancellable when users interrupt
- Conversational flows: Scenarios requiring precise timing between utterances
Example - Try It Yourself
FAQs
Troubleshooting
| Issue | Solution |
|---|---|
| Overlapping speech | Use await on handles instead of create_task() |
| Tasks not cancelling on interruption | Check utterance.interrupted in loops |
| Handle is None | Only available during function tool execution via session.current_utterance |
Correct Usage Pattern
✅ Correct: Sequential Speech
Await each handle to prevent overlapping TTS.
handle1 = session.say("First")
await handle1
handle2 = session.say("Second")
await handle2
❌ Incorrect: Concurrent Speech
Using create_task() causes audio overlap.
asyncio.create_task(session.say("First"))
asyncio.create_task(session.say("Second"))
Got a Question? Ask us on discord

