Speech Handle

Speech control in VideoSDK agents operates through two complementary layers: session-level methods for initiating speech and utterance-level handles for managing speech lifecycle. This document covers both aspects of controlling agent speech output.

Session-Level Speech Control

The AgentSession provides three primary methods for controlling agent speech output

1. Say

say(message: str, interruptible: bool = True): Sends a direct message from the agent to meeting participants with interruption control.

Parameters:

message: The message to be spoken.
interruptible: When True, the agent’s speech can be interrupted. When False, the agent will continue speaking until the message is fully delivered. Default is True.

# Basic usage  
await session.say("Critical update!", interruptible=False)  
  
# In agent lifecycle hooks  
class MyAgent(Agent):  
    async def on_enter(self):  
        await self.session.say("Welcome to the meeting!")

2. Reply

reply(instructions: str, wait_for_playback: bool = True, interruptible: bool = True): Generates agent responses dynamically using custom instructions with interruption control.

Parameters:

instructions: Custom instructions for generating the response
wait_for_playback: When True, prevents user interruptions until playback completes
interruptible: When True, the agent’s response can be interrupted. When False, the agent will continue speaking without interruption. Default is True.

# Generate immediate response  
await session.reply(instructions="Please summarize the conversation so far", interruptible=False)  
  
# Wait for complete playback before allowing new inputs  
await session.reply(  
    instructions="Explain the next steps",   
    wait_for_playback=True  
)  
  
# Practical example in function tools  
class MyAgent(Agent):  
    @function_tool  
    async def get_summary(self) -> str:  
        await self.session.reply(  
            instructions="Based on our conversation, let me provide a summary..."  
        )  
        return "Summary generated"

3. Interrupt

interrupt(): Immediately stops the agent's current speech operation.

# Emergency stop during agent response  
session.interrupt()

# User interruption handling  
class InteractiveAgent(Agent):  
    async def handle_user_input(self, user_input: str):  
        if "stop" in user_input.lower():  
            self.session.interrupt()  
            await self.session.reply(instructions="How can I help you instead?")  
      
    @function_tool  
    async def emergency_stop(self) -> str:  
        """Stop current agent operation immediately"""  
        self.session.interrupt()  
        return "Agent stopped successfully"

Utterance-Level Management

UtteranceHandle manages individual agent utterances, preventing overlapping speech and enabling graceful interruption handling.

Core Concepts

Lifecycle Management
- Each UtteranceHandle tracks a single utterance from creation through completion.
Completion States

An utterance can complete in two ways:
1. Natural Completion: The TTS finishes playing the audio
2. User Interruption: The user starts speaking during playback
Awaitable Pattern
- The handle supports Python's async/await syntax for sequential speech control.

API Reference

Property/Method	Return Type	Description
`id`	`str`	Unique identifier for the utterance
`done()`	`bool`	Returns `True` if utterance is complete
`interrupted`	`bool`	Returns `True` if user interrupted
`interrupt()`	`None`	Manually marks utterance as interrupted
`__await__()`	`Generator`	Enables awaiting the handle

Usage Patterns

Sequential Speech

To prevent overlapping TTS, await each handle before starting the next utterance:

# Correct approach  
handle1 = self.session.say(f"The current temperature is {temperature}°C.")  
await handle1  # Wait for first utterance to complete  

handle2 = self.session.say("Do you live in this city?")  
await handle2  # Wait for second utterance to complete

Checking Interruption Status

Access the current utterance handle via self.session.current_utterance to detect interruptions:

utterance: UtteranceHandle | None = self.session.current_utterance  

# In long-running operations, check periodically  
for i in range(10):  
    if utterance and utterance.interrupted:  
        logger.info("Task was interrupted by the user.")  
        return "The task was cancelled because you interrupted me."  
    
    await asyncio.sleep(1)

Best Practices

Sequential Speech: Always await handles when you need sequential speech to prevent audio overlap
Interruption Handling: Check interrupted status in long-running operations to enable graceful cancellation
Handle References: Store handle references if you need to check status later in your function
Avoid Concurrent Tasks: Don't use create_task() for speech that should play sequentially

Common Use Cases

Multi-part responses: When function tools need to speak multiple sentences in sequence
Long-running operations: Tasks that should be cancellable when users interrupt
Conversational flows: Scenarios requiring precise timing between utterances

Example - Try It Yourself

Utterence handle example

Checkout the interruption handle implementation via the utterence handle functionality

FAQs

Troubleshooting

Issue	Solution
Overlapping speech	Use `await` on handles instead of `create_task()`
Tasks not cancelling on interruption	Check `utterance.interrupted` in loops
Handle is None	Only available during function tool execution via `session.current_utterance`

Correct Usage Pattern

✅ Correct: Sequential Speech

Await each handle to prevent overlapping TTS.

handle1 = session.say("First")
await handle1
handle2 = session.say("Second")
await handle2

❌ Incorrect: Concurrent Speech

Using create_task() causes audio overlap.

asyncio.create_task(session.say("First"))
asyncio.create_task(session.say("Second"))

Got a Question? Ask us on discord

Session-Level Speech Control​

Utterance-Level Management​

Core Concepts​

API Reference​

Usage Patterns​

Best Practices​

Common Use Cases​

Example - Try It Yourself​