Skip to main content

Speech Handle

Speech control in VideoSDK agents operates through two complementary layers: session-level methods for initiating speech and utterance-level handles for managing speech lifecycle. This document covers both aspects of controlling agent speech output.

Session-Level Speech Control

The AgentSession provides three primary methods for controlling agent speech output

1. Say

say(message: str): Sends a direct message from the agent to meeting participants.

# Basic usage  
await session.say("Hello! How can I help you today?")

# In agent lifecycle hooks
class MyAgent(Agent):
async def on_enter(self):
await self.session.say("Welcome to the meeting!")

2. Reply

reply(instructions: str, wait_for_playback: bool = True): Generates agent responses dynamically using custom instructions while maintaining conversation context.

Parameters:

  • instructions: Custom instructions for generating the response
  • wait_for_playback: When True, prevents user interruptions until playback completes
# Generate immediate response  
await session.reply(instructions="Please summarize the conversation so far")

# Wait for complete playback before allowing new inputs
await session.reply(
instructions="Explain the next steps",
wait_for_playback=True
)

# Practical example in function tools
class MyAgent(Agent):
@function_tool
async def get_summary(self) -> str:
await self.session.reply(
instructions="Based on our conversation, let me provide a summary..."
)
return "Summary generated"

3. Interrupt

interrupt(): Immediately stops the agent's current speech operation.

# Emergency stop during agent response  
session.interrupt()

# User interruption handling
class InteractiveAgent(Agent):
async def handle_user_input(self, user_input: str):
if "stop" in user_input.lower():
self.session.interrupt()
await self.session.reply(instructions="How can I help you instead?")

@function_tool
async def emergency_stop(self) -> str:
"""Stop current agent operation immediately"""
self.session.interrupt()
return "Agent stopped successfully"

Utterance-Level Management

UtteranceHandle manages individual agent utterances, preventing overlapping speech and enabling graceful interruption handling.

Core Concepts

  • Lifecycle Management

    • Each UtteranceHandle tracks a single utterance from creation through completion.
  • Completion States

    An utterance can complete in two ways:

    1. Natural Completion: The TTS finishes playing the audio
    2. User Interruption: The user starts speaking during playback
  • Awaitable Pattern

    • The handle supports Python's async/await syntax for sequential speech control.

API Reference

Property/MethodReturn TypeDescription
idstrUnique identifier for the utterance
done()boolReturns True if utterance is complete
interruptedboolReturns True if user interrupted
interrupt()NoneManually marks utterance as interrupted
__await__()GeneratorEnables awaiting the handle

Usage Patterns

  • Sequential Speech

    To prevent overlapping TTS, await each handle before starting the next utterance:

    # Correct approach  
    handle1 = self.session.say(f"The current temperature is {temperature}°C.")
    await handle1 # Wait for first utterance to complete

    handle2 = self.session.say("Do you live in this city?")
    await handle2 # Wait for second utterance to complete
  • Checking Interruption Status

    Access the current utterance handle via self.session.current_utterance to detect interruptions:

    utterance: UtteranceHandle | None = self.session.current_utterance  

    # In long-running operations, check periodically
    for i in range(10):
    if utterance and utterance.interrupted:
    logger.info("Task was interrupted by the user.")
    return "The task was cancelled because you interrupted me."

    await asyncio.sleep(1)

Best Practices

  • Sequential Speech: Always await handles when you need sequential speech to prevent audio overlap
  • Interruption Handling: Check interrupted status in long-running operations to enable graceful cancellation
  • Handle References: Store handle references if you need to check status later in your function
  • Avoid Concurrent Tasks: Don't use create_task() for speech that should play sequentially

Common Use Cases

  • Multi-part responses: When function tools need to speak multiple sentences in sequence
  • Long-running operations: Tasks that should be cancellable when users interrupt
  • Conversational flows: Scenarios requiring precise timing between utterances

Example - Try It Yourself

FAQs

Troubleshooting
IssueSolution
Overlapping speechUse await on handles instead of create_task()
Tasks not cancelling on interruptionCheck utterance.interrupted in loops
Handle is NoneOnly available during function tool execution via session.current_utterance
Correct Usage Pattern

✅ Correct: Sequential Speech

Await each handle to prevent overlapping TTS.

handle1 = session.say("First")
await handle1
handle2 = session.say("Second")
await handle2

❌ Incorrect: Concurrent Speech

Using create_task() causes audio overlap.

asyncio.create_task(session.say("First"))
asyncio.create_task(session.say("Second"))

Got a Question? Ask us on discord