Turn Detection and Voice Activity Detection
Turn detection enables your AI agents to have natural, human-like conversations by knowing when to listen and when to respond. It prevents agents from interrupting users mid-sentence and ensures they respond at appropriate moments in the conversation.
Overview
- Voice Activity Detection (VAD): Detects when speech starts and stops in the audio stream to monitor voice activity and handle interruptions.
- Turn Detection/End-of-Utterance Detection (EOU): Analyzes conversation context to determine if the user expects a response.
Together, these create smooth conversation flow where your agent waits for users to finish their thoughts before responding and stops responding if user interrupts mid agent speech.
Voice Activity Detection (VAD)
VAD monitors voice activity to detect when speech begins and ends in the audio stream, helping your agent know when someone is talking.
Basic Setup
from videosdk.plugins.silero import SileroVAD
# Configure VAD for your environment
vad = SileroVAD(
threshold=0.5, # Sensitivity (0.3-0.8)
min_speech_duration=0.1, # Ignore very brief sounds
min_silence_duration=0.75 # Wait time before considering speech ended
)
Configuration Guidelines
Environment | Threshold | Use Case |
---|---|---|
Quiet Office | 0.3-0.4 | High sensitivity for soft speech |
Normal Room | 0.5-0.6 | Balanced detection |
Noisy Environment | 0.7-0.8 | Reduce false triggers |
For better voice activity detection, use the Denoise component(background noise removal) alongside VAD.
Turn Detection
Turn Detection/End-Of-Utterence(EOU) analyzes conversation context to determine if the user expects a response, distinguishing between statements, questions, and incomplete thoughts.
Basic Setup
from videosdk.plugins.turn_detector import TurnDetector
# Configure EOU detection
turn_detector = TurnDetector(
threshold=0.7 # Confidence level for response triggers
)
What EOU Detects
- Questions: "What's the weather like?" → Agent responds
- Commands: "Set a reminder for 3 PM" → Agent responds
- Incomplete thoughts: "I was thinking about..." → Agent waits
- Pauses: "Let me see... actually..." → Agent waits
Pipeline Integration
Add turn detection to your existing cascading pipeline
:
from videosdk.agents import CascadingPipeline
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector
pipeline = CascadingPipeline(
stt=your_stt_provider,
llm=your_llm_provider,
tts=your_tts_provider,
vad=SileroVAD(threshold=0.5),
turn_detector=TurnDetector(threshold=0.7)
)
OpenAI Realtime API has built-in turn detection. External VAD/Turn Detection components are not needed with RealTimePipeline.
Example Implementation
Here's a complete example showing turn detection in action:
from videosdk.agents import Agent, CascadingPipeline, AgentSession
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector
class ConversationalAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful assistant that waits for users to finish speaking before responding."
)
async def on_enter(self):
await self.session.say("Hello! I'm listening and will respond when you're ready.")
# Set up pipeline with turn detection
pipeline = CascadingPipeline(
stt=your_stt_provider,
llm=your_llm_provider,
tts=your_tts_provider,
vad=SileroVAD(threshold=0.5),
turn_detector=TurnDetector(threshold=0.7)
)
# Create and start session
session = AgentSession(agent=ConversationalAgent(), pipeline=pipeline)
.
.
.
Examples - Try Out Yourself
Got a Question? Ask us on discord