Skip to main content

Turn Detection and Voice Activity Detection

Turn detection enables your AI agents to have natural, human-like conversations by knowing when to listen and when to respond. It prevents agents from interrupting users mid-sentence and ensures they respond at appropriate moments in the conversation.

Turn Detection and VAD

Overview

  1. Voice Activity Detection (VAD): Detects when speech starts and stops in the audio stream to monitor voice activity and handle interruptions.
  2. Turn Detection/End-of-Utterance Detection (EOU): Analyzes conversation context to determine if the user expects a response.

Together, these create smooth conversation flow where your agent waits for users to finish their thoughts before responding and stops responding if user interrupts mid agent speech.

Voice Activity Detection (VAD)

VAD monitors voice activity to detect when speech begins and ends in the audio stream, helping your agent know when someone is talking.

Basic Setup

from videosdk.plugins.silero import SileroVAD  

# Configure VAD for your environment
vad = SileroVAD(
threshold=0.5, # Sensitivity (0.3-0.8)
min_speech_duration=0.1, # Ignore very brief sounds
min_silence_duration=0.75 # Wait time before considering speech ended
)

Configuration Guidelines

EnvironmentThresholdUse Case
Quiet Office0.3-0.4High sensitivity for soft speech
Normal Room0.5-0.6Balanced detection
Noisy Environment0.7-0.8Reduce false triggers
tip

For better voice activity detection, use the Denoise component(background noise removal) alongside VAD.

Turn Detection

Turn Detection/End-Of-Utterence(EOU) analyzes conversation context to determine if the user expects a response, distinguishing between statements, questions, and incomplete thoughts.

Basic Setup

from videosdk.plugins.turn_detector import TurnDetector  

# Configure EOU detection
turn_detector = TurnDetector(
threshold=0.7 # Confidence level for response triggers
)

What EOU Detects

  • Questions: "What's the weather like?" → Agent responds
  • Commands: "Set a reminder for 3 PM" → Agent responds
  • Incomplete thoughts: "I was thinking about..." → Agent waits
  • Pauses: "Let me see... actually..." → Agent waits

Pipeline Integration

Add turn detection to your existing cascading pipeline:

from videosdk.agents import CascadingPipeline  
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

pipeline = CascadingPipeline(
stt=your_stt_provider,
llm=your_llm_provider,
tts=your_tts_provider,
vad=SileroVAD(threshold=0.5),
turn_detector=TurnDetector(threshold=0.7)
)
tip

OpenAI Realtime API has built-in turn detection. External VAD/Turn Detection components are not needed with RealTimePipeline.

Example Implementation

Here's a complete example showing turn detection in action:

main.py
from videosdk.agents import Agent, CascadingPipeline, AgentSession  
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector

class ConversationalAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a helpful assistant that waits for users to finish speaking before responding."
)

async def on_enter(self):
await self.session.say("Hello! I'm listening and will respond when you're ready.")

# Set up pipeline with turn detection
pipeline = CascadingPipeline(
stt=your_stt_provider,
llm=your_llm_provider,
tts=your_tts_provider,
vad=SileroVAD(threshold=0.5),
turn_detector=TurnDetector(threshold=0.7)
)

# Create and start session
session = AgentSession(agent=ConversationalAgent(), pipeline=pipeline)
.
.
.

Examples - Try Out Yourself

Got a Question? Ask us on discord