Turn Detection and Voice Activity Detection

In conversational AI, timing is everything. Traditional voice agents rely on simple silence-based timers (Voice Activity Detection or VAD) to guess when a user has finished speaking. This often leads to awkward interruptions or unnatural pauses.

To solve this, VideoSDK created Namo-v1: an open-source, high-performance turn-detection model that understands the meaning of the conversation, not just the silence.

Namo Turn Detection

From Silence Detection to Speech Understanding

Namo shifts from basic audio analysis to sophisticated Natural Language Understanding (NLU), allowing your agent to know when a user is truly finished speaking versus just pausing to think.

Traditional VAD (Silence-Based)	Namo Turn Detector (Semantic-Based)
Listens for silence.	Understands words and context.
Relies on a fixed timer (e.g., 800ms).	Uses a transformer model to predict intent.
Often interrupts or lags.	Knows when to wait and when to respond instantly.
Struggles with natural pauses and filler words.	Distinguishes between a brief pause and a true endpoint.

This semantic understanding enables AI agents to respond quicker and more naturally, creating a fluid, human-like conversational experience.

Learn More

For a deep dive into Namo's architecture, performance benchmarks, and how to use it as a standalone model, check out the dedicated Namo Turn Detector plugin page.

Implementation

For the most robust setup, you can use VAD and Namo together. VAD acts as a basic speech detector, while Namo intelligently decides if the turn is over.

1. Voice Activity Detection (VAD)

First, configure VAD to detect the presence of speech. This helps manage interruptions and acts as a first-pass filter.

from videosdk.plugins.silero import SileroVAD

# Configure VAD to detect speech activity
vad = SileroVAD(
    threshold=0.5,                    # Sensitivity to speech (0.3-0.8)
    min_speech_duration=0.1,          # Ignore very brief sounds
    min_silence_duration=0.75         # Wait time before considering speech ended
)

2. Namo Turn Detection

Next, add the NamoTurnDetectorV1 plugin to analyze the content of the speech and predict the user's intent.

Multilingual Model

If your agent needs to support multiple languages, use the default multilingual model. It's a single, powerful model that works across more than 20 languages.

from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Pre-download the multilingual model to avoid runtime delays
pre_download_namo_turn_v1_model()

# Initialize the multilingual Turn Detector
turn_detector = NamoTurnDetectorV1(
    threshold=0.7  # Confidence level for triggering a response
)

The table below lists all supported languages with their performance metrics and language codes.

Language	Accuracy	Precision	Recall	F1 Score
🇸🇦 Arabic	0.849	0.7965	0.9439	0.8639
🇮🇳 Bengali	0.794	0.7874	0.7939	0.7907
🇨🇳 Chinese	0.9164	0.8859	0.9608	0.9219
Show 19 More...

Language-Specific Models

For maximum performance and accuracy in a single language, use a specialized model. These models are faster and have a smaller memory footprint.

from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Pre-download a specific language model (e.g., German)
pre_download_namo_turn_v1_model(language="de")

# Initialize the Turn Detector for German
turn_detector = NamoTurnDetectorV1(
  language="de",
  threshold=0.7
)

Language	Code	Model Link	Accuracy
🇰🇷 Korean	ko	Namo-v1-Korean	97.3%
🇹🇷 Turkish	tr	Namo-v1-Turkish	96.8%
🇯🇵 Japanese	ja	Namo-v1-Japanese	93.5%
Show 20 More...

note

To see all available models for different languages, along with their benchmarks and accuracy, please visit our Hugging Face models page.

Pipeline Integration

Combine VAD and Namo in your CascadingPipeline to bring it all together.

from videosdk.agents import CascadingPipeline
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Pre-download the model you intend to use
pre_download_namo_turn_v1_model(language="en")

pipeline = CascadingPipeline(
    stt=your_stt_provider,
    llm=your_llm_provider,
    tts=your_tts_provider,
    vad=SileroVAD(threshold=0.5),
    turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7)
)

tip

The RealTimePipeline for providers like OpenAI includes built-in turn detection, so external VAD and Turn Detector components are not required.

Example Implementation

Here’s a complete example showing Namo in a conversational agent.

main.py
from videosdk.agents import Agent, CascadingPipeline, AgentSession
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model
from your_providers import your_stt_provider, your_llm_provider, your_tts_provider

class ConversationalAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful assistant that waits for users to finish speaking before responding."
        )

    async def on_enter(self):
        await self.session.say("Hello! I'm listening and will respond when you're ready.")

# 1. Pre-download the model to ensure fast startup
pre_download_namo_turn_v1_model(language="en")

# 2. Set up the pipeline with Namo for intelligent turn detection
pipeline = CascadingPipeline(
    stt=your_stt_provider,
    llm=your_llm_provider,
    tts=your_tts_provider,
    vad=SileroVAD(threshold=0.5),
    turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7)
)

# 3. Create and start the session
session = AgentSession(agent=ConversationalAgent(), pipeline=pipeline)
# ... connect to your call transport

Examples - Try It Yourself

Namo Quickstart

A quickstart guide to get you started with Namo Turn Detector.

Cascading Pipleine

Turn-Detection and VAD with cascading pipeline

Got a Question? Ask us on discord

From Silence Detection to Speech Understanding​

Implementation​

1. Voice Activity Detection (VAD)​

2. Namo Turn Detection​

Multilingual Model​

Language-Specific Models​

Pipeline Integration​

Example Implementation​

Examples - Try It Yourself​