Skip to main content
Version: 1.0.x

Turn Detection

In conversational AI, timing is everything. Traditional voice agents rely on simple silence-based timers (Voice Activity Detection or VAD) to guess when a user has finished speaking. This often leads to awkward interruptions or unnatural pauses.

Turn detection solves this by understanding the meaning of what the user said, not just the silence after it, so your agent knows when to respond instantly and when to keep listening.

From Silence to Speech Understanding

Semantic turn detection shifts from basic audio analysis to Natural Language Understanding (NLU), letting your agent tell the difference between a user who is truly finished and one who is just pausing to think.

Traditional VAD (Silence-Based)Semantic Turn Detection
Listens for silence.Understands words and context.
Relies on a fixed timer (e.g., 800ms).Uses a transformer model to predict intent.
Often interrupts or lags.Knows when to wait and when to respond instantly.
Struggles with natural pauses and filler words.Distinguishes between a brief pause and a true endpoint.

VideoSDK offers two ways to add semantic turn detection to a cascade pipeline:

  • Echo (Inference): server-hosted, lowest-setup option. Choose echo-small for speed or echo-large for accuracy.
  • TurnSense: runs in the cloud or fully on-device.

In every case, VAD detects that speech is happening; the turn detector decides when the turn is over.

Echo Turn Detection (Inference)

Echo is VideoSDK's server-hosted turn detector, exposed through the VideoSDK Inference Gateway via the TurnV2 class. No model is downloaded or loaded on your machine, for authentication VIDEOSDK_AUTH_TOKEN is required.

How It Works

Echo (TurnV2) architecture

As the user speaks, VAD detects the speech and STT produces a transcript. After each user utterance, the latest transcript is sent to the Inference Gateway, where the selected Echo model (echo-small or echo-large) classifies the turn into one of four states:

StateMeaning
CompleteThe user has finished their turn.
IncompleteThe user is still mid-sentence or not finished yet.
BackchannelA short acknowledgement (e.g. "uh-huh", "okay okay").
WaitThe user wants the agent to hold (e.g. "wait a minute", "hold on").

Models

Both models share the same four-state classification; they differ only in the latency/accuracy trade-off:

ProviderModel NameModel ID
VideoSDKEcho Smallecho-small
VideoSDKEcho Largeecho-large

TurnV2.echo_small()

  • The default, lowest-latency model optimized for the fastest possible turn detection.
  • Best when responsiveness matters most.

TurnV2.echo_large()

  • A higher-accuracy model that trades a little latency for better classification.
  • Best when accuracy matters more than raw speed.

Supported Languages

Both echo-small and echo-large support the following languages:

#Language#Language
1English7Urdu
2Hindi8Bengali
3Gujarati9French
4Marathi10German
5Tamil11Italian
6Telugu12Spanish

Usage

Set your auth token, then import and configure TurnV2:

VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token"
from videosdk.agents.inference import TurnV2

# Fastest, lowest latency (default)
turn_detector = TurnV2.echo_small()

# Higher accuracy
turn_detector = TurnV2.echo_large()

Performance

Benchmarked on the TURNS2K dataset against a leading third-party turn-detection model, referred to here as Baseline. Each sample is labeled Complete (the user has finished speaking) or Incomplete (the user is still speaking).

MetricEcho-SmallEcho-LargeBaseline
Accuracy93.60%96.20%61.13%
Recall (Complete)97.31%96.50%32.83%
Specificity88.91%95.81%96.83%
F1 Score (Complete)0.94430.96590.4851

Echo-Large

On 2,000 English conversational samples, Echo-Large achieved 96.2% accuracy in detecting whether a speaker had finished their turn, substantially outperforming the Baseline under the same conditions. The largest difference was in turn-completion detection: Echo-Large correctly identified 96.5% of completed turns versus 32.8% for the Baseline, resulting in far fewer missed responses.

For every 100 times a user finished speaking, Echo-Large responded correctly approximately 97 times, missing only ~3.5 turn completions; the Baseline responded correctly about 33 times.

Echo-Small

Echo-Small is optimized for responsive voice interactions where recognizing completed speech quickly is critical. On 2,000 labeled English conversational samples, Echo-Small achieved:

For every 100 times a user finished speaking, Echo-Small responded correctly 97.3 times versus 32.8 for the Baseline, about 25× fewer missed turn endings in this benchmark. Echo-Small is designed for applications where fast conversational turn-taking is a priority, while maintaining strong overall classification performance.

note

Results are measured on the benchmark dataset described above, on samples labeled Complete or Incomplete. Performance may vary depending on language, deployment configuration, user behavior, and application requirements.

TurnSense

TurnSense (SmolLM2-135M, English) is an alternative turn detector exposed through the TurnDetector class. It can run through the Inference Gateway or fully on-device.

BackendClassLanguagesDeployment
TurnSense (SmolLM2-135M)TurnDetectorEnglishCloud or local

Import the turn detector class from videosdk.agents.inference for HTTP-based EOU detection through the VideoSDK Inference Gateway; no local model download required.

from videosdk.agents.inference import TurnDetector

# TurnSense / SmolLM2-135M (English)
turn_detector = TurnDetector(threshold=0.7)
note

Cloud turn detection requires VIDEOSDK_AUTH_TOKEN. See Authentication and Tokens.

For plugin-specific setup, see Turn Detector.

Got a Question? Ask us on discord