Skip to main content
Version: 1.0.x

Silero VAD

The Silero VAD (Voice Activity Detection) provider enables your agent to detect when users start and stop speaking. When added to a pipeline, it automatically enables interrupt functionality - allowing users to interrupt the agent mid-response.

Installation

Install the Silero VAD-enabled VideoSDK Agents package:

pip install "videosdk-plugins-silero"

Importing

from videosdk.agents.plugins import SileroVAD

Example Usage

from videosdk.agents.plugins import SileroVAD
from videosdk.agents import Pipeline

# Initialize the Silero VAD
vad = SileroVAD(
input_sample_rate=48000,
model_sample_rate=16000,
threshold=0.3,
min_speech_duration=0.1,
min_silence_duration=0.75,
padding_duration=0.3
)

# Add VAD to pipeline - automatically enables interrupts
pipeline = Pipeline(vad=vad)

Configuration Options

  • input_sample_rate: (int) Sample rate of input audio in Hz (default: 48000)
  • model_sample_rate: (Literal[8000, 16000]) Model's expected sample rate (default: 16000)
  • threshold: (float) Voice activity detection sensitivity (0.0 to 1.0, default: 0.5)
  • start_threshold: (float) Probability threshold above which speech is considered to have started (default: 0.4)
  • end_threshold: (float) Probability threshold below which speech is considered to have ended (default: 0.25)
  • min_speech_duration: (float) Minimum speech duration to trigger detection in seconds (default: 0.05)
  • min_silence_duration: (float) Minimum silence duration to end speech detection in seconds (default: 0.4)
  • padding_duration: (float) Audio padding before speech detection in seconds (default: 0.5)
  • max_buffered_speech: (float) Maximum speech buffer duration in seconds (default: 60.0)
  • force_cpu: (bool) Force CPU usage instead of GPU acceleration (default: True)
  • onnx_model_path: (str | Path, optional) Path to a custom ONNX VAD model file. When None, the bundled model is used (default: None)
  • max_speech_duration: (float, optional) Maximum continuous speech duration in seconds before forcing a split. When None, no maximum is enforced (default: None)
  • min_silence_at_split: (float) Minimum silence in seconds required at a split point when enforcing max_speech_duration (default: 0.098)
  • energy_filter_enabled: (bool) Enable an energy-based pre-filter to skip inference on low-energy frames (default: False)
  • energy_silence_threshold: (float) Energy level below which a frame is treated as silence when the energy filter is enabled (default: 0.001)
  • smoothing_strategy: (Literal["ema", "moving_average", "none"]) Strategy used to smooth raw VAD probabilities (default: "ema")
  • smoothing_factor: (float) Smoothing factor for the EMA filter (default: 0.35)
  • smoothing_window: (int) Window size for the moving-average filter (default: 5)
  • min_volume: (float) Minimum audio volume required for a frame to be considered (default: 0.0)
  • probability_history_size: (int) Size of the probability ring buffer kept for history. 0 disables history tracking (default: 0)
  • offload_inference: (bool) Run model inference on a dedicated thread pool to offload it from the event loop (default: False)

Additional Resources

The following resources provide more information about using Silero VAD with VideoSDK Agents SDK.

  • Silero VAD project: The open source VAD model that powers the VideoSDK Silero VAD plugin.

Got a Question? Ask us on discord