Silero VAD
The Silero VAD (Voice Activity Detection) provider enables your agent to detect when users start and stop speaking. When added to a pipeline, it automatically enables interrupt functionality - allowing users to interrupt the agent mid-response.
Installation
Install the Silero VAD-enabled VideoSDK Agents package:
pip install "videosdk-plugins-silero"
Importing
from videosdk.agents.plugins import SileroVAD
Example Usage
from videosdk.agents.plugins import SileroVAD
from videosdk.agents import Pipeline
# Initialize the Silero VAD
vad = SileroVAD(
input_sample_rate=48000,
model_sample_rate=16000,
threshold=0.3,
min_speech_duration=0.1,
min_silence_duration=0.75,
padding_duration=0.3
)
# Add VAD to pipeline - automatically enables interrupts
pipeline = Pipeline(vad=vad)
Configuration Options
input_sample_rate: (int) Sample rate of input audio in Hz (default:48000)model_sample_rate: (Literal[8000, 16000]) Model's expected sample rate (default:16000)threshold: (float) Voice activity detection sensitivity (0.0 to 1.0, default:0.5)start_threshold: (float) Probability threshold above which speech is considered to have started (default:0.4)end_threshold: (float) Probability threshold below which speech is considered to have ended (default:0.25)min_speech_duration: (float) Minimum speech duration to trigger detection in seconds (default:0.05)min_silence_duration: (float) Minimum silence duration to end speech detection in seconds (default:0.4)padding_duration: (float) Audio padding before speech detection in seconds (default:0.5)max_buffered_speech: (float) Maximum speech buffer duration in seconds (default:60.0)force_cpu: (bool) Force CPU usage instead of GPU acceleration (default:True)onnx_model_path: (str | Path, optional) Path to a custom ONNX VAD model file. WhenNone, the bundled model is used (default:None)max_speech_duration: (float, optional) Maximum continuous speech duration in seconds before forcing a split. WhenNone, no maximum is enforced (default:None)min_silence_at_split: (float) Minimum silence in seconds required at a split point when enforcingmax_speech_duration(default:0.098)energy_filter_enabled: (bool) Enable an energy-based pre-filter to skip inference on low-energy frames (default:False)energy_silence_threshold: (float) Energy level below which a frame is treated as silence when the energy filter is enabled (default:0.001)smoothing_strategy: (Literal["ema", "moving_average", "none"]) Strategy used to smooth raw VAD probabilities (default:"ema")smoothing_factor: (float) Smoothing factor for the EMA filter (default:0.35)smoothing_window: (int) Window size for the moving-average filter (default:5)min_volume: (float) Minimum audio volume required for a frame to be considered (default:0.0)probability_history_size: (int) Size of the probability ring buffer kept for history.0disables history tracking (default:0)offload_inference: (bool) Run model inference on a dedicated thread pool to offload it from the event loop (default:False)
Additional Resources
The following resources provide more information about using Silero VAD with VideoSDK Agents SDK.
- Silero VAD project: The open source VAD model that powers the VideoSDK Silero VAD plugin.
Got a Question? Ask us on discord

