Version: 1.0.x

Silero VAD

The Silero VAD (Voice Activity Detection) provider enables your agent to detect when users start and stop speaking. When added to a pipeline, it automatically enables interrupt functionality - allowing users to interrupt the agent mid-response.

Installation

Install the Silero VAD-enabled VideoSDK Agents package:

pip install "videosdk-plugins-silero"

Importing

from videosdk.agents.plugins import SileroVAD

Example Usage

from videosdk.agents.plugins import SileroVAD
from videosdk.agents import Pipeline

# Initialize the Silero VAD
vad = SileroVAD(
    input_sample_rate=48000,
    model_sample_rate=16000,
    threshold=0.3,
    min_speech_duration=0.1,
    min_silence_duration=0.75,
    padding_duration=0.3
)

# Add VAD to pipeline - automatically enables interrupts
pipeline = Pipeline(vad=vad)

Configuration Options

input_sample_rate: (int) Sample rate of input audio in Hz (default: 48000)
model_sample_rate: (Literal[8000, 16000]) Model's expected sample rate (default: 16000)
threshold: (float) Voice activity detection sensitivity (0.0 to 1.0, default: 0.5)
start_threshold: (float) Probability threshold above which speech is considered to have started (default: 0.4)
end_threshold: (float) Probability threshold below which speech is considered to have ended (default: 0.25)
min_speech_duration: (float) Minimum speech duration to trigger detection in seconds (default: 0.05)
min_silence_duration: (float) Minimum silence duration to end speech detection in seconds (default: 0.4)
padding_duration: (float) Audio padding before speech detection in seconds (default: 0.5)
max_buffered_speech: (float) Maximum speech buffer duration in seconds (default: 60.0)
force_cpu: (bool) Force CPU usage instead of GPU acceleration (default: True)
onnx_model_path: (str | Path, optional) Path to a custom ONNX VAD model file. When None, the bundled model is used (default: None)
max_speech_duration: (float, optional) Maximum continuous speech duration in seconds before forcing a split. When None, no maximum is enforced (default: None)
min_silence_at_split: (float) Minimum silence in seconds required at a split point when enforcing max_speech_duration (default: 0.098)
energy_filter_enabled: (bool) Enable an energy-based pre-filter to skip inference on low-energy frames (default: False)
energy_silence_threshold: (float) Energy level below which a frame is treated as silence when the energy filter is enabled (default: 0.001)
smoothing_strategy: (Literal["ema", "moving_average", "none"]) Strategy used to smooth raw VAD probabilities (default: "ema")
smoothing_factor: (float) Smoothing factor for the EMA filter (default: 0.35)
smoothing_window: (int) Window size for the moving-average filter (default: 5)
min_volume: (float) Minimum audio volume required for a frame to be considered (default: 0.0)
probability_history_size: (int) Size of the probability ring buffer kept for history. 0 disables history tracking (default: 0)
offload_inference: (bool) Run model inference on a dedicated thread pool to offload it from the event loop (default: False)