ElevenLabs STT

The ElevenLabs STT provider enables your agent to use ElevenLabs advanced speech-to-text models for high-accuracy, real-time audio transcription with advanced voice activity detection.

Installation

Install the ElevenLabs-enabled VideoSDK Agents package:

pip install "videosdk-plugins-elevenlabs"

Importing

from videosdk.plugins.elevenlabs import ElevenLabsSTT

Authentication

The ElevenLabs plugin requires an ElevenLabs API key.

Set ELEVENLABS_API_KEY in your .env file.

Example Usage

from videosdk.plugins.elevenlabs import ElevenLabsSTT
from videosdk.agents import CascadingPipeline

# Initialize the ElevenLabs STT model
stt = ElevenLabsSTT(
    # When ELEVENLABS_API_KEY is set in .env - DON'T pass api_key parameter
    api_key="your-elevenlabs-api-key",
    model_id="scribe_v2_realtime",
    language_code="en",
    commit_strategy="vad",
    vad_silence_threshold_secs=0.8,
    vad_threshold=0.4,
    min_speech_duration_ms=50,
    min_silence_duration_ms=50,
    include_language_detection=False
)

# Add stt to cascading pipeline
pipeline = CascadingPipeline(stt=stt)

note

When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit api_key, videosdk_auth, and other credential parameters from your code.

Configuration Options

api_key: Your ElevenLabs API key (can also be set via ELEVENLABS_API_KEY environment variable)
model_id: (str) STT model identifier (default: "scribe_v2_realtime")
language_code: (str) Language code for transcription (default: "en")
sample_rate: (int) Sample rate of input audio in Hz (default: 48000)
commit_strategy: (str) Strategy for committing transcripts (default: "vad")
- "vad" - Voice Activity Detection based commit strategy
vad_silence_threshold_secs: (float) Duration of silence in seconds to detect end-of-speech (default: 0.8)
vad_threshold: (float) Threshold for detecting voice activity (default: 0.4)
min_speech_duration_ms: (int) Minimum duration in milliseconds for a speech segment (default: 50)
min_silence_duration_ms: (int) Minimum duration in milliseconds of silence to consider end-of-speech (default: 50)
include_language_detection: (bool) Whether to include language detection in the transcription (default: False)

Additional Resources

The following resources provide more information about using ElevenLabs with VideoSDK Agents SDK.

ElevenLabs docs: ElevenLabs STT docs.

SDK Reference

GitHub Repository

Python Package

Got a Question? Ask us on discord

Installation​

Importing​

Authentication​

Example Usage​

Configuration Options​

Additional Resources​