Skip to main content

Google STT

The Google STT provider enables your agent to use Google's advanced speech-to-text models for high-accuracy, real-time audio transcription.

Installation

Install the Google-enabled VideoSDK Agents package:

pip install "videosdk-plugins-google"

Importing

from videosdk.plugins.google import GoogleSTT, VoiceActivityConfig

Setup Credentials/Authentication

To use Google STT, you need to set up your Google Cloud credentials. You can do this by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your service account key file.

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json"

Alternatively, you can pass the path to the key file directly to the GoogleSTT constructor via the api_key parameter. or Set GOOGLE_APPLICATION_CREDENTIALS in your .env file.

Example Usage

from videosdk.plugins.google import GoogleSTT, VoiceActivityConfig
from videosdk.agents import CascadingPipeline

voice_activity_timeout = VoiceActivityConfig(
speech_start_timeout=1.0,
speech_end_timeout=5.0
)

# Initialize the Google STT model
stt = GoogleSTT(
# If GOOGLE_APPLICATION_CREDENTIALS is set, you can omit api_key
api_key="/path/to/your/keyfile.json",
languages="en-US",
model="latest_long",
interim_results=True,
punctuate=True,
profanity_filter=False,
voice_activity_timeout = voice_activity_timeout
)

# Add stt to cascading pipeline
pipeline = CascadingPipeline(stt=stt)
note

When using an environment variable for credentials, don't pass the api_key as an argument to the model instance. The SDK automatically reads the environment variable.

Configuration Options

  • api_key: (str) Path to your Google Cloud service account JSON file. This can also be set via the GOOGLE_APPLICATION_CREDENTIALS environment variable.
  • languages: (Union[str, list[str]]) Language code or a list of language codes for transcription (default: "en-US").
  • model: (str) The Google STT model to use (e.g., "latest_long", "telephony") (default: "latest_long").
  • sample_rate: (int) The target audio sample rate in Hz for transcription (default: 16000). The input audio at 48000Hz will be resampled to this rate.
  • interim_results: (bool) Enable real-time partial transcription results (default: True).
  • punctuate: (bool) Add punctuation to transcription (default: True).
  • min_confidence_threshold: (float) The minimum confidence level for a transcription result to be considered valid (default: 0.1).
  • location: (str) The Google Cloud location to use for the STT service (default: "global").
  • profanity_filter: (bool) detect profane words and return only the first letter followed by asterisks in the transcript (default: False).
  • voice_activity_timeout: (VoiceActivityConfig) Configure speech activity timeouts (default: None).
    • speech_start_timeout: (float) Seconds to wait for speech to begin before timing out. Minimum 0.5 (default: 1.0).
    • speech_end_timeout: (float) Seconds of silence after speech before ending. Minimum 0.1 (default: 5.0).

Additional Resources

The following resources provide more information about using Google with VideoSDK Agents SDK.

Got a Question? Ask us on discord