Skip to main content

Azure TTS

The Azure TTS provider enables your agent to use Microsoft Azure's high-quality text-to-speech models for generating natural-sounding voice output with advanced voice tuning and expressive speaking styles.

Installation

Install the Azure-enabled VideoSDK Agents package:

pip install "videosdk-plugins-azure"

Importing

from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle

Authentication

The Azure TTS plugin requires an Azure AI Speech Service resource.

Setup Steps:

  1. Create an AI Services resource for Speech in the Azure portal or from Azure AI Foundry
  2. Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys

Set AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in your .env file:

AZURE_SPEECH_KEY=your-azure-speech-key
AZURE_SPEECH_REGION=your-azure-region

Example Usage

from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle
from videosdk.agents import CascadingPipeline

# Configure voice tuning for prosody control
voice_tuning = VoiceTuning(
rate="fast",
volume="loud",
pitch="high"
)

# Configure speaking style for expressive speech
speaking_style = SpeakingStyle(
style="cheerful",
degree=1.5
)

# Initialize the Azure TTS model
tts = AzureTTS(
voice="en-US-EmmaNeural",
language="en-US",
tuning=voice_tuning,
style=speaking_style
)

# Add tts to cascading pipeline
pipeline = CascadingPipeline(tts=tts)
note

When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit speech_key, speech_region, and other credential parameters from your code.

Configuration Options

  • speech_key: (Optional[str]) Azure Speech API key. Uses AZURE_SPEECH_KEY environment variable if not provided.
  • speech_region: (Optional[str]) Azure Speech region (e.g., "eastus", "westus2"). Uses AZURE_SPEECH_REGION environment variable if not provided.
  • speech_endpoint: (Optional[str]) Custom endpoint URL. Uses AZURE_SPEECH_ENDPOINT environment variable if not provided.
  • voice: (str) Voice name to use for audio output (default: "en-US-EmmaNeural"). Get available voices using the Azure voices API.
  • language: (str) Language code (optional, inferred from voice if not specified).
  • tuning: (VoiceTuning) Voice tuning object for rate, volume, and pitch control:
    • rate: (str) Speaking rate ("x-slow", "slow", "medium", "fast", "x-fast" or percentage like "50%")
    • volume: (str) Speaking volume ("silent", "x-soft", "soft", "medium", "loud", "x-loud" or percentage)
    • pitch: (str) Voice pitch ("x-low", "low", "medium", "high", "x-high" or frequency like "+50Hz")
  • style: (SpeakingStyle) Speaking style object for expressive speech:
    • style: (str) Speaking style (e.g., "cheerful", "sad", "angry", "excited", "friendly")
    • degree: (float) Style intensity from 0.01 to 2.0 (default: 1.0)
  • deployment_id: (str) Custom deployment ID for custom models.
  • speech_auth_token: (str) Authorization token for authentication.

Voice Selection

You can find available voices using the Azure Voices List API:

curl --location --request GET 'https://eastus2.tts.speech.microsoft.com/cognitiveservices/voices/list' \
--header 'Ocp-Apim-Subscription-Key: YOUR_SPEECH_KEY'

Popular voice options include:

  • en-US-EmmaNeural (Female, neutral)
  • en-US-BrianNeural (Male, neutral)
  • en-US-AriaNeural (Female, cheerful)
  • en-GB-SoniaNeural (Female, British)

Additional Resources

The following resources provide more information about using Azure with VideoSDK Agents SDK.

Got a Question? Ask us on discord