Azure TTS
The Azure TTS provider enables your agent to use Microsoft Azure's high-quality text-to-speech models for generating natural-sounding voice output with advanced voice tuning and expressive speaking styles.
Installation
Install the Azure-enabled VideoSDK Agents package:
pip install "videosdk-plugins-azure"
Importing
from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle
Authentication
The Azure TTS plugin requires an Azure AI Speech Service resource.
Setup Steps:
- Create an AI Services resource for Speech in the Azure portal or from Azure AI Foundry
- Get the Speech resource key and region. After your Speech resource is deployed, select "Go to resource" to view and manage keys
Set AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in your .env file:
AZURE_SPEECH_KEY=your-azure-speech-key
AZURE_SPEECH_REGION=your-azure-region
Example Usage
from videosdk.plugins.azure import AzureTTS, VoiceTuning, SpeakingStyle
from videosdk.agents import CascadingPipeline
# Configure voice tuning for prosody control
voice_tuning = VoiceTuning(
rate="fast",
volume="loud",
pitch="high"
)
# Configure speaking style for expressive speech
speaking_style = SpeakingStyle(
style="cheerful",
degree=1.5
)
# Initialize the Azure TTS model
tts = AzureTTS(
voice="en-US-EmmaNeural",
language="en-US",
tuning=voice_tuning,
style=speaking_style
)
# Add tts to cascading pipeline
pipeline = CascadingPipeline(tts=tts)
When using .env file for credentials, don't pass them as arguments to model instances or context objects. The SDK automatically reads environment variables, so omit speech_key, speech_region, and other credential parameters from your code.
Configuration Options
speech_key: (Optional[str]) Azure Speech API key. UsesAZURE_SPEECH_KEYenvironment variable if not provided.speech_region: (Optional[str]) Azure Speech region (e.g.,"eastus","westus2"). UsesAZURE_SPEECH_REGIONenvironment variable if not provided.speech_endpoint: (Optional[str]) Custom endpoint URL. UsesAZURE_SPEECH_ENDPOINTenvironment variable if not provided.voice: (str) Voice name to use for audio output (default:"en-US-EmmaNeural"). Get available voices using the Azure voices API.language: (str) Language code (optional, inferred from voice if not specified).tuning: (VoiceTuning) Voice tuning object for rate, volume, and pitch control:rate: (str) Speaking rate ("x-slow","slow","medium","fast","x-fast"or percentage like"50%")volume: (str) Speaking volume ("silent","x-soft","soft","medium","loud","x-loud"or percentage)pitch: (str) Voice pitch ("x-low","low","medium","high","x-high"or frequency like"+50Hz")
style: (SpeakingStyle) Speaking style object for expressive speech:style: (str) Speaking style (e.g.,"cheerful","sad","angry","excited","friendly")degree: (float) Style intensity from 0.01 to 2.0 (default: 1.0)
deployment_id: (str) Custom deployment ID for custom models.speech_auth_token: (str) Authorization token for authentication.
Voice Selection
You can find available voices using the Azure Voices List API:
curl --location --request GET 'https://eastus2.tts.speech.microsoft.com/cognitiveservices/voices/list' \
--header 'Ocp-Apim-Subscription-Key: YOUR_SPEECH_KEY'
Popular voice options include:
en-US-EmmaNeural(Female, neutral)en-US-BrianNeural(Male, neutral)en-US-AriaNeural(Female, cheerful)en-GB-SoniaNeural(Female, British)
Additional Resources
The following resources provide more information about using Azure with VideoSDK Agents SDK.
- Azure Speech Service Overview: Complete overview of Azure Speech services.
- Azure TTS docs: Azure Text-to-Speech documentation.
- Voice Selection Guide: Guide for selecting synthesis language and voice.
- Speech Synthesis Markup: Learn about prosody adjustments and voice tuning.
Got a Question? Ask us on discord

