Skip to main content
Version: 1.0.x

Cartesia TTS

The Cartesia TTS provider enables your agent to use Cartesia's high-quality, low-latency text-to-speech models for generating natural-sounding voice output.

Installation

Install the Cartesia-enabled VideoSDK Agents package:

pip install "videosdk-plugins-cartesia"

Importing

from videosdk.agents.plugins import CartesiaTTS

Authentication

The Cartesia plugin requires a Cartesia API key.

Set CARTESIA_API_KEY in your .env file.

Example Usage

from videosdk.agents.plugins import CartesiaTTS
from videosdk.agents import Pipeline

# Initialize the Cartesia TTS model
tts = CartesiaTTS(
# When CARTESIA_API_KEY is set in .env - DON'T pass api_key parameter
api_key="your-cartesia-api-key",
model="sonic-2",
voice_id="794f9389-aac1-45b6-b726-9d9369183238",
language="en",
pronunciation_dict_id= None,
max_buffer_delay_ms=None,
word_timestamps=True
)

# Add tts to pipeline
pipeline = Pipeline(tts=tts)
note

When using .env file for credentials, don't pass them as arguments to model instances. The SDK automatically reads environment variables, so omit api_key and other credential parameters from your code.

Configuration Options

  • api_key: (str) Your Cartesia API key. Can also be set via the CARTESIA_API_KEY environment variable.
  • model: (str) The Cartesia TTS model to use (e.g., "sonic-2", "sonic-turbo"). Defaults to "sonic-2".
  • voice_id: (str | list[float]) Either a Cartesia voice ID (str) or a voice embedding (list of floats).
  • language: (str) The language of the voice (e.g., "en", "fr"). Defaults to "en".
  • base_url: (str) Cartesia base URL. Defaults to "https://api.cartesia.ai".
  • generation_config: (GenerationConfig) Voice generation parameters (sonic-3 only; only fields you set are forwarded). Defaults to None:
    • speed: (float) Speaking speed (optional)
    • emotion: (str) Emotion control (optional)
    • volume: (float) Output volume (optional)
  • pronunciation_dict_id: (str) The ID of the pronunciation dictionary to use for generating speech.
  • max_buffer_delay_ms : (int) Deprecated. Sentence-paced flushing now drives buffer behavior; this value is no longer forwarded. Defaults to None.
  • word_timestamps: (bool) Enable word-level timestamps in the TTS output. Defaults to False.
  • max_connection_age_sec: (float) Refresh the WebSocket after this many seconds to avoid hitting Cartesia's idle/session limits.

Additional resources

The following resources provide more information about using Cartesia with VideoSDK Agents.

Got a Question? Ask us on discord