Skip to main content
Version: 1.0.x

Avatar Server

When you add an avatar to your agent, a second participant — the Avatar Server — joins the VideoSDK room alongside the agent. Your agent continues handling the conversation (STT → LLM → TTS) and streams its TTS audio output to the Avatar Server over VideoSDK's built-in data channel. The Avatar Server receives that audio, renders video frames synchronized to the speech, and publishes both audio and video tracks directly back into the room.

The end user only ever sees and hears the Avatar Server's output. The agent participant publishes silence and no video.

Avatar Server


How the Audio Gets There — Data Channel

The agent and Avatar Server communicate entirely over VideoSDK's built-in data channels. No external message broker, queue, or WebSocket is needed.

MessageDirectionReliabilityPurpose
PCM audio chunkAgent → Avatar ServerUnreliableRaw TTS audio, chunked at ≤15 KB
segment_endAgent → Avatar ServerReliableTTS turn has finished
INTERRUPTAgent → Avatar ServerReliableStop playback immediately
stream_endedAvatar Server → AgentReliablePlayback complete acknowledgment

Audio chunks use unreliable delivery — a dropped packet is better than a delayed one. Control signals use reliable delivery so that a missed interrupt or segment boundary never permanently desyncs state.


Two Ways to Run an Avatar Server

There are two paths depending on whether you want the framework to handle A/V synchronization for you or whether you are building your own rendering backend.

Path 1 - Local Avatar

Local Avatar Server

When using a local avatar, the framework's built-in components handle receiving audio from the data channel, orchestrating your renderer, and pacing frames into the room. You only need to implement the visual rendering logic itself.

AvatarAudioIn — runs inside your Avatar Server process. It listens on the data channel, reassembles the PCM stream, handles interrupts (clearing its buffer with a 0.3 s cooldown to drop any in-flight chunks), and exposes a clean async iterator of audio frames and segment markers.

AvatarServer — the orchestrator. It drains AvatarAudioIn, feeds each frame into your AvatarRenderer, forwards the rendered output into AvatarSynchronizer, and sends a stream_ended acknowledgment back to the agent at the end of each TTS turn.

AvatarSynchronizer — paces audio and video frames into their respective custom tracks at the configured FPS. At 30 FPS and 24 kHz, each video frame corresponds to exactly 800 audio samples. It sleeps between frames if the renderer runs faster than real time.

AvatarRenderer — the only thing you implement. For each audio frame you receive, produce one video frame and yield them in order (video first, then audio). The framework wires everything else.

A small dispatcher (POST /launch) runs as a separate HTTP service and spawns one Avatar Server process per room on demand.


Comparison

Local AvatarCloud / Custom Backend
Who runs the Avatar ServerYou, on your own machine or serverYour backend or a cloud provider
Audio received viaFramework's AvatarAudioIn (data channel)Your own data channel subscriber
A/V synchronizationFramework's AvatarSynchronizerYour own engine or provider's
What you implementAvatarRenderer (one class)Full backend service
Best forCustom visuals, full control, prototypingProduction lip-sync, managed infrastructure
ExamplesCircular glow, waveform visualizerAny cloud avatar provider

Key Components

ComponentDescription
AvatarAudioOutAgent-side component that dispatches the Avatar Server and streams TTS audio in chunks over the data channel
AvatarAudioInService-side component that receives data channel messages and provides an async iterator of audio frames
AvatarServerService-side orchestrator that connects AvatarAudioIn → renderer → synchronizer → media tracks
AvatarSynchronizerHandles timing and pacing of audio + video frames based on configured FPS
AvatarRendererAbstract base class — implement this to define avatar visuals and rendering logic
AvatarSettingsConfiguration object for resolution, FPS, and audio sample rate
generate_avatar_credentialsUtility to sign a VideoSDK JWT for authenticating the Avatar Server participant

See Also

Got a Question? Ask us on discord