Skip to main content
Version: 0.0.x

AI Translation Agent using OpenAI Realtime API and VideoSDK

Real-time AI translation bridges language barriers in video calls by automatically processing spoken audio, translating it, and speaking the translation in another language. In this project, we leverage VideoSDK for real-time audio/video conferencing and OpenAI’s Realtime API for integrated speech-to-text (STT), translation, and text-to-speech (TTS).

The AI voice agent joins a meeting as a "translator," capturing participants’ audio streams, sending them to OpenAI’s streaming endpoint, and injecting the synthesized translated speech back into the meeting. This full-stack system combines a React frontend, a FastAPI backend, and a Python-based AI agent to facilitate seamless multilingual communication.

Get the Code on GitHub​

This project integrates VideoSDK and OpenAI Realtime APIs to create a powerful AI Translator Agent. Below are the detailed setup instructions to help you get started quickly. Be sure to check out the GitHub repository for a fully functional working example

Architecture Diagram

Step-by-Step Video Guide​

This webinar dives into building real-time conversational AI agents using VideoSDK for robust audio/video infrastructure and OpenAI's Realtime API, with a focus on speech-to-speech functionality. We explore the architecture of an AI Translator Agent, covering how the client, Python backend, audio processing, and OpenAI integration work together. Discover the potential of AI in real-time communication and find the complete code example on our GitHub repository. New users can also get 10000 free minutes every month on VideoSDK to start building their own agents.

System Architecture​

Below is the architecture diagram illustrating this flow:

Architecture Diagram

As highlighted in the diagram above, the AI Translation Agent is a distributed system comprising three core components working in synergy:

  1. React Frontend: The user interface built with React and VideoSDK's React SDK, enabling users to create/join meetings and interact visually.
  2. FastAPI Backend: A Python server acting as an intermediary to launch the AI Agent process upon user request from the frontend.
  3. Python AI Agent: The intelligent core, using the VideoSDK Python SDK to participate in the meeting and the OpenAI Realtime API for integrated speech processing and translation.

The flow begins with users interacting with the React Frontend to manage their meeting session. When a user decides to add the translation capability, the frontend signals the FastAPI Backend. The backend's role is simple but crucial: it spins up the separate Python AI Agent process. This agent then uses its own VideoSDK client (via the Python SDK) to join the same meeting as a participant. Once inside, it listens to the audio streams of other participants, processes them, sends them to OpenAI for transcription and translation, receives the translated speech back, and injects that audio stream into the meeting for everyone to hear.

Clone This Project​

Follow the steps below to get started with the AI Translation Agent project.

1

Clone the Repository

Start by cloning the GitHub repository to get the project files.
Clone Repository
git clone https://github.com/videosdk-community/videosdk-openai-realtime-translator.git
Navigate to Project Directory
cd videosdk-openai-realtime-translator
2

Client Setup

Navigate to the client folder and set up the environment variables.
Move to Client Directory
cd client
Copy Environment File
cp .env.example .env
Update .env File
VITE_APP_VIDEOSDK_TOKEN=your_videosdk_auth_token_here
3

Server Setup (Python FastAPI)

Create a virtual environment and install dependencies.
Create Virtual Environment
python -m venv .venv
Activate Virtual Environment (Mac/Linux)
source .venv/bin/activate
Activate Virtual Environment (Windows)
.\venv\Scripts\activate
Install Dependencies
pip install -r requirements.txt
Copy the environment file and update required API keys.
Copy .env File
cp .env.example .env
Update .env File
OPENAI_API_KEY=your_openai_key_here
Generate your OpenAI API key from OpenAI Platform and update it in the .env file.
4

Running the Application

Start the backend server using Uvicorn.
Run Backend Server
uvicorn main:app --reload
Start the frontend client application.
Run Frontend
npm run dev

Technical Deep Dive​

Key Components and Code Walkthrough​

Let's break down the code in each major part of the application to understand how this real-time translation happens.

1. React Frontend​

The frontend is responsible for the user experience – letting them join a meeting, setting their name and language, and visualizing the participants. It also handles the UI logic for inviting the AI agent and managing microphone state.

  • client/src/App.tsx: This is the main entry point for the client application.
    • It manages the meeting state (meetingId), user name (userName), and selected language (selectedLanguage).
    • It uses the MeetingProvider from @videosdk.live/react-sdk to set up the meeting context, passing the meetingId, user name, and importantly, their preferredLanguage in the metaData. This metadata is later accessed by the AI Agent.
    • It handles the UI for creating or joining a meeting and gathering user information.
client/src/App.tsx (within App component)
<MeetingProvider
config={{
meetingId,
micEnabled: false, // Mic is initially muted, uses push-to-talk
webcamEnabled: true,
name: userName,
debugMode: true,
metaData: {
// Sending preferred language as metadata
preferredLanguage: selectedLanguage,
},
}}
token={token}
joinWithoutUserInteraction // Automatically join when MeetingId is set
>
{/* ... MeetingView component handles the meeting UI ... */}
</MeetingProvider>
  • The MeetingView component (also in App.tsx) displays the participants. It iterates through the participants map provided by useMeeting, identifying the AI participant based on display name and rendering ParticipantCard for each.

  • client/src/components/ParticipantCard.tsx: This component renders individual participant tiles (or a large view for the AI).

    • It uses the useParticipant hook to get participant-specific data like displayName, micStream, webcamStream, isActiveSpeaker, and isLocal.
    • It includes logic for a "Hold to Speak" (spacebar push-to-talk) button for local human participants. This is a common pattern to control when audio is sent, crucial for managing potential feedback in calls with agents.
client/src/components/ParticipantCard.tsx (Push-to-Talk useEffect)
React.useEffect(() => {
if (!isLocal || isAI) return;
// ... add keydown/keyup listeners for spacebar ...
const handleKeyDown = (e: KeyboardEvent) => {
if (e.code === "Space" && !isSpeaking) {
e.preventDefault();
setIsSpeaking(true);
unmuteMic(); // Unmute mic when spacebar is held
}
};
const handleKeyUp = (e: KeyboardEvent) => {
if (e.code === "Space") {
e.preventDefault();
setIsSpeaking(false);
muteMic(); // Mute mic when spacebar is released
}
};
// ... add and remove event listeners ...
}, [isLocal, isAI, isSpeaking, unmuteMic, muteMic]);
  • It has a key useEffect hook that detects when the AI participant becomes the isActiveSpeaker. When this happens and the current participant is the local human user, it includes logic to automatically mute the local microphone. This is a critical step to prevent the human's microphone from picking up the AI's translated speech and creating a feedback loop or confusing the AI's STT.
client/src/components/ParticipantCard.tsx (Auto-Mute useEffect)
React.useEffect(() => {
if (isLocal && !isAI && isActiveSpeaker) {
// If THIS card is the local human AND they are active speaker...
const aiParticipants = document.querySelectorAll(
'[data-ai-participant="true"]' // Find AI participants
);
aiParticipants.forEach((ai) => {
// Check if ANY AI participant is marked as currently speaking by the UI
if (ai.getAttribute("data-is-speaking") === "true") {
muteMic(); // If an AI is speaking, mute the local human mic
setIsSpeaking(false);
}
});
}
}, [isLocal, isAI, isActiveSpeaker, muteMic]); // Depend on mic state changes
  • Displays video (<video ref={videoRef}>) or an avatar (<Bot /> or <User />) and the participant's name.

  • client/src/components/MeetingControls.tsx: Contains basic meeting controls.

    • The "Invite AI Translator" button triggers a fetch request to the backend's /join-player endpoint, sending the current meetingId and token.
const inviteAI = async () => {
try {
const response = await fetch("http://localhost:8000/join-player", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ meeting_id: meetingId, token }),
});
if (!response.ok) throw new Error("Failed to invite AI");
toast.success("AI Translator joined successfully");
setAiJoined(true); // Update state to reflect AI joined
} catch (error) {
toast.error("Failed to invite AI Translator");
console.error("Error inviting AI:", error);
}
};

2. FastAPI Backend​

The backend is minimal, serving primarily to launch the AI agent process when requested by the frontend.

  • main.py:
    • Sets up a basic FastAPI app with CORS enabled.
    • Defines the /join-player endpoint that accepts the meeting ID and token.
    • It uses FastAPI's BackgroundTasks to run the server_operations function. This is important because the ai_agent.join() call is a long-running asynchronous process that would otherwise block the HTTP request.
main.py
  @app.post("/join-player")
async def join_player(req: MeetingReqConfig, bg_tasks: BackgroundTasks):
# Add the server_operations async function to background tasks
bg_tasks.add_task(server_operations, req)
return {"message": "AI agent joined"}

async def server_operations(req: MeetingReqConfig):
global ai_agent
# Initialize and join the AI agent in the background
ai_agent = AIAgent(req.meeting_id, req.token, "AI")
try:
await ai_agent.join()
# Keep the background task running to keep the agent active
while True:
await asyncio.sleep(1)
except Exception as ex:
print(f"[ERROR]: either joining or running bg tasks: {ex}")
finally:
# Ensure the agent leaves the meeting cleanly if the task stops
if ai_agent:
ai_agent.leave()

# ... uvicorn main execution block ...

3. Python AI Agent​

This is where the AI Agent lives and interacts with both VideoSDK and OpenAI.

  • ai_agent.py:
    • Initializes the VideoSDK meeting client using VideoSDK.init_meeting. Notice the custom_microphone_audio_track is set to an instance of CustomAudioStreamTrack. This is how the agent will output its translated speech into the meeting.
agent/ai_agent.py (within AIAgent.__init__)
  self.audio_track = CustomAudioStreamTrack(
loop=self.loop, handle_interruption=True
)
self.meeting_config = MeetingConfig(
# ... other config ...
mic_enabled=True, # Agent needs mic enabled to speak
webcam_enabled=False,
custom_microphone_audio_track=self.audio_track, # Inject custom audio here
)
self.agent = VideoSDK.init_meeting(**self.meeting_config)
# Add event listeners for meeting and participant events
self.agent.add_event_listener(MeetingHandler(...))
  • The on_participant_joined callback (part of MeetingHandler) is triggered when a user joins. It extracts the preferredLanguage from the participant's meta_data that was sent from the frontend.
  • When two human participants have joined (besides the AI), the agent can dynamically update the OpenAI session instructions using self.intelligence.update_session_instructions. This tells OpenAI the languages to translate between based on the detected languages of the human participants.
agent/ai_agent.py (within on_participant_joined)
  peer_name = participant.display_name
native_lang = participant.meta_data["preferredLanguage"] # Get language from frontend metadata
self.participants_data[participant.id] = {
"name": peer_name,
"lang": native_lang,
}
print("Participant joined:", peer_name)
print("Native language :", native_lang)

# When we have 2 human participants, set the translator instructions
if len(self.participants_data) == 2:
participant_ids = list(self.participants_data.keys())
p1 = self.participants_data[participant_ids[0]]
p2 = self.participants_data[participant_ids[1]]

translator_instructions = f"""
You are a real-time translator bridging a conversation between:
- {p1['name']} (speaks {p1['lang']})
- {p2['name']} (speaks {p2['lang']})

You have to listen and speak those exactly word in different language
eg. when {p1['lang']} is spoken then say that exact in language {p2['lang']}
similar when {p2['lang']} is spoken then say that exact in language {p1['lang']}
Keep in account who speaks what and use
NOTE -
Your job is to translate, from one language to another, don't engage in any conversation
"""
# Asynchronously update the OpenAI session
asyncio.create_task(
self.intelligence.update_session_instructions(translator_instructions)
)
# ... stream handling logic ...
  • For each participant, a ParticipantHandler is attached. When a participant's audio stream is enabled (on_stream_enabled), the agent starts an asynchronous task (add_audio_listener) to process that stream.
  • The add_audio_listener method reads audio frames (stream.track.recv()), processes the raw audio data (converting from VideoSDK's internal format to a NumPy array, resampling from 48kHz to 16kHz, and converting to 16-bit PCM bytes), and then sends these processed bytes to the OpenAIIntelligence instance.
agent/ai_agent.py (within add_audio_listener)
  while True:
try:
# await asyncio.sleep(0.01) # Small sleep to prevent tight loop
if not self.intelligence.ws:
continue # Wait for OpenAI WS to be ready

frame = await stream.track.recv() # Receive audio frame from VideoSDK
audio_data = frame.to_ndarray()[0]
audio_data_float = ( # Convert to float32
audio_data.astype(np.float32) / np.iinfo(np.int16).max
)
audio_mono = librosa.to_mono(audio_data_float.T) # Ensure mono
audio_resampled = librosa.resample( # Resample to 16kHz
audio_mono, orig_sr=48000, target_sr=16000
)
pcm_frame = ( # Convert back to 16-bit PCM bytes
(audio_resampled * np.iinfo(np.int16).max)
.astype(np.int16)
.tobytes()
)

# Send processed audio to OpenAI Intelligence module
await self.intelligence.send_audio_data(pcm_frame)

except Exception as e:
print("Audio processing error:", e)
break # Exit loop on error

4. OpenAI Integration​

This module handles the communication with the OpenAI Realtime API over a WebSocket.

  • openai_intelligence.py:
    • The OpenAIIntelligence class establishes and manages the WebSocket connection to the OpenAI Realtime API endpoint (wss://api.openai.com/v1/realtime).
    • The connect method sets up the WebSocket and initiates the receive_message_handler task.
    • The send_audio_data method takes the 16kHz PCM audio bytes received from the AIAgent, Base64 encodes them, and sends them as an input_audio_buffer.append message over the WebSocket.
intelligence/openai/openai_intelligence.py (send_audio_data)
  async def send_audio_data(self, audio_data: bytes):
"""audio_data is assumed to be pcm16 24kHz mono little-endian"""
# Base64 encode the raw audio bytes
base64_audio_data = base64.b64encode(audio_data).decode("utf-8")
# Create and send the append message
message = InputAudioBufferAppend(audio=base64_audio_data)
await self.send_request(message) # send_request just sends the JSON string over WS
  • The receive_message_handler asynchronously processes incoming messages from OpenAI. It parses the JSON messages and handles different event types.
  • A critical event it handles is EventType.RESPONSE_AUDIO_DELTA, which contains Base64 encoded chunks of the synthesized translated speech. When this event is received, the audio data is decoded and passed to the AI agent's custom audio track using self.on_audio_response.
intelligence/openai/openai_intelligence.py (within receive_message_handler)
  async for response in self.ws:
try:
if response.type == aiohttp.WSMsgType.TEXT:
message = json.loads(response.data)
match message["type"]:
case EventType.RESPONSE_AUDIO_DELTA:
# Decode the received Base64 audio chunk
audio_chunk = base64.b64decode(message["delta"])
# Pass the audio chunk to the custom audio track buffer
self.on_audio_response(audio_chunk)
case EventType.ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED:
# Log the transcription of the input audio
print(f"Client Transcription: {message['transcript']}")
# ... handle other events like SESSION_CREATED, ERROR, etc. ...
except Exception as e:
traceback.print_exc()
print("Error in receiving message:", e)
  • The update_session_instructions method, called by the AI agent, updates the system prompt for the OpenAI session, informing it of the languages to translate between.
intelligence/openai/openai_intelligence.py (update_session_instructions)
  async def update_session_instructions(self, new_instructions: str):
"""Dynamically update the system instructions for the OpenAI session."""
if self.ws is None:
self.pending_instructions = new_instructions # Queue if not connected yet
return
# Update the session parameters object and send the update message
self.session_update_params.instructions = new_instructions
await self.update_session(self.session_update_params)

async def update_session(self, session: SessionUpdateParams):
# Send the session.update message over the WebSocket
await self.send_request(
SessionUpdate(
event_id=generate_event_id(),
session=session,
)
)

5. Custom Audio Stream Track​

This class acts as a virtual microphone for the AI Agent, allowing it to inject audio received from OpenAI back into the VideoSDK meeting.

  • agent/audio_stream_track.py:
    • Inherits from videosdk.CustomAudioTrack, requiring implementation of the recv method.
    • It maintains a frame_buffer which stores av.AudioFrame objects.
    • The add_new_bytes method is called by OpenAIIntelligence when it receives translated audio chunks. This method takes the raw audio bytes, converts them into av.AudioFrame objects, and adds them to the frame_buffer.
agent/audio_stream_track.py (add_new_bytes and _process_audio)
  async def add_new_bytes(self, audio_data_stream: Iterator[bytes]):
# This method receives audio bytes from the intelligence module
# It puts the stream into a queue to be processed in a separate thread
await self._process_audio_task_queue.put(audio_data_stream)

async def _process_audio(self):
# This runs in a separate thread to process incoming audio bytes
while True:
try:
audio_data_stream = asyncio.run_coroutine_threadsafe(
self._process_audio_task_queue.get(), self.loop # Get stream from queue
).result()
for audio_data in audio_data_stream:
# Convert raw bytes to AudioFrame and append to buffer
self.audio_data_buffer += audio_data
while len(self.audio_data_buffer) > self.chunk_size:
chunk = self.audio_data_buffer[: self.chunk_size]
self.audio_data_buffer = self.audio_data_buffer[self.chunk_size :]
audio_frame = self.buildAudioFrames(chunk) # Converts bytes to AudioFrame
self.frame_buffer.append(audio_frame) # Add frame to buffer
except Exception as e:
print("Error while process audio", e)
  • The recv method is the core of the CustomAudioTrack. VideoSDK's Python SDK calls this method whenever it needs the next audio frame from this custom track to send to the meeting.
  • recv checks the frame_buffer. If there are frames (translated audio from OpenAI), it pops one and returns it. If the buffer is empty, it returns a silence frame to maintain the audio stream integrity.
  # agent/audio_stream_track.py (recv method)
async def recv(self) -> AudioFrame:
try:
# ... timestamp logic ...

if len(self.frame_buffer) > 0:
# If there's buffered audio from OpenAI TTS, return the next frame
frame = self.frame_buffer.pop(0)
else:
# If buffer is empty, return a silence frame
frame = AudioFrame(format="s16", layout="mono", samples=self.samples)
for p in frame.planes:
p.update(bytes(p.buffer_size)) # Fill with silence

# Set presentation timestamp (pts) for the frame
frame.pts = pts
frame.time_base = time_base
frame.sample_rate = self.sample_rate
return frame
except Exception as e:
traceback.print_exc()
print("error while creating tts->rtc frame", e)
# Return None or raise MediaStreamError if the stream is done/failed
raise MediaStreamError

The Real-Time Translation Flow​

Let's trace the path of audio through the system when a participant speaks:

  1. Audio Capture and Transmission: A human participant speaks. Their client application (React Frontend) captures the audio and sends it via VideoSDK's real-time infrastructure.
  2. Agent Receives Audio: The Python AI Agent, connected to the meeting, receives the participant's audio stream through the VideoSDK Python SDK.
  3. Audio Processing & Forwarding: The Agent processes the raw audio (resampling, formatting) and sends it continuously over a WebSocket connection to the OpenAI Realtime API.
  4. OpenAI STT, Translation, and TTS: OpenAI receives the audio stream, performs Speech-to-Text (STT) to transcribe it, translates the transcription based on the participant languages defined in the instructions, and then performs Text-to-Speech (TTS) to synthesize the translated audio.
  5. OpenAI Streams Translated Audio Back: OpenAI streams the synthesized, translated audio back to the Agent over the same WebSocket connection in real-time chunks.
  6. Agent Receives & Buffers Translation: The Agent receives these audio chunks and buffers them within its CustomAudioStreamTrack.
  7. Agent Injects Audio into Meeting: The VideoSDK Python SDK continuously requests audio frames from the Agent's CustomAudioStreamTrack. The Agent provides the buffered translated audio frames.
  8. VideoSDK Broadcasts Translation: VideoSDK broadcasts the audio received from the Agent's custom track to all other participants in the meeting.
  9. Participants Hear Translation & UI Updates: Other participants' clients receive and play the translated audio. The frontend UI includes logic to detect the AI speaking and can automatically mute the local participant's microphone to help prevent echoes.

This streamlined flow demonstrates how audio seamlessly moves from one speaker, through the AI agent and OpenAI, and back into the meeting for others to hear in their language.

Conclusion​

This project effectively demonstrates how to build sophisticated real-time AI agents that participate directly in video conferencing using VideoSDK and the OpenAI Realtime API. By treating the AI as a peer participant with a custom audio input, we can seamlessly integrate external AI processing into the live conversation.

The use of VideoSDK provides the robust real-time audio and video infrastructure, while the OpenAI Realtime API handles the complex integrated speech-to-text, translation, and text-to-speech tasks with low latency. The Python agent orchestrates this process, intelligently managing audio streams and dynamically instructing the AI based on the meeting context (like participant languages).

This architecture is not limited to translation. The same pattern could be applied to build agents for transcription, summarization, real-time sentiment analysis, information retrieval, or any task that involves processing meeting audio and providing real-time output back to participants.

We encourage you to clone the project from GitHub, follow the setup steps detailed above, and experiment with this powerful combination of real-time communication and cutting-edge AI. Unlock new possibilities for interactive and intelligent online meetings!

Got a Question? Ask us on discord