AI Agent with JavaScript - Quick Start
VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your JavaScript application within minutes.
In this quickstart, you'll explore how to create an AI agent that joins a meeting room and interacts with users through voice using Google Gemini Live API.
Prerequisites
Before proceeding, ensure that your development environment meets the following requirements:
- Video SDK Developer Account (Not having one, follow Video SDK Dashboard)
- Node.js and Python 3.12+ installed on your device
- Google API Key with Gemini Live API access
You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK dashboard to generate a token and the Google AI Studio for Google API key.
Project Structure
First, create an empty project using mkdir folder_name
on your preferable location for the JavaScript Frontend. Your final project structure should look like this:
root
├── index.html
├── config.js
├── index.js
├── agent-js.py
└── .env
You will be working on the following files:
index.html
: Responsible for creating a basic UI for joining the meeting.config.js
: Responsible for storing the token and room ID for the JavaScript frontend.index.js
: Responsible for rendering the meeting view and audio functionality.agent-js.py
: The Python agent using Google Gemini Live API..env
: Environment variables for the Python agent's API keys.
1. Building the JavaScript Frontend
Step 1: Install VideoSDK
Import VideoSDK using the <script>
tag or install it using npm/yarn.
- <script>
- npm
- yarn
<html>
<head>
<!--.....-->
</head>
<body>
<!--.....-->
<script src="https://sdk.videosdk.live/js-sdk/0.3.6/videosdk.js"></script>
</body>
</html>
npm install @videosdk.live/js-sdk
yarn add @videosdk.live/js-sdk
Step 2: Design the User Interface
Create an index.html
file containing join-screen
and grid-screen
for audio-only interaction.
<!DOCTYPE html>
<html>
<head> </head>
<body>
<div id="join-screen">
<button id="createMeetingBtn">Join Agent Meeting</button>
</div>
<div id="textDiv"></div>
<div id="grid-screen" style="display: none">
<h3 id="meetingIdHeading"></h3>
<button id="leaveBtn">Leave</button>
<button id="toggleMicBtn">Toggle Mic</button>
<div id="audioContainer"></div>
</div>
<script src="https://sdk.videosdk.live/js-sdk/0.3.6/videosdk.js"></script>
<script src="config.js"></script>
<script src="index.js"></script>
</body>
</html>
Step 3: Configure the Frontend
Create a meeting room using the VideoSDK API:
curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: YOUR_JWT_TOKEN_HERE" \
-H "Content-Type: application/json"
Copy the roomId
from the response and configure it in config.js
for the JavaScript frontend:
TOKEN = "your_videosdk_auth_token_here";
ROOM_ID = "YOUR_MEETING_ID"; // Static room ID shared between frontend and agent
Step 4: Implement Meeting Logic
In index.js
, retrieve DOM elements, declare variables, and add the core meeting functionalities.
// getting Elements from Dom
const leaveButton = document.getElementById("leaveBtn");
const toggleMicButton = document.getElementById("toggleMicBtn");
const createButton = document.getElementById("createMeetingBtn");
const audioContainer = document.getElementById("audioContainer");
const textDiv = document.getElementById("textDiv");
// declare Variables
let meeting = null;
let meetingId = "";
let isMicOn = false;
// Join Agent Meeting Button Event Listener
createButton.addEventListener("click", async () => {
document.getElementById("join-screen").style.display = "none";
textDiv.textContent = "Please wait, we are joining the meeting";
meetingId = ROOM_ID;
initializeMeeting();
});
// Initialize meeting
function initializeMeeting() {
window.VideoSDK.config(TOKEN);
meeting = window.VideoSDK.initMeeting({
meetingId: meetingId,
name: "C.V.Raman",
micEnabled: true,
webcamEnabled: false,
});
meeting.join();
meeting.localParticipant.on("stream-enabled", (stream) => {
if (stream.kind === "audio") {
setAudioTrack(stream, meeting.localParticipant, true);
}
});
meeting.on("meeting-joined", () => {
textDiv.textContent = null;
document.getElementById("grid-screen").style.display = "block";
document.getElementById("meetingIdHeading").textContent = `Meeting Id: ${meetingId}`;
});
meeting.on("meeting-left", () => {
audioContainer.innerHTML = "";
});
meeting.on("participant-joined", (participant) => {
let audioElement = createAudioElement(participant.id);
participant.on("stream-enabled", (stream) => {
if (stream.kind === "audio") {
setAudioTrack(stream, participant, false);
audioContainer.appendChild(audioElement);
}
});
});
meeting.on("participant-left", (participant) => {
let aElement = document.getElementById(`a-${participant.id}`);
if (aElement) aElement.remove();
});
}
// Create audio elements for participants
function createAudioElement(pId) {
let audioElement = document.createElement("audio");
audioElement.setAttribute("autoPlay", "false");
audioElement.setAttribute("playsInline", "true");
audioElement.setAttribute("controls", "false");
audioElement.setAttribute("id", `a-${pId}`);
audioElement.style.display = "none";
return audioElement;
}
// Set audio track
function setAudioTrack(stream, participant, isLocal) {
if (stream.kind === "audio") {
if (isLocal) {
isMicOn = true;
} else {
const audioElement = document.getElementById(`a-${participant.id}`);
if (audioElement) {
const mediaStream = new MediaStream();
mediaStream.addTrack(stream.track);
audioElement.srcObject = mediaStream;
audioElement.play().catch((err) => console.error("audioElem.play() failed", err));
}
}
}
}
// Implement controls
leaveButton.addEventListener("click", async () => {
meeting?.leave();
document.getElementById("grid-screen").style.display = "none";
document.getElementById("join-screen").style.display = "block";
});
toggleMicButton.addEventListener("click", async () => {
if (isMicOn) meeting?.muteMic();
else meeting?.unmuteMic();
isMicOn = !isMicOn;
});
2. Building the Python Agent
Step 1: Configure the Agent
Create a .env
file to store your API keys securely for the Python agent:
# Google API Key for Gemini Live API
GOOGLE_API_KEY=your_google_api_key_here
# VideoSDK Authentication Token
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here
Step 2: Create the Python Agent
Create the Python agent (agent-js.py
) that will join the same meeting room and interact with users through voice.
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import logging
logging.getLogger().setLevel(logging.INFO)
class MyVoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.",
)
async def on_enter(self) -> None:
await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?")
async def on_exit(self) -> None:
await self.session.say("Goodbye!")
async def start_session(context: JobContext):
agent = MyVoiceAgent()
model = GeminiRealtime(
model="gemini-2.0-flash-live-001",
# When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
# api_key="AIXXXXXXXXXXXXXXXXXXXX",
config=GeminiLiveConfig(
voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
response_modalities=["AUDIO"]
)
)
pipeline = RealTimePipeline(model=model)
session = AgentSession(agent=agent, pipeline=pipeline)
def on_transcription(data: dict):
role = data.get("role")
text = data.get("text")
print(f"[TRANSCRIPT][{role}]: {text}")
pipeline.on("realtime_model_transcription", on_transcription)
await context.run_until_shutdown(session=session, wait_for_participant=True)
def make_context() -> JobContext:
room_options = RoomOptions(
room_id="YOUR_MEETING_ID", # Replace with your actual room_id
name="Gemini Agent",
playground=True,
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()
3. Run the Application
Step 1: Start the Frontend
Once you have completed all the steps, serve your frontend files:
# Using Python's built-in server
python3 -m http.server 8000
# Or using Node.js http-server
npx http-server -p 8000
Open http://localhost:8000
in your web browser.
Step 2: Start the Python Agent
Open a new terminal and run the Python agent:
# Install Python dependencies
pip install "videosdk-plugins-google"
pip install videosdk-agents
# Run the Python agent (reads GOOGLE_API_KEY from .env file)
python agent-js.py
Step 3: Connect and Interact
-
Join the meeting from the frontend:
- Click the "Join Agent Meeting" button in your browser.
- Allow microphone permissions when prompted.
-
Agent connection:
- Once you join, the Python agent will detect your participation.
- You should see "Participant joined" in the terminal.
- The AI agent will greet you and initiate the game.
-
Start playing:
- The agent will guide you through a number guessing game (1-100).
- Use your microphone to interact with the AI host.
Troubleshooting
Common Issues:
-
"Waiting for participant..." but no connection:
- Ensure both frontend and the agent are running.
- Check that the
ROOM_ID
matches inconfig.js
andagent-js.py
. - Verify your VideoSDK token is valid.
-
Audio not working:
- Check browser permissions for microphone access.
- Ensure your Google API key has Gemini Live API access enabled.
-
Agent not responding:
- Verify your
GOOGLE_API_KEY
is set in the.env
file. - Check that the Gemini Live API is enabled in your Google Cloud Console.
- Verify your
Next Steps
Clone repo for quick implementation
Got a Question? Ask us on discord