Skip to main content

AI Agent with JavaScript - Quick Start

VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your JavaScript application within minutes.

In this quickstart, you'll explore how to create an AI agent that joins a meeting room and interacts with users through voice using Google Gemini Live API.

Prerequisites

Before proceeding, ensure that your development environment meets the following requirements:

  • Video SDK Developer Account (Not having one, follow Video SDK Dashboard)
  • Node.js and Python 3.12+ installed on your device
  • Google API Key with Gemini Live API access
important

You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK dashboard to generate a token and the Google AI Studio for Google API key.

Project Structure

First, create an empty project using mkdir folder_name on your preferable location for the JavaScript Frontend. Your final project structure should look like this:

Project Structure
  root
├── index.html
├── config.js
├── index.js
├── agent-js.py
└── .env

You will be working on the following files:

  • index.html: Responsible for creating a basic UI for joining the meeting.
  • config.js: Responsible for storing the token and room ID for the JavaScript frontend.
  • index.js: Responsible for rendering the meeting view and audio functionality.
  • agent-js.py: The Python agent using Google Gemini Live API.
  • .env: Environment variables for the Python agent's API keys.

1. Building the JavaScript Frontend

Step 1: Install VideoSDK

Import VideoSDK using the <script> tag or install it using npm/yarn.

<html>
<head>
<!--.....-->
</head>
<body>
<!--.....-->
<script src="https://sdk.videosdk.live/js-sdk/0.3.6/videosdk.js"></script>
</body>
</html>

Step 2: Design the User Interface

Create an index.html file containing join-screen and grid-screen for audio-only interaction.

index.html
<!DOCTYPE html>
<html>
<head> </head>
<body>
<div id="join-screen">
<button id="createMeetingBtn">Join Agent Meeting</button>
</div>
<div id="textDiv"></div>
<div id="grid-screen" style="display: none">
<h3 id="meetingIdHeading"></h3>
<button id="leaveBtn">Leave</button>
<button id="toggleMicBtn">Toggle Mic</button>
<div id="audioContainer"></div>
</div>
<script src="https://sdk.videosdk.live/js-sdk/0.3.6/videosdk.js"></script>
<script src="config.js"></script>
<script src="index.js"></script>
</body>
</html>

Step 3: Configure the Frontend

Create a meeting room using the VideoSDK API:

curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: YOUR_JWT_TOKEN_HERE" \
-H "Content-Type: application/json"

Copy the roomId from the response and configure it in config.js for the JavaScript frontend:

config.js
TOKEN = "your_videosdk_auth_token_here";
ROOM_ID = "YOUR_MEETING_ID"; // Static room ID shared between frontend and agent

Step 4: Implement Meeting Logic

In index.js, retrieve DOM elements, declare variables, and add the core meeting functionalities.

index.js
// getting Elements from Dom
const leaveButton = document.getElementById("leaveBtn");
const toggleMicButton = document.getElementById("toggleMicBtn");
const createButton = document.getElementById("createMeetingBtn");
const audioContainer = document.getElementById("audioContainer");
const textDiv = document.getElementById("textDiv");

// declare Variables
let meeting = null;
let meetingId = "";
let isMicOn = false;

// Join Agent Meeting Button Event Listener
createButton.addEventListener("click", async () => {
document.getElementById("join-screen").style.display = "none";
textDiv.textContent = "Please wait, we are joining the meeting";
meetingId = ROOM_ID;
initializeMeeting();
});

// Initialize meeting
function initializeMeeting() {
window.VideoSDK.config(TOKEN);

meeting = window.VideoSDK.initMeeting({
meetingId: meetingId,
name: "C.V.Raman",
micEnabled: true,
webcamEnabled: false,
});

meeting.join();

meeting.localParticipant.on("stream-enabled", (stream) => {
if (stream.kind === "audio") {
setAudioTrack(stream, meeting.localParticipant, true);
}
});

meeting.on("meeting-joined", () => {
textDiv.textContent = null;
document.getElementById("grid-screen").style.display = "block";
document.getElementById("meetingIdHeading").textContent = `Meeting Id: ${meetingId}`;
});

meeting.on("meeting-left", () => {
audioContainer.innerHTML = "";
});

meeting.on("participant-joined", (participant) => {
let audioElement = createAudioElement(participant.id);
participant.on("stream-enabled", (stream) => {
if (stream.kind === "audio") {
setAudioTrack(stream, participant, false);
audioContainer.appendChild(audioElement);
}
});
});

meeting.on("participant-left", (participant) => {
let aElement = document.getElementById(`a-${participant.id}`);
if (aElement) aElement.remove();
});
}

// Create audio elements for participants
function createAudioElement(pId) {
let audioElement = document.createElement("audio");
audioElement.setAttribute("autoPlay", "false");
audioElement.setAttribute("playsInline", "true");
audioElement.setAttribute("controls", "false");
audioElement.setAttribute("id", `a-${pId}`);
audioElement.style.display = "none";
return audioElement;
}

// Set audio track
function setAudioTrack(stream, participant, isLocal) {
if (stream.kind === "audio") {
if (isLocal) {
isMicOn = true;
} else {
const audioElement = document.getElementById(`a-${participant.id}`);
if (audioElement) {
const mediaStream = new MediaStream();
mediaStream.addTrack(stream.track);
audioElement.srcObject = mediaStream;
audioElement.play().catch((err) => console.error("audioElem.play() failed", err));
}
}
}
}

// Implement controls
leaveButton.addEventListener("click", async () => {
meeting?.leave();
document.getElementById("grid-screen").style.display = "none";
document.getElementById("join-screen").style.display = "block";
});

toggleMicButton.addEventListener("click", async () => {
if (isMicOn) meeting?.muteMic();
else meeting?.unmuteMic();
isMicOn = !isMicOn;
});

2. Building the Python Agent

Step 1: Configure the Agent

Create a .env file to store your API keys securely for the Python agent:

.env
# Google API Key for Gemini Live API
GOOGLE_API_KEY=your_google_api_key_here

# VideoSDK Authentication Token
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here

Step 2: Create the Python Agent

Create the Python agent (agent-js.py) that will join the same meeting room and interact with users through voice.

agent-js.py
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import logging

logging.getLogger().setLevel(logging.INFO)

class MyVoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.",
)

async def on_enter(self) -> None:
await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?")

async def on_exit(self) -> None:
await self.session.say("Goodbye!")

async def start_session(context: JobContext):
agent = MyVoiceAgent()
model = GeminiRealtime(
model="gemini-2.0-flash-live-001",
# When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
# api_key="AIXXXXXXXXXXXXXXXXXXXX",
config=GeminiLiveConfig(
voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
response_modalities=["AUDIO"]
)
)

pipeline = RealTimePipeline(model=model)
session = AgentSession(agent=agent, pipeline=pipeline)

def on_transcription(data: dict):
role = data.get("role")
text = data.get("text")
print(f"[TRANSCRIPT][{role}]: {text}")
pipeline.on("realtime_model_transcription", on_transcription)

await context.run_until_shutdown(session=session, wait_for_participant=True)

def make_context() -> JobContext:
room_options = RoomOptions(
room_id="YOUR_MEETING_ID", # Replace with your actual room_id
name="Gemini Agent",
playground=True,
)
return JobContext(room_options=room_options)

if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()

3. Run the Application

Step 1: Start the Frontend

Once you have completed all the steps, serve your frontend files:

# Using Python's built-in server
python3 -m http.server 8000

# Or using Node.js http-server
npx http-server -p 8000

Open http://localhost:8000 in your web browser.

Step 2: Start the Python Agent

Open a new terminal and run the Python agent:

# Install Python dependencies
pip install "videosdk-plugins-google"
pip install videosdk-agents

# Run the Python agent (reads GOOGLE_API_KEY from .env file)
python agent-js.py

Step 3: Connect and Interact

  1. Join the meeting from the frontend:

    • Click the "Join Agent Meeting" button in your browser.
    • Allow microphone permissions when prompted.
  2. Agent connection:

    • Once you join, the Python agent will detect your participation.
    • You should see "Participant joined" in the terminal.
    • The AI agent will greet you and initiate the game.
  3. Start playing:

    • The agent will guide you through a number guessing game (1-100).
    • Use your microphone to interact with the AI host.

Troubleshooting

Common Issues:

  1. "Waiting for participant..." but no connection:

    • Ensure both frontend and the agent are running.
    • Check that the ROOM_ID matches in config.js and agent-js.py.
    • Verify your VideoSDK token is valid.
  2. Audio not working:

    • Check browser permissions for microphone access.
    • Ensure your Google API key has Gemini Live API access enabled.
  3. Agent not responding:

    • Verify your GOOGLE_API_KEY is set in the .env file.
    • Check that the Gemini Live API is enabled in your Google Cloud Console.

Next Steps

Clone repo for quick implementation

Got a Question? Ask us on discord