AI Agent with JavaScript - Quick Start

VideoSDK empowers you to seamlessly integrate AI agents with real-time voice interaction into your JavaScript application within minutes.

In this quickstart, you'll explore how to create an AI agent that joins a meeting room and interacts with users through voice using Google Gemini Live API.

Prerequisites

Before proceeding, ensure that your development environment meets the following requirements:

Video SDK Developer Account (Not having one, follow Video SDK Dashboard)
Node.js and Python 3.12+ installed on your device
Google API Key with Gemini Live API access

important

You need a VideoSDK account to generate a token and a Google API key for the Gemini Live API. Visit the VideoSDK dashboard to generate a token and the Google AI Studio for Google API key.

Project Structure

First, create an empty project using mkdir folder_name on your preferable location for the JavaScript Frontend. Your final project structure should look like this:

Project Structure
  root
   ├── index.html
   ├── config.js
   ├── index.js
   ├── agent-js.py
   └── .env

You will be working on the following files:

index.html: Responsible for creating a basic UI for joining the meeting.
config.js: Responsible for storing the token and room ID for the JavaScript frontend.
index.js: Responsible for rendering the meeting view and audio functionality.
agent-js.py: The Python agent using Google Gemini Live API.
.env: Environment variables for the Python agent's API keys.

1. Building the JavaScript Frontend

Step 1: Install VideoSDK

Install VideoSDK

Import VideoSDK using the <script> tag or install it using npm/yarn.

<script>
npm
yarn

<html>
  <head>
    <!--.....-->
  </head>
  <body>
    <!--.....-->
    <script src="https://sdk.videosdk.live/js-sdk/0.3.6/videosdk.js"></script>
  </body>
</html>

npm install @videosdk.live/js-sdk

yarn add @videosdk.live/js-sdk

Step 2: Design the User Interface

Design the User Interface

Create an index.html file containing join-screen and grid-screen for audio-only interaction.

index.html
<!DOCTYPE html>
<html>
<head> </head>
<body>
    <div id="join-screen">
        <button id="createMeetingBtn">Join Agent Meeting</button>
    </div>
    <div id="textDiv"></div>
    <div id="grid-screen" style="display: none">
        <h3 id="meetingIdHeading"></h3>
        <button id="leaveBtn">Leave</button>
        <button id="toggleMicBtn">Toggle Mic</button>
        <div id="audioContainer"></div>
    </div>
    <script src="https://sdk.videosdk.live/js-sdk/0.3.6/videosdk.js"></script>
    <script src="config.js"></script>
    <script src="index.js"></script>
</body>
</html>

Step 3: Configure the Frontend

Configure the Frontend

Create a meeting room using the VideoSDK API:

curl -X POST https://api.videosdk.live/v2/rooms \
  -H "Authorization: YOUR_JWT_TOKEN_HERE" \
  -H "Content-Type: application/json"

Copy the roomId from the response and configure it in config.js for the JavaScript frontend:

config.js
TOKEN = "your_videosdk_auth_token_here";
ROOM_ID = "YOUR_MEETING_ID";  // Static room ID shared between frontend and agent

Step 4: Implement Meeting Logic

Implement Meeting Logic

In index.js, retrieve DOM elements, declare variables, and add the core meeting functionalities.

index.js
// getting Elements from Dom
const leaveButton = document.getElementById("leaveBtn");
const toggleMicButton = document.getElementById("toggleMicBtn");
const createButton = document.getElementById("createMeetingBtn");
const audioContainer = document.getElementById("audioContainer");
const textDiv = document.getElementById("textDiv");

// declare Variables
let meeting = null;
let meetingId = "";
let isMicOn = false;

// Join Agent Meeting Button Event Listener
createButton.addEventListener("click", async () => {
  document.getElementById("join-screen").style.display = "none";
  textDiv.textContent = "Please wait, we are joining the meeting";
  meetingId = ROOM_ID;
  initializeMeeting();
});

// Initialize meeting
function initializeMeeting() {
  window.VideoSDK.config(TOKEN);

  meeting = window.VideoSDK.initMeeting({
    meetingId: meetingId,
    name: "C.V.Raman",
    micEnabled: true,
    webcamEnabled: false,
  });

  meeting.join();

  meeting.localParticipant.on("stream-enabled", (stream) => {
    if (stream.kind === "audio") {
      setAudioTrack(stream, meeting.localParticipant, true);
    }
  });

  meeting.on("meeting-joined", () => {
    textDiv.textContent = null;
    document.getElementById("grid-screen").style.display = "block";
    document.getElementById("meetingIdHeading").textContent = `Meeting Id: ${meetingId}`;
  });

  meeting.on("meeting-left", () => {
    audioContainer.innerHTML = "";
  });

  meeting.on("participant-joined", (participant) => {
    let audioElement = createAudioElement(participant.id);
    participant.on("stream-enabled", (stream) => {
      if (stream.kind === "audio") {
        setAudioTrack(stream, participant, false);
        audioContainer.appendChild(audioElement);
      }
    });
  });

  meeting.on("participant-left", (participant) => {
    let aElement = document.getElementById(`a-${participant.id}`);
    if (aElement) aElement.remove();
  });
}

// Create audio elements for participants
function createAudioElement(pId) {
  let audioElement = document.createElement("audio");
  audioElement.setAttribute("autoPlay", "false");
  audioElement.setAttribute("playsInline", "true");
  audioElement.setAttribute("controls", "false");
  audioElement.setAttribute("id", `a-${pId}`);
  audioElement.style.display = "none";
  return audioElement;
}

// Set audio track
function setAudioTrack(stream, participant, isLocal) {
  if (stream.kind === "audio") {
    if (isLocal) {
      isMicOn = true;
    } else {
      const audioElement = document.getElementById(`a-${participant.id}`);
      if (audioElement) {
        const mediaStream = new MediaStream();
        mediaStream.addTrack(stream.track);
        audioElement.srcObject = mediaStream;
        audioElement.play().catch((err) => console.error("audioElem.play() failed", err));
      }
    }
  }
}

// Implement controls
leaveButton.addEventListener("click", async () => {
  meeting?.leave();
  document.getElementById("grid-screen").style.display = "none";
  document.getElementById("join-screen").style.display = "block";
});

toggleMicButton.addEventListener("click", async () => {
  if (isMicOn) meeting?.muteMic();
  else meeting?.unmuteMic();
  isMicOn = !isMicOn;
});

2. Building the Python Agent

Step 1: Configure the Agent

Configure the Agent

Create a .env file to store your API keys securely for the Python agent:

.env
# Google API Key for Gemini Live API
GOOGLE_API_KEY=your_google_api_key_here

# VideoSDK Authentication Token
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token_here

Step 2: Create the Python Agent

Create the Python Agent

Create the Python agent (agent-js.py) that will join the same meeting room and interact with users through voice.

agent-js.py
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import logging

logging.getLogger().setLevel(logging.INFO)

class MyVoiceAgent(Agent): 
    def __init__(self):
        super().__init__(
            instructions="You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.",
        )

    async def on_enter(self) -> None:
        await self.session.say("Welcome to the Videosdk's AI Agent game show! I'm your host, and we're about to play for 1,000,000$. Are you ready to play?")
    
    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")

async def start_session(context: JobContext):
    agent = MyVoiceAgent()
    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        # api_key="AIXXXXXXXXXXXXXXXXXXXX", 
        config=GeminiLiveConfig(
            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    pipeline = RealTimePipeline(model=model)
    session = AgentSession(agent=agent, pipeline=pipeline)

    def on_transcription(data: dict):
        role = data.get("role")
        text = data.get("text")
        print(f"[TRANSCRIPT][{role}]: {text}")
    pipeline.on("realtime_model_transcription", on_transcription)

    await context.run_until_shutdown(session=session, wait_for_participant=True)

def make_context() -> JobContext:
    room_options = RoomOptions(
        room_id="YOUR_MEETING_ID", # Replace with your actual room_id
        name="Gemini Agent",
        playground=True,
    )
    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

3. Run the Application

Step 1: Start the Frontend

Start the Frontend

Once you have completed all the steps, serve your frontend files:

# Using Python's built-in server
python3 -m http.server 8000

# Or using Node.js http-server
npx http-server -p 8000

Open http://localhost:8000 in your web browser.

Step 2: Start the Python Agent

Start the Python Agent

Open a new terminal and run the Python agent:

# Install Python dependencies
pip install "videosdk-plugins-google"
pip install videosdk-agents

# Run the Python agent (reads GOOGLE_API_KEY from .env file)
python agent-js.py

Step 3: Connect and Interact

Connect and Interact

Join the meeting from the frontend:
- Click the "Join Agent Meeting" button in your browser.
- Allow microphone permissions when prompted.
Agent connection:
- Once you join, the Python agent will detect your participation.
- You should see "Participant joined" in the terminal.
- The AI agent will greet you and initiate the game.
Start playing:
- The agent will guide you through a number guessing game (1-100).
- Use your microphone to interact with the AI host.

Troubleshooting

Common Issues:

"Waiting for participant..." but no connection:
- Ensure both frontend and the agent are running.
- Check that the ROOM_ID matches in config.js and agent-js.py.
- Verify your VideoSDK token is valid.
Audio not working:
- Check browser permissions for microphone access.
- Ensure your Google API key has Gemini Live API access enabled.
Agent not responding:
- Verify your GOOGLE_API_KEY is set in the .env file.
- Check that the Gemini Live API is enabled in your Google Cloud Console.

Next Steps

Clone repo for quick implementation

Quickstart Example

Complete working example with source code

Got a Question? Ask us on discord

Prerequisites​

Project Structure​

1. Building the JavaScript Frontend​

Step 1: Install VideoSDK​

Step 2: Design the User Interface​

Step 3: Configure the Frontend​

Step 4: Implement Meeting Logic​

2. Building the Python Agent​

Step 1: Configure the Agent​

Step 2: Create the Python Agent​

3. Run the Application​

Step 1: Start the Frontend​

Step 2: Start the Python Agent​

Step 3: Connect and Interact​

Troubleshooting​

Common Issues:​

Next Steps​