Published 2026-03-13 by TechNet New England
Voice AI is everywhere, but most implementations assume you're building for phone systems with SIP trunks, RTP streams, and telephony infrastructure. What if you just want a voice assistant that works in a web browser—like talking to someone on a video call?
This guide covers how to build a browser-based voice agent that feels like a natural phone conversation, without any telephony complexity. Everything runs client-side using Web Audio APIs, with server-side AI processing.
Why Browser-Based Instead of SIP/RTP?
Traditional voice AI systems require:
- SIP trunks and VoIP infrastructure
- RTP stream handling and codecs
- Telephony providers (Twilio, Vonage, etc.)
- Complex NAT traversal and firewall configuration
- Per-minute telephony costs
Browser-based voice agents eliminate all of that:
- No telephony infrastructure — Uses the browser's built-in microphone and speaker
- No per-minute costs — Only pay for AI API calls (transcription, LLM, TTS)
- Works anywhere — Desktop, mobile, embedded in any web app
- Simpler architecture — HTTP/WebSocket instead of RTP streams
- Better for internal tools — Perfect for dashboards, admin panels, support interfaces
Architecture Overview
The system works like this:
User speaks → Microphone captures audio
→ Silence detection (RMS energy analysis)
→ Audio sent to transcription API
→ Text processed by LLM
→ Response sent to TTS API
→ Audio played back to user
The entire conversation feels natural because of careful state management and "barge-in" support (user can interrupt the AI while it's speaking).
The Five States
A well-designed voice agent cycles through these states:
| State | What's Happening | UI Indicator |
|---|---|---|
| IDLE | Waiting for user to start speaking | Microphone ready |
| LISTENING | Recording user speech, detecting silence | Waveform animation |
| PROCESSING | Sending audio to transcription | Processing indicator |
| THINKING | LLM generating response | Thinking animation |
| SPEAKING | Playing TTS audio | Speaker animation |
Key Technical Components
1. Silence Detection (When to Stop Recording)
You can't just record forever—you need to detect when the user stops talking. The most reliable method is RMS energy analysis:
// Calculate RMS (Root Mean Square) energy of audio buffer
function calculateRMS(audioData) {
let sum = 0;
for (let i = 0; i < audioData.length; i++) {
sum += audioData[i] * audioData[i];
}
return Math.sqrt(sum / audioData.length);
}
// If RMS drops below threshold for X milliseconds, stop recording
const SILENCE_THRESHOLD = 0.01;
const SILENCE_DURATION_MS = 1500;
This approach works better than voice activity detection (VAD) for conversational AI because it's simpler and more predictable.
2. Speech-to-Text Options
For transcription, two services work particularly well for real-time voice:
- Groq Whisper — Extremely fast (sub-second latency), good accuracy
- Deepgram Nova-2 — Streaming support, excellent for real-time
Both are significantly faster than OpenAI's Whisper API for real-time use cases.
3. Voice-Optimized LLM Prompts
Voice responses need to be different from text responses:
You are a voice assistant. Keep responses concise and conversational.
Rules:
- Respond in 1-3 sentences maximum
- Never use bullet points, lists, or formatting
- Don't say "I'd be happy to help" or similar filler
- Speak naturally, as if on a phone call
- Ask clarifying questions instead of long explanations
4. Text-to-Speech
Deepgram Aura is currently the best option for low-latency TTS:
- Sub-200ms time to first byte
- Natural-sounding voices
- Streaming audio support
ElevenLabs has better voice quality but higher latency. For conversational AI, speed matters more than perfect audio quality.
5. Barge-In (Interruption Handling)
Users expect to interrupt the AI mid-sentence, just like a real conversation. Implementation:
// When user starts speaking while AI is talking:
1. Stop TTS playback immediately
2. Clear any queued audio
3. Transition to LISTENING state
4. Begin recording new user input
Conversation Memory
For multi-turn conversations, you need to maintain context:
// Store conversation history
const messages = [
{ role: "system", content: systemPrompt },
{ role: "user", content: "What's the weather like?" },
{ role: "assistant", content: "I don't have access to weather data..." },
{ role: "user", content: "Okay, what can you help with?" },
// ... continue conversation
];
// Trim history if it gets too long (keep last N turns)
const MAX_HISTORY = 20;
Recommended Tech Stack
| Component | Recommended | Alternative |
|---|---|---|
| Frontend | React + Web Audio API | Vanilla JS, Vue, Svelte |
| Transcription | Groq Whisper | Deepgram Nova-2 |
| LLM | Claude 3.5 Sonnet | GPT-4o, Llama 3 |
| TTS | Deepgram Aura | ElevenLabs, OpenAI TTS |
| Backend | Node.js/Express | Python/FastAPI |
Latency Budget
For a conversation to feel natural, total round-trip should be under 2 seconds:
| Step | Target |
|---|---|
| Silence detection | ~500ms after user stops |
| Transcription | <500ms |
| LLM response | <800ms (streaming) |
| TTS generation | <200ms to first byte |
| Total | <2 seconds |
When to Use This vs. Traditional Telephony
Use browser-based voice when:
- Building internal tools, dashboards, or admin interfaces
- Creating voice features for existing web apps
- Prototyping voice AI quickly
- You don't need to receive inbound phone calls
- Users are already in a browser context
Use SIP/RTP telephony when:
- You need a phone number customers can call
- Building IVR or call center automation
- Integrating with existing phone systems
- Users won't have browser access
Common Pitfalls
- Microphone permissions — Always handle the case where users deny mic access
- Echo cancellation — Browsers handle this, but test with speakers (not headphones)
- Mobile Safari — Has quirks with Web Audio API; test thoroughly
- Rate limiting — Implement client-side throttling to avoid API abuse
- Long silences — Add a timeout to prevent infinite recording
Getting Started
The simplest starting point:
- Use the browser's
MediaRecorderAPI to capture audio - Send audio chunks to Groq Whisper for transcription
- Pass transcription to your preferred LLM
- Send LLM response to Deepgram Aura for TTS
- Play the audio using the Web Audio API
From there, add silence detection, state management, and conversation memory to create a polished experience.
Need Help Building Voice AI?
Voice interfaces are becoming essential for modern applications—from customer support to internal tools. If you're considering adding voice capabilities to your business applications, we can help architect and implement a solution that fits your needs.