TEN Logo

The Problem with Traditional Voice Assistants

You're asking your voice assistant a complex question:

"Tell me about the best restaurants in Portland. Oh, and can you also show me options that have outdoor seating?"

A traditional voice assistant with basic Voice Activity Detection (VAD) faces a dilemma:

Does the pause after "Portland" mean you're done talking?
Is "Oh, and..." a continuation, or a new request?
Should it respond immediately or keep listening?

Simple timing-based solutions fail here. You need something smarter: turn detection.

What is Turn Detection?

Turn detection is the contextual evolution beyond Voice Activity Detection. While VAD answers "Is this speech or noise?", turn detection answers the deeper question: "Has the speaker finished their complete thought?"

The Three States of Conversation

Turn detection classifies each transcription segment into one of three states:

State	Meaning	Assistant Action
Finished	Speaker has completed their thought	Process and respond
Unfinished	Speaker is mid-sentence or thinking	Keep listening
Wait	Ambiguous; waiting for clarity	Hold briefly, then timeout

This isn't based on silence duration or audio amplitude. Instead, it uses AI reasoning to understand conversational context.

How It Works: The Architecture

Turn detection in the TEN Framework uses a fine-tuned large language model (LLM) deployed on GPU infrastructure. Here's the flow:

User speaks continuously
    ↓
Real-time STT (Deepgram) streams transcription chunks
    ↓
Each chunk → Turn Detection Model (Fine-tuned LLM)
    ↓
Model analyzes: "Is this thought complete?"
    ├─ finished → Send to LLM for response
    ├─ unfinished → Continue collecting audio
    └─ wait → Hold with timeout
    ↓
Once finished: LLM generates response → TTS → User hears it

Why Use an LLM for Turn Detection?

You might wonder: Can't we just use heuristics or simple machine learning?

The answer is no—not reliably. Consider these examples:

Example 1: Natural Pauses

"I'm thinking about... taking a vacation next month."

A simple silence detector would interrupt. Turn detection understands "thinking about" as a filler phrase with a pause, not the end of a thought.

Example 2: Compound Requests

"Book me a flight to NYC. Actually, make that Boston."

Turn detection recognizes this as a single turn with a self-correction, not two separate requests.

Example 3: Question-Answer Sequences

"What's the weather in London? And what about Paris?"

These are contextually related—turn detection groups them together rather than responding after "London".

An LLM, especially one fine-tuned on conversational data, understands these patterns naturally.

GPU Deployment with Cerebrium

To make turn detection fast enough for real-time conversations, the TEN Framework uses Cerebrium, a platform for deploying AI models on GPUs.

Here's why this matters:

Speed: GPU inference for LLMs drops response time to milliseconds
Scalability: Handle thousands of concurrent conversations
Cost-effective: Pay only for compute you use
OpenAI-compatible API: Drop-in integration with your existing tools

Deployment Process

# 1. Create Cerebrium account at https://www.cerebrium.ai/

# 2. Install CLI and login
pip install cerebrium
cerebrium login

# 3. Deploy the turn detection model
cd agents/examples/voice-assistant-with-turn-detection/cerebrium
cerebrium deploy

# 4. Get your credentials
# Base URL: https://api.cortex.cerebrium.ai/v4/p-xxxxx/ten-turn-detection-project/run
# API Key: Your Cerebrium token

The deployment loads the TEN_Turn_Detection model (fine-tuned for conversation understanding) on an NVIDIA A10 GPU and exposes it via OpenAI-compatible API.

Real-World Benefits

1. Natural Conversation Flow

Users can speak in their natural style without awkward pauses or artificial phrasing. The assistant understands when to listen and when to respond.

Before: "Hello." awkward pause "How. Are. You." waits for response

After: "Hello, how are you? I'm looking for a restaurant." assistant waits for full thought, then responds

2. Better Interruption Handling

What happens if a user interrupts themselves?

"I want to book a flight to—wait, actually, I need to check my calendar first."

Turn detection recognizes this as a single turn with a course correction, not two conflicting requests.

3. Improved Context Understanding

When the LLM processes the user's input, it sees the complete thought, not fragmented pieces. This leads to more accurate responses and better handling of complex queries.

4. Reduced Latency

Despite adding an extra LLM call, turn detection can reduce overall latency because:

It avoids responding prematurely and being interrupted
It batches fragments into complete thoughts
The final LLM response is more accurate (fewer clarifications needed)

Architecture: The Full Stack

Here's how everything works together in a Voice Assistant with Turn Detection:

┌─────────────────────────────────────────────────────┐
│           User Speaking (Microphone)                │
└──────────────────┬──────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────────────────┐
│     Audio Streaming (Agora RTC)                     │
└──────────────────┬──────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────────────────┐
│   Speech-to-Text (Deepgram)                         │
│   Produces real-time transcription chunks           │
└──────────────────┬──────────────────────────────────┘
                   ↓
┌─────────────────────────────────────────────────────┐
│   Turn Detection (Fine-tuned LLM on Cerebrium GPU)  │
│   Classification: finished/unfinished/wait          │
└──────────────────┬──────────────────────────────────┘
                   ↓
            [finished?]
           /            \
        YES              NO
         ↓                ↓
    LLM Response    Continue listening
         ↓
    Text-to-Speech (ElevenLabs)
         ↓
    Audio Output (User hears response)

Each layer is independently configurable:

Audio: Switch from Agora to another RTC provider
STT: Use Azure, Speechmatics, or AssemblyAI instead of Deepgram
Turn Detection: Adjust sensitivity or model parameters
LLM: Switch from OpenAI to Claude, Llama, or Coze
TTS: Use Azure TTS, Cartesia, or Fish Audio instead of ElevenLabs

Getting Started

# Prerequisites: Complete Cerebrium GPU deployment (see above)

# Install dependencies
task install

# Configure environment variables
# Set: TTD_BASE_URL, TTD_API_KEY, DEEPGRAM_API_KEY, OPENAI_API_KEY, ELEVENLABS_TTS_KEY

# Run the assistant
task run

# Access:
# - Frontend: http://localhost:3000
# - API: http://localhost:8080
# - Designer: http://localhost:49483

Customization with TMAN Designer

The visual designer at http://localhost:49483 lets you:

Fine-tune turn detection sensitivity — More aggressive vs. lenient
Adjust timeout thresholds — How long to wait in ambiguous states
Add tools and integrations — Weather, calendar, database lookups
Change prompts — Custom system prompts for different use cases
Add capabilities — Plugin new features without code

You can build complex agent behaviors by chaining components together, all visually.

Advanced Configurations

Multi-Language Support

Turn detection works across languages. Configure Deepgram for your target language, and the turn detection model adapts automatically:

# Support English and Spanish
DEEPGRAM_LANGUAGE=en,es

Custom System Prompts

Modify the LLM's behavior by customizing its system prompt in the designer:

"You are a helpful restaurant reservation assistant.
You ask clarifying questions about party size,
preferred cuisine, and dietary restrictions before
confirming bookings."

Interrupt Handling

Turn detection supports graceful interruption:

User: "Tell me about Tokyo"
       [3 seconds of content]
Assistant starts: "Tokyo is a vibrant city..."
User: "Wait, I meant Bangkok!"
       [Turn detection flags as new turn]
Assistant pivots: "Bangkok is known for..."

Comparison: Turn Detection vs. VAD

Aspect	Voice Activity Detection (VAD)	Turn Detection
Question answered	"Is this speech?"	"Is this thought complete?"
Method	Audio analysis (spectrograms)	Semantic analysis (LLM)
Handles pauses	Sometimes error-prone	Highly accurate
Handles complex requests	Not context-aware	Understands intent
Latency	~50ms	~200-500ms (including LLM)
Best for	Basic speech detection, noise filtering	Natural conversation, complex queries
GPU required	No	Yes

When to Use Each

Use VAD if:

You need ultra-low latency (IoT, embedded systems)
Simple commands and responses
Limited infrastructure for GPU deployment

Use Turn Detection if:

Multi-turn conversations
Complex, compound requests
Natural conversation is critical
You have GPU infrastructure

Many applications use both: VAD first to filter noise, then turn detection for semantic understanding.

Performance Considerations

Latency Breakdown

For a typical request:

Audio capture: ~20ms
STT (Deepgram streaming): ~100-300ms
Turn detection (GPU LLM): ~200-300ms
LLM response generation: ~500-1500ms (depending on complexity)
TTS synthesis: ~500-1000ms
Total: 1-3 seconds for a complete response

This feels natural to users—similar to a real person thinking and responding.

Cost Optimization

Cerebrium provides different tier options:

Free tier: 50,000 requests/month (testing)
Starter: $99/month for 1M+ requests
Enterprise: Custom scaling for high-volume deployments

The fine-tuned turn detection model is smaller and faster than a full LLM, so costs are reasonable for most applications.

What's Next?

Turn detection is part of a broader ecosystem:

Voice Assistant with VAD — Foundational speech detection
Voice Assistant with Live2D — Add visual avatars
Voice Assistant with Real-time — Ultra-low-latency variants
Voice Assistant with SIP/Twilio — Integrate with phone systems

Learn More

Cerebrium Documentation — Deploy ML models on GPUs
TEN Turn Detection Model — The model itself
vLLM Documentation — Inference engine powering the deployment
TEN Framework Docs — Complete framework reference
TMAN Designer Guide — Visual customization

Wrapping Up

Turn detection represents the next evolution in voice AI. It moves beyond simple speech detection into true conversational understanding. Users get natural, responsive interactions. Developers get a modular, customizable foundation.

The TEN Framework makes deploying production-grade turn detection accessible. Whether you're building customer service bots, accessibility tools, or voice-first applications, turn detection helps you create assistants that feel genuinely intelligent.

Ready to build smarter conversations? Get started with Voice Assistant with Turn Detection today.

The Next Evolution in Voice Assistants: AI-Powered Turn Detection