Natural Conversations Through Intelligent Turn-Taking
Multimodal LLM-based controllers for better latency-interruption tradeoffs than existing endpointing methods
Turn-taking is the single largest driver of perceived naturalness in voice agents.
Turn-Taking States
4-State
In Development
Paper Target
Interspeech 2026
Feb 25, 2026
Executive Summary
What we built
Turn-taking controllers for real-time voice agents, comparing classic endpointing (VAD/acoustic/semantic) to multimodal LLM-based approaches that consume audio and partial transcripts to decide when to respond.
Why it matters
Turn-taking is the single largest driver of perceived naturalness in voice agents. It governs responsiveness (users hate dead air) and politeness (users hate being cut off). Common stacks rely on brittle VAD heuristics, but humans use both acoustics and semantics — models should too.
Results
- RouterLLM v1 achieving ~400ms latency with Cerebras inference
- 4-state controller architecture: COMPLETE/INCOMPLETE/BACKCHANNEL/WAIT
- Systematic comparison framework across VAD, semantic, acoustic, fusion, and multimodal approaches
- Research targeting Interspeech 2026 publication
Best for
- →Voice assistants requiring natural conversation flow
- →Customer service bots handling complex dialogues
- →Healthcare appointment scheduling with empathetic interactions
Limitations
- Research in progress — not yet production deployed
- Full-duplex speech-to-speech models not yet practical for production
- Multimodal training requires specialized datasets
The Problem
Voicemail detection fails in three distinct ways. Each has different causes and different costs.
Symptom
Premature responses
Cause
Fixed thresholds, no semantics
Business Cost
Symptom
Hesitation confusion
Cause
Can't distinguish thinking pauses
Business Cost
Symptom
Late responses
Cause
ASR latency (0.5-2+ seconds)
Business Cost
Symptom
Backchannel failures
Cause
Can't detect 'uh-huh' vs completion
Business Cost
How It Works
A two-layer detection system where each covers the other's weaknesses.
Speech Encoder
Processes real-time audio stream for acoustic features
- Prosody analysis
- Pause detection
- Energy patterns
Text Encoder
Processes partial transcripts for semantic understanding
- Utterance completeness
- Intent detection
- Sentence boundaries
Multimodal Fusion
Combines audio and text signals with conversation history
- Joint signal analysis
- Context integration
- State prediction
Turn-Taking Classifier
Outputs 4-state decision: COMPLETE, INCOMPLETE, BACKCHANNEL, WAIT
- COMPLETE → SHIFT (take floor)
- INCOMPLETE → HOLD (wait)
- BACKCHANNEL → IGNORE
- WAIT → YIELD
Benchmark Results
Interactive explorer comparing Anyreach to baseline methods across datasets and metrics.
Baseline Comparison: Accuracy (%)
| Name | Region | accuracy | precision | recall | f1 | latency | Samples |
|---|---|---|---|---|---|---|---|
RouterLLM v1 | All | 0% | 0% | 0% | 0 | 400ms | 0 |
Frequently Asked Questions
Common questions about our voicemail detection system.
Methodology
How we built, trained, and evaluated this model.
Dataset
Comprehensive evaluation across 4-state turn-taking, audio-only VAD, text-only detection, and behavioral tests.
Labeling
Human annotators label turn boundaries, backchannel events, and floor-transfer points.
Evaluation Protocol
Metrics: Precision/Recall/F1 per class, Decision Latency, Floor-Transfer-Offset (FTO), Turn-Over Rate, Jensen-Shannon Divergence from human FTO, False Cut-in Rate, A/B preference.
Known Limitations
- •Research in progress — metrics will be updated post-publication
- •Phone-call A/B evaluation planned for final phase
Evaluation Details
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
