Natural Conversations Through Intelligent Turn-Taking

Multimodal LLM-based controllers for better latency-interruption tradeoffs than existing endpointing methods

Turn-taking is the single largest driver of perceived naturalness in voice agents.

ResearchPaper in Progress

Real-TimeStreaming

MultimodalAudio + Text

BenchmarkSuite

Book a Technical Walkthrough

Compliance:

RouterLLM Latency< 500ms target

~400ms

Turn-Taking States

4-State

In Development

Paper Target

Interspeech 2026

Feb 25, 2026

Executive Summary

What we built

Turn-taking controllers for real-time voice agents, comparing classic endpointing (VAD/acoustic/semantic) to multimodal LLM-based approaches that consume audio and partial transcripts to decide when to respond.

Why it matters

Turn-taking is the single largest driver of perceived naturalness in voice agents. It governs responsiveness (users hate dead air) and politeness (users hate being cut off). Common stacks rely on brittle VAD heuristics, but humans use both acoustics and semantics — models should too.

Results

RouterLLM v1 achieving ~400ms latency with Cerebras inference
4-state controller architecture: COMPLETE/INCOMPLETE/BACKCHANNEL/WAIT
Systematic comparison framework across VAD, semantic, acoustic, fusion, and multimodal approaches
Research targeting Interspeech 2026 publication

Best for

→Voice assistants requiring natural conversation flow
→Customer service bots handling complex dialogues
→Healthcare appointment scheduling with empathetic interactions

Limitations

Research in progress — not yet production deployed
Full-duplex speech-to-speech models not yet practical for production
Multimodal training requires specialized datasets

Prosody analysis
Pause detection
Energy patterns

Text Encoder

Processes partial transcripts for semantic understanding

Utterance completeness
Intent detection
Sentence boundaries

Multimodal Fusion

Combines audio and text signals with conversation history

Joint signal analysis
Context integration
State prediction

Turn-Taking Classifier

Outputs 4-state decision: COMPLETE, INCOMPLETE, BACKCHANNEL, WAIT

COMPLETE → SHIFT (take floor)
INCOMPLETE → HOLD (wait)
BACKCHANNEL → IGNORE
WAIT → YIELD

Benchmark Results

Interactive explorer comparing Anyreach to baseline methods across datasets and metrics.

Baseline Comparison: Accuracy (%)

Model	Accuracy	Precision	Recall	F1 Score	Latency
RouterLLM v1	0%	0%	0%	0.00	400ms

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameTurn-Taking Benchmark Suite

SizeMultiple datasets (Easy Turn, Smart Turn, TEN, TURNS-2K, Full-Duplex-Bench)

Comprehensive evaluation across 4-state turn-taking, audio-only VAD, text-only detection, and behavioral tests.

Labeling

Human annotators label turn boundaries, backchannel events, and floor-transfer points.

Evaluation Protocol

Metrics: Precision/Recall/F1 per class, Decision Latency, Floor-Transfer-Offset (FTO), Turn-Over Rate, Jensen-Shannon Divergence from human FTO, False Cut-in Rate, A/B preference.

Known Limitations

•Research in progress — metrics will be updated post-publication
•Phone-call A/B evaluation planned for final phase

Evaluation Details

Last Evaluated:2026-01-13

Model Version:routerllm-v1

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough