anyreach logo

Natural Conversations Through Intelligent Turn-Taking

Multimodal LLM-based controllers for better latency-interruption tradeoffs than existing endpointing methods

Turn-taking is the single largest driver of perceived naturalness in voice agents.

ResearchPaper in Progress
Real-TimeStreaming
MultimodalAudio + Text
BenchmarkSuite
Compliance:
RouterLLM Latency< 500ms target
~400ms

Turn-Taking States

4-State

In Development

Paper Target

Interspeech 2026

Feb 25, 2026

Executive Summary

What we built

Turn-taking controllers for real-time voice agents, comparing classic endpointing (VAD/acoustic/semantic) to multimodal LLM-based approaches that consume audio and partial transcripts to decide when to respond.

Why it matters

Turn-taking is the single largest driver of perceived naturalness in voice agents. It governs responsiveness (users hate dead air) and politeness (users hate being cut off). Common stacks rely on brittle VAD heuristics, but humans use both acoustics and semantics — models should too.

Results

  • RouterLLM v1 achieving ~400ms latency with Cerebras inference
  • 4-state controller architecture: COMPLETE/INCOMPLETE/BACKCHANNEL/WAIT
  • Systematic comparison framework across VAD, semantic, acoustic, fusion, and multimodal approaches
  • Research targeting Interspeech 2026 publication

Best for

  • Voice assistants requiring natural conversation flow
  • Customer service bots handling complex dialogues
  • Healthcare appointment scheduling with empathetic interactions

Limitations

  • Research in progress — not yet production deployed
  • Full-duplex speech-to-speech models not yet practical for production
  • Multimodal training requires specialized datasets

The Problem

Voicemail detection fails in three distinct ways. Each has different causes and different costs.

Symptom

Premature responses

Cause

Fixed thresholds, no semantics

Business Cost

Cuts off users mid-sentence

Symptom

Hesitation confusion

Cause

Can't distinguish thinking pauses

Business Cost

Responds during "um..." moments

Symptom

Late responses

Cause

ASR latency (0.5-2+ seconds)

Business Cost

Awkward gaps in conversation

Symptom

Backchannel failures

Cause

Can't detect 'uh-huh' vs completion

Business Cost

Inappropriate floor-taking

How It Works

A two-layer detection system where each covers the other's weaknesses.

Speech Encoder

Processes real-time audio stream for acoustic features

  • Prosody analysis
  • Pause detection
  • Energy patterns

Text Encoder

Processes partial transcripts for semantic understanding

  • Utterance completeness
  • Intent detection
  • Sentence boundaries

Multimodal Fusion

Combines audio and text signals with conversation history

  • Joint signal analysis
  • Context integration
  • State prediction

Turn-Taking Classifier

Outputs 4-state decision: COMPLETE, INCOMPLETE, BACKCHANNEL, WAIT

  • COMPLETE → SHIFT (take floor)
  • INCOMPLETE → HOLD (wait)
  • BACKCHANNEL → IGNORE
  • WAIT → YIELD

Benchmark Results

Interactive explorer comparing Anyreach to baseline methods across datasets and metrics.

Baseline Comparison: Accuracy (%)

NameRegion
accuracy
precision
recall
f1
latency
Samples

RouterLLM v1

All0%0%0%0400ms
0

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameTurn-Taking Benchmark Suite
SizeMultiple datasets (Easy Turn, Smart Turn, TEN, TURNS-2K, Full-Duplex-Bench)

Comprehensive evaluation across 4-state turn-taking, audio-only VAD, text-only detection, and behavioral tests.

Labeling

Human annotators label turn boundaries, backchannel events, and floor-transfer points.

Evaluation Protocol

Metrics: Precision/Recall/F1 per class, Decision Latency, Floor-Transfer-Offset (FTO), Turn-Over Rate, Jensen-Shannon Divergence from human FTO, False Cut-in Rate, A/B preference.

Known Limitations

  • Research in progress — metrics will be updated post-publication
  • Phone-call A/B evaluation planned for final phase

Evaluation Details

Last Evaluated:2026-01-13
Model Version:routerllm-v1

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.