anyreach logo

Human-Like Speech for Voice Agents

AnyreachTTS: Natural, low-latency text-to-speech with backchanneling and voice cloning

Generic TTS solutions lack voice agent-specific features.

Voice CloningReady
Real-TimeStreaming
InterruptionHandling
BackchannelSupport
Compliance:
Model Architecture
Orpheus LLaMA 3.2 3B

First Chunk Latency

Competitive

vs ElevenLabs

Backchanneling

Native

Production ready

Executive Summary

What we built

AnyreachTTS — a custom text-to-speech system built on Orpheus LLaMA 3.2 3B, optimized for voice agent use cases with features like backchanneling, interruption handling, and voice adaptation.

Why it matters

Generic TTS solutions lack voice agent-specific features. Latency to first audio chunk is critical for perceived responsiveness. Voice consistency across long conversations requires specialized training. Backchanneling creates natural dialogue flow.

Results

  • Acceptable first-chunk latency competitive with ElevenLabs
  • Native backchanneling support ("uh-huh", "I see")
  • Multi-speaker training for tone consistency (July 2025)
  • Baseten + NVIDIA TensorRT deployment for simplified infrastructure

Best for

  • Real-time voice agents requiring low-latency responses
  • Branded voice experiences with cloned voices
  • Multi-turn conversations needing consistent tone
  • Applications requiring natural backchannel behavior

Limitations

  • Multi-speaker training still in progress for GA
  • Optimal results require LoRA fine-tuning on roleplay calls

The Problem

Voicemail detection fails in three distinct ways. Each has different causes and different costs.

Symptom

Slow first audio chunk

Cause

Inefficient model inference or streaming pipeline

Business Cost

Awkward pauses, user drops
Agent feels unresponsive

Symptom

TTS continues during user speech

Cause

Missing flush/clear mechanisms

Business Cost

Overlapping audio
Confusing conversation flow

Symptom

Random shifts in speaking style

Cause

Single-speaker models, no fine-tuning

Business Cost

Unnatural, jarring experience

Symptom

Agent sounds robotic

Cause

Traditional TTS without conversation awareness

Business Cost

Unnatural conversation flow

Symptom

Generic preset voices

Cause

No LoRA fine-tuning or professional cloning

Business Cost

Lack of brand identity

How It Works

A two-layer detection system where each covers the other's weaknesses.

Text Tokenizer

Converts text input to tokens for the language model

  • Text preprocessing
  • Token generation
  • Special character handling

LLaMA 3.2 3B

Native LLM-based speech generation with voice embeddings

  • Speech token generation
  • Voice embedding integration
  • Backchannel triggers

Audio Decoder

Converts speech tokens to streaming audio output

  • Real-time audio generation
  • Streaming WebSocket output
  • Quality optimization

Product Features

Ready for production with enterprise-grade reliability.

Streaming Output

WebSocket-based real-time audio delivery for immediate playback

Interruption Handling

Flush/clear on user speech to prevent audio overlap

Backchanneling

Natural "uh-huh", "I see" responses during conversation

Voice Cloning

LoRA fine-tuning + professional cloning for branded voices

Multi-Speaker

Consistent tone across conversations (July 2025 GA)

LiveKit Integration

Native plugin support for voice agent stack

Integration Details

Runs On

Baseten + NVIDIA TensorRT

Latency Budget

Competitive with ElevenLabs

Providers

LiveKit, WebSocket API, Baseten

Implementation

Voice cloning: 1-2 weeks with roleplay recordings

Frequently Asked Questions

Common questions about our voicemail detection system.

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.