Human-Like Speech for Voice Agents
AnyreachTTS: Natural, low-latency text-to-speech with backchanneling and voice cloning
Generic TTS solutions lack voice agent-specific features.
First Chunk Latency
Competitive
vs ElevenLabs
Backchanneling
Native
Production ready
Executive Summary
What we built
AnyreachTTS — a custom text-to-speech system built on Orpheus LLaMA 3.2 3B, optimized for voice agent use cases with features like backchanneling, interruption handling, and voice adaptation.
Why it matters
Generic TTS solutions lack voice agent-specific features. Latency to first audio chunk is critical for perceived responsiveness. Voice consistency across long conversations requires specialized training. Backchanneling creates natural dialogue flow.
Results
- Acceptable first-chunk latency competitive with ElevenLabs
- Native backchanneling support ("uh-huh", "I see")
- Multi-speaker training for tone consistency (July 2025)
- Baseten + NVIDIA TensorRT deployment for simplified infrastructure
Best for
- →Real-time voice agents requiring low-latency responses
- →Branded voice experiences with cloned voices
- →Multi-turn conversations needing consistent tone
- →Applications requiring natural backchannel behavior
Limitations
- Multi-speaker training still in progress for GA
- Optimal results require LoRA fine-tuning on roleplay calls
The Problem
Voicemail detection fails in three distinct ways. Each has different causes and different costs.
Symptom
Slow first audio chunk
Cause
Inefficient model inference or streaming pipeline
Business Cost
Symptom
TTS continues during user speech
Cause
Missing flush/clear mechanisms
Business Cost
Symptom
Random shifts in speaking style
Cause
Single-speaker models, no fine-tuning
Business Cost
Symptom
Agent sounds robotic
Cause
Traditional TTS without conversation awareness
Business Cost
Symptom
Generic preset voices
Cause
No LoRA fine-tuning or professional cloning
Business Cost
How It Works
A two-layer detection system where each covers the other's weaknesses.
Text Tokenizer
Converts text input to tokens for the language model
- Text preprocessing
- Token generation
- Special character handling
LLaMA 3.2 3B
Native LLM-based speech generation with voice embeddings
- Speech token generation
- Voice embedding integration
- Backchannel triggers
Audio Decoder
Converts speech tokens to streaming audio output
- Real-time audio generation
- Streaming WebSocket output
- Quality optimization
Product Features
Ready for production with enterprise-grade reliability.
Streaming Output
WebSocket-based real-time audio delivery for immediate playback
Interruption Handling
Flush/clear on user speech to prevent audio overlap
Backchanneling
Natural "uh-huh", "I see" responses during conversation
Voice Cloning
LoRA fine-tuning + professional cloning for branded voices
Multi-Speaker
Consistent tone across conversations (July 2025 GA)
LiveKit Integration
Native plugin support for voice agent stack
Integration Details
Runs On
Baseten + NVIDIA TensorRT
Latency Budget
Competitive with ElevenLabs
Providers
LiveKit, WebSocket API, Baseten
Implementation
Voice cloning: 1-2 weeks with roleplay recordings
Frequently Asked Questions
Common questions about our voicemail detection system.
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
