Voicemail Detection That Actually Delivers
When Your Brand Speaks, Make Sure It Lands
34.
Beep Detection
96.1%
vs 89.9% DSP
CPU Latency
27.6ms
~50x faster
Executive Summary
What we built
A combined semantic + acoustic voicemail detection system that handles both classification ("Is this voicemail?") AND timing ("When do I start speaking?").
Why it matters
34.4% of outbound calls hit voicemail. Failures waste ASR/LLM/TTS spend, damage brand perception, and mean your message—that appointment reminder, payment alert, or urgent callback—never gets delivered.
Results
- 96.1% beep detection accuracy (vs 89.9% DSP, 81.2% Gemini)
- 27.6ms latency on CPU (~50x faster than multimodal LLMs)
- Call success rates improved from 83.2% to 94.8% on live calls
Best for
- →High-volume outbound campaigns
- →Appointment reminders & payment alerts
- →Automated callback systems
Limitations
- Performance varies on carriers with non-standard voicemail greetings
- Model trained on English greetings; other languages require fine-tuning
The Problem
Voicemail detection fails in three distinct ways. Each has different causes and different costs.
Symptom
Bot doesn't recognize it's talking to a recording
Cause
Semantic detection fails on ambiguous greetings ("Hello?"), ASR lag (0.5-2s), or silent voicemails with no transcript to analyze
Business Cost
Symptom
Message gets clipped or incomplete
Cause
Spoke before the beep—classification was correct but timing was wrong. Transcripts lag 0.5-2s behind real-time audio.
Business Cost
Symptom
No detection signal available
Cause
No recorded greeting—just silence followed by a beep. Semantic detection has nothing to work with. The beep is the only signal.
Business Cost
How It Works
A two-layer detection system where each covers the other's weaknesses.
Semantic Layer
Provides classification confidence: "This is voicemail"
- Analyzes transcript for voicemail cues
- "you've reached...", "please leave a message..."
- Detects long monologue with no turn-taking
Acoustic Layer (Beep Detector)
Provides precise timing: "Start speaking now"
- Learns acoustic signature of beep tones
- Not fixed frequency thresholds—trained on diverse recordings
- Handles carrier/region variation automatically
Combined Policy
Each covers the other's weaknesses
- Semantic catches clear greeting patterns
- Beep detection handles silent voicemails
- Beep detection solves timing problem transcripts can't
Benchmark Results
Interactive explorer comparing Anyreach to baseline methods across datasets and metrics.
Baseline Comparison: Accuracy (%)
| Name | Region | accuracy | precision | recall | f1 | latency | Samples |
|---|---|---|---|---|---|---|---|
Anyreach Beep Detector | All | 96.1% | 96.5% | 95.4% | 95.9 | 27.6ms | 125,000±0.3 |
Anyreach (GPU) | All | 96.1% | 96.5% | 95.4% | 95.9 | 2.5ms | 10,000±0.5 |
DSP (Signal Processing) | All | 89.9% | 100% | 82.8% | 90.6 | 10ms | 125,000±0.7 |
Gemini (Multimodal LLM) | All | 81.2% | 76.3% | 88.2% | 77.8 | 1320ms | 125,000±0.8 |
Ablation Studies
We tested each approach in isolation to understand what works and why.
Key Takeaways
- 1.Classification alone isn't enough—28% of failures are timing problems
- 2.DSP misses too many beeps (82.8% recall); LLMs are too slow (1,320ms)
- 3.Combined semantic + acoustic achieves +11.6 pts success rate improvement
Semantic detection only
Text-based classification is sufficient for voicemail detection
Fails on silent VMs, timing issues from ASR lag (0.5-2s)
DSP-only detection
FFT-based beep detection at ~1kHz is sufficient
82.8% recall—misses too many beeps due to codec/carrier variation
Multimodal LLM (Gemini)
Audio-capable LLMs can detect beeps
81.2% accuracy, 1,320ms latency—impractical for real-time
Combined semantic + acoustic
Multi-signal approach with each covering the other's weaknesses
Production model: 83.2% → 94.8% success rate on live calls
Audio Examples
See how the model handles different voicemail scenarios.
Standard Voicemail (Success)
Transcript
[Voicemail] Please leave your message. [Agent] Good morning, my name is Grace. I'm calling from Mary's Center Dental Department. This message is for Lula H. I was calling to confirm your Cleaning on Thursday, August seventh at 12 noon...
Model Decision
Voicemail detected (semantic), waited for beep, full message delivered successfully
Silent Voicemail (Success)
Transcript
[No greeting - silent voicemail] [Agent] Good morning, my name is Grace. I'm calling from Mary's Center Dental Department to confirm an upcoming appointment for Kedste T...
Model Decision
No transcript available—beep detection triggered message delivery
Timing Failure (Clipped)
Transcript
[Voicemail] Please leave your message for— [Agent] Good morning, my name is Grace. I'm calling from Mary's Center Dental Department. This message is for Denis S...
Model Decision
Agent started speaking before greeting completed—first words overlapped with voicemail
Ambiguous Greeting
Transcript
Hello?
Model Decision
Low VM confidence (0.32), waited for additional signal—live pickup confirmed
Reliability & Rollout
How we safely deployed to production with continuous monitoring.
Rollout Timeline
Shadow Mode
Model runs in parallel, no production impact
Canary Rollout
5% of traffic with automatic rollback triggers
Full Rollout
100% of traffic with enhanced monitoring
Live Monitoring
Call Success Rate
Voicemail calls with successful message delivery
94.8%
Threshold: > 90%
Classification Accuracy
Correct voicemail vs live pickup detection
96.1%
Threshold: > 95%
P99 Latency
99th percentile detection latency (CPU)
48ms
Threshold: < 100ms
Safety Guardrails
- Low confidence (<70%) triggers conservative wait-for-beep mode
- Beep detection fallback for silent voicemails with no transcript
- Carrier-specific thresholds for known edge cases
Product Features
Ready for production with enterprise-grade reliability.
~50x Faster Than LLMs
27.6ms on CPU vs 1,320ms for multimodal LLMs—fast enough for real-time decisions
Solves Both Problems
Classification AND timing in one system—no more clipped messages
Handles Silent Voicemails
Beep detection works when there's no transcript to analyze
Higher Recall Than DSP
95.4% recall vs 82.8%—catches beeps that signal processing misses
No GPU Required
Production-grade accuracy on CPU. GPU available for 2.5ms latency.
Carrier Variation Handled
Trained on diverse recordings across carriers, regions, and edge cases
Integration Details
Runs On
Edge (WebAssembly) or Server (Docker)
Latency Budget
<50ms P99 recommended
Providers
Twilio, Vonage, SIP, Custom WebSocket
Implementation
1-2 days typical
Frequently Asked Questions
Common questions about our voicemail detection system.
Methodology
How we built, trained, and evaluated this model.
Dataset
34.4% of outbound calls hit voicemail. Diverse carrier, region, and edge case coverage.
Labeling
Automated via carrier metadata + human review for edge cases. Failure mode analysis: 55.6% classification failures, 28.2% timing failures.
Evaluation Protocol
Before/after comparison on live calls. Success rate: 83.2% → 94.8% (+11.6 pts).
Known Limitations
- •Model trained on English greetings; other languages require fine-tuning
- •Performance varies on carriers with non-standard voicemail greetings
- •Silent voicemail detection depends on beep presence
Evaluation Details
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
