Voicemail Detection That Actually Delivers

When Your Brand Speaks, Make Sure It Lands

34.

+11.6 ptsSuccess Rate

96.1%Accuracy

27.6msCPU Latency

34.4%Calls Hit VM

Book a Technical Walkthrough Read the Full Story

Compliance:

SOC2

HIPAA

Successful Deliveries+11.6 pts

94.8%

Beep Detection

96.1%

vs 89.9% DSP

CPU Latency

27.6ms

~50x faster

Executive Summary

What we built

A combined semantic + acoustic voicemail detection system that handles both classification ("Is this voicemail?") AND timing ("When do I start speaking?").

Why it matters

34.4% of outbound calls hit voicemail. Failures waste ASR/LLM/TTS spend, damage brand perception, and mean your message—that appointment reminder, payment alert, or urgent callback—never gets delivered.

Results

96.1% beep detection accuracy (vs 89.9% DSP, 81.2% Gemini)
27.6ms latency on CPU (~50x faster than multimodal LLMs)
Call success rates improved from 83.2% to 94.8% on live calls

Best for

→High-volume outbound campaigns
→Appointment reminders & payment alerts
→Automated callback systems

Limitations

While we achieved 94.8% success rate, we're not at 100% yet—some edge cases remain

The Problem

Current solutions fall short. Each approach has different causes and different costs.

Failure Volume55.6%

Symptom

Bot doesn't recognize it's talking to a recording

Cause

Semantic detection fails on ambiguous greetings ("Hello?"), ASR lag (0.5-2s), or silent voicemails with no transcript to analyze

Business Cost

Pathological dialogue loops

Wasted ASR/LLM/TTS spend

Bot sounds broken, customers perceive robocall

Failure Volume28.2%

Symptom

Message gets clipped or incomplete

Cause

Spoke before the beep—classification was correct but timing was wrong. Transcripts lag 0.5-2s behind real-time audio.

Business Cost

First seconds lost to greeting

Customer receives garbled message

Incomplete information delivery or nothing at all

Failure Volume16.2%

Symptom

No detection signal available

Cause

No recorded greeting—just silence followed by a beep. Semantic detection has nothing to work with. The beep is the only signal.

Business Cost

Full call duration wasted

Complete message delivery failure

No message delivered at all

How It Works

Different approaches offer different tradeoffs. Here's how they compare.

Semantic Layer

Provides classification confidence: "This is voicemail"

Analyzes transcript for voicemail cues
"you've reached...", "please leave a message..."
Detects long monologue with no turn-taking

Acoustic Layer (Beep Detector)

Provides precise timing: "Start speaking now"

Learns acoustic signature of beep tones
Not fixed frequency thresholds—trained on diverse recordings
Handles carrier/region variation automatically

Combined Policy

Each covers the other's weaknesses

Semantic catches clear greeting patterns
Beep detection handles silent voicemails
Beep detection solves timing problem transcripts can't

Benchmark Results

Interactive explorer comparing Anyreach to baseline methods across datasets and metrics.

Baseline Comparison: Accuracy (%)

Model	Accuracy	Precision	Recall	F1 Score	Latency
Anyreach Beep Detector	96.1%	96.5%	95.4%	95.90	27.6ms
DSP (Signal Processing)	89.9%	100%	82.8%	90.60	10ms
Gemini (Multimodal LLM)	81.2%	76.3%	88.2%	77.80	1320ms

Ablation Studies

We tested each approach in isolation to understand what works and why.

Key Takeaways

1.Classification alone isn't enough—28% of failures are timing problems
2.DSP misses too many beeps (82.8% recall); Multimodal LLMs are too slow (1,320ms)
3.Combined semantic + acoustic achieves +11.6 pts success rate improvement

Semantic detection only

Text-based classification is sufficient for voicemail detection

-8.2 F1

-5.4 Recall

Fails on silent VMs, timing issues from ASR lag (0.5-2s)

DSP-only detection

FFT-based beep detection at ~1kHz is sufficient

-5.3 F1

-12.6 Recall

82.8% recall—misses too many beeps due to codec/carrier variation

Multimodal LLM (Gemini)

Audio-capable LLMs can detect beeps

-18.1 F1

+1292.4ms Latency

81.2% accuracy, 1,320ms latency—impractical for real-time

Combined semantic + acoustic

Winner

Multi-signal approach with each covering the other's weaknesses

+5.3 F1

+12.6 Recall

Production model: 83.2% → 94.8% success rate on live calls

Audio Examples

See how the model handles different voicemail scenarios.

Standard Voicemail (Success)

Deliveredstandarddelivered

58.0s

Beep

0.0s / 58.0s

Transcript

[Voicemail] Please leave your message. [Agent] Good morning, my name is Grace. I'm calling from Mary's Center Dental Department. This message is for Lula H. I was calling to confirm your Cleaning on Thursday, August seventh at 12 noon...

Model Decision

Voicemail detected (semantic), waited for beep, full message delivered successfully

Silent Voicemail (Success)

Deliveredsilentedge-casedelivered

62.0s

Beep

0.0s / 62.0s

Transcript

[No greeting - silent voicemail] [Agent] Good morning, my name is Grace. I'm calling from Mary's Center Dental Department to confirm an upcoming appointment for Kedste T...

Model Decision

No transcript available—beep detection triggered message delivery

Timing Failure (Clipped)

Clippedtiming-failureclipped

55.0s

Beep

0.0s / 55.0s

Transcript

[Voicemail] Please leave your message for— [Agent] Good morning, my name is Grace. I'm calling from Mary's Center Dental Department. This message is for Denis S...

Model Decision

Agent started speaking before greeting completed—first words overlapped with voicemail

Ambiguous Greeting

Deliveredambiguouslive-pickup

4.1s

Audio not available

Transcript

Hello?

Model Decision

Low VM confidence (0.32), waited for additional signal—live pickup confirmed

Reliability

Production monitoring and guardrails ensuring consistent performance.

Live Monitoring

Call Success Rate

Voicemail calls with successful message delivery

94.8%

Threshold: > 90%

healthy

Classification Accuracy

Correct voicemail vs live pickup detection

96.1%

Threshold: > 95%

healthy

P99 Latency

99th percentile detection latency (CPU)

48ms

Threshold: < 100ms

healthy

Safety Guardrails

Low confidence (<70%) triggers conservative wait-for-beep mode
Beep detection fallback for silent voicemails with no transcript
Carrier-specific thresholds for known edge cases

Product Features

Ready for production with enterprise-grade reliability.

~50x Faster Than LLMs

27.6ms on CPU vs 1,320ms for multimodal LLMs—fast enough for real-time decisions

Solves Both Problems

Classification AND timing in one system—no more clipped messages

Handles Silent Voicemails

Beep detection works when there's no transcript to analyze

Higher Recall Than DSP

95.4% recall vs 82.8%—catches beeps that signal processing misses

No GPU Required

Production-grade accuracy on CPU. GPU available for 2.5ms latency.

Carrier Variation Handled

Trained on diverse recordings across carriers, regions, and edge cases

Integration Details

Runs On

Edge (WebAssembly) or Server (Docker)

Latency Budget

<50ms P99 recommended

Providers

Twilio, Vonage, SIP, Custom WebSocket

Implementation

1-2 days typical

Frequently Asked Questions

Common questions about our voicemail detection system.

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough