anyreach logo

anyreach-asr: Speech Recognition for Voice Agents

Sub-300ms streaming transcription across 50+ languages with domain-adaptive accuracy

Voice agents live or die by transcription quality.

5.26%WER (Batch)
6.84%WER (Streaming)
<300msTTFT
50+Languages
Compliance:
SOC2
HIPAA
Word Error Rate47% lower than competitors
5.26%

Streaming Latency

<300ms

Time-to-first-token

Languages

50+

With regional dialects

Executive Summary

What we built

anyreach-asr is a high-accuracy, low-latency speech recognition engine powering Anyreach's real-time voice agent pipeline. It handles noisy telephony audio, accented speech, domain-specific vocabulary, and multilingual code-switching out of the box.

Why it matters

Voice agents live or die by transcription quality. A misheard medication name, a garbled email address, or a 500ms+ ASR delay creates cascading failures downstream -- wrong LLM responses, broken conversation flow, lost customer trust. Every word matters, every millisecond counts.

Results

  • 5.26% median WER on batch (47.4% lower than competitors)
  • 6.84% median WER on streaming (54.3% lower than competitors)
  • Sub-300ms time-to-first-token for real-time streaming
  • Real-time multilingual code-switching across 10 languages
  • Up to 90% improvement in domain-specific term recognition via keyterm prompting

Best for

  • High-volume outbound/inbound voice agent calls
  • Multilingual customer support (code-switching callers)
  • Healthcare, finance, and legal transcription requiring domain accuracy
  • Real-time captioning and compliance recording

Limitations

  • Streaming PII redaction currently English-only (batch supports all languages)
  • Audio Intelligence features (sentiment, summarization) English-only for now

The Problem

Current solutions fall short. Each approach has different causes and different costs.

Symptom

High word error rates in noisy telephony

Cause

Background noise, codec artifacts, far-field mics cause generic ASR models to produce garbage transcripts

Business Cost

Wasted compute on wrong transcripts
LLM receives wrong input, generates wrong response

Symptom

Name and email spelling failures

Cause

Caller spells out "B as in Bravo, R-O-W-N at gmail dot com" and generic ASR outputs "brown at gmail.com" or worse "be our own a gmail com." Proper nouns, email addresses, and alphanumeric sequences are among the hardest recognition tasks

Business Cost

Customer service callbacks to fix data
Wrong contact info = lost customer

Symptom

Accent and dialect misrecognition

Cause

Indian English, Australian English, Swiss German, regional Arabic dialects -- generic models struggle with non-standard pronunciation. A caller saying "schedule" (British) vs "schedule" (American) or Arabic speakers from Egypt vs Morocco produce different phonetic patterns

Business Cost

Repeat requests and clarifications
WER degrades significantly on models not designed for dialectal variation

Symptom

Multilingual caller confusion

Cause

Caller switches between Spanish and English mid-sentence. Monolingual ASR produces gibberish for the non-English segments

Business Cost

Escalations to human agents
Common in US healthcare, customer support, and sales contexts

Symptom

Latency-induced conversation breakdown

Cause

ASR taking 500ms+ creates awkward "walkie-talkie" pauses. Voice agent can't respond naturally. Competitors like OpenAI Whisper have 500ms+ TTFT and lack native streaming entirely

Business Cost

Callers hang up due to delays
Robotic, unnatural conversation flow

Symptom

Filler words polluting LLM input

Cause

"Um, so, uh, I wanted to, uh, check on my, um, appointment" -- if filler words aren't handled, the LLM receives noisy input that degrades response quality

Business Cost

But stripping them unconditionally loses information needed for verbatim legal/compliance transcripts
Lower quality LLM responses

How It Works

Different approaches offer different tradeoffs. Here's how they compare.

Audio Ingestion

Accepts raw audio from telephony/WebSocket

  • Works directly on unprocessed audio without noise reduction preprocessing
  • Preserves acoustic cues critical for accuracy
  • Handles codec variations and telephony artifacts

Core Recognition Engine

Transformer-based architecture with latent space audio embedding

  • Handles accents, dialects, and noisy conditions natively
  • Supports 50+ languages with regional variants
  • Sub-300ms streaming latency with interim results

Language & Formatting

Smart formatting and domain adaptation

  • Punctuation, capitalization, paragraph breaks
  • Entity detection (50+ types)
  • Filler word handling and keyterm prompting

Output & Intelligence

Speaker diarization and PII protection

  • Word-level timestamps with speaker labels
  • Real-time PII/PHI/PCI redaction
  • Sentiment analysis, topic detection, summarization

Product Features

Ready for production with enterprise-grade reliability.

Sub-300ms Streaming Latency

200-300ms time-to-first-token, with partial/interim transcripts delivered while the caller is still speaking. Endpointing (voice activity detection) is configurable from 10ms to 500ms+ to tune for chatbot-style short utterances vs natural conversation. Competitors like Whisper lack native streaming entirely (500ms+ TTFT). This eliminates "walkie-talkie" delays that make voice agents feel robotic.

50+ Languages, 10 Simultaneous

Supports languages from English to Arabic (17 dialects) to South Asian languages. Real-time code-switching handles multilingual callers across English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch simultaneously -- without explicit language detection or routing.

Domain-Adaptive Vocabulary (Keyterm Prompting)

Accepts up to 100 custom terms per request. Domain-specific words like "Clindamycin" improve from 71% to 96% confidence instantly, no model retraining required. Supports proper nouns (company names, person names), product names, medical/legal/financial jargon. One customer saw 625% improvement in veterinary term recognition.

Name & Email Spelling Accuracy

Entity detection identifies 50+ entity types including person names, email addresses, phone numbers, and SSNs in real-time. Combined with keyterm prompting for expected proper nouns, and smart formatting that automatically structures emails/URLs/phone numbers. Handles letter-by-letter spelling dictation in voice agent workflows.

Accent & Dialect Recognition

Not just language support but explicit dialect handling: 5 English variants (US, AU, GB, IN, NZ), Swiss German (de-CH), Flemish (nl-BE), Canadian French (fr-CA), Brazilian vs European Portuguese (pt-BR, pt-PT), Latin American Spanish (es-419), and 17 Arabic regional variants (Egypt, Morocco, Saudi, UAE, etc.). Handles regional phonetic shifts and non-standardized pronunciation patterns.

Filler Word Handling

Recognizes "um", "uh", "uh-huh", "mhmm", and "nuh-uh." Default behavior strips "um" and "uh" for clean LLM-ready transcripts. Verbatim mode (`filler_words=true`) preserves all disfluencies with consistent spelling normalization regardless of spoken duration. Essential for sales coaching (measuring confidence), legal transcription (verbatim record), public speaking analysis, and language instruction.

Smart Formatting

Automatic punctuation, capitalization, and paragraph breaks. For English: dates, times, currency amounts, phone numbers, email addresses, and URLs are formatted correctly. Works across all languages with broadest support for English.

Speaker Diarization

Word-level speaker labels with precise start/end timestamps and confidence scores. Identifies who said what and when. Works in both streaming (speaker IDs) and batch (IDs + confidence scores) modes.

Real-Time PII Redaction

Supports 50+ entity types across PII (names, locations, SSNs), PHI (medical conditions, drugs, blood types), and PCI (credit card numbers, CVV, expiration). HIPAA-compliant. Granular control -- choose specific entity types to redact or use category groups.

Noise Robustness

Works directly on raw, unprocessed audio rather than relying on noise reduction preprocessing (which can actually degrade accuracy by removing acoustic cues). Handles significant speaker-to-microphone distance, overlapping speech, background noise, codec artifacts. Proven in air traffic control, drive-thru, call center, and clinical environments.

Integration Details

Runs On

Anyreach Cloud or Self-Hosted (AWS, GCP, Azure)

Latency Budget

<300ms TTFT streaming

Providers

REST API, WebSocket Streaming, Python SDK, Node.js SDK, .NET SDK

Implementation

1-2 days typical

Frequently Asked Questions

Common questions about our voicemail detection system.

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.