anyreach logo

Real-Time Multilingual Voice Translation

Automatic Speech Translation with Sub-Second Latency

Healthcare and enterprise clients require real-time multilingual support.

45.09Spanish BLEU
0.763sLatency
4+Language Pairs
BLEU/COMETMetrics
Compliance:
SOC2
HIPAA
Spanish BLEUCoVoST2
45.09

Arabic BLEU

48.19

CoVoST2

S2T Latency

0.763s

Seamless

Executive Summary

What we built

A comprehensive Automatic Speech Translation evaluation and deployment framework comparing open-source and proprietary models across multiple language pairs and datasets.

Why it matters

Healthcare and enterprise clients require real-time multilingual support. Model selection requires empirical benchmarks on domain-relevant data, and latency vs. quality tradeoffs must be quantified for production decisions.

Results

  • Spanish BLEU 45.09 with Voxtral-Small-24B on CoVoST2
  • Arabic BLEU 48.19 with Seamless-m4t-v2-large
  • Sub-second latency (0.763s) for S2T translation
  • Benchmarked Spanish, Arabic, Russian, and Cantonese

Best for

  • Healthcare multilingual support
  • Real-time voice translation
  • Enterprise global deployments
  • Low-latency translation applications

Limitations

  • Cantonese support still developing (Seamless has generation errors)
  • Proprietary models have higher latency (1.8-2.3s)
  • Quality varies by model and language pair

How It Works

A two-layer detection system where each covers the other's weaknesses.

Speech-to-Text (S2T)

Direct speech to translated text

  • Lower latency than traditional pipeline
  • Best for applications with TTS step
  • Seamless-m4t-v2-large achieves 0.763s

Speech-to-Speech (S2S)

End-to-end speech translation

  • Lowest latency possible
  • Preserves speaker characteristics
  • Less controllable output

Traditional Pipeline

ASR → MT/LLM → TTS

  • Modular and debuggable
  • Higher latency due to multiple steps
  • Error can propagate through pipeline

Benchmark Results

Interactive explorer comparing Anyreach to baseline methods across datasets and metrics.

Baseline Comparison: Accuracy (%)

NameRegion
accuracy
precision
recall
f1
latency
Samples

Seamless-m4t-v2-large

Russian53.96%88.94%87%87.9763ms
10,000±0.7

Seamless-m4t-v2-large

Arabic48.19%88.9%86%87.4763ms
10,000±0.6

Voxtral-Small-24B

Spanish45.09%87.67%85%86.31034ms
10,000±0.6

Product Features

Ready for production with enterprise-grade reliability.

Sub-Second Latency

Seamless-m4t-v2-large achieves 0.763s average latency for speech-to-text translation.

Multiple Language Pairs

Benchmarked Spanish, Arabic, Russian, and Cantonese with production-ready metrics.

Model Flexibility

Choose between quality (Voxtral), latency (Seamless), or cost (Canary) optimized models.

Comprehensive Metrics

BLEU for syntactic quality, COMET for semantic quality, chrF++ for character-level accuracy.

Baseten Deployment

Production-ready infrastructure with Seamless, Canary, and Voxtral models deployed.

Concurrency Support

Canary-1b-v2 handles 32 concurrent requests at 5.09/s throughput.

Integration Details

Runs On

Baseten Infrastructure

Latency Budget

<1s for S2T, <2s for full pipeline

Providers

Seamless-m4t-v2-large, Canary-1b-v2, Voxtral-Small-24B

Implementation

1-2 weeks for production deployment

Frequently Asked Questions

Common questions about our voicemail detection system.

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.