A Unified Quality Score for Voice Agents
Anyreach Voice Metric (AVM)
No universal voice agent quality metric exists today.
Reference Source
Top 1%
Best reps
Scoring Method
LLM Judge
Automated
Executive Summary
What we built
The Anyreach Voice Metric (AVM) is a unified call-handling score that uses top 1% human representative calls as the gold standard, generating simulations and LLM-judged comparisons for automated quality assessment.
Why it matters
No universal voice agent quality metric exists today. Manual QA doesn't scale and is inconsistent. AVM enables objective, reproducible quality measurement for CI/CD gates and continuous improvement.
Results
- Target ≥0.8 correlation with human QA raters
- Reference-based comparison to top 1% human calls
- Automated LLM-as-Judge scoring
- CI/CD quality gates for deployments
Best for
- →Quality measurement and tracking
- →Deployment gating decisions
- →Version comparison and regression detection
- →Continuous improvement feedback loops
Limitations
- Requires top 1% human reference calls
- LLM judge calibration process needed
- Reference-free evaluation less reliable
How It Works
A two-layer detection system where each covers the other's weaknesses.
Reference Dataset
Top 1% human representative calls
- Select best human conversations as gold standard
- Combination of outcome metrics and qualitative review
- Curated benchmark for comparison
LLM-as-Judge
Automated comparison engine
- Compare AI transcripts to human references
- Apply standardized scoring rubric
- Generate scores across multiple dimensions
Calibration Process
Human-LLM alignment validation
- Human baseline scoring on sample calls
- LLM scoring same calls for correlation
- Iterate rubric until ≥0.8 correlation
Reliability & Rollout
How we safely deployed to production with continuous monitoring.
Rollout Timeline
Phase 1: Define Rubrics
Define rubrics and scoring criteria
Phase 2: Build LLM Judge
Build LLM judge prompts
Phase 3: Calibrate
Calibrate against human raters
Phase 4: Integrate
Integrate with simulation pipeline
Phase 5: Deploy Gates
Deploy CI/CD gates
Live Monitoring
Safety Guardrails
Product Features
Ready for production with enterprise-grade reliability.
Human-Reference Comparison
Compare AI agent performance against top 1% human representative calls as gold standard.
LLM-as-Judge Automation
Scalable automated scoring using LLM judge with validated rubrics.
Multi-Dimensional Assessment
Turn-level (relevance, tone, timing) and conversation-level (task success, satisfaction) metrics.
CI/CD Quality Gates
Block deployments that fall below quality thresholds or show significant regression.
Continuous Calibration
Periodic re-calibration ensures scores remain aligned with human judgment.
ReSim Lab Integration
Part of broader simulation harness with AI judges, scenario generator, and A/B pipelines.
Integration Details
Runs On
Cloud (LLM Judge) + Simulation Pipeline
Latency Budget
Batch processing, not real-time
Providers
Bot-to-Bot Simulation, LLM Evaluation
Implementation
Multi-phase rollout
Frequently Asked Questions
Common questions about our voicemail detection system.
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
