anyreach logo

A Unified Quality Score for Voice Agents

Anyreach Voice Metric (AVM)

No universal voice agent quality metric exists today.

Human-ReferenceComparison
LLM-as-JudgeAutomation
ValidatedScoring Rubric
ContinuousCalibration
Compliance:
SOC2
HIPAA
Human QA CorrelationTarget
≥0.8

Reference Source

Top 1%

Best reps

Scoring Method

LLM Judge

Automated

Executive Summary

What we built

The Anyreach Voice Metric (AVM) is a unified call-handling score that uses top 1% human representative calls as the gold standard, generating simulations and LLM-judged comparisons for automated quality assessment.

Why it matters

No universal voice agent quality metric exists today. Manual QA doesn't scale and is inconsistent. AVM enables objective, reproducible quality measurement for CI/CD gates and continuous improvement.

Results

  • Target ≥0.8 correlation with human QA raters
  • Reference-based comparison to top 1% human calls
  • Automated LLM-as-Judge scoring
  • CI/CD quality gates for deployments

Best for

  • Quality measurement and tracking
  • Deployment gating decisions
  • Version comparison and regression detection
  • Continuous improvement feedback loops

Limitations

  • Requires top 1% human reference calls
  • LLM judge calibration process needed
  • Reference-free evaluation less reliable

How It Works

A two-layer detection system where each covers the other's weaknesses.

Reference Dataset

Top 1% human representative calls

  • Select best human conversations as gold standard
  • Combination of outcome metrics and qualitative review
  • Curated benchmark for comparison

LLM-as-Judge

Automated comparison engine

  • Compare AI transcripts to human references
  • Apply standardized scoring rubric
  • Generate scores across multiple dimensions

Calibration Process

Human-LLM alignment validation

  • Human baseline scoring on sample calls
  • LLM scoring same calls for correlation
  • Iterate rubric until ≥0.8 correlation

Reliability & Rollout

How we safely deployed to production with continuous monitoring.

Rollout Timeline

pending

Phase 1: Define Rubrics

Define rubrics and scoring criteria

Planning
pending

Phase 2: Build LLM Judge

Build LLM judge prompts

Development
pending

Phase 3: Calibrate

Calibrate against human raters

Validation
pending

Phase 4: Integrate

Integrate with simulation pipeline

Integration
pending

Phase 5: Deploy Gates

Deploy CI/CD gates

Production

Live Monitoring

Safety Guardrails

    Product Features

    Ready for production with enterprise-grade reliability.

    Human-Reference Comparison

    Compare AI agent performance against top 1% human representative calls as gold standard.

    LLM-as-Judge Automation

    Scalable automated scoring using LLM judge with validated rubrics.

    Multi-Dimensional Assessment

    Turn-level (relevance, tone, timing) and conversation-level (task success, satisfaction) metrics.

    CI/CD Quality Gates

    Block deployments that fall below quality thresholds or show significant regression.

    Continuous Calibration

    Periodic re-calibration ensures scores remain aligned with human judgment.

    ReSim Lab Integration

    Part of broader simulation harness with AI judges, scenario generator, and A/B pipelines.

    Integration Details

    Runs On

    Cloud (LLM Judge) + Simulation Pipeline

    Latency Budget

    Batch processing, not real-time

    Providers

    Bot-to-Bot Simulation, LLM Evaluation

    Implementation

    Multi-phase rollout

    Frequently Asked Questions

    Common questions about our voicemail detection system.

    Ready to see this in action?

    Book a technical walkthrough with our team to see how this research applies to your use case.