anyreach logo

Quality & Evaluation Framework

AI-Powered Testing, Simulation, and Assessment at Scale

Traditional QA samples only 1-5% of calls with expensive human reviewers.

100%Coverage
Human-CalibratedAccuracy
Real-TimeDashboards
AutomatedTesting
Compliance:
SOC2
HIPAA
Call CoverageAI assessment
100%

Human Correlation

≥0.8

Calibrated

Simulation Throughput

1,000/hr

Target

Executive Summary

What we built

A comprehensive quality and evaluation framework combining automated testing (bot-to-bot simulation), AI-powered assessment (LLM judges), production monitoring (Voice QA), and model validation protocols — enabling quality assurance from development through production.

Why it matters

Traditional QA samples only 1-5% of calls with expensive human reviewers. Manual testing cannot scale with rapid iteration cycles. Without systematic model validation, assessment accuracy silently degrades. Our framework provides 100% call coverage, repeatable testing, and human-calibrated accuracy.

Results

  • 100% call coverage vs 1-5% sampling with traditional QA
  • ≥0.8 correlation with human expert scores when properly calibrated
  • Text simulation 10-100x faster than voice calls
  • Target 1,000 simulations/hour for regression testing
  • ≥90% blocking issues caught pre-production
  • Real-time dashboards with customizable metrics

Best for

  • Enterprise voice agent deployments
  • CI/CD quality gates and regression testing
  • Continuous quality improvement
  • A/B testing and version comparison
  • Compliance monitoring and benchmarking

Limitations

  • Complex assessments take minutes post-call
  • Human calibration required for accuracy
  • Voice simulation still in progress
  • Eval datasets must be refreshed as workflows evolve

The Problem

Voicemail detection fails in three distinct ways. Each has different causes and different costs.

0

Symptom

"Everything is green" in reporting, but humans observe failures rising

Cause

Assessment model changed without validation — the measurement tool itself degraded

Business Cost

Delayed detection of quality issues
Wasted resources optimizing for incorrect metrics
You can't improve what you can't measure — bad assessments produce bad decisions
0

Symptom

Critical issues in production not caught by QA sampling

Cause

Traditional 1-5% sampling misses edge cases and distributed failures

Business Cost

Emergency fixes and customer remediation
Customer complaints about issues QA "missed"
0

Symptom

Testing bottleneck prevents rapid improvement cycles

Cause

Voice calls take too long for comprehensive testing

Business Cost

Delayed feature releases and bug fixes
Engineering time spent waiting for test results

How It Works

A two-layer detection system where each covers the other's weaknesses.

Bot-to-Bot Simulation

Automated testing through agent-vs-agent conversations

  • Text simulation 10-100x faster than voice
  • Repeatable scenarios for regression detection
  • Target 1,000 simulations/hour throughput

LLM Evaluation

AI judges for scalable quality assessment

  • Turn-level: relevance, tone, informativeness
  • Conversation-level: task success, satisfaction
  • Reference-based and scenario-based testing

Voice QA

Production monitoring with 100% call coverage

  • 100% call assessment vs 1-5% sampling
  • Real-time dashboards in Metabase
  • Customizable metrics per use case

Assessment Model Validation

Eval gates for model selection and changes

  • Model registry with approved versions
  • Eval harness with consistent test sets
  • Canary rollout with automatic rollback

Ablation Studies

We tested each approach in isolation to understand what works and why.

Key Takeaways

  • 1.100% call assessment catches issues that 1-5% sampling misses
  • 2.Text simulation enables 10-100x faster testing cycles for rapid iteration
  • 3.Human calibration is essential — ≥0.8 correlation requires clear rubrics
  • 4.Assessment models need eval gates; "faster/cheaper" often means lower accuracy
  • 5.Multi-layer approach: simulation (pre-prod) + evaluation (CI/CD) + monitoring (production)

Traditional QA sampling (1-5%)

Manual sampling is sufficient for quality assurance

-95 Recall

Misses 95-99% of calls; cannot detect distributed failures or edge cases

Voice-only testing

Full voice simulation is necessary for all testing

+100ms Latency

10-100x slower than text; impractical for rapid iteration

Faster/cheaper assessment model

General-purpose models work for assessment tasks

Higher incorrect assessments on edge cases; rejected for production

Combined Framework

Winner

Multi-layer approach with simulation, evaluation, and monitoring

+85 F1
+90 Recall

Pre-prod simulation + LLM evaluation + 100% production monitoring = comprehensive coverage

Reliability & Rollout

How we safely deployed to production with continuous monitoring.

Rollout Timeline

completed

Text Simulation

CLI-based bot-to-bot conversations for fast iteration

Completed
Throughput10-100x faster
Repeatability100%
completed

LLM Evaluation

Turn-level and conversation-level AI judges

Completed
Human Correlation≥0.8
Scale50-100+ evals
active

Voice QA

100% production call assessment with dashboards

Active
Coverage100%
DashboardsReal-time
active

Voice Simulation

Full TTS and latency testing at scale

In Progress
Target Throughput1,000/hr
Audio QAFull stack

Live Monitoring

Customer Satisfaction

Post-interaction rating

≥4.5/5

Threshold: > 4.0

healthy

Task Completion Rate

Goals successfully achieved

>90%

Threshold: > 85%

healthy

Containment Rate

Handled without human escalation

>85%

Threshold: > 80%

healthy

Simulation Throughput

Tests per hour

Target 1,000/hr

Threshold: > 500/hr

healthy

Safety Guardrails

  • Human calibration required for all assessment dimensions (≥0.8 correlation target)
  • Eval harness gates for all assessment model changes
  • Canary rollout with automatic rollback on disagreement spikes
  • Reference datasets refreshed quarterly as workflows evolve
  • CI/CD gates: min_task_success, min_satisfaction, max_error_rate

Product Features

Ready for production with enterprise-grade reliability.

100% Call Assessment

AI evaluates every call, not just 1-5% sample, for complete quality visibility and edge case detection.

Bot-to-Bot Simulation

Repeatable agent-vs-agent conversations 10-100x faster than voice for rapid iteration and regression testing.

LLM Judges

Turn-level and conversation-level AI evaluation for relevance, tone, task success, and policy adherence.

Human-Calibrated Accuracy

≥0.8 correlation with human expert scores through continuous calibration and clear rubrics.

Real-Time Dashboards

Metabase visualizations with drill-down, filtering, and automated alerting for production monitoring.

Model Validation Protocol

Eval gates prevent silent regression when assessment models change or providers are swapped.

Reference & Scenario Testing

Compare AI to actual human conversations or test specific behaviors (angry users, edge cases).

CI/CD Integration

Quality gates block releases that fail minimum thresholds for success, satisfaction, or error rate.

Customizable Metrics

Universal metrics plus use-case-specific measures for healthcare, lead gen, service, and more.

Multi-Layer Coverage

Pre-production simulation, CI/CD evaluation, and production monitoring create comprehensive quality assurance.

Integration Details

Runs On

LiveKit Rooms (simulation) + Cloud LLM Judges + Supabase + Metabase (monitoring)

Latency Budget

Real-time for production QA, batch for simulation/evaluation

Providers

LiveKit, Supabase, Metabase, AVM Framework, LLM Evaluation

Implementation

2-3 weeks for full pipeline setup

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameProduction Calls + Simulation Corpus + Assessment Eval Dataset
Size100,000+ production calls, 1,000+ simulation templates, labeled eval sets

Comprehensive coverage of real conversations, simulated edge cases, and validated assessment test cases. Datasets refreshed quarterly as workflows evolve.

Labeling

Automated via carrier metadata + human expert scores for calibration. Assessment ground truth includes successes, failures, and ambiguous interactions.

Evaluation Protocol

Multi-layer validation: Simulation repeatability, LLM judge correlation with humans, production QA accuracy, assessment model disagreement rate monitoring.

Known Limitations

  • Complex assessments take minutes post-call
  • Human calibration required for each evaluation dimension
  • Voice simulation still in progress
  • Eval datasets must be refreshed as workflows evolve
  • Multi-party rooms (>2 participants) not yet supported

Evaluation Details

Last Evaluated:2026-01-16
Model Version:quality-eval-v1.0
Eval Run:eval-20260116-combined
Commit:c8f5e2b

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.