Quality & Evaluation Framework

AI-Powered Testing, Simulation, and Assessment at Scale

Traditional QA samples only 1-5% of calls with expensive human reviewers.

100%Coverage

Human-CalibratedAccuracy

Real-TimeDashboards

AutomatedTesting

Book a Technical Walkthrough

Compliance:

SOC2

HIPAA

Call CoverageAI assessment

100%

Human Correlation

≥0.8

Calibrated

Simulation Throughput

1,000/hr

Target

Executive Summary

What we built

A comprehensive quality and evaluation framework combining automated testing (bot-to-bot simulation), AI-powered assessment (LLM judges), production monitoring (Voice QA), and model validation protocols — enabling quality assurance from development through production.

Why it matters

Traditional QA samples only 1-5% of calls with expensive human reviewers. Manual testing cannot scale with rapid iteration cycles. Without systematic model validation, assessment accuracy silently degrades. Our framework provides 100% call coverage, repeatable testing, and human-calibrated accuracy.

Results

100% call coverage vs 1-5% sampling with traditional QA
≥0.8 correlation with human expert scores when properly calibrated
Text simulation 10-100x faster than voice calls
Target 1,000 simulations/hour for regression testing
≥90% blocking issues caught pre-production
Real-time dashboards with customizable metrics

Best for

→Enterprise voice agent deployments
→CI/CD quality gates and regression testing
→Continuous quality improvement
→A/B testing and version comparison
→Compliance monitoring and benchmarking

Limitations

Complex assessments take minutes post-call
Human calibration required for accuracy
Voice simulation still in progress
Eval datasets must be refreshed as workflows evolve

The Problem

Current solutions fall short. Each approach has different causes and different costs.

Symptom

"Everything is green" in reporting, but humans observe failures rising

Cause

Assessment model changed without validation — the measurement tool itself degraded

Business Cost

Delayed detection of quality issues

Wasted resources optimizing for incorrect metrics

You can't improve what you can't measure — bad assessments produce bad decisions

Symptom

Critical issues in production not caught by QA sampling

Cause

Traditional 1-5% sampling misses edge cases and distributed failures

Business Cost

Emergency fixes and customer remediation

Customer complaints about issues QA "missed"

Symptom

Testing bottleneck prevents rapid improvement cycles

Cause

Voice calls take too long for comprehensive testing

Business Cost

Delayed feature releases and bug fixes

Engineering time spent waiting for test results

How It Works

Different approaches offer different tradeoffs. Here's how they compare.

Bot-to-Bot Simulation

Automated testing through agent-vs-agent conversations

Text simulation 10-100x faster than voice
Repeatable scenarios for regression detection
Target 1,000 simulations/hour throughput

LLM Evaluation

AI judges for scalable quality assessment

Turn-level: relevance, tone, informativeness
Conversation-level: task success, satisfaction
Reference-based and scenario-based testing

Voice QA

Production monitoring with 100% call coverage

100% call assessment vs 1-5% sampling
Real-time dashboards in Metabase
Customizable metrics per use case

Assessment Model Validation

Eval gates for model selection and changes

Model registry with approved versions
Eval harness with consistent test sets
Canary rollout with automatic rollback

Ablation Studies

We tested each approach in isolation to understand what works and why.

Key Takeaways

1.100% call assessment catches issues that 1-5% sampling misses
2.Text simulation enables 10-100x faster testing cycles for rapid iteration
3.Human calibration is essential — ≥0.8 correlation requires clear rubrics
4.Assessment models need eval gates; "faster/cheaper" often means lower accuracy
5.Multi-layer approach: simulation (pre-prod) + evaluation (CI/CD) + monitoring (production)

Traditional QA sampling (1-5%)

Manual sampling is sufficient for quality assurance

-95 Recall

Misses 95-99% of calls; cannot detect distributed failures or edge cases

Voice-only testing

Full voice simulation is necessary for all testing

+100ms Latency

10-100x slower than text; impractical for rapid iteration

Faster/cheaper assessment model

General-purpose models work for assessment tasks

Higher incorrect assessments on edge cases; rejected for production

Combined Framework

Winner

Multi-layer approach with simulation, evaluation, and monitoring

+85 F1

+90 Recall

Pre-prod simulation + LLM evaluation + 100% production monitoring = comprehensive coverage

Reliability & Rollout

How we safely deployed to production with continuous monitoring.

Rollout Timeline

completed

Text Simulation

CLI-based bot-to-bot conversations for fast iteration

Completed

Throughput10-100x faster

Repeatability100%

completed

LLM Evaluation

Turn-level and conversation-level AI judges

Completed

Human Correlation≥0.8

Scale50-100+ evals

active

Voice QA

100% production call assessment with dashboards

Active

Coverage100%

DashboardsReal-time

active

Voice Simulation

Full TTS and latency testing at scale

In Progress

Target Throughput1,000/hr

Audio QAFull stack

Live Monitoring

Customer Satisfaction

Post-interaction rating

≥4.5/5

Threshold: > 4.0

healthy

Task Completion Rate

Goals successfully achieved

>90%

Threshold: > 85%

healthy

Containment Rate

Handled without human escalation

>85%

Threshold: > 80%

healthy

Simulation Throughput

Tests per hour

Target 1,000/hr

Threshold: > 500/hr

healthy

Safety Guardrails

Human calibration required for all assessment dimensions (≥0.8 correlation target)
Eval harness gates for all assessment model changes
Canary rollout with automatic rollback on disagreement spikes
Reference datasets refreshed quarterly as workflows evolve
CI/CD gates: min_task_success, min_satisfaction, max_error_rate

Product Features

Ready for production with enterprise-grade reliability.

100% Call Assessment

AI evaluates every call, not just 1-5% sample, for complete quality visibility and edge case detection.

Bot-to-Bot Simulation

Repeatable agent-vs-agent conversations 10-100x faster than voice for rapid iteration and regression testing.

LLM Judges

Turn-level and conversation-level AI evaluation for relevance, tone, task success, and policy adherence.

Human-Calibrated Accuracy

≥0.8 correlation with human expert scores through continuous calibration and clear rubrics.

Real-Time Dashboards

Metabase visualizations with drill-down, filtering, and automated alerting for production monitoring.

Model Validation Protocol

Eval gates prevent silent regression when assessment models change or providers are swapped.

Reference & Scenario Testing

Compare AI to actual human conversations or test specific behaviors (angry users, edge cases).

CI/CD Integration

Quality gates block releases that fail minimum thresholds for success, satisfaction, or error rate.

Customizable Metrics

Universal metrics plus use-case-specific measures for healthcare, lead gen, service, and more.

Multi-Layer Coverage

Pre-production simulation, CI/CD evaluation, and production monitoring create comprehensive quality assurance.

Integration Details

Runs On

LiveKit Rooms (simulation) + Cloud LLM Judges + Supabase + Metabase (monitoring)

Latency Budget

Real-time for production QA, batch for simulation/evaluation

Providers

LiveKit, Supabase, Metabase, AVM Framework, LLM Evaluation

Implementation

2-3 weeks for full pipeline setup

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameProduction Calls + Simulation Corpus + Assessment Eval Dataset

Size100,000+ production calls, 1,000+ simulation templates, labeled eval sets

Comprehensive coverage of real conversations, simulated edge cases, and validated assessment test cases. Datasets refreshed quarterly as workflows evolve.

Labeling

Automated via carrier metadata + human expert scores for calibration. Assessment ground truth includes successes, failures, and ambiguous interactions.

Evaluation Protocol

Multi-layer validation: Simulation repeatability, LLM judge correlation with humans, production QA accuracy, assessment model disagreement rate monitoring.

Known Limitations

•Complex assessments take minutes post-call
•Human calibration required for each evaluation dimension
•Voice simulation still in progress
•Eval datasets must be refreshed as workflows evolve
•Multi-party rooms (>2 participants) not yet supported

Evaluation Details

Last Evaluated:2026-01-16

Model Version:quality-eval-v1.0

Eval Run:eval-20260116-combined

Commit:c8f5e2b

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough