Quality & Evaluation Framework
AI-Powered Testing, Simulation, and Assessment at Scale
Traditional QA samples only 1-5% of calls with expensive human reviewers.
Human Correlation
≥0.8
Calibrated
Simulation Throughput
1,000/hr
Target
Executive Summary
What we built
A comprehensive quality and evaluation framework combining automated testing (bot-to-bot simulation), AI-powered assessment (LLM judges), production monitoring (Voice QA), and model validation protocols — enabling quality assurance from development through production.
Why it matters
Traditional QA samples only 1-5% of calls with expensive human reviewers. Manual testing cannot scale with rapid iteration cycles. Without systematic model validation, assessment accuracy silently degrades. Our framework provides 100% call coverage, repeatable testing, and human-calibrated accuracy.
Results
- 100% call coverage vs 1-5% sampling with traditional QA
- ≥0.8 correlation with human expert scores when properly calibrated
- Text simulation 10-100x faster than voice calls
- Target 1,000 simulations/hour for regression testing
- ≥90% blocking issues caught pre-production
- Real-time dashboards with customizable metrics
Best for
- →Enterprise voice agent deployments
- →CI/CD quality gates and regression testing
- →Continuous quality improvement
- →A/B testing and version comparison
- →Compliance monitoring and benchmarking
Limitations
- Complex assessments take minutes post-call
- Human calibration required for accuracy
- Voice simulation still in progress
- Eval datasets must be refreshed as workflows evolve
The Problem
Voicemail detection fails in three distinct ways. Each has different causes and different costs.
Symptom
"Everything is green" in reporting, but humans observe failures rising
Cause
Assessment model changed without validation — the measurement tool itself degraded
Business Cost
Symptom
Critical issues in production not caught by QA sampling
Cause
Traditional 1-5% sampling misses edge cases and distributed failures
Business Cost
Symptom
Testing bottleneck prevents rapid improvement cycles
Cause
Voice calls take too long for comprehensive testing
Business Cost
How It Works
A two-layer detection system where each covers the other's weaknesses.
Bot-to-Bot Simulation
Automated testing through agent-vs-agent conversations
- Text simulation 10-100x faster than voice
- Repeatable scenarios for regression detection
- Target 1,000 simulations/hour throughput
LLM Evaluation
AI judges for scalable quality assessment
- Turn-level: relevance, tone, informativeness
- Conversation-level: task success, satisfaction
- Reference-based and scenario-based testing
Voice QA
Production monitoring with 100% call coverage
- 100% call assessment vs 1-5% sampling
- Real-time dashboards in Metabase
- Customizable metrics per use case
Assessment Model Validation
Eval gates for model selection and changes
- Model registry with approved versions
- Eval harness with consistent test sets
- Canary rollout with automatic rollback
Ablation Studies
We tested each approach in isolation to understand what works and why.
Key Takeaways
- 1.100% call assessment catches issues that 1-5% sampling misses
- 2.Text simulation enables 10-100x faster testing cycles for rapid iteration
- 3.Human calibration is essential — ≥0.8 correlation requires clear rubrics
- 4.Assessment models need eval gates; "faster/cheaper" often means lower accuracy
- 5.Multi-layer approach: simulation (pre-prod) + evaluation (CI/CD) + monitoring (production)
Traditional QA sampling (1-5%)
Manual sampling is sufficient for quality assurance
Misses 95-99% of calls; cannot detect distributed failures or edge cases
Voice-only testing
Full voice simulation is necessary for all testing
10-100x slower than text; impractical for rapid iteration
Faster/cheaper assessment model
General-purpose models work for assessment tasks
Higher incorrect assessments on edge cases; rejected for production
Combined Framework
Multi-layer approach with simulation, evaluation, and monitoring
Pre-prod simulation + LLM evaluation + 100% production monitoring = comprehensive coverage
Reliability & Rollout
How we safely deployed to production with continuous monitoring.
Rollout Timeline
Text Simulation
CLI-based bot-to-bot conversations for fast iteration
LLM Evaluation
Turn-level and conversation-level AI judges
Voice QA
100% production call assessment with dashboards
Voice Simulation
Full TTS and latency testing at scale
Live Monitoring
Customer Satisfaction
Post-interaction rating
≥4.5/5
Threshold: > 4.0
Task Completion Rate
Goals successfully achieved
>90%
Threshold: > 85%
Containment Rate
Handled without human escalation
>85%
Threshold: > 80%
Simulation Throughput
Tests per hour
Target 1,000/hr
Threshold: > 500/hr
Safety Guardrails
- Human calibration required for all assessment dimensions (≥0.8 correlation target)
- Eval harness gates for all assessment model changes
- Canary rollout with automatic rollback on disagreement spikes
- Reference datasets refreshed quarterly as workflows evolve
- CI/CD gates: min_task_success, min_satisfaction, max_error_rate
Product Features
Ready for production with enterprise-grade reliability.
100% Call Assessment
AI evaluates every call, not just 1-5% sample, for complete quality visibility and edge case detection.
Bot-to-Bot Simulation
Repeatable agent-vs-agent conversations 10-100x faster than voice for rapid iteration and regression testing.
LLM Judges
Turn-level and conversation-level AI evaluation for relevance, tone, task success, and policy adherence.
Human-Calibrated Accuracy
≥0.8 correlation with human expert scores through continuous calibration and clear rubrics.
Real-Time Dashboards
Metabase visualizations with drill-down, filtering, and automated alerting for production monitoring.
Model Validation Protocol
Eval gates prevent silent regression when assessment models change or providers are swapped.
Reference & Scenario Testing
Compare AI to actual human conversations or test specific behaviors (angry users, edge cases).
CI/CD Integration
Quality gates block releases that fail minimum thresholds for success, satisfaction, or error rate.
Customizable Metrics
Universal metrics plus use-case-specific measures for healthcare, lead gen, service, and more.
Multi-Layer Coverage
Pre-production simulation, CI/CD evaluation, and production monitoring create comprehensive quality assurance.
Integration Details
Runs On
LiveKit Rooms (simulation) + Cloud LLM Judges + Supabase + Metabase (monitoring)
Latency Budget
Real-time for production QA, batch for simulation/evaluation
Providers
LiveKit, Supabase, Metabase, AVM Framework, LLM Evaluation
Implementation
2-3 weeks for full pipeline setup
Frequently Asked Questions
Common questions about our voicemail detection system.
Methodology
How we built, trained, and evaluated this model.
Dataset
Comprehensive coverage of real conversations, simulated edge cases, and validated assessment test cases. Datasets refreshed quarterly as workflows evolve.
Labeling
Automated via carrier metadata + human expert scores for calibration. Assessment ground truth includes successes, failures, and ambiguous interactions.
Evaluation Protocol
Multi-layer validation: Simulation repeatability, LLM judge correlation with humans, production QA accuracy, assessment model disagreement rate monitoring.
Known Limitations
- •Complex assessments take minutes post-call
- •Human calibration required for each evaluation dimension
- •Voice simulation still in progress
- •Eval datasets must be refreshed as workflows evolve
- •Multi-party rooms (>2 participants) not yet supported
Evaluation Details
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
