Automated Testing Through Agent-vs-Agent Conversations
Bot-to-Bot Simulation
Voice calls take too long for rapid iteration.
Voice Simulation
Full TTS
In Progress
Throughput Target
1,000/hr
Scale testing
Executive Summary
What we built
A simulation framework that runs agent-vs-agent conversations, returning full JSON transcripts for comparison, regression testing, and evaluation tooling.
Why it matters
Voice calls take too long for rapid iteration. Text simulation enables faster testing cycles. Reproducible scenarios catch regressions. Automated scoring reduces manual QA burden.
Results
- Text simulation 10-100x faster than voice
- Target 1,000 simulations/hour throughput
- ≥90% blocking issues caught pre-production
- Full JSON transcript capture for analysis
Best for
- →Prompt/version A/B comparison
- →Regression testing
- →Sandboxed experiments
- →CI/CD quality gates
Limitations
- Voice simulation still in progress
- Audio recording storage out of scope
- Multi-party rooms (>2) not yet supported
How It Works
A two-layer detection system where each covers the other's weaknesses.
Text Simulation
CLI-based fast iteration
- 10-100x faster than voice calls
- Multilingual support
- Full JSON transcript output
Voice Simulation
Full TTS and latency testing
- Audio QA and timing validation
- Target 1,000 sims/hour scale
- ≥90% blocking audio issues detected
LiveKit Rooms
2-participant conversation space
- Create sim_<run_id> rooms
- Dispatch agent + user simulator
- Capture session.history for each conversation
Reliability & Rollout
How we safely deployed to production with continuous monitoring.
Rollout Timeline
Text Simulation
CLI-based commands, multilingual support
Voice Simulation
Full TTS, latency, audio processing
Scale Testing
1,000 simulations/hour throughput
Live Monitoring
Safety Guardrails
Product Features
Ready for production with enterprise-grade reliability.
Repeatable Tests
Run the same scenario multiple times for consistent regression detection.
Fast Iteration
Text simulation is 10-100x faster than actual voice calls.
Automated Scoring
Integration with AVM for unified quality scoring.
Scenario-Based Testing
Prompt User LLM to follow specific behaviors like angry or demanding users.
Reference-Based Testing
Provide actual human conversation transcripts for realistic testing.
Partial Conversations
Start mid-flow with partial_conversation to test specific scenarios.
Integration Details
Runs On
LiveKit Rooms + Simulation Templates
Latency Budget
Batch processing for evaluation
Providers
LiveKit, AVM Scoring, LLM Evaluation
Implementation
1-2 weeks for pipeline setup
Frequently Asked Questions
Common questions about our voicemail detection system.
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
