Automated Testing Through Agent-vs-Agent Conversations

Bot-to-Bot Simulation

Voice calls take too long for rapid iteration.

RepeatableTests

FastIteration

AutomatedScoring

RegressionDetection

Compliance:

SOC2

HIPAA

Text SimulationFast iteration

CLI-based

Voice Simulation

Full TTS

In Progress

Throughput Target

1,000/hr

Scale testing

Executive Summary

What we built

A simulation framework that runs agent-vs-agent conversations, returning full JSON transcripts for comparison, regression testing, and evaluation tooling.

Why it matters

Voice calls take too long for rapid iteration. Text simulation enables faster testing cycles. Reproducible scenarios catch regressions. Automated scoring reduces manual QA burden.

Results

Text simulation 10-100x faster than voice
Target 1,000 simulations/hour throughput
≥90% blocking issues caught pre-production
Full JSON transcript capture for analysis

Best for

→Prompt/version A/B comparison
→Regression testing
→Sandboxed experiments
→CI/CD quality gates

Limitations

Voice simulation still in progress
Audio recording storage out of scope
Multi-party rooms (>2) not yet supported

How It Works

A two-layer detection system where each covers the other's weaknesses.

Text Simulation

CLI-based fast iteration

10-100x faster than voice calls
Multilingual support
Full JSON transcript output

Voice Simulation

Full TTS and latency testing

Audio QA and timing validation
Target 1,000 sims/hour scale
≥90% blocking audio issues detected

LiveKit Rooms

2-participant conversation space

Create sim_<run_id> rooms
Dispatch agent + user simulator
Capture session.history for each conversation

Reliability & Rollout

How we safely deployed to production with continuous monitoring.

Rollout Timeline

completed

Text Simulation

CLI-based commands, multilingual support

Completed

active

Voice Simulation

Full TTS, latency, audio processing

In Progress

pending

Scale Testing

1,000 simulations/hour throughput

Target

Live Monitoring

Safety Guardrails

Product Features

Ready for production with enterprise-grade reliability.

Repeatable Tests

Run the same scenario multiple times for consistent regression detection.

Fast Iteration

Text simulation is 10-100x faster than actual voice calls.

Automated Scoring

Integration with AVM for unified quality scoring.

Scenario-Based Testing

Prompt User LLM to follow specific behaviors like angry or demanding users.

Reference-Based Testing

Provide actual human conversation transcripts for realistic testing.

Partial Conversations

Start mid-flow with partial_conversation to test specific scenarios.

Integration Details

Runs On

LiveKit Rooms + Simulation Templates

Latency Budget

Batch processing for evaluation

Providers

LiveKit, AVM Scoring, LLM Evaluation

Implementation

1-2 weeks for pipeline setup

Frequently Asked Questions

Common questions about our voicemail detection system.

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough