anyreach logo

Automated Testing Through Agent-vs-Agent Conversations

Bot-to-Bot Simulation

Voice calls take too long for rapid iteration.

RepeatableTests
FastIteration
AutomatedScoring
RegressionDetection
Compliance:
SOC2
HIPAA
Text SimulationFast iteration
CLI-based

Voice Simulation

Full TTS

In Progress

Throughput Target

1,000/hr

Scale testing

Executive Summary

What we built

A simulation framework that runs agent-vs-agent conversations, returning full JSON transcripts for comparison, regression testing, and evaluation tooling.

Why it matters

Voice calls take too long for rapid iteration. Text simulation enables faster testing cycles. Reproducible scenarios catch regressions. Automated scoring reduces manual QA burden.

Results

  • Text simulation 10-100x faster than voice
  • Target 1,000 simulations/hour throughput
  • ≥90% blocking issues caught pre-production
  • Full JSON transcript capture for analysis

Best for

  • Prompt/version A/B comparison
  • Regression testing
  • Sandboxed experiments
  • CI/CD quality gates

Limitations

  • Voice simulation still in progress
  • Audio recording storage out of scope
  • Multi-party rooms (>2) not yet supported

How It Works

A two-layer detection system where each covers the other's weaknesses.

Text Simulation

CLI-based fast iteration

  • 10-100x faster than voice calls
  • Multilingual support
  • Full JSON transcript output

Voice Simulation

Full TTS and latency testing

  • Audio QA and timing validation
  • Target 1,000 sims/hour scale
  • ≥90% blocking audio issues detected

LiveKit Rooms

2-participant conversation space

  • Create sim_<run_id> rooms
  • Dispatch agent + user simulator
  • Capture session.history for each conversation

Reliability & Rollout

How we safely deployed to production with continuous monitoring.

Rollout Timeline

completed

Text Simulation

CLI-based commands, multilingual support

Completed
active

Voice Simulation

Full TTS, latency, audio processing

In Progress
pending

Scale Testing

1,000 simulations/hour throughput

Target

Live Monitoring

Safety Guardrails

    Product Features

    Ready for production with enterprise-grade reliability.

    Repeatable Tests

    Run the same scenario multiple times for consistent regression detection.

    Fast Iteration

    Text simulation is 10-100x faster than actual voice calls.

    Automated Scoring

    Integration with AVM for unified quality scoring.

    Scenario-Based Testing

    Prompt User LLM to follow specific behaviors like angry or demanding users.

    Reference-Based Testing

    Provide actual human conversation transcripts for realistic testing.

    Partial Conversations

    Start mid-flow with partial_conversation to test specific scenarios.

    Integration Details

    Runs On

    LiveKit Rooms + Simulation Templates

    Latency Budget

    Batch processing for evaluation

    Providers

    LiveKit, AVM Scoring, LLM Evaluation

    Implementation

    1-2 weeks for pipeline setup

    Frequently Asked Questions

    Common questions about our voicemail detection system.

    Ready to see this in action?

    Book a technical walkthrough with our team to see how this research applies to your use case.