anyreach logo

Domain-Specific Voice Agents Through Custom Training

Custom LLM Fine-Tuning Pipeline

Generic LLMs lack domain-specific knowledge and tone.

Dual-ChannelProcessing
Adapter-BasedEfficiency
Domain-SpecificOptimization
Human-ReferenceEvaluation
Compliance:
SOC2
HIPAA
Training DataAgent + User separated
Dual-Channel

Adaptation Method

LoRA

50-600 calls

Infrastructure

Baseten

NVIDIA GPUs

Executive Summary

What we built

An end-to-end pipeline that converts human agent call recordings into training data for domain-specific LLM fine-tuning, enabling AI agents that replicate the best human representatives.

Why it matters

Generic LLMs lack domain-specific knowledge and tone. Human agents have institutional knowledge worth capturing. Adapter-based training enables client-specific customization at viable unit economics.

Results

  • 50-600 calls sufficient with LoRA adapters
  • Fraction of full fine-tuning cost
  • Per-client models economically viable
  • Multi-output: LLM, TTS, turn-taking, tool calling

Best for

  • Client-specific voice agents
  • Domain knowledge capture
  • Tone and style replication
  • Multi-use case deployments

Limitations

  • Requires quality dual-channel recordings
  • Transcription accuracy affects training quality
  • RL fine-tuning still advanced approach

How It Works

A two-layer detection system where each covers the other's weaknesses.

Data Pipeline

Dual-channel call processing

  • Channel separation (agent + user)
  • VAD segmentation with Silero
  • ASR transcription with Assembly AI
  • Turn alignment and dataset creation

LoRA Adapters

Efficient adapter-based training

  • 50-600 calls vs thousands for full fine-tuning
  • Fraction of GPU cost
  • Per-client models viable
  • Swap adapters at runtime

Multi-Output Training

Multiple datasets from same recordings

  • LLM fine-tuning for response generation
  • TTS fine-tuning for voice cloning
  • Turn-taking model training
  • Tool calling dataset creation

Product Features

Ready for production with enterprise-grade reliability.

Dual-Channel Processing

Separate agent and user audio for clean transcription and turn alignment.

LoRA Adapter Efficiency

50-600 calls sufficient, fraction of full fine-tuning cost, per-client models viable.

Context-Grounded Training

Last 4 turns context for TTS and LLM — learn appropriate tone/speed from conversation.

Multi-Output Datasets

Same recordings produce LLM, TTS, turn-taking, and tool calling training data.

Human Closeness Evaluation

Measure how close AI performance is to human agents on same conversations.

Baseten + TensorRT Deployment

NVIDIA GPU training with TensorRT optimized inference via LiveKit.

Integration Details

Runs On

Baseten + NVIDIA GPUs, TensorRT inference

Latency Budget

Training batch, inference real-time

Providers

HuggingFace Hub, Baseten, LiveKit, Assembly AI

Implementation

2-4 weeks for full pipeline

Frequently Asked Questions

Common questions about our voicemail detection system.

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.