Domain-Specific Voice Agents Through Custom Training
Custom LLM Fine-Tuning Pipeline
Generic LLMs lack domain-specific knowledge and tone.
Adaptation Method
LoRA
50-600 calls
Infrastructure
Baseten
NVIDIA GPUs
Executive Summary
What we built
An end-to-end pipeline that converts human agent call recordings into training data for domain-specific LLM fine-tuning, enabling AI agents that replicate the best human representatives.
Why it matters
Generic LLMs lack domain-specific knowledge and tone. Human agents have institutional knowledge worth capturing. Adapter-based training enables client-specific customization at viable unit economics.
Results
- 50-600 calls sufficient with LoRA adapters
- Fraction of full fine-tuning cost
- Per-client models economically viable
- Multi-output: LLM, TTS, turn-taking, tool calling
Best for
- →Client-specific voice agents
- →Domain knowledge capture
- →Tone and style replication
- →Multi-use case deployments
Limitations
- Requires quality dual-channel recordings
- Transcription accuracy affects training quality
- RL fine-tuning still advanced approach
How It Works
A two-layer detection system where each covers the other's weaknesses.
Data Pipeline
Dual-channel call processing
- Channel separation (agent + user)
- VAD segmentation with Silero
- ASR transcription with Assembly AI
- Turn alignment and dataset creation
LoRA Adapters
Efficient adapter-based training
- 50-600 calls vs thousands for full fine-tuning
- Fraction of GPU cost
- Per-client models viable
- Swap adapters at runtime
Multi-Output Training
Multiple datasets from same recordings
- LLM fine-tuning for response generation
- TTS fine-tuning for voice cloning
- Turn-taking model training
- Tool calling dataset creation
Product Features
Ready for production with enterprise-grade reliability.
Dual-Channel Processing
Separate agent and user audio for clean transcription and turn alignment.
LoRA Adapter Efficiency
50-600 calls sufficient, fraction of full fine-tuning cost, per-client models viable.
Context-Grounded Training
Last 4 turns context for TTS and LLM — learn appropriate tone/speed from conversation.
Multi-Output Datasets
Same recordings produce LLM, TTS, turn-taking, and tool calling training data.
Human Closeness Evaluation
Measure how close AI performance is to human agents on same conversations.
Baseten + TensorRT Deployment
NVIDIA GPU training with TensorRT optimized inference via LiveKit.
Integration Details
Runs On
Baseten + NVIDIA GPUs, TensorRT inference
Latency Budget
Training batch, inference real-time
Providers
HuggingFace Hub, Baseten, LiveKit, Assembly AI
Implementation
2-4 weeks for full pipeline
Frequently Asked Questions
Common questions about our voicemail detection system.
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
