Domain-Specific Voice Agents Through Custom Training

Custom LLM Fine-Tuning Pipeline

Generic LLMs lack domain-specific knowledge and tone.

Dual-ChannelProcessing

Adapter-BasedEfficiency

Domain-SpecificOptimization

Human-ReferenceEvaluation

Compliance:

SOC2

HIPAA

Training DataAgent + User separated

Dual-Channel

Adaptation Method

LoRA

50-600 calls

Infrastructure

Baseten

NVIDIA GPUs

Executive Summary

What we built

An end-to-end pipeline that converts human agent call recordings into training data for domain-specific LLM fine-tuning, enabling AI agents that replicate the best human representatives.

Why it matters

Generic LLMs lack domain-specific knowledge and tone. Human agents have institutional knowledge worth capturing. Adapter-based training enables client-specific customization at viable unit economics.

Results

50-600 calls sufficient with LoRA adapters
Fraction of full fine-tuning cost
Per-client models economically viable
Multi-output: LLM, TTS, turn-taking, tool calling

Best for

→Client-specific voice agents
→Domain knowledge capture
→Tone and style replication
→Multi-use case deployments

Limitations

Requires quality dual-channel recordings
Transcription accuracy affects training quality
RL fine-tuning still advanced approach

How It Works

A two-layer detection system where each covers the other's weaknesses.

Data Pipeline

Dual-channel call processing

Channel separation (agent + user)
VAD segmentation with Silero
ASR transcription with Assembly AI
Turn alignment and dataset creation

LoRA Adapters

Efficient adapter-based training

50-600 calls vs thousands for full fine-tuning
Fraction of GPU cost
Per-client models viable
Swap adapters at runtime

Multi-Output Training

Multiple datasets from same recordings

LLM fine-tuning for response generation
TTS fine-tuning for voice cloning
Turn-taking model training
Tool calling dataset creation

Product Features

Ready for production with enterprise-grade reliability.

Dual-Channel Processing

Separate agent and user audio for clean transcription and turn alignment.

LoRA Adapter Efficiency

50-600 calls sufficient, fraction of full fine-tuning cost, per-client models viable.

Context-Grounded Training

Last 4 turns context for TTS and LLM — learn appropriate tone/speed from conversation.

Multi-Output Datasets

Same recordings produce LLM, TTS, turn-taking, and tool calling training data.

Human Closeness Evaluation

Measure how close AI performance is to human agents on same conversations.

Baseten + TensorRT Deployment

NVIDIA GPU training with TensorRT optimized inference via LiveKit.

Integration Details

Runs On

Baseten + NVIDIA GPUs, TensorRT inference

Latency Budget

Training batch, inference real-time

Providers

HuggingFace Hub, Baseten, LiveKit, Assembly AI

Implementation

2-4 weeks for full pipeline

Frequently Asked Questions

Common questions about our voicemail detection system.

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough