Assessment Model Selection: The Day a "Small Swap" Broke Accuracy
Why assessment models need eval gates, not vibes
Assessment is infrastructure.
Solution
Eval Gates
Before rollout
Accuracy
Restored
With stronger model
Executive Summary
What we built
A lightweight but strict model-selection protocol for assessment: candidate models must pass a standard evaluation harness before they can be used in production assessments.
Why it matters
Assessment is infrastructure. If assessments drift, you lose visibility into quality — your dashboards look "fine" while customer experience degrades. Switching assessment models without explicit evals can silently multiply error rates.
Results
- Prevented silent regression by validating model choice against a consistent test set
- Established a repeatable process for future model retirements and provider changes
- Stronger model restored expected error behavior after initial regression
Best for
- →Teams using LLMs to score calls, classify outcomes, or extract structured fields
- →Any system where the assessor is more critical than the agent
Limitations
- Requires maintaining an eval dataset (and refreshing it as workflows evolve)
- Some regressions are distributional (only appear on new call types)
The Problem
Voicemail detection fails in three distinct ways. Each has different causes and different costs.
Symptom
"Everything is green" in reporting, but humans observe failures rising
Cause
The assessment model is a hidden dependency; changing it changes the measurement tool
Business Cost
How It Works
A two-layer detection system where each covers the other's weaknesses.
Model Registry
List of approved models + version metadata
- Approved model versions
- Provider tracking
- Rollback targets
Eval Harness
Fixed dataset + scoring scripts + threshold gates
- Consistent test set
- Scoring metrics
- Pass/fail thresholds
Canary Rollout
Small % of calls scored by new model + monitored disagreement rate
- Traffic splitting
- Disagreement monitoring
- Rollback triggers
Fallback
Revert to last known-good model or alternate provider
- Automatic rollback
- Provider switching
- Incident response
Ablation Studies
We tested each approach in isolation to understand what works and why.
Key Takeaways
- 1."Faster/cheaper" models underperform on nuanced assessment tasks
- 2.Significant disagreement increase on edge cases (multi-turn confirmations, incomplete calls)
- 3.Model changes require eval approval + staged rollout
"Fast" general model
Faster/cheaper models work for assessment tasks
Higher incorrect assessments observed — rejected for assessment use
"Stronger" model
More capable model maintains assessment accuracy
Error rate consistent with baseline — approved for production
Frequently Asked Questions
Common questions about our voicemail detection system.
Methodology
How we built, trained, and evaluated this model.
Dataset
Dataset must represent current workflows; otherwise you overfit to history.
Labeling
Humans assign outcome + key-field correctness
Evaluation Protocol
Compare automated outputs vs human labels; compute disagreement by category
Known Limitations
- •Dataset must be refreshed as workflows evolve
Evaluation Details
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
