Assessment Model Selection: The Day a "Small Swap" Broke Accuracy

Why assessment models need eval gates, not vibes

Assessment is infrastructure.

EvalHarness

CanaryRollout

ModelRegistry

Compliance:

RiskWithout evals

Silent Regression

Solution

Eval Gates

Before rollout

Accuracy

Restored

With stronger model

Executive Summary

What we built

A lightweight but strict model-selection protocol for assessment: candidate models must pass a standard evaluation harness before they can be used in production assessments.

Why it matters

Assessment is infrastructure. If assessments drift, you lose visibility into quality — your dashboards look "fine" while customer experience degrades. Switching assessment models without explicit evals can silently multiply error rates.

Results

Prevented silent regression by validating model choice against a consistent test set
Established a repeatable process for future model retirements and provider changes
Stronger model restored expected error behavior after initial regression

Best for

→Teams using LLMs to score calls, classify outcomes, or extract structured fields
→Any system where the assessor is more critical than the agent

Limitations

Requires maintaining an eval dataset (and refreshing it as workflows evolve)
Some regressions are distributional (only appear on new call types)

The Problem

Voicemail detection fails in three distinct ways. Each has different causes and different costs.

Symptom

"Everything is green" in reporting, but humans observe failures rising

Cause

The assessment model is a hidden dependency; changing it changes the measurement tool

Business Cost

You can't improve what you can't measure — bad assessments produce bad decisions

How It Works

A two-layer detection system where each covers the other's weaknesses.

Model Registry

List of approved models + version metadata

Approved model versions
Provider tracking
Rollback targets

Eval Harness

Fixed dataset + scoring scripts + threshold gates

Consistent test set
Scoring metrics
Pass/fail thresholds

Canary Rollout

Small % of calls scored by new model + monitored disagreement rate

Traffic splitting
Disagreement monitoring
Rollback triggers

Fallback

Revert to last known-good model or alternate provider

Automatic rollback
Provider switching
Incident response

Ablation Studies

We tested each approach in isolation to understand what works and why.

Key Takeaways

1."Faster/cheaper" models underperform on nuanced assessment tasks
2.Significant disagreement increase on edge cases (multi-turn confirmations, incomplete calls)
3.Model changes require eval approval + staged rollout

"Fast" general model

Faster/cheaper models work for assessment tasks

Higher incorrect assessments observed — rejected for assessment use

"Stronger" model

Winner

More capable model maintains assessment accuracy

Error rate consistent with baseline — approved for production

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameAssessment Eval Dataset

SizeLabeled calls including successes, failures, and partial/ambiguous interactions

Dataset must represent current workflows; otherwise you overfit to history.

Labeling

Humans assign outcome + key-field correctness

Evaluation Protocol

Compare automated outputs vs human labels; compute disagreement by category

Known Limitations

•Dataset must be refreshed as workflows evolve

Evaluation Details

Last Evaluated:2026-01-13

Model Version:assessment-eval-v1

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough