anyreach logo

Assessment Model Selection: The Day a "Small Swap" Broke Accuracy

Why assessment models need eval gates, not vibes

Assessment is infrastructure.

EvalHarness
CanaryRollout
ModelRegistry
Compliance:
RiskWithout evals
Silent Regression

Solution

Eval Gates

Before rollout

Accuracy

Restored

With stronger model

Executive Summary

What we built

A lightweight but strict model-selection protocol for assessment: candidate models must pass a standard evaluation harness before they can be used in production assessments.

Why it matters

Assessment is infrastructure. If assessments drift, you lose visibility into quality — your dashboards look "fine" while customer experience degrades. Switching assessment models without explicit evals can silently multiply error rates.

Results

  • Prevented silent regression by validating model choice against a consistent test set
  • Established a repeatable process for future model retirements and provider changes
  • Stronger model restored expected error behavior after initial regression

Best for

  • Teams using LLMs to score calls, classify outcomes, or extract structured fields
  • Any system where the assessor is more critical than the agent

Limitations

  • Requires maintaining an eval dataset (and refreshing it as workflows evolve)
  • Some regressions are distributional (only appear on new call types)

The Problem

Voicemail detection fails in three distinct ways. Each has different causes and different costs.

Symptom

"Everything is green" in reporting, but humans observe failures rising

Cause

The assessment model is a hidden dependency; changing it changes the measurement tool

Business Cost

You can't improve what you can't measure — bad assessments produce bad decisions

How It Works

A two-layer detection system where each covers the other's weaknesses.

Model Registry

List of approved models + version metadata

  • Approved model versions
  • Provider tracking
  • Rollback targets

Eval Harness

Fixed dataset + scoring scripts + threshold gates

  • Consistent test set
  • Scoring metrics
  • Pass/fail thresholds

Canary Rollout

Small % of calls scored by new model + monitored disagreement rate

  • Traffic splitting
  • Disagreement monitoring
  • Rollback triggers

Fallback

Revert to last known-good model or alternate provider

  • Automatic rollback
  • Provider switching
  • Incident response

Ablation Studies

We tested each approach in isolation to understand what works and why.

Key Takeaways

  • 1."Faster/cheaper" models underperform on nuanced assessment tasks
  • 2.Significant disagreement increase on edge cases (multi-turn confirmations, incomplete calls)
  • 3.Model changes require eval approval + staged rollout

"Fast" general model

Faster/cheaper models work for assessment tasks

Higher incorrect assessments observed — rejected for assessment use

"Stronger" model

Winner

More capable model maintains assessment accuracy

Error rate consistent with baseline — approved for production

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameAssessment Eval Dataset
SizeLabeled calls including successes, failures, and partial/ambiguous interactions

Dataset must represent current workflows; otherwise you overfit to history.

Labeling

Humans assign outcome + key-field correctness

Evaluation Protocol

Compare automated outputs vs human labels; compute disagreement by category

Known Limitations

  • Dataset must be refreshed as workflows evolve

Evaluation Details

Last Evaluated:2026-01-13
Model Version:assessment-eval-v1

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.