Context Engineering: When the System Prompt Became "Evidence"

How we stopped assessments from hallucinating entire call flows

Assessment accuracy depends as much on context boundaries as on model choice.

ContextContract

EvidenceCitations

GuardrailsRequired

Compliance:

Root CauseNot model choice

Context Boundaries

Solution

Evidence-Only

No prompt leakage

False Positives

Eliminated

Perfect hallucinations

Executive Summary

What we built

A strict assessment context contract: the assessor receives only post-call artifacts (and optionally a compact rubric), never the agent's generative instructions.

Why it matters

Assessment accuracy depends as much on context boundaries as on model choice. Including the agent's system prompt inside the assessor context caused the assessor to "invent" a detailed, plausible call that never happened. Prompt leakage produces the most dangerous failure: an assessment that looks confidently correct and masks broken recordings or failed calls.

Results

Eliminated a class of "perfect hallucination" assessments
Improved investigator velocity by removing misleading narrative outputs
For assessments, the model must only see observable artifacts, not "what should have happened"

Best for

→Any pipeline where an LLM judges another LLM/agent
→Any system where recordings/transcripts can fail or be missing

Limitations

If transcripts are low quality, you may need a retrieval layer to provide structured evidence
Rubrics must be carefully phrased so they don't become "story seeds"

The Problem

Voicemail detection fails in three distinct ways. Each has different causes and different costs.

Symptom

Assessments describe a full call (patient name, confirmed appointment) even when recordings are blank or calls failed

Cause

The assessor treats the system prompt as ground truth and "fills in" what it expects to find

Business Cost

Debugging goes in the wrong direction; failures persist longer; trust erodes

How It Works

A two-layer detection system where each covers the other's weaknesses.

Context Contract

Defines what the assessor can and cannot see

Allowed: call metadata, transcript segments, extracted fields, event logs
Forbidden: system prompt, "intended call flow", internal agent instructions

Evidence-First Prompting

Require citations to transcript spans / event logs when making claims

Citation requirements
Evidence linking
Claim validation

Missing-Artifact Handling

Explicit "insufficient evidence" outcome if recordings/transcripts are absent

Artifact presence checks
Fallback outcomes
Explicit uncertainty

Ablation Studies

We tested each approach in isolation to understand what works and why.

Key Takeaways

1.Removing generative instructions from context reduces narrative hallucinations
2.The assessor needs the rubric, not the generative instructions — call flow must be inferred from evidence
3.Instruction helps, but context control prevents the failure class

Context includes system prompt

Assessor needs to know the intended call flow

Hallucinated call flow — assessor "filled in" missing evidence from prompt

Context excludes prompt, evidence-only

Winner

Assessor should only see observable artifacts

"Insufficient evidence" when artifacts missing — removed false positives

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameContext Engineering Eval Dataset

SizeIncludes failed calls, blank recordings, partial transcripts, and normal calls

Humans mark whether evidence exists for each required field.

Labeling

Humans mark whether evidence exists for each required field

Evaluation Protocol

Measure false positives where assessor asserts details absent from artifacts

Known Limitations

•Artifact quality still bounds assessment quality

Evaluation Details

Last Evaluated:2026-01-13

Model Version:context-eng-v1

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough