Context Engineering: When the System Prompt Became "Evidence"
How we stopped assessments from hallucinating entire call flows
Assessment accuracy depends as much on context boundaries as on model choice.
Solution
Evidence-Only
No prompt leakage
False Positives
Eliminated
Perfect hallucinations
Executive Summary
What we built
A strict assessment context contract: the assessor receives only post-call artifacts (and optionally a compact rubric), never the agent's generative instructions.
Why it matters
Assessment accuracy depends as much on context boundaries as on model choice. Including the agent's system prompt inside the assessor context caused the assessor to "invent" a detailed, plausible call that never happened. Prompt leakage produces the most dangerous failure: an assessment that looks confidently correct and masks broken recordings or failed calls.
Results
- Eliminated a class of "perfect hallucination" assessments
- Improved investigator velocity by removing misleading narrative outputs
- For assessments, the model must only see observable artifacts, not "what should have happened"
Best for
- →Any pipeline where an LLM judges another LLM/agent
- →Any system where recordings/transcripts can fail or be missing
Limitations
- If transcripts are low quality, you may need a retrieval layer to provide structured evidence
- Rubrics must be carefully phrased so they don't become "story seeds"
The Problem
Voicemail detection fails in three distinct ways. Each has different causes and different costs.
Symptom
Assessments describe a full call (patient name, confirmed appointment) even when recordings are blank or calls failed
Cause
The assessor treats the system prompt as ground truth and "fills in" what it expects to find
Business Cost
How It Works
A two-layer detection system where each covers the other's weaknesses.
Context Contract
Defines what the assessor can and cannot see
- Allowed: call metadata, transcript segments, extracted fields, event logs
- Forbidden: system prompt, "intended call flow", internal agent instructions
Evidence-First Prompting
Require citations to transcript spans / event logs when making claims
- Citation requirements
- Evidence linking
- Claim validation
Missing-Artifact Handling
Explicit "insufficient evidence" outcome if recordings/transcripts are absent
- Artifact presence checks
- Fallback outcomes
- Explicit uncertainty
Ablation Studies
We tested each approach in isolation to understand what works and why.
Key Takeaways
- 1.Removing generative instructions from context reduces narrative hallucinations
- 2.The assessor needs the rubric, not the generative instructions — call flow must be inferred from evidence
- 3.Instruction helps, but context control prevents the failure class
Context includes system prompt
Assessor needs to know the intended call flow
Hallucinated call flow — assessor "filled in" missing evidence from prompt
Context excludes prompt, evidence-only
Assessor should only see observable artifacts
"Insufficient evidence" when artifacts missing — removed false positives
Frequently Asked Questions
Common questions about our voicemail detection system.
Methodology
How we built, trained, and evaluated this model.
Dataset
Humans mark whether evidence exists for each required field.
Labeling
Humans mark whether evidence exists for each required field
Evaluation Protocol
Measure false positives where assessor asserts details absent from artifacts
Known Limitations
- •Artifact quality still bounds assessment quality
Evaluation Details
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
