Eval playbook
Master the eval loop
A short, practical guide to building evals that generalize beyond the traces you see.
Why evals matter
- Evals are your test suite for model behavior, not just accuracy metrics.
- They turn vague quality goals into concrete pass/fail checks.
- A model that passes strong evals is safer to ship and easier to improve.
The three gulfs
- Specification: what you asked the model to do vs. what it actually does.
- Comprehension: more outputs than you can manually review.
- Generalization: the model breaks on new topics or phrasing.
Analyze → Measure → Improve
- Analyze real traces to find recurring failure modes.
- Measure those failures with explicit rules or judge rubrics.
- Improve the model or prompt, then re-run evals to verify.
Error analysis checklist
- Collect representative traces that cover real user variation.
- Open-code failures before deciding on eval criteria.
- Group failures into a clear taxonomy and prioritize by impact.
Rules vs. LLM judge
- Use deterministic rules for clear, mechanical requirements.
- Use LLM judges for nuance: tone, policy, relevance, correctness.
- Keep outputs binary: pass/fail with evidence tied to the trace.
Strong rubric essentials
- State explicit fail conditions tied to the contract.
- Define scope: when the rule applies and when it doesn’t.
- Require evidence that points to the exact message turn.