Eval playbook

Master the eval loop

A short, practical guide to building evals that generalize beyond the traces you see.

Why evals matter

Evals are your test suite for model behavior, not just accuracy metrics.
They turn vague quality goals into concrete pass/fail checks.
A model that passes strong evals is safer to ship and easier to improve.

The three gulfs

Specification: what you asked the model to do vs. what it actually does.
Comprehension: more outputs than you can manually review.
Generalization: the model breaks on new topics or phrasing.

Analyze → Measure → Improve

Analyze real traces to find recurring failure modes.
Measure those failures with explicit rules or judge rubrics.
Improve the model or prompt, then re-run evals to verify.

Error analysis checklist

Collect representative traces that cover real user variation.
Open-code failures before deciding on eval criteria.
Group failures into a clear taxonomy and prioritize by impact.

Rules vs. LLM judge

Use deterministic rules for clear, mechanical requirements.
Use LLM judges for nuance: tone, policy, relevance, correctness.
Keep outputs binary: pass/fail with evidence tied to the trace.

Strong rubric essentials

State explicit fail conditions tied to the contract.
Define scope: when the rule applies and when it doesn’t.
Require evidence that points to the exact message turn.

Sources