Eval playbook

Master the eval loop

A short, practical guide to building evals that generalize beyond the traces you see.

Why evals matter

  • Evals are your test suite for model behavior, not just accuracy metrics.
  • They turn vague quality goals into concrete pass/fail checks.
  • A model that passes strong evals is safer to ship and easier to improve.

The three gulfs

  • Specification: what you asked the model to do vs. what it actually does.
  • Comprehension: more outputs than you can manually review.
  • Generalization: the model breaks on new topics or phrasing.

Analyze → Measure → Improve

  • Analyze real traces to find recurring failure modes.
  • Measure those failures with explicit rules or judge rubrics.
  • Improve the model or prompt, then re-run evals to verify.

Error analysis checklist

  • Collect representative traces that cover real user variation.
  • Open-code failures before deciding on eval criteria.
  • Group failures into a clear taxonomy and prioritize by impact.

Rules vs. LLM judge

  • Use deterministic rules for clear, mechanical requirements.
  • Use LLM judges for nuance: tone, policy, relevance, correctness.
  • Keep outputs binary: pass/fail with evidence tied to the trace.

Strong rubric essentials

  • State explicit fail conditions tied to the contract.
  • Define scope: when the rule applies and when it doesn’t.
  • Require evidence that points to the exact message turn.