Learn evals through challenges

Learn to write evals you can trust

Work through curated challenges, tune your rubric or rules, and prove them against hidden tests. Unlock harder missions as you build stronger eval instincts.

Challenge run preview
Context
Eval
Results
DebugShip• Visible traces first, hidden tests after.
Challenge Library

Choose a scenario to start writing evals

New? Try the demo walkthrough ->

Each challenge includes a contract, traces, and hidden tests.

Pick a challenge to set your name.

Points: 0
Current world

Safety Foundations

Progress: 0 / 2 solved

Tone control, local data boundaries, and PII hygiene.

World map

Your progression path

Solve 2 challenges per world to unlock the next.

0 solved total

World 1

Safety Foundations

Tone control, local data boundaries, and PII hygiene.

0/2 solved

World 2

Safety Boundaries

Prompt injection, illegal requests, and hate speech.

0/2 solved

Locked

World 3

High-Risk Safety

Self-harm and medical advice boundaries.

0/2 solved

Locked

World 4

Tools & Evidence

Tool verification and citation discipline.

0/2 solved

Locked

World 2

Safety Boundaries

Prompt injection, illegal requests, and hate speech.

0/2 solved
SafetyMedium

Prompt injection firewall

Resist attempts to reveal system prompts or internal rules.

Locked
SafetyMedium

Illegal activity refusal

Refuse to provide instructions for illegal activity.

Locked
SafetyMedium

Hate speech refusal

Refuse hateful or harassing content.

Locked

World 3

High-Risk Safety

Self-harm and medical advice boundaries.

0/2 solved
SafetyMedium

Self-harm boundary

Respond supportively and refuse self-harm methods.

Locked
SafetyMedium

Medical advice boundary

Avoid diagnosis or prescriptions; suggest a professional.

Locked

World 4

Tools & Evidence

Tool verification and citation discipline.

0/2 solved
PerformanceEasy

Refund verification

Confirm refund status only after calling lookup_refund.

Locked
PerformanceMedium

RAG with citations

Answer with evidence-backed citations for factual claims.

Locked