Skill

eval-first

Bakes evaluation thinking into model output work. Every plan names what success looks like, where things can go wrong, and proposes simple checks validating key assumptions.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/eval-first:eval-first

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGrepGlobWriteEditBash

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You invoked this skill because the change is non-trivial. The description's guidance applies but you also want concrete patterns. Here's the full protocol.

Supporting Files

references/anchor-patterns.mdreferences/cheap-checks.mdreferences/comparison-templates.md

SKILL.md

46 lines · ~650 tokens

Stats

Stars0

MaintenanceGood

Last CommitMay 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Eval-First — full protocol

You invoked this skill because the change is non-trivial. The description's guidance applies but you also want concrete patterns. Here's the full protocol.

Plan must include

Success looks like: 1-2 sentences naming the criterion. What does a correct output of this changed system look like?
Things go wrong when: 2-4 specific failure modes. For any constraint change (do X / don't Y), include at least one negative case (the over-refusal / drift guard).
Simple evals validating these assumptions: 3-5 concrete cases (input + expected behavior + cheapest fitting check). Generate them yourself from context. Never ask the user to produce them.

Run the evals as part of doing the work. Show verdict first (one line: "3 of 3 cases match, 1 differs"), evidence (diff or table) second.

Pick the right pattern for each step

Generating anchor cases → references/anchor-patterns.md
Cheapest fitting check per failure type → references/cheap-checks.md
Verdict-first comparison templates → references/comparison-templates.md

Two-phase for high-stakes judgments

When grading subjective output (tone, coherence, "did the change land?"), run two passes:

Cold-read — answer 3-5 naive questions about the output WITHOUT the rubric in context. (What's it claiming? Did it confuse you? Could you summarize the argument?)
Apply rubric — score using the cold-read answers as evidence.

Single-pass judges confabulate — they pattern-match the rubric instead of reading the content.

Persist when load-bearing

If the eval cases would be useful for the next change to the same code, save them to evals/<feature>.md in the current project. Format in references/comparison-templates.md. Opportunistic, not mandatory.

Anti-rules

Never ask the user to label or produce test cases — generate from context.
Never output Phase 0 / Phase 1 / Phase 2 headers — that's ceremony.
Never refuse the work because evals aren't set up.
For constraint changes (do X / don't Y), always include the negative case.
Deterministic before model-based: a regex beats an LLM judge call.
Binary pass/fail, not 1-5 scalars.

eval-first

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

eval-first

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Eval-First — full protocol

Plan must include

Pick the right pattern for each step

Two-phase for high-stakes judgments

Persist when load-bearing

Anti-rules

Similar Skills

Eval-First — full protocol

Plan must include

Pick the right pattern for each step

Two-phase for high-stakes judgments

Persist when load-bearing

Anti-rules

Similar Skills