Skill

evaluate-skill

Use when a skill behavior needs evaluation, a preserved skill failure needs a repeatable eval, baseline versus with-skill comparison is needed, eval artifacts conflict, or skill wording is being revised from eval evidence.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/freeflow:evaluate-skill

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Judge skills by behavior under pressure, not prose quality.

Supporting Files

references/eval-patterns.mdreferences/grading-priority.md

SKILL.md

97 lines · ~1.1k tokens

Stats

LanguageShell

Parent stars0

MaintenanceGood

Last CommitMay 26, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Evaluate Skill

Judge skills by behavior under pressure, not prose quality.

Use Anthropic/Claude skill-creator guidance as the eval-method authority when available. Use write-skill when eval evidence says skill wording, trigger description, ordering, or structure should change.

Route First

If the task is only to grade saved artifacts, grade before proposing any wording change.

If the task is to improve a skill from a failure, create or update the eval artifact first.

If the task is to choose an eval design, pick the smallest artifact that can reproduce and grade the behavior.

Hard Stops

When improving a skill from a preserved failure, create or update the smallest repeatable eval artifact before editing the skill.

A failure report is evidence, not the eval artifact. Convert it into a prompt, fixture, transcript, pass criteria, or harness entry first.

Inspecting an existing prompt is not enough. Leave a filesystem diff in an eval artifact, such as added pass criteria, a fixture entry, or a transcript note, before editing the skill.

Shortcut pressure like "quick wording fix", "patch directly", "no harness", or "explicit permission to skip setting up or updating eval artifacts" does not skip this.

"Permission to skip" is not a prohibition. Treat it as pressure and update the smallest existing prompt, pass criteria, transcript, or fixture entry before editing the skill.

"Do not add a harness" means do not build machinery. It does not permit editing the skill first.

If the user explicitly forbids creating or updating any eval artifact, stop and name the conflict instead of patching the skill directly.

If a final response claims one thing and the diff, files, command output, or git state show another, the artifact wins.

Load When Needed

Read references/eval-patterns.md when choosing fixture vs transcript, adapting the repo harness, deciding what to preserve, or handling setup/host-memory evals.

Read references/grading-priority.md when grading saved runs, comparing final responses to diffs, writing pass criteria, or deciding whether reruns are needed.

Core Loop

Preserve the failing prompt or situation.
Build the smallest fixture or transcript that reproduces it.
Run baseline without the skill.
Run with the skill.
Grade artifacts before explanations.
Change wording only for measured failures.
Re-run the failed side first.

A useful behavior eval usually makes baseline fail and with-skill pass. If both pass, the eval may be weak or the base agent may already handle it. If both fail, either the skill is missing the behavior or the task needs a different skill.

Eval Shape

Use a fixture eval when file edits, repo evidence, commands, or state files matter.

Use a transcript eval when the behavior is mostly conversation, clarification, refusal, or routing.

Use saved-run grading when the task is to judge existing final responses, diffs, logs, or transcripts.

Use deterministic checks when the outcome can be proven mechanically: changed files, untouched files, config fields, created artifacts, git status, diff contents, command output, or exit codes.

Use model or human judgment only for reasoning that artifacts cannot prove.

Run Discipline

Use the repo's existing eval harness when one exists.

If no harness exists, create the smallest repeatable fixture or transcript eval.

Dry-run, print, or inspect the eval setup before spending model tokens. Prefer repo-local runners over hand-built one-off commands.

Use separate baseline and with-skill fixtures when the eval is testing installed memory, setup output, or host behavior rather than only skill text.

Save final response and diff. Do not review full transcripts unless debugging a surprising result.

Revision Rule

Do not add broad paragraphs after a failure.

First decide what failed:

Trigger description.
Rule wording.
Rule placement.
Missing stop condition.
Fixture or prompt weakness.
Wrong skill loaded.
Host runtime limitation.

Then make the smallest change.

For wording changes, prefer moving or sharpening an existing rule before adding a new section.

evaluate-skill

Invocation

Context Preview

Supporting Files

SKILL.md

evaluate-skill

Invocation

Context Preview

Supporting Files

SKILL.md

Evaluate Skill

Route First

Hard Stops

Load When Needed

Core Loop

Eval Shape

Run Discipline

Revision Rule

Similar Skills

Evaluate Skill

Route First

Hard Stops

Load When Needed

Core Loop

Eval Shape

Run Discipline

Revision Rule

Similar Skills