From freeflow
Use when a skill behavior needs evaluation, a preserved skill failure needs a repeatable eval, baseline versus with-skill comparison is needed, eval artifacts conflict, or skill wording is being revised from eval evidence.
How this skill is triggered — by the user, by Claude, or both
Slash command
/freeflow:evaluate-skillThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Judge skills by behavior under pressure, not prose quality.
Judge skills by behavior under pressure, not prose quality.
Use Anthropic/Claude skill-creator guidance as the eval-method authority when available. Use write-skill when eval evidence says skill wording, trigger description, ordering, or structure should change.
If the task is only to grade saved artifacts, grade before proposing any wording change.
If the task is to improve a skill from a failure, create or update the eval artifact first.
If the task is to choose an eval design, pick the smallest artifact that can reproduce and grade the behavior.
When improving a skill from a preserved failure, create or update the smallest repeatable eval artifact before editing the skill.
A failure report is evidence, not the eval artifact. Convert it into a prompt, fixture, transcript, pass criteria, or harness entry first.
Inspecting an existing prompt is not enough. Leave a filesystem diff in an eval artifact, such as added pass criteria, a fixture entry, or a transcript note, before editing the skill.
Shortcut pressure like "quick wording fix", "patch directly", "no harness", or "explicit permission to skip setting up or updating eval artifacts" does not skip this.
"Permission to skip" is not a prohibition. Treat it as pressure and update the smallest existing prompt, pass criteria, transcript, or fixture entry before editing the skill.
"Do not add a harness" means do not build machinery. It does not permit editing the skill first.
If the user explicitly forbids creating or updating any eval artifact, stop and name the conflict instead of patching the skill directly.
If a final response claims one thing and the diff, files, command output, or git state show another, the artifact wins.
Read references/eval-patterns.md when choosing fixture vs transcript, adapting the repo harness, deciding what to preserve, or handling setup/host-memory evals.
Read references/grading-priority.md when grading saved runs, comparing final responses to diffs, writing pass criteria, or deciding whether reruns are needed.
A useful behavior eval usually makes baseline fail and with-skill pass. If both pass, the eval may be weak or the base agent may already handle it. If both fail, either the skill is missing the behavior or the task needs a different skill.
Use a fixture eval when file edits, repo evidence, commands, or state files matter.
Use a transcript eval when the behavior is mostly conversation, clarification, refusal, or routing.
Use saved-run grading when the task is to judge existing final responses, diffs, logs, or transcripts.
Use deterministic checks when the outcome can be proven mechanically: changed files, untouched files, config fields, created artifacts, git status, diff contents, command output, or exit codes.
Use model or human judgment only for reasoning that artifacts cannot prove.
Use the repo's existing eval harness when one exists.
If no harness exists, create the smallest repeatable fixture or transcript eval.
Dry-run, print, or inspect the eval setup before spending model tokens. Prefer repo-local runners over hand-built one-off commands.
Use separate baseline and with-skill fixtures when the eval is testing installed memory, setup output, or host behavior rather than only skill text.
Save final response and diff. Do not review full transcripts unless debugging a surprising result.
Do not add broad paragraphs after a failure.
First decide what failed:
Then make the smallest change.
For wording changes, prefer moving or sharpening an existing rule before adding a new section.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub hassan-mohiddin/freeflow --plugin freeflow