From swatkinson-toolkit
Retrospective self-evaluation of a skill you just ran. Reconstructs the skill's intended goals, reconstructs what ACTUALLY happened from the session transcript (errors, retries, stalls, hangs, user interventions, deviations), scores the run against a 6-dimension rubric, root-causes each friction point, and proposes concrete edits to the skill file plus ways the operator could run it better. Use at the end of a skill run, when the user invokes /skill-evaluate, or asks "how well did that skill go", "how could this skill be improved", "evaluate that run", "what slowed that down".
How this skill is triggered — by the user, by Claude, or both
Slash command
/swatkinson-toolkit:skill-evaluateThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You just ran a skill (or are wrapping one up). Turn the critical eye on the **skill itself**: did it do what it was built to do, where did it create friction, and how should its instructions change so the next run is smoother, faster, and more robust. Produce the grounded, evidence-cited retrospective a careful engineer writes after an incident — not a self-congratulatory recap.
You just ran a skill (or are wrapping one up). Turn the critical eye on the skill itself: did it do what it was built to do, where did it create friction, and how should its instructions change so the next run is smoother, faster, and more robust. Produce the grounded, evidence-cited retrospective a careful engineer writes after an incident — not a self-congratulatory recap.
Core stance — three rules that make this useful instead of noise:
The rubric, the evidence-gathering technique, and the report template live in REFERENCE.md — load it in Phase 2.
/skill-evaluate handle-it) → that one. Else infer the skill that dominated this session. Multiple skills ran or it's ambiguous → ask which (or offer to evaluate each).SKILL.md and everything it bundles — REFERENCE.md, any agent files it spawns (~/.claude/agents/*.md), sub-skills it delegates to. Extract its spec: stated purpose, intended outcomes / success criteria, the phase/step flow, hard rules + invariants, and the human gates it intends to have. This is what you grade the run against.Your working memory of the run is recency-biased and self-serving — it forgets your own wasted steps and smooths over stalls. Back it with the transcript.
~/.claude/projects/<project-slug>/, most-recently-modified; or use the session-management search tools if available). See REFERENCE → Gathering evidence.Write the report using the template in REFERENCE → Report template: verdict + scorecard, what went well (codify-worthy), the findings table, proposed skill edits (file + section + concrete new/changed text, ready to apply), and operator tips (how the user can run it better next time).
Editing a skill is a real change — don't silently rewrite it. Group the proposed skill edits by confidence (high-confidence/mechanical vs. judgment-call) and offer to apply them, waiting for the user's go-ahead. Apply only what they approve; leave operator tips and environment notes as advice. If the evaluated skill is the one that's currently running, prefer to hand back the report and let the user run the edits as a separate pass rather than editing mid-flight.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub swatkinson/swatkinson-cc-toolkit --plugin swatkinson-toolkit