From skill-bench
Use when authoring new Claude Code skills or refining existing ones. New skills: 5-phase workflow (Design → Plan → Build → Validate → Finalize). Existing skills: enter at Refine Mode for validation and targeted refinement.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-bench:skill-benchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Interactive workbench for creating, testing, and refining Claude Code skills through structured conversation.
Interactive workbench for creating, testing, and refining Claude Code skills through structured conversation.
Five phases: Design (brainstorming) → Plan (writing-plans) → Build & Test (skill-creator) → Validate (consistency-tester) → Finalize (lint + promote).
Refining an existing skill? → Jump to Refine Mode — skips to validation and refinement.
Before starting, verify required dependencies:
superpowers:brainstorming skill is available via the Skill toolclaude plugin install claude-plugins-official/superpowersclaude plugin install claude-plugins-official/superpowers" — stopskill-creator skill is availableskills/skill-creator/ from https://github.com/anthropics/skills to ~/.claude/skills/skill-creator/~/.claude/skills/skill-creator/SKILL.md and its internal agents are accessibleREQUIRED SUB-SKILL: Use superpowers:brainstorming
Before invoking brainstorming:
.skillbench/specs/{skill-name}-design.md"references/skill-format.md and summarize for brainstorming:
name, description) and markdown contentname must be kebab-case: ^[a-z0-9]+(-[a-z0-9]+)*$description must start with "Use when", max 1024 chars, trigger-oriented (not workflow-summarizing)references/lines * 6.md files in agents/)skills/**/SKILL.md and agents/**/*.md. Warn about shadowing.Then invoke brainstorming. It handles: clarifying questions, 2-3 approaches with trade-offs, sectioned design for approval, spec writing and commit.
Exit gate: User-approved design spec exists in .skillbench/specs/.
REQUIRED SUB-SKILL: Use superpowers:writing-plans
Before invoking writing-plans:
.skillbench/plans/{skill-name}-plan.md"references/skill-authoring-plan-template.md and provide it as context so writing-plans produces skill-authoring tasks (define behavior → baseline test → write section → simulate → pressure test) instead of code-oriented tasks..skillbench/specs/{skill-name}-design.md
If the spec file does not exist, return to Phase 1 before proceeding.Then invoke writing-plans. It handles: file structure mapping, bite-sized task creation, self-review, plan writing and commit.
Exit gate: User-approved plan exists in .skillbench/plans/.
Initialize project config (first time only) — If .skillbench/config.json doesn't exist, create it:
{
"drafts_dir": "skills/drafts",
"evals_dir": ".skillbench/evals",
"test_model": "claude-opus-4-6",
"context_files": []
}
Read the config if it exists. Use drafts_dir and evals_dir from config.
Update .gitignore — If .gitignore exists and doesn't mention .skillbench, offer to add:
# Skill Bench artifacts
.skillbench/test-history/
.skillbench/workspace/
Create directories:
{drafts_dir}/{skill-name}/{evals_dir}/{skill-name}/.skillbench/workspace/{skill-name}/REQUIRED: Invoke the skill-creator skill with structured handoff context.
Before invoking:
.skillbench/specs/{skill-name}-design.md. If missing, return to Phase 1..skillbench/plans/{skill-name}-plan.md. If missing, return to Phase 2.references/skill-format.md for skill-creator: frontmatter schema, size budgets, reference splitting rules.references/writing-skills-summary.md and include as context.{drafts_dir}/{skill-name}/"{evals_dir}/{skill-name}/evals.json".skillbench/workspace/{skill-name}/"Skill-creator runs its eval loop: write skill → create evals with assertions → run with-skill + without-skill → grade → launch eval viewer → iterate → optimize description → return handoff summary.
Size tracking: After skill-creator exits, report {filename}: {lines} lines (~{lines * 6} tokens) for each file in the draft. At 200+ lines suggest splitting to references/. At 300+ lines strongly recommend.
Artifacts: Skill-creator writes workspace artifacts to .skillbench/workspace/{skill-name}/. Eval definitions are preserved in {evals_dir}/{skill-name}/evals.json.
Exit gate: Skill-creator returns handoff summary with final skill path and eval results.
Multirun consistency testing — run the skill draft against real test cases multiple times, compare outputs, collect user judgment, and refine.
Check for test case library — Look for .skillbench/test-cases/{skill-name}.json
{evals_dir}/{skill-name}/evals.json exists (from Phase 3), offer to convert: "I can seed your test case library from the Phase 3 evals. Convert?"
evals.json, map each eval case to a test case (eval prompt → test prompt, eval assertions → test description), write to .skillbench/test-cases/{skill-name}.json"Create your test case library at
.skillbench/test-cases/{skill-name}.json. Format:"
{ "run_count": 5, "cases": [ { "name": "case-name", "prompt": "input prompt for this case", "reference_files": ["path/to/reference.pdf"], "description": "what this case tests" } ] }
Wait for the user to create the file before proceeding.
Verify skill draft exists — Confirm {drafts_dir}/{skill-name}/SKILL.md exists from Phase 3. If not, return to Phase 3.
Use the Agent tool to spawn the consistency-tester agent with:
{drafts_dir}/{skill-name}/SKILL.md.skillbench/test-cases/{skill-name}.json.skillbench/test-history/{skill-name}/The consistency-tester handles the full validation loop: run collection → consistency summary → user judgment → failure analysis → skill refinement → re-run recommendations. It loops until the user declares validation complete.
Exit gate: User declares validation complete.
Entry point for improving an existing skill. Skips Phases 1-3.
See references/refine-mode.md for the full import procedure.
Execute Phase 4 above using the imported skill's draft path. All Phase 4 mechanics apply: test case library, consistency-tester, user judgment loop, skill-refiner.
Execute Phase 5 below. If the skill was copied from a published location, promotion offers to overwrite the original or place at a new path.
Validate the skill and promote to its final location. Load references/anti-patterns.md for the lint checklist.
Frontmatter validation:
name matches ^[a-z0-9]+(-[a-z0-9]+)*$description starts with "Use when" and is ≤ 1024 charsmodel is valid enum or omittedContent checks:
Reference validation:
skill: \name`` — Glob to verify each exists**name** agent — Glob to verify each existsDescription quality (CSO verification):
**/SKILL.md files, check for description overlapToken efficiency:
wc -w) and line countEval coverage:
{evals_dir}/{skill-name}/evals.json existsFlowchart review (if skill contains digraph blocks):
Present findings as Blocking (must fix), Warning (should fix), Info (optional).
skills/{skill-name}/, plugin directory, or custom path{drafts_dir}/{skill-name}/ to target. Include references, scripts, companion agents.feat: add {skill-name} skillnpx claudepluginhub yiminnn/skill-bench-plugin --plugin skill-benchGuides creation, improvement, auditing, testing, and distribution of Claude Code skills using Anthropic's official methodology.
Create or improve Claude Code and Codex skills. Use only when the request is explicitly about the skill itself: creating a new skill, editing a SKILL.md file, testing a skill draft in .claude/skills/.../SKILL.md or .codex/skills/.../SKILL.md, improving an existing skill, debugging why a skill is not triggering, running evals or benchmarks for a skill, or turning an already-described workflow into a reusable skill. Trigger on phrases like existing skill, skill draft, SKILL.md, skill trigger, skill eval, make this into a skill, or turn this workflow into a skill. Do not use for ordinary implementation work unless the user explicitly mentions a skill, SKILL.md, or converting the task into a skill. That exclusion includes generic coding tasks, generic automation/workflow setup, CI setup, GitHub Actions workflows, debugging, code review, database migrations, schema changes, and production incident response when the user is just asking for help with the task itself.
Creates and validates Claude Code skills per AgentSkills.io 2026 spec and 100-point rubric. Use for building new skills or auditing existing ones.