By Yiminnn
Interactive skill authoring bench — create, test, and refine Claude Code skills through conversation
Use when validating a skill draft through multirun testing. Takes a test case library, runs the skill-tester agent N times per case, summarizes consistency across runs, collects user pass/fail judgments with annotations, analyzes failure patterns across cases, proposes targeted skill edits, and manages the re-run loop until the user declares validation complete.
Use when you need to browse, list, or summarize existing skill drafts and their test history. Scans the drafts directory, reads frontmatter, and reports on draft status.
Use when a skill draft has been tested against multiple cases and needs refinement based on test results and user feedback. Takes skill content, N test results with thinking traces, and user annotations identifying which results failed and why. Analyzes failure patterns across runs and proposes targeted skill edits.
Use when testing a skill draft via simulated execution. Takes a skill's SKILL.md content + sample input, reasons through what the skill would produce, and returns a structured evaluation with pass/fail status, issues, thinking trace, and suggested next test cases.
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Interactive skill authoring for Claude Code — create, test, and refine skills through structured conversation.
claude plugin marketplace add https://github.com/Yiminnn/skill-bench-plugin
claude plugin install skill-bench
/skill-bench
| Phase | What Happens | Powered By |
|---|---|---|
| 1. Design | Brainstorm approaches, produce design spec | superpowers:brainstorming |
| 2. Plan | Generate implementation tasks with TDD steps | superpowers:writing-plans |
| 3. Build & Test | Build skill, eval with baseline comparison, iterate | skill-creator |
| 4. Validate | Multirun consistency testing, user judgment, refinement | consistency-tester + skill-refiner |
| 5. Finalize | Lint, validate references, promote | built-in |
Already have a skill? Skip straight to validation:
/skill-bench
> refine path/to/my-skill/SKILL.md
| You want to... | Say... |
|---|---|
| Create a new skill | /skill-bench |
| Refine an existing skill | /skill-bench then refine path/to/skill |
| Approve a step | y or looks good |
| Edit the draft yourself | Edit in your editor, then say I edited it |
| Run quick test | yes (when offered sample run) |
| Run thorough testing | full validation |
| Mark a run as failed | run 3 failed — [what went wrong] |
| Approve proposed fixes | approve all or approve fix 1 and 3 |
| Finish testing | validation complete |
| Check existing drafts | show me my skill drafts |
| Component | Type | Model | Purpose |
|---|---|---|---|
skill-bench | Skill | — | 5-phase workflow orchestrator |
skill-tester | Agent | Opus | Simulates skill execution, returns structured eval with thinking trace |
consistency-tester | Agent | Opus | Multirun validation: run N times, compare, collect judgment, refine |
skill-refiner | Agent | Opus | Dual-lens failure analysis (cross-run + per-run), proposes targeted edits |
skill-explorer | Agent | Haiku | Read-only scanner for drafts and test history |
On first use, creates .skillbench/config.json in your project:
{
"drafts_dir": "skills/drafts",
"evals_dir": ".skillbench/evals",
"test_model": "claude-opus-4-6",
"context_files": []
}
| Path | Purpose | Tracked? |
|---|---|---|
.skillbench/config.json | Project settings | Yes |
.skillbench/specs/ | Design specs | Yes |
.skillbench/plans/ | Implementation plans | Yes |
.skillbench/evals/ | Eval definitions | Yes |
.skillbench/test-cases/ | Test case libraries | Yes |
.skillbench/workspace/ | Skill-creator iterations | No |
.skillbench/test-history/ | Test results and refinements | No |
Requires two dependencies (both auto-installed on first use):
Manual install if needed:
claude plugin install claude-plugins-official/superpowers
MIT
npx claudepluginhub yiminnn/skill-bench-plugin --plugin skill-benchUltimate Claude Code skill creator. Design, scaffold, build, review, evolve, and publish production-grade AI agent skills following the Agent Skills open standard and 3-layer architecture.
Professional skill creation with TDD workflow. Features dual-mode (fast/full), behavioral validation, and automated quality gates for 9.0/10+ scores.
Create new skills, improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, or benchmark skill performance with variance analysis.
Create, test, measure, and iteratively improve Claude Code skills with category-aware design, gotchas-driven development, progressive disclosure coaching, and automated description optimization.
Create and manage Claude Code skills, plugins, subagents, and hooks. Use when building new skills, validating existing skills, testing skills empirically, creating plugins, converting projects to plugins, creating hooks, or managing plugin automation. Includes /skills-toolkit:skill-composer, /skills-toolkit:skill-refiner, /skills-toolkit:skill-tester, /skills-toolkit:plugin-creator, /skills-toolkit:subagent-creator, /skills-toolkit:hook-creator, and /skills-toolkit:ask-user-question skills.
Harness-native ECC operator layer - 67 agents, 271 skills, 92 legacy command shims, reusable hooks, rules, selective install profiles, and production-ready workflows for Claude Code, Codex, OpenCode, Cursor, and related agent harnesses