From base
Creates, evaluates, and improves Claude skills through iterative testing and benchmarking. Use when a user wants to create a new skill, refine an existing one, test skill behavior, benchmark performance, improve triggering accuracy, or turn a workflow into a reusable skill. Triggers on phrases like "make this into a skill", "improve my skill", "test this skill", "why isn't my skill triggering", or "package this skill".
How this skill is triggered — by the user, by Claude, or both
Slash command
/base:skill-creatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Create, evaluate, and iteratively improve skills through a structured draft → test → review → improve cycle.
LICENSE.txtNOTICEagents/analyzer.mdagents/comparator.mdagents/grader.mdassets/eval_review.htmleval-viewer/generate_review.pyeval-viewer/viewer.htmlreferences/description-optimization.mdreferences/environments.mdreferences/eval-workflow.mdreferences/packaging.mdreferences/schemas.mdscripts/__init__.pyscripts/aggregate_benchmark.pyscripts/generate_report.pyscripts/improve_description.pyscripts/package_skill.pyscripts/quick_validate.pyscripts/run_eval.pyCreate, evaluate, and iteratively improve skills through a structured draft → test → review → improve cycle.
Determine the user's intent and follow only the relevant path:
Most sessions involve Create or Improve, which both follow the core workflow below.
The process follows an iterative loop:
eval-viewer/generate_review.py to present results for human review, and show quantitative metricsDetermine where the user is in this process and help them progress. If they say "I want to make a skill for X", help narrow intent, draft, write test cases, evaluate, and iterate. If they already have a draft, jump straight to eval/iterate.
Remain flexible — if the user says "I don't need evaluations, just vibe with me", adapt accordingly.
After the skill is finalized, offer to run the description optimizer to improve triggering accuracy.
Skill creation attracts users across a wide range of technical familiarity. Pay attention to context cues and adjust vocabulary accordingly. Terms like "evaluation" and "benchmark" are generally safe. For terms like "JSON" or "assertion", look for cues that the user is comfortable with them, or briefly explain when in doubt.
Start by understanding what the user wants. The current conversation may already contain a workflow to capture (e.g., "turn this into a skill"). If so, extract answers from the conversation history first — tools used, sequence of steps, corrections made, input/output formats observed. Confirm with the user before proceeding.
Key questions to resolve:
Proactively ask about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until this is solid.
Check available MCPs — if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline.
Based on the interview, fill in:
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
Skills use a three-level loading system:
Keep SKILL.md under 500 lines. When approaching this limit, split into reference files with clear pointers about when to read them. For large reference files (300+ lines), include a table of contents at the top.
Skills must not contain malware, exploit code, or security-compromising content. A skill's contents should not surprise the user in their intent. Do not create misleading skills or skills designed for unauthorized access or data exfiltration. Roleplay-style skills are acceptable.
Use the imperative form in instructions.
Defining output formats:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern — include input/output pairs to guide style and expectations:
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
Explain to the model why things matter rather than relying on heavy-handed directives. Use theory of mind and aim for general instructions that aren't over-fitted to specific examples. Draft first, then review with fresh eyes and improve.
After drafting the skill, create 2–3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user for confirmation, then run them.
Save test cases to evals/evals.json. Start with just the prompts — assertions come later. See references/schemas.md for the full schema including the assertions field.
Treat these steps as a continuous workflow — pausing mid-sequence risks losing timing data from subagent notifications, which aren't persisted elsewhere.
For the detailed step-by-step procedure (spawning runs, drafting assertions, grading, launching the viewer), see references/eval-workflow.md.
This is the heart of the loop. Test cases have run, the user has reviewed results, and now the skill needs to get better.
Generalize from the feedback. The goal is a skill that works across many different prompts, not just the test examples. Rather than adding narrow, overfitting fixes or oppressively constrictive directives, try different metaphors or recommend different working patterns when something is stubborn. It's cheap to experiment.
Keep the prompt lean. Remove anything that isn't pulling its weight. Read the transcripts, not just final outputs — if the skill makes the model waste time on unproductive steps, remove those instructions and observe what happens.
Explain the why. Convey the reasoning behind every instruction. Today's LLMs have strong theory of mind and, given good context, go beyond rote instructions to deliver excellent results. If you find yourself writing ALWAYS or NEVER in all caps, that's a signal to reframe: explain the reasoning so the model understands the importance. That approach is more effective and more durable than rigid directives.
Look for repeated work across test cases. Read transcripts and notice when all subagents independently wrote similar helper scripts or took the same multistep approach. If every test case resulted in a create_docx.py or build_chart.py, bundle that script in scripts/ and tell the skill to use it.
This work has high impact — invest thinking time and iterate on your draft revision before finalizing.
After improving the skill:
iteration-<N+1>/, including baselines. For new skills, the baseline is always no-skill. For existing skills, use your judgment — the original version or the previous iteration--previous-workspace pointing at the previous iterationStop when the user is satisfied, feedback is all empty, or improvements plateau.
For rigorous comparison between two skill versions, use the blind comparison system. Read agents/comparator.md and agents/analyzer.md for details. An independent agent judges two outputs without knowing which skill produced which, then analyzes why the winner won.
This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.
After a skill is finalized, optimizing its description improves triggering accuracy. For the full procedure, see references/description-optimization.md.
The core workflow applies everywhere. For Claude.ai and Cowork-specific adaptations, see references/environments.md.
When the skill is ready to distribute, see references/packaging.md.
Agents — read when spawning the relevant subagent:
References — read when entering the relevant phase:
Core loop reminder:
Task Progress:
- [ ] Understand what the skill is about
- [ ] Draft or edit the skill
- [ ] Run Claude-with-access-to-the-skill on test prompts
- [ ] Evaluate outputs with the user (see references/eval-workflow.md)
- [ ] Iterate until user is satisfied
- [ ] Package the final skill (see references/packaging.md)
Add these steps to your TodoList if available to ensure nothing is missed.
npx claudepluginhub kevin-rm/claude-code --plugin baseCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.