From nsls2-skills
Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit or optimize an existing skill, test a skill against realistic prompts, or iterate on skill quality. Also use when someone says "turn this into a skill" or asks to capture a workflow as a reusable skill.
How this skill is triggered — by the user, by Claude, or both
Slash command
/nsls2-skills:skill-creatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A skill for creating new skills and iteratively improving them.
A skill for creating new skills and iteratively improving them.
The process of creating a skill:
Your job is to figure out where the user is in this process and help them progress. Maybe they want to make a skill from scratch — help them narrow scope, write a draft, test it, and iterate. Maybe they already have a draft — go straight to eval/iterate. Be flexible. If the user says "I don't need evaluations, just vibe with me," do that instead.
Skill creation attracts users across a wide range of technical familiarity. Pay attention to context cues to understand how to phrase your communication.
Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill gaps and should confirm before proceeding.
Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've ironed this out.
If useful tools are available for research (searching docs, finding similar skills, looking up best practices), use them. Come prepared with context to reduce burden on the user.
Based on the user interview, fill in these components:
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
Skills use a three-level loading system:
Key patterns:
Domain organization — when a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
├── aws.md
├── gcp.md
└── azure.md
The agent reads only the relevant reference file.
Prefer using the imperative form in instructions.
Defining output formats:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern:
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
Explain to the model why things are important rather than using heavy-handed MUSTs. Use theory of mind and make the skill general rather than overly narrow to specific examples.
If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — reframe and explain the reasoning so the model understands why the thing you're asking for is important. That's a more effective approach than brute-force instruction.
Start by writing a draft, then look at it with fresh eyes and improve it.
Every skill should end with a self-improvement section that tells the agent to update the skill when it encounters gaps, failures, or new patterns. This ensures the skill gets better over time. Example:
## Self-Improvement
After using this skill:
1. **If something failed or was wrong**: update the relevant section or add a new gotcha.
2. **If a new pattern emerged**: add it or create a new section.
3. **If a workaround was needed**: document it inline where the original guidance was.
After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?"
Save test cases to evals/evals.json in the skill's workspace directory:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"assertions": [],
"files": []
}
]
}
Put results in <skill-name>-workspace/ as a sibling to the skill directory. Within the workspace, organize results by iteration (iteration-1/, iteration-2/, etc.) and within that, each test case gets a directory (eval-0/, eval-1/, etc.). Create directories as you go.
For each test case, run two versions — one with the skill, one without. Launch them in parallel using your subagent/task tool if available. If subagents aren't available, run them sequentially.
With-skill run: Execute the task prompt while following the skill's instructions. Save outputs to <workspace>/iteration-N/eval-ID/with_skill/outputs/.
Baseline run (same prompt, no skill loaded):
without_skill/outputs/.old_skill/outputs/.Write an eval_metadata.json for each test case. Give each eval a descriptive name based on what it's testing.
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}
Use the time productively — draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable and have descriptive names. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.
Update the eval_metadata.json files and evals/evals.json with the assertions.
When each run completes, save timing data to timing.json in the run directory:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Once all runs are done:
grading.json:{
"eval_id": 0,
"expectations": [
{"text": "Output contains valid JSON", "passed": true, "evidence": "Parsed successfully"},
{"text": "Includes all required fields", "passed": false, "evidence": "Missing 'description' field"}
]
}
Aggregate results — compute pass rates for with-skill vs baseline. Note time and token differences.
Present to the user — show a summary of each test case: the prompt, key outputs, assertion results, and any observations. Highlight patterns the aggregate stats might hide (assertions that always pass regardless of skill, high-variance results, time/token tradeoffs).
Ask the user to review the results. Empty feedback means they thought it was fine. Focus improvements on test cases where the user had specific complaints.
Generalize from the feedback. You're iterating on a few examples to move fast, but the skill will be used across many different prompts. Rather than putting in fiddly overfitty changes or oppressively constrictive MUSTs, try branching out with different metaphors or recommending different patterns.
Keep the prompt lean. Remove things that aren't pulling their weight. Read the transcripts, not just the final outputs — if the skill makes the model waste time on unproductive steps, trim those parts.
Explain the why. Try hard to explain the reasoning behind everything you're asking the model to do. Even if the feedback is terse, understand the task and transmit that understanding into the instructions.
Look for repeated work across test cases. If all test runs independently wrote similar helper scripts or took the same multi-step approach, the skill should bundle that script. Write it once, put it in scripts/, and reference it from the skill.
Take your time. Write a draft revision, then look at it fresh and improve it.
After improving the skill:
iteration-N+1/ directory, including baseline runsKeep going until:
When adding a skill to the NSLS2 skills repo (NSLS2/skills):
skills/<skill-name>/SKILL.md.claude-plugin/marketplace.json:
{
"skills": [
"./skills/existing-skill",
"./skills/new-skill"
]
}
README.md — add a row to the Available Skills tableAfter using this skill to create or improve another skill:
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub nsls2/frontend-skills --plugin nsls2-skills