From skill-creator-pro
Use when creating new skills, iterating on existing skills, auditing skill quality, bumping skill versions, or optimizing skill trigger descriptions. Pro-grade lifecycle covering scaffold, baseline testing, evals, lint, semver, CHANGELOG, and description optimization. Triggers on "create a skill", "new skill", "迭代 skill", "优化 skill", "skill 版本", "skill lint", "review this skill", "write a skill for X", "turn this into a skill", or whenever the user starts or edits a SKILL.md.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-creator-pro:skill-creator-proThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are about to create or change a skill. That is a high-leverage act: one
You are about to create or change a skill. That is a high-leverage act: one skill lives in the system prompt of every future conversation and gets invoked tens, hundreds, or thousands of times. Treat it like production code.
The lifecycle is one pipeline: RED → GREEN → EVAL → LINT → BUMP → OPTIMIZE-DESC → PACKAGE / SYNC. Each step exists because skipping it has produced a specific failure mode in real use.
Understand intent
│
▼
RED: Run 1-2 baseline subagent scenarios WITHOUT the skill
│ (watch it fail; record rationalizations verbatim)
▼
GREEN: scripts/init_skill.py <name> → fill SKILL.md addressing those failures
│ (description = "Use when ..."; body explains WHY, not just WHAT)
▼
EVAL (optional, recommended for non-trivial skills):
│ write evals.json → spawn with-skill + baseline subagents → benchmark.json
│ → eval-viewer → user feedback → improve → re-run
▼
REFACTOR: close new rationalizations surfaced by evals
│
▼
LINT: scripts/lint_skill.py <name> → fix all errors, then warns
│
▼
BUMP: scripts/bump_version.py <name> --type <patch|minor|major> -m "..."
│ (creates/updates CHANGELOG.md, badges, frontmatter.metadata.version)
▼
OPTIMIZE TRIGGER DESC: (when accuracy matters)
│ iterate the description against 20 trigger queries (10 should-trigger, 10 near-miss)
▼
PACKAGE / SYNC: deploy to ~/.claude/skills/ — or ship a multi-platform library
(npx skills add + .claude/.codex/.cursor plugin manifests)
Do not skip RED. The biggest failure mode of hand-written skills is that they solve problems the model never had — you assumed a rationalization that never appears, and missed the one that does.
Not every skill earns the full pipeline. Pick a track before you start:
| Track | When | Skip |
|---|---|---|
| Lightweight | Personal utility, narrow scope, pure-reference, one user | RED, EVAL, description optimization |
| Full | Discipline skill (TDD-like), shipped to others, >1 use case, high stakes | Nothing; run the whole loop |
| Audit-only | Existing skill the user wants polished | Start at LINT, skip RED/GREEN |
Announce the track to the user in one line: "Going with the lightweight track because X." Let them redirect.
A skill has one of two shapes. Pick before scaffolding — retrofitting is costly.
| Shape | Structure | Use when |
|---|---|---|
| Single-file | one SKILL.md (+ references/, scripts/) | the skill does one thing, or a few tightly-related things sharing one procedure |
| Router + workflows | thin router SKILL.md indexing workflows/*.md; agent loads only the selected one | ≥3 distinct user intents, body would exceed ~400 lines of independently-invoked sections, or the capability set will grow |
The router shape is how LLMQuant/skills ships 18 category skills and how
meta+companion families (firefly / cancer-buddy / beacon) scale. A router is
progressive disclosure: the model reads a ~40-line index + the one relevant
workflow instead of skimming a 2000-line monolith on every near-miss. Less
irrelevant text in context is a correctness lever, not just a cost one.
Scaffold a router with init_skill.py <name> --router (creates the router,
workflows/, a sample workflow, and multi-platform plugin manifests). Full
guide: references/router-and-workflows.md. Templates:
templates/SKILL_ROUTER_TEMPLATE.md + templates/WORKFLOW_TEMPLATE.md.
Every top-tier skill is one of two schools. Classify before writing — they have different body structures, trigger styles, and verification methods, and blending them produces an incoherent skill.
| Capability / reference | Discipline / process | |
|---|---|---|
| PRD says | "extract / convert / generate / analyze X" | "always / never / enforce / stop the agent from Z" |
| Body | table-first, ✅/❌ code recipes, scripts, QA loop | flowchart + one Iron Law + rationalization table + red-flags |
| Trigger | exhaustive keyword enumeration + Do NOT clause | minimal "Use when " |
| Verify | mandatory "assume-failure" QA loop (fresh-eyes subagent) | RED-GREEN-REFACTOR under multi-pressure scenarios |
Full taxonomy + the distilled quality bar from the top-50 starred skill repos:
references/top-skills-playbook.md.
When the user hands you a PRD / spec and says "build the skill", run the whole pipeline driven by the PRD. Do not stop at a draft — a PRD asks for a finished, top-tier skill.
AskUserQuestion); both gaps produce a bad skill.Full procedure + the PRD intake template: references/prd-to-skill.md.
Ask, using AskUserQuestion when available:
If the user says "turn this conversation into a skill", mine the transcript instead of asking: extract the steps they corrected, the tools used, the unobvious judgment calls. Then confirm the gaps.
Before writing a single line of the skill, run one pressure scenario with a subagent without the skill. Use the Agent tool:
Execute this task: [the realistic user prompt]. Report what you did and how you decided, verbatim. No skill is available.
Record: what did it do wrong? what rationalizations did it use? These become the specific things your skill must counter. If the baseline succeeds effortlessly, the skill is probably unnecessary — tell the user.
See references/tdd-for-skills.md for pressure-scenario patterns, pressure
types (time, sunk cost, authority, exhaustion), and a catalog of
rationalizations.
python scripts/init_skill.py <name> --dest ~/.claude/skills # single-file
python scripts/init_skill.py <name> --dest ~/.claude/skills --router # router + workflows
The default creates <dest>/<name>/ with:
SKILL.md — frontmatter template + TODO placeholdersREADME.md — human-readable, version badge, install snippet, options tablescripts/ — empty; create if the skill has deterministic helpersreferences/ — empty; create for heavy reference material (>100 lines)--router instead creates a router SKILL.md, a workflows/ dir with a sample
workflow, and .claude-plugin/ + .codex-plugin/ + .cursor-plugin/ manifests
for multi-platform install. See references/router-and-workflows.md.
Now fill SKILL.md:
Frontmatter (see references/frontmatter-spec.md for the full spec):
---
name: <kebab-case-name>
description: Use when <specific triggering conditions>. Triggers on
<concrete phrases>.
license: MIT
metadata:
author: <you>
version: "0.1.0"
tags: <space-separated>
---
name and description are the only hard requirements; the rest improves
discoverability, versioning, and lint signal.
Body (≤500 lines; split to references/ beyond that):
Evidence Contract (any skill that touches external facts — strongly recommended for medical / research / financial skills). Declare three things so the model never re-litigates "can I just make this up?" under pressure:
Match the contract to the domain's "things dangerous to guess". See
references/router-and-workflows.md § The Evidence Contract pattern.
README.md — humans read this on GitHub or in the file browser. Keep the version badge, install command, and a minimal usage example.
Write 2-3 realistic test prompts in evals/evals.json, spawn with-skill
and baseline subagents in the same turn, collect outputs into
<skill>-workspace/iteration-1/, grade against assertions, aggregate into
benchmark.json, review the side-by-side output, iterate.
See references/evals-and-benchmark.md for the schemas, workspace layout, and
what a benchmark review should surface.
When you read the transcripts (not just outputs) from the with-skill runs, look for:
scripts/.python scripts/lint_skill.py <name> --path <skill-dir>
# or: python scripts/lint_skill.py <name> --dest ~/.claude/skills --json
Checks (see references/lint-rules.md for the full list):
SKILL.md present, optional README.md, CHANGELOG.md.name + description required; description starts with "Use
when"; metadata.version is semver; tags present; description ≥ 80
chars (below = almost never enough trigger info).TODO placeholders left, ≤500 lines, interactive skills mention
AskUserQuestion (or explicitly declare non-interactive).argparse, no pip --break-system-packages in
production code paths.Fix error-severity findings before shipping. Treat warn as a checklist.
python scripts/bump_version.py <name> --path <skill-dir> \
--type patch \
--message "primary change summary" \
--change "additional bullet" \
--change "additional bullet"
| Bump | Use for |
|---|---|
patch | Bug fix, wording, frontmatter tweak, CJK rendering fix |
minor | New CLI flag, new option, new reference doc, expanded scope |
major | Breaking CLI change, removed option, renamed skill |
Stay in 0.x until the skill has stabilized across real use.
The script updates README.md version badge, SKILL.md frontmatter
metadata.version, and prepends a dated entry to CHANGELOG.md
(auto-created in Keep-a-Changelog format if missing).
Build a 20-query eval set (≈10 should-trigger, ≈10 near-miss should-not-trigger) and iterate the description against it. Split 60/40 train/test and pick the winner by test score, not train, to avoid overfitting.
See references/description-optimization.md for the "when-to-use-only" rule,
a worked set of good/bad description examples, and how to write realistic
near-miss queries.
~/.claude/skills/<name>/. Verify with a fresh conversation.skills/<name>/… and ship plugin manifests so one command
installs everywhere:
npx skills add <owner>/<repo> # auto-detects CC / Codex / Cursor / OpenCode / …
npx skills add <owner>/<repo> -g --all
Manifests: .claude-plugin/plugin.json + marketplace.json, .codex-plugin/,
.cursor-plugin/. Add CONTRIBUTING.md (required contract + PR checklist +
Conventional Commits) and a parallel README.zh-CN.md. Full guide:
references/shipping-a-skill-library.md.~/.claude/skills/<name>,
and rsync outward to any distribution repo. Never let copies drift — an
installer must get the same bytes as your local source.If you skipped RED, write one sentence in the skill's commit message explaining why: "Lightweight track, pure reference, verified by eyeballing the rendered output." Forcing yourself to justify it in writing prevents lazy skipping.
Anthropic's official shape is <what it does>. Use when <triggers/keywords>. —
state the capability AND the triggering conditions, third person. The red line
is the workflow: when the description summarizes the steps, the model
follows the description instead of reading the body. A description saying
"performs code review between tasks" (a how) caused the model to do ONE review;
the skill required TWO. "Reviews code for X. Use when executing plans with
independent tasks" (what + when) fixed it.
So: name what it does and when to fire — never recite first/then/finally. The
lint's FM_DESC_WORKFLOW_SUMMARY enforces the how red line, not the what.
See references/top-skills-playbook.md § The description is the product.
If the skill says "run lint before bump", running bump first and lint after "because it's faster" is a violation. The letter exists because the spirit isn't self-enforcing. When tempted to bend a rule, update the skill instead so the rule reflects your current judgment.
Push back on the user — politely — if:
CLAUDE.md, not in a globally
loaded skill.| Excuse | Reality |
|---|---|
| "Skipping RED because the skill is obvious" | Obvious to you ≠ obvious under pressure. Takes 3 minutes. Run it. |
| "I'll polish the description later" | Description is the only thing that makes the skill discoverable. Polish it before shipping, not "later". |
| "Just this once I'll summarize the workflow in the description" | The model will shortcut past the body. This is the single most tested failure mode. |
| "The lint warning isn't a real problem" | If you're right, fix the lint rule. If the rule is right, fix the skill. Don't suppress. |
| "I'll bump version later" | Every change without a CHANGELOG entry is a loss of history. Bump now. |
| "It's working for me" | A skill's job is to work for future Claude in conversations you won't see. Test from that stance. |
references/top-skills-playbook.md — the distilled quality bar from the
top-50 starred skill repos: two schools, description craft, progressive-
disclosure hard rules, degrees of freedom, anti-patterns, pre-ship checklist.references/prd-to-skill.md — PRD → finished top-tier skill: ingest/classify,
derive evals from acceptance criteria, the PRD intake template.references/router-and-workflows.md — the two skill shapes, when to use a
router, anatomy of routers/workflows, the Evidence Contract pattern.references/shipping-a-skill-library.md — multi-platform packaging (npx skills add, plugin manifests), CONTRIBUTING contract, Conventional Commits.references/frontmatter-spec.md — every field, when to set it, common
mistakes.references/tdd-for-skills.md — RED pressure scenarios, subagent prompts,
rationalization capture.references/description-optimization.md — the "when-to-use-only" rule, good
vs bad examples, iteration loop mechanics.references/evals-and-benchmark.md — evals.json schema, workspace layout,
benchmark review, grading assertions.references/lint-rules.md — full list of lint checks, severities, and the
reasoning behind each.Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub zwbao/skill-creator-pro --plugin skill-creator-pro