Skill

skill-creator-pro

Use when creating new skills, iterating on existing skills, auditing skill quality, bumping skill versions, or optimizing skill trigger descriptions. Pro-grade lifecycle covering scaffold, baseline testing, evals, lint, semver, CHANGELOG, and description optimization. Triggers on "create a skill", "new skill", "迭代 skill", "优化 skill", "skill 版本", "skill lint", "review this skill", "write a skill for X", "turn this into a skill", or whenever the user starts or edits a SKILL.md.

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/skill-creator-pro:skill-creator-pro

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are about to create or change a skill. That is a high-leverage act: one

SKILL.md

418 lines · ~4.7k tokens

Stats

LanguagePython

Stars1

Forks1

MaintenanceGood

Last CommitJun 1, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

skill-creator-pro — A pro-grade skill lifecycle

You are about to create or change a skill. That is a high-leverage act: one skill lives in the system prompt of every future conversation and gets invoked tens, hundreds, or thousands of times. Treat it like production code.

The lifecycle is one pipeline: RED → GREEN → EVAL → LINT → BUMP → OPTIMIZE-DESC → PACKAGE / SYNC. Each step exists because skipping it has produced a specific failure mode in real use.

The Core Loop

Understand intent
  │
  ▼
RED: Run 1-2 baseline subagent scenarios WITHOUT the skill
  │  (watch it fail; record rationalizations verbatim)
  ▼
GREEN: scripts/init_skill.py <name>  →  fill SKILL.md addressing those failures
  │  (description = "Use when ..."; body explains WHY, not just WHAT)
  ▼
EVAL (optional, recommended for non-trivial skills):
  │  write evals.json → spawn with-skill + baseline subagents → benchmark.json
  │  → eval-viewer → user feedback → improve → re-run
  ▼
REFACTOR: close new rationalizations surfaced by evals
  │
  ▼
LINT: scripts/lint_skill.py <name>  →  fix all errors, then warns
  │
  ▼
BUMP: scripts/bump_version.py <name> --type <patch|minor|major> -m "..."
  │  (creates/updates CHANGELOG.md, badges, frontmatter.metadata.version)
  ▼
OPTIMIZE TRIGGER DESC: (when accuracy matters)
  │  iterate the description against 20 trigger queries (10 should-trigger, 10 near-miss)
  ▼
PACKAGE / SYNC: deploy to ~/.claude/skills/ — or ship a multi-platform library
                (npx skills add + .claude/.codex/.cursor plugin manifests)

Do not skip RED. The biggest failure mode of hand-written skills is that they solve problems the model never had — you assumed a rationalization that never appears, and missed the one that does.

Decide the Track Up Front

Not every skill earns the full pipeline. Pick a track before you start:

Track	When	Skip
Lightweight	Personal utility, narrow scope, pure-reference, one user	RED, EVAL, description optimization
Full	Discipline skill (TDD-like), shipped to others, >1 use case, high stakes	Nothing; run the whole loop
Audit-only	Existing skill the user wants polished	Start at LINT, skip RED/GREEN

Announce the track to the user in one line: "Going with the lightweight track because X." Let them redirect.

Decide the Skill Shape (orthogonal to track)

A skill has one of two shapes. Pick before scaffolding — retrofitting is costly.

Shape	Structure	Use when
Single-file	one `SKILL.md` (+ `references/`, `scripts/`)	the skill does one thing, or a few tightly-related things sharing one procedure
Router + workflows	thin router `SKILL.md` indexing `workflows/*.md`; agent loads only the selected one	≥3 distinct user intents, body would exceed ~400 lines of independently-invoked sections, or the capability set will grow

The router shape is how LLMQuant/skills ships 18 category skills and how meta+companion families (firefly / cancer-buddy / beacon) scale. A router is progressive disclosure: the model reads a ~40-line index + the one relevant workflow instead of skimming a 2000-line monolith on every near-miss. Less irrelevant text in context is a correctness lever, not just a cost one.

Scaffold a router with init_skill.py <name> --router (creates the router, workflows/, a sample workflow, and multi-platform plugin manifests). Full guide: references/router-and-workflows.md. Templates: templates/SKILL_ROUTER_TEMPLATE.md + templates/WORKFLOW_TEMPLATE.md.

Classify the School (capability vs discipline)

Every top-tier skill is one of two schools. Classify before writing — they have different body structures, trigger styles, and verification methods, and blending them produces an incoherent skill.

	Capability / reference	Discipline / process
PRD says	"extract / convert / generate / analyze X"	"always / never / enforce / stop the agent from Z"
Body	table-first, ✅/❌ code recipes, scripts, QA loop	flowchart + one Iron Law + rationalization table + red-flags
Trigger	exhaustive keyword enumeration + `Do NOT` clause	minimal "Use when "
Verify	mandatory "assume-failure" QA loop (fresh-eyes subagent)	RED-GREEN-REFACTOR under multi-pressure scenarios

Full taxonomy + the distilled quality bar from the top-50 starred skill repos: references/top-skills-playbook.md.

PRD-Driven Generation (input a PRD → finished skill)

When the user hands you a PRD / spec and says "build the skill", run the whole pipeline driven by the PRD. Do not stop at a draft — a PRD asks for a finished, top-tier skill.

Ingest & classify — extract the PRD intake (school, shape, trigger phrasings, inputs/outputs, dangerous-to-guess values, acceptance criteria, target models). Missing trigger phrasings or acceptance criteria → ask the user once (AskUserQuestion); both gaps produce a bad skill.
Derive evals from the PRD — each acceptance criterion → ≥1 checkable assertion (≥3 scenarios). Eval-first: tests before docs.
RED → scaffold → GREEN — run the baseline, then fill SKILL.md by school, applying the playbook.
Verify → LINT → OPTIMIZE-DESC → BUMP → PACKAGE — to completion.
Report against the PRD — list every requirement and whether the shipped skill meets it, with eval evidence; be honest about anything deferred.

Full procedure + the PRD intake template: references/prd-to-skill.md.

Step-by-Step

1. Understand intent (2 min)

Ask, using AskUserQuestion when available:

What should this skill enable a future Claude to do?
What would a user actually type that should trigger it? (Get 2-3 real phrasings, not abstract intents.)
Input → output: concrete shape, formats, file types.
One sentence on why it doesn't already work without a skill — this is the gap the skill fills.

If the user says "turn this conversation into a skill", mine the transcript instead of asking: extract the steps they corrected, the tools used, the unobvious judgment calls. Then confirm the gaps.

2. RED — baseline subagent test (full track only)

Before writing a single line of the skill, run one pressure scenario with a subagent without the skill. Use the Agent tool:

Execute this task: [the realistic user prompt]. Report what you did and how you decided, verbatim. No skill is available.

Record: what did it do wrong? what rationalizations did it use? These become the specific things your skill must counter. If the baseline succeeds effortlessly, the skill is probably unnecessary — tell the user.

See references/tdd-for-skills.md for pressure-scenario patterns, pressure types (time, sunk cost, authority, exhaustion), and a catalog of rationalizations.

3. GREEN — scaffold + write

python scripts/init_skill.py <name> --dest ~/.claude/skills          # single-file
python scripts/init_skill.py <name> --dest ~/.claude/skills --router # router + workflows

The default creates <dest>/<name>/ with:

SKILL.md — frontmatter template + TODO placeholders
README.md — human-readable, version badge, install snippet, options table
scripts/ — empty; create if the skill has deterministic helpers
references/ — empty; create for heavy reference material (>100 lines)

--router instead creates a router SKILL.md, a workflows/ dir with a sample workflow, and .claude-plugin/ + .codex-plugin/ + .cursor-plugin/ manifests for multi-platform install. See references/router-and-workflows.md.

Now fill SKILL.md:

Frontmatter (see references/frontmatter-spec.md for the full spec):

---
name: <kebab-case-name>
description: Use when <specific triggering conditions>. Triggers on
  <concrete phrases>.
license: MIT
metadata:
  author: <you>
  version: "0.1.0"
  tags: <space-separated>
---

name and description are the only hard requirements; the rest improves discoverability, versioning, and lint signal.

Body (≤500 lines; split to references/ beyond that):
- One-paragraph overview.
- "When to Use" bullet list of symptoms, and "When NOT to Use" for adjacent cases.
- Workflow steps (imperative: "Do X", not "You should do X").
- A quick-reference table for scanning.
- Common mistakes with fixes.
- Explain the WHY for every non-obvious rule. Today's models have theory of mind — rigid MUSTs without reasoning produce brittle compliance.
Evidence Contract (any skill that touches external facts — strongly recommended for medical / research / financial skills). Declare three things so the model never re-litigates "can I just make this up?" under pressure:
- Sources of record — the API / DB / files / tools that supply ground truth. Prefer live retrieval over the model's memory.
- Fallback — if a required input is missing, name the exact missing input and continue only with retrieved or user-provided evidence; never fill the gap from memory.
- Never fabricate — the value types that are dangerous to guess (quotes, prices, citations, lab values, dosages, trial IDs). A model synthesizing these is the failure the contract exists to stop.
Match the contract to the domain's "things dangerous to guess". See references/router-and-workflows.md § The Evidence Contract pattern.
README.md — humans read this on GitHub or in the file browser. Keep the version badge, install command, and a minimal usage example.

4. EVAL — objective benchmark (full track)

Write 2-3 realistic test prompts in evals/evals.json, spawn with-skill and baseline subagents in the same turn, collect outputs into <skill>-workspace/iteration-1/, grade against assertions, aggregate into benchmark.json, review the side-by-side output, iterate.

See references/evals-and-benchmark.md for the schemas, workspace layout, and what a benchmark review should surface.

5. REFACTOR — close new rationalizations

When you read the transcripts (not just outputs) from the with-skill runs, look for:

Repeated helper scripts the subagent wrote independently — bundle one into scripts/.
New rationalizations the skill didn't counter — add them to a rationalization table in the body, keep the why attached.
Instructions the model ignored — they're likely too abstract or too rigid; rewrite as a principle with a concrete example.

6. LINT — automated audit

python scripts/lint_skill.py <name> --path <skill-dir>
# or:  python scripts/lint_skill.py <name> --dest ~/.claude/skills --json

Checks (see references/lint-rules.md for the full list):

Structure: SKILL.md present, optional README.md, CHANGELOG.md.
Frontmatter: name + description required; description starts with "Use when"; metadata.version is semver; tags present; description ≥ 80 chars (below = almost never enough trigger info).
Description anti-pattern: warns if the description reads like a workflow summary ("then ... next ... finally ..."). When the description describes what the skill does, the model follows the description and skips the body. Keep it strictly "when to use".
Body: no TODO placeholders left, ≤500 lines, interactive skills mention AskUserQuestion (or explicitly declare non-interactive).
Scripts: Python CLIs use argparse, no pip --break-system-packages in production code paths.

Fix error-severity findings before shipping. Treat warn as a checklist.

7. BUMP — semver + CHANGELOG

python scripts/bump_version.py <name> --path <skill-dir> \
    --type patch \
    --message "primary change summary" \
    --change "additional bullet" \
    --change "additional bullet"

Bump	Use for
`patch`	Bug fix, wording, frontmatter tweak, CJK rendering fix
`minor`	New CLI flag, new option, new reference doc, expanded scope
`major`	Breaking CLI change, removed option, renamed skill

Stay in 0.x until the skill has stabilized across real use.

The script updates README.md version badge, SKILL.md frontmatter metadata.version, and prepends a dated entry to CHANGELOG.md (auto-created in Keep-a-Changelog format if missing).

8. OPTIMIZE trigger description (optional, full track)

Build a 20-query eval set (≈10 should-trigger, ≈10 near-miss should-not-trigger) and iterate the description against it. Split 60/40 train/test and pick the winner by test score, not train, to avoid overfitting.

See references/description-optimization.md for the "when-to-use-only" rule, a worked set of good/bad description examples, and how to write realistic near-miss queries.

9. PACKAGE / SYNC

Single-location deploy (most users): the skill is already live in ~/.claude/skills/<name>/. Verify with a fresh conversation.
Multi-platform library (shipping several skills to other people / agents): lay the repo out as skills/<name>/… and ship plugin manifests so one command installs everywhere:
```
npx skills add <owner>/<repo>          # auto-detects CC / Codex / Cursor / OpenCode / …
npx skills add <owner>/<repo> -g --all
```
Manifests: .claude-plugin/plugin.json + marketplace.json, .codex-plugin/, .cursor-plugin/. Add CONTRIBUTING.md (required contract + PR checklist + Conventional Commits) and a parallel README.zh-CN.md. Full guide: references/shipping-a-skill-library.md.
Source-of-truth sync: keep one source repo, symlink ~/.claude/skills/<name>, and rsync outward to any distribution repo. Never let copies drift — an installer must get the same bytes as your local source.

The Three Iron Laws

1. No skill without a failing test (or a documented reason why)

If you skipped RED, write one sentence in the skill's commit message explaining why: "Lightweight track, pure reference, verified by eyeballing the rendered output." Forcing yourself to justify it in writing prevents lazy skipping.

2. Description = what + when, never how

Anthropic's official shape is <what it does>. Use when <triggers/keywords>. — state the capability AND the triggering conditions, third person. The red line is the workflow: when the description summarizes the steps, the model follows the description instead of reading the body. A description saying "performs code review between tasks" (a how) caused the model to do ONE review; the skill required TWO. "Reviews code for X. Use when executing plans with independent tasks" (what + when) fixed it.

So: name what it does and when to fire — never recite first/then/finally. The lint's FM_DESC_WORKFLOW_SUMMARY enforces the how red line, not the what. See references/top-skills-playbook.md § The description is the product.

3. Violating the letter = violating the spirit

If the skill says "run lint before bump", running bump first and lint after "because it's faster" is a violation. The letter exists because the spirit isn't self-enforcing. When tempted to bend a rule, update the skill instead so the rule reflects your current judgment.

When NOT to create a skill

Push back on the user — politely — if:

The task is a one-off; skills are for repeated patterns.
The rule is mechanically enforceable (regex, linter, type system). A skill documents judgment calls; automate the rest.
The project-specific convention belongs in CLAUDE.md, not in a globally loaded skill.
There's already a skill doing this; improve the existing one rather than forking.

Rationalization Table — watch yourself for these

Excuse	Reality
"Skipping RED because the skill is obvious"	Obvious to you ≠ obvious under pressure. Takes 3 minutes. Run it.
"I'll polish the description later"	Description is the only thing that makes the skill discoverable. Polish it before shipping, not "later".
"Just this once I'll summarize the workflow in the description"	The model will shortcut past the body. This is the single most tested failure mode.
"The lint warning isn't a real problem"	If you're right, fix the lint rule. If the rule is right, fix the skill. Don't suppress.
"I'll bump version later"	Every change without a CHANGELOG entry is a loss of history. Bump now.
"It's working for me"	A skill's job is to work for future Claude in conversations you won't see. Test from that stance.

Where to go next

references/top-skills-playbook.md — the distilled quality bar from the top-50 starred skill repos: two schools, description craft, progressive- disclosure hard rules, degrees of freedom, anti-patterns, pre-ship checklist.
references/prd-to-skill.md — PRD → finished top-tier skill: ingest/classify, derive evals from acceptance criteria, the PRD intake template.
references/router-and-workflows.md — the two skill shapes, when to use a router, anatomy of routers/workflows, the Evidence Contract pattern.
references/shipping-a-skill-library.md — multi-platform packaging (npx skills add, plugin manifests), CONTRIBUTING contract, Conventional Commits.
references/frontmatter-spec.md — every field, when to set it, common mistakes.
references/tdd-for-skills.md — RED pressure scenarios, subagent prompts, rationalization capture.
references/description-optimization.md — the "when-to-use-only" rule, good vs bad examples, iteration loop mechanics.
references/evals-and-benchmark.md — evals.json schema, workspace layout, benchmark review, grading assertions.
references/lint-rules.md — full list of lint checks, severities, and the reasoning behind each.

skill-creator-pro

Popularity

Invocation

Context Preview

SKILL.md

skill-creator-pro

Popularity

Invocation

Context Preview

SKILL.md

skill-creator-pro — A pro-grade skill lifecycle

The Core Loop

Decide the Track Up Front

Decide the Skill Shape (orthogonal to track)

Classify the School (capability vs discipline)

PRD-Driven Generation (input a PRD → finished skill)

Step-by-Step

1. Understand intent (2 min)

2. RED — baseline subagent test (full track only)

3. GREEN — scaffold + write

4. EVAL — objective benchmark (full track)

5. REFACTOR — close new rationalizations

6. LINT — automated audit

7. BUMP — semver + CHANGELOG

8. OPTIMIZE trigger description (optional, full track)

9. PACKAGE / SYNC

The Three Iron Laws

1. No skill without a failing test (or a documented reason why)

2. Description = what + when, never how

3. Violating the letter = violating the spirit

When NOT to create a skill

Rationalization Table — watch yourself for these

Where to go next

Similar Skills

skill-creator-pro — A pro-grade skill lifecycle

The Core Loop

Decide the Track Up Front

Decide the Skill Shape (orthogonal to track)

Classify the School (capability vs discipline)

PRD-Driven Generation (input a PRD → finished skill)

Step-by-Step

1. Understand intent (2 min)

2. RED — baseline subagent test (full track only)

3. GREEN — scaffold + write

4. EVAL — objective benchmark (full track)

5. REFACTOR — close new rationalizations

6. LINT — automated audit

7. BUMP — semver + CHANGELOG

8. OPTIMIZE trigger description (optional, full track)

9. PACKAGE / SYNC

The Three Iron Laws

1. No skill without a failing test (or a documented reason why)

2. Description = what + when, never how

3. Violating the letter = violating the spirit

When NOT to create a skill

Rationalization Table — watch yourself for these

Where to go next

Similar Skills