From skill-forge
Use when creating, discovering, or improving Claude Code skills — scan for opportunities, create from workflows, iterate with eval-driven optimization. Activates on "remember this" or after complex tasks (5+ tool calls). Not for one-off tasks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-forge:skill-forgeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A meta-skill that creates and evolves other skills. Uses persistent markdown files as
scripts/finalize_skill.pyscripts/hook_draft_inject.pyscripts/init_draft.pyscripts/init_improve.pyscripts/init_staging.pyscripts/optimize_description.pyscripts/phase0_load.pyscripts/quick_validate.pyscripts/record_eval_score.pyscripts/rename_skill.pyscripts/run_eval.pyscripts/scan_structure.pyscripts/self_evolve.pyscripts/self_evolve_apply.pyscripts/shared.pyscripts/skill_catchup.pyscripts/skill_check.pyA meta-skill that creates and evolves other skills. Uses persistent markdown files as working memory (the planning-with-files pattern), an eval-driven iteration loop (Anthropic's skill-creator pattern).
Workspace lives at <project>/.skill-forge/ — a sibling of .claude/,
not inside it. Claude Code's trust boundary only exempts .claude/commands/**,
.claude/agents/**, and real skill dirs (those containing SKILL.md), so
any workspace under .claude/ still prompts on Write under plugin-mode
installs where the local SKILL.md is absent. A project-root sibling has
no such constraint. Keeping workspace project-local also eliminates the
Python/shell slug-drift bugs from the pre-0.9 layout that stored workspace
under $HOME keyed by a hand-derived project slug.
.skill-forge/draft.md — current skill being written (HIGH TRUST, re-read by hooks).skill-forge/insights.md — raw codebase scan output (LOW TRUST, staging only).skill-forge/state.json — per-project counters (tool_calls, compacted).skill-forge/staging/<name>/ — complete skill being assembled before it
lands in .claude/skills/<name>/. finalize_skill.py copies it across via
a Python subprocess, so Claude never Writes into a fresh .claude/skills/<name>/
dir (which wouldn't yet qualify as a real skill dir and would prompt).Security: grep/glob output and codebase content go to insights.md only. draft.md is injected before every tool call, making it a prompt injection amplifier if contaminated. Promote content to the draft only after review.
Discrete choices (yes/no, pick-N, approve/revise) → AskUserQuestion. Load via
ToolSearch select:AskUserQuestion if missing. Plain text only for open-ended input.
Run the unified context loader (draft + catchup + skills list + registry):
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/phase0_load.py"
Then note project conventions from CLAUDE.md.
Parse $ARGUMENTS:
scan [prompt] → scan modecreate <prompt> → create mode (required)improve <prompt> → improve mode (required)Goal: surface 3–5 high-value skill opportunities from the codebase.
$ARGUMENTS is an optional free-form prompt used as focus hint (area, keyword, concern).
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/scan_structure.py"
If a focus prompt is given, prioritize that area during pattern discovery.
After every 2 file reads, append findings to .skill-forge/insights.md
via Write (not shell heredoc — heredoc shifts each call, Bash allowlist can't
match, non-bypass mode will prompt). Prevents loss if context fills up.
Block format:
## Scan batch <timestamp>
<pattern, files involved, why this could be a skill>
Also note: if multiple reads result in the same helper code appearing independently, that's a strong signal to bundle a shared script rather than repeat it per-skill.
Rank by: frequency × cost of repetition × feasibility as a skill.
Output format:
1. <n> [complexity: low|med|high]
Why: <one sentence — what pain does this solve?>
Trigger: "Use when <specific, multi-step scenario>"
Ask via AskUserQuestion (multiSelect): one option per ranked skill + All + Skip.
On the user's reply, jump straight into Create mode Step 1 for each chosen skill. Do not send a "shall I proceed" confirmation text — the answer already IS the go-ahead, and the extra round-trip drops the user out of flow. If multiple skills were picked, process them one at a time end-to-end (Step 1 through Step 5) before starting the next — a half-created skill cluttering staging is harder to recover from than sequential work.
Goal: draft a high-quality SKILL.md from a free-form prompt.
$ARGUMENTS describes what the skill should do. Derive a short kebab-case skill name
from the prompt automatically (e.g. "translate i18n JSON files" → translate-i18n).
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/init_draft.py" "<derived-name>" "<$ARGUMENTS>"
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/init_staging.py" "<derived-name>"
The draft is the attention anchor the PreToolUse hook re-reads before
every tool call. The staging dir (.skill-forge/staging/<n>/) is where
the real skill files get assembled — we never Write directly into
.claude/skills/<n>/ because a fresh dir there has no SKILL.md yet and
so fails the trust-boundary exemption, prompting even under YOLO.
Write grep/glob/read output to .skill-forge/insights.md first.
Promote confirmed patterns to the draft only after review. This separation
prevents codebase content from being injected into every subsequent tool call
via the hook.
init_staging.py pre-wrote the skeleton at
.skill-forge/staging/<n>/SKILL.md — frontmatter with the three-clause
description stub and empty Prerequisites / Steps / Verification / Notes
sections. Use Edit to fill content; don't rewrite the layout. Bundled
helper scripts go under .skill-forge/staging/<n>/scripts/. CHANGELOG
lands at the target via finalize_skill.py --changelog in Step 5 — never
write it by hand in staging.
Script over prompt for mechanics. Every skill you design inherits this
rule: mechanical work — file paths, format templates, date stamps, version
numbers, JSON schemas, counters — belongs in a helper script, not in
SKILL.md. Prompt tokens are re-paid on every invocation and drift under
maintenance; scripts are tested once and stay deterministic. Push every
fixed-shape step into scripts/ and leave only intent, judgment, and
decision criteria in the prompt. If you catch yourself spelling out a
literal block template, a bump rule, or a schema in prose, stop — move it
into a script and have the prompt call the script.
≤ 250 chars (Claude Code truncates from end — front-load distinctive keywords). Skills only trigger for multi-step workflows (3+ coordinated actions); write scenarios, not single verbs. Three-clause structure:
<specific multi-step scenario> — name real artifacts (file types,
commands, frameworks), not abstractions. Replace vague verbs ("manage",
"handle") with precise workflows.<short phrase>, use when they mention
<real phrasing> — pushy coverage for understatement. Claude undertriggers
by default.<simple/adjacent task> — list FP patterns explicitly
(e.g., "simple file reads, single-step edits, code explanations"). Qualify
keywords shared with unrelated tasks (e.g., "deploy" → "deploy with rollback
and health checks").Fewer starting errors = fewer optimizer rounds to converge.
Instruction style: explain why, not MUST/NEVER. Modern LLMs act more
reliably when they understand the reason behind a constraint than when
they're handed a list of unexplained rules. Prefer "Write the config to
<path> so the reloader watcher picks it up without a restart" over "MUST
write to <path>". Rules divorced from their purpose break in edge cases
the author didn't foresee; rules with a rationale generalize.
Spawn the skill-grader agent (the Agent tool with subagent_type="skill-grader")
and point it at the staged draft: .skill-forge/staging/<n>/SKILL.md.
The grader returns JSON — parse total and threshold_pass.
See Skill evaluator for the rubric. Self-evaluating produces charity-biased
scores because the main agent has sunk cost in the draft; a fresh grader
context scores the text as written. Finalize only on total ≥ 6.
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/record_eval_score.py" <score>
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/finalize_skill.py" "<n>" --mode create
finalize_skill.py runs entirely inside a subprocess — shutil.copytree
moves the staged tree into .claude/skills/<n>/ without going through
Claude's tool permission layer, so no prompt fires even on a brand-new
skill dir. The same script consumes the pending eval score, upserts the
registry, wipes staging/<n>/, and clears .skill-forge/draft.md. Offer
to run improve mode next to tune the description.
Goal: iterate an existing skill — diagnose whether the issue is content, triggering, or both, then fix accordingly.
$ARGUMENTS describes what to improve. Identify the target skill from the prompt
by matching against the registry (name, description, or intent). If ambiguous, ask.
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/init_improve.py" "<matched-name>"
init_improve.py does two things in one shot: copies the live skill dir
(.claude/skills/<n>/*) into .skill-forge/staging/<n>/, and writes the
SKILL.md into the active draft. Every Edit/Write below lands in staging —
.claude/skills/<n>/ stays untouched until finalize_skill.py --mode update
copies the finished result back atomically in Step 4.
Content:
scripts/.Triggering:
Do NOT use when?Classify → 3a (content), 3b (triggering), or both (3a first).
.skill-forge/insights.md (low trust staging).skill-forge/staging/<n>/SKILL.md with Edit (not Write)Bundling standard: if the skill's recent 3 uses independently generated the same
helper code, that code belongs in .skill-forge/staging/<n>/scripts/ — write once,
reference from SKILL.md. finalize_skill.py --mode update copies the whole
tree back, so new scripts/ entries land in place automatically.
Generate 20 trigger eval queries (10 should-trigger, 10 should-not-trigger).
Save to .skill-forge/staging/<n>/.opt/trigger_evals.json — the .opt/
dir is part of the staged skill, so it gets copied back to
.claude/skills/<n>/.opt/ by finalize and persists for future improve
rounds (history, convergence flags):
[
{"query": "<realistic user message>", "should_trigger": true},
{"query": "<near-miss that should NOT trigger>", "should_trigger": false}
]
Query quality rules:
Ask user to review before running.
Run optimization loop:
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/optimize_description.py" \
--skill-path ".skill-forge/staging/<n>" \
--eval-set ".skill-forge/staging/<n>/.opt/trigger_evals.json" \
--max-iterations 5
Safe to point at staging: optimize_description.py only reads
SKILL.md for the current description and writes opt_state.json next to
it. The actual claude -p eval subprocess writes a throwaway command
file into .claude/commands/ with a UUID-suffixed slug, so there's no
collision with the live skill that's still sitting in .claude/skills/.
Show before/after description and score improvement. Apply the winning
description to .skill-forge/staging/<n>/SKILL.md with Edit.
opt_state.json lands at .skill-forge/staging/<n>/.opt/opt_state.json
with round history (FP/FN counts, best score, convergence flag) and
gets copied back on finalize. Convergence = perfect train score (1.0).
If FP/FN counts stall across rounds, stop early and report — the eval
set probably needs sharper near-misses.
Re-grade the patched staged draft via the skill-grader subagent (same
agent as create mode). Then one command does everything:
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/record_eval_score.py" <score>
python3 "${CLAUDE_PLUGIN_ROOT}/skills/skill-forge/scripts/finalize_skill.py" "<n>" \
--mode update \
--changelog "<one-line summary: what changed and why>" \
--bump patch # or: minor (new capability), major (breaking change)
finalize_skill.py copies staging over .claude/skills/<n>/, bumps the
registry version, prepends a dated entry to CHANGELOG.md with the
computed version, consumes the eval score, wipes staging, and clears the
draft — one subprocess call, zero Claude tool permission prompts. Date,
version, and entry format are mechanical — never write CHANGELOG by hand.
Patch vs rewrite: Edit the staged SKILL.md unless >60% of content
changes; full rewrite is fine too, same finalize path.
Fires after: 5+ tool calls, user correction mid-task, error recovery, or explicit "remember this" / "save this workflow" / "make a skill" requests.
AskUserQuestion: "Reusable pattern: Create / Rename / Skip.Create → run create mode (auto-name). Rename → plain-text prompt for name, then create. Skip → silent reset.Skip if: task < 3 tool calls, pure read-only, or simple single-file edit.
Scoring runs in the skill-grader subagent — fresh context, no sunk cost in
the draft, calibrated scores. Main agent does not self-evaluate.
Invocation:
Agent tool, subagent_type="skill-grader"
prompt: "Grade draft at <absolute path>. Mode: create|improve.
Write verdict to <output path> and echo to stdout."
Parse the returned JSON: total (int 0–8) is the decision key,
threshold_pass (bool) is true iff total ≥ 6. Rubric, scoring dimensions,
and the full schema live in agents/skill-grader.md — single source, don't
duplicate here.
Threshold actions: pass → finalize. 4–5 → revise once, re-grade. < 4
→ show the grader's suggestions and ask user whether to rework or abort.
.skill-forge/draft.md.Generalize from failures; don't patch the one failing case. A skill that passes 3 tests but breaks on the 4th real use is worse than one moderately good on all.
| File | Trust | Re-read by hooks? | Purpose |
|---|---|---|---|
.skill-forge/draft.md | HIGH | YES (every tool call) | Active skill being written |
.skill-forge/insights.md | LOW | NO | Codebase scan staging |
.claude/skills/skill_registry.json | HIGH | NO (loaded on demand) | Version registry |
.claude/skills/<n>/SKILL.md | HIGH | NO | Final persisted skill |
.claude/skills/<n>/CHANGELOG.md | MED | NO | Evolution history |
.claude/skills/<n>/scripts/ | HIGH | NO (run on demand) | Bundled helper scripts |
.claude/skills/skill_registry.json:
{
"version": "1",
"skills": [
{
"name": "skill-name",
"version": "1.0.0",
"scope": "project",
"created": "2026-01-01",
"updated": "2026-01-01",
"auto_trigger": true,
"description_chars": 187,
"eval_score": 7,
"trigger_score": null,
"usage_count": 0
}
]
}
trigger_score is populated by improve mode. null = not yet run.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub nekocode/skill-forge --plugin skill-forge