From skill-forge
Audits Claude Code skills for structure compliance, triggering accuracy, instruction quality, and best practices. Scores 0-100 with prioritized improvement recommendations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-forge:skill-forge-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Accept input as:
Accept input as:
~/.claude/skills/)Read all .md files, scripts, and asset files.
Run python scripts/validate_skill.py <path> for programmatic checks.
Manual verification:
name field| Check | Pass Criteria |
|---|---|
| Name format | kebab-case, 1-64 chars, no leading/trailing hyphens |
| Description present | Non-empty, 1-1024 characters |
| Description has WHAT | Explains capabilities |
| Description has WHEN | Includes trigger phrases |
| Description has keywords | Domain-specific terms included |
| No XML tags | No < or > characters |
| Optional fields valid | license, compatibility (<500 chars), metadata |
Assess the description for activation quality:
Under-triggering risks:
Over-triggering risks:
Generate test queries:
| Criterion | Score (0-10) |
|---|---|
| Specificity | Are instructions actionable? (not "validate properly") |
| Completeness | All workflows covered? |
| Error handling | Common failures addressed? |
| Examples | Concrete examples provided? |
| Progressive disclosure | Detailed docs in references/ not SKILL.md? |
| Length | Under 500 lines / 5000 tokens? |
| Cross-references | Clear links to references/scripts? |
For skills with sub-skills:
parent-child conventionScoring methodology (0-100):
| Category | Weight | Checks |
|---|---|---|
| Frontmatter Quality | 25% | Name, description, format |
| Trigger Accuracy | 20% | WHAT + WHEN + keywords |
| Instruction Quality | 25% | Specificity, completeness, examples |
| Structure Compliance | 15% | File naming, organization, references |
| Script Quality | 10% | If applicable (full marks if no scripts needed) |
| Progressive Disclosure | 5% | Proper use of 3-level system |
After reviewing, generate a structured trigger eval set for ongoing testing:
python scripts/generate_eval_set.py <path> to auto-generate a starter setevals/evals.json in the skill directoryGood queries are realistic and specific (include file paths, context, domain details). Bad queries are overly generic ("format this data") or obviously irrelevant.
python scripts/optimize_description.py <path> --eval-set evals/evals.json
to score the current description and get improvement suggestions/skill-forge eval <path> for full functional evaluation# Skill Review: [name]
## Health Score: [X]/100
## Summary
[2-3 sentence assessment]
## Scores by Category
| Category | Score | Notes |
|----------|-------|-------|
| Frontmatter | X/25 | [issues] |
| Triggering | X/20 | [issues] |
| Instructions | X/25 | [issues] |
| Structure | X/15 | [issues] |
| Scripts | X/10 | [issues] |
| Disclosure | X/5 | [issues] |
## Critical Issues (fix immediately)
- [issue 1]
- [issue 2]
## High Priority (fix within 1 week)
- [issue 1]
## Recommendations
- [suggestion 1]
- [suggestion 2]
## Suggested Test Queries
### Should Trigger
1. [query]
2. [query]
3. [query]
### Should NOT Trigger
1. [query]
2. [query]
3. [query]
npx claudepluginhub agricidaniel/skill-forgeEvaluates Claude skills across description quality, content organization, writing style, and structural integrity. Generates weighted scores, letter grades, and improvement plans.
Audits Claude skills against Anthropic prompting best practices including positive framing, motivation, and XML structure. Use after creation/modification, before release, or for inconsistent results.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.