From skill-reviewer
Evaluate a skill's design quality against best practices from Anthropic's skill-building guidance and real-world patterns from high-quality skills.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-reviewer:skill-reviewerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluate a skill's design quality against best practices from Anthropic's skill-building guidance and real-world patterns from high-quality skills.
Evaluate a skill's design quality against best practices from Anthropic's skill-building guidance and real-world patterns from high-quality skills.
| /reflect | /skill-reviewer | |
|---|---|---|
| Input | Current conversation history | Skill files on disk |
| Evaluates | What worked/failed during usage | Design, structure, documentation quality |
| When | After using a skill | Any time, even before first use |
| Output | Conversation learnings, SKILL.md patches | Scored evaluation with actionable recommendations |
/skill-reviewer publisher-lookup → Review by skill name (auto-discovers path)
/skill-reviewer ./skills/my-skill/ → Review by path
/skill-reviewer --fix publisher-lookup → Review + apply quick fixes automatically
/skill-reviewer --compare reflect taboolar → Side-by-side comparison of two skills
$ARGUMENTS = [--fix] <skill-path-or-name> OR --compare <left-skill> <right-skill>
Parse the argument before starting:
--compare, expect exactly two skill names following it. Run comparison mode (Phase 7).--fix, the next token is the skill name. Run normal evaluation then apply fixes (Phase 6).Step 1 - Locate the skill. Search these paths in order, stop at first match:
./, /, ~/)~/.claude/skills/<name>/SKILL.md~/.claude/plugins/*/skills/<name>/SKILL.md~/Code/taboola-pm-skills/plugins/<name>/skills/<name>/SKILL.md.claude/skills/<name>/SKILL.md**/skills/<name>/SKILL.md (exclude .git/, node_modules/, .venv/)If no SKILL.md found: Stop. Report "Skill not found: . Searched [list paths checked]. Try providing the full path."
If multiple matches: List all candidates and ask "Which skill did you mean?" before proceeding.
Step 2 - Resolve symlinks. If the located path is a symlink, resolve it once to the real path. Work from the real path for all subsequent steps. Do not follow further symlinks inside the skill directory.
Step 3 - Inventory the skill root. Check for these standard owned subfolders only:
SKILL.md (required)LEARNINGS.mdCLAUDE.mdreferences/ - list filesscripts/ - list filestemplates/ - list filesdata/ - list filesexamples/ - list filesDo NOT recurse into nested plugins/, .git/, node_modules/, vendor/, .venv/, or additional SKILL.md files beyond the root unless the main SKILL.md explicitly delegates to them.
Step 4 - Read SKILL.md fully. Extract frontmatter fields and note body structure (sections, line count).
Step 5 - Read supporting files referenced by SKILL.md or needed to score a dimension. Do not read files speculatively.
Step 6 - Count metrics: SKILL.md line count, number of files per subfolder.
Before scoring, check if this skill has been reviewed before:
python3 scripts/history.py --skill <skill-name>
If prior reviews exist, note the previous score and grade. A re-review should explain what changed since the last evaluation.
Read references/evaluation-framework.md for the 9 categories. Use the signal words table as hints only — confirm classification semantically from the surrounding text, not keyword presence alone.
A skill should fit ONE category cleanly. If evidence is mixed, mark confidence as low and explain why. Do not force a category.
Read references/evaluation-framework.md for the full rubric with scoring anchors.
For each dimension, use keyword patterns as hints — confirm semantically before assigning a score. When evidence is ambiguous, round down and note what would raise the score.
D1: Progressive Disclosure (0-3) - Does it use the file system for context engineering?
D2: Description Quality (0-3) - Is the frontmatter description trigger-focused?
D3: Gotchas Section (0-3) - Is there a section documenting known failures?
D4: Don't State the Obvious (0-3) - Is content non-obvious to Claude?
D5: Avoid Railroading (0-3) - Is there flexibility for Claude to adapt?
D6: Setup & Configuration (0-3) - Are user-specific values configurable?
/Users/specific-name/, hardcoded emails or IDs baked into the skillD7: Memory & Data Storage (0-3) - Does it retain useful state across runs?
D8: Scripts & Composable Code (0-3) - Does it provide reusable scripts?
D9: Frontmatter Quality (0-3) - Is the YAML frontmatter well-configured?
Bash without subcommand scope, no argument-hintD10: Hooks Integration (0-2) - Does it use on-demand hooks?
score.py auto-detects the structural anti-patterns (persona-only, script-wrapper, monolithic, hardcoded-paths, no-error-handling, stale-references, overpermissioned unscoped Bash, no-output-format). Trust its output for those.
The following three require your semantic judgment — score.py does not detect them:
| Anti-Pattern | How to Detect (Claude only) |
|---|---|
| Redundant knowledge | Read the content — is this generic best-practice Claude already knows, or org-specific? |
| Category straddling | After classifying in Phase 2, does the skill pull equally hard in 2+ directions? |
| Overpermissioned (unused tools) | Do the listed allowed-tools all appear in the workflow? Score.py only catches unscoped Bash. |
Read references/report-template.md and fill it completely. Every section is required except "Comparison to Gold Standard" (see benchmarks.md for when to include it).
Score total: D1+D2+D3+D4+D5+D6+D7+D8+D9 (each 0-3) + D10 (0-2) = max 29 Grade: 25-29=A, 20-24=B, 15-19=C, 10-14=D, 0-9=F
Only auto-apply safe, non-destructive improvements. Show each change before applying.
Auto-apply:
argument-hint, tighten allowed-tools scope## Common Mistakes\n\n<!-- TODO: document gotchas from real usage -->references/ directory structure if SKILL.md > 150 lines and no subfolders existLEARNINGS.md only if the skill produces outputs where cross-run retention makes sense (not for all skills)Never auto-apply:
Run Phase 1-4 for each skill independently. Then produce a side-by-side table:
## Comparison: [left-skill] vs [right-skill]
| Dimension | [left] | [right] | Winner |
|-----------|--------|---------|--------|
| D1 | X | Y | ... |
...
| **Total** | X/29 | Y/29 | ... |
## What [left] does better
...
## What [right] does better
...
## What each could adopt from the other
...
Run this before Phase 3, not after. It gives D1, D2, D3, D6, D7, D8, D9, D10 plus anti-pattern detection in seconds:
python3 scripts/score.py <skill-directory> --json
Use --json so the output is machine-parseable. Feed the structural scores directly into the report table. Only D4 and D5 require your semantic judgment — add them to the structural total for the full score.
Adjust a structural score only if you have specific semantic evidence that contradicts it, and note the reason.
Read config.json at startup. Key settings:
history_log — path to history log (default: ~/.claude/plugin-data/skill-reviewer/history.log)default_verbose — show evidence in table outputbenchmarks_check — whether to attempt benchmark comparison (default: true)auto_fix_threshold — only offer --fix mode if score is below this (default: 15)After generating a report, append a one-line entry using the history script:
python3 scripts/history.py --append "YYYY-MM-DD | <skill-name> | <score>/29 | <grade> | <top-issue>"
To view past evaluations for a skill:
python3 scripts/history.py --skill <name>
This allows tracking improvement over time when a skill is reviewed repeatedly.
After every review session, check if anything new was discovered that isn't already in LEARNINGS.md:
score.pyIf yes, append it to LEARNINGS.md under the relevant section (Gotchas, Patterns That Work, or Version History). Keep entries specific: include the skill name, what happened, and the fix or insight.
This trigger is the mechanism that makes D3 improve over time — LEARNINGS.md feeds back into the gotchas section as the skill accumulates real failure data.
See examples/publisher-lookup-review.md for a complete example output to calibrate report quality.
references/evaluation-framework.md — 10-dimension scoring rubric with anchorsreferences/report-template.md — output format templatereferences/benchmarks.md — gold standard skills and their known pathsscripts/score.py — structural scoring script (D1-D3, D6-D10 + anti-patterns)scripts/history.py — view and append to the evaluation history logexamples/publisher-lookup-review.md — complete example report for calibrationconfig.json — configurable history path and behaviour settingsLEARNINGS.md — captured gotchas and patterns from real usageSymlink resolution: Resolve the symlink once, work from the real path. Do not chase further symlinks inside the skill. Some skills in taboola-pm-skills are symlinked 2-3 levels deep.
Nested plugins: Some repos have nested plugins/*/skills/*/plugins/ structures. Do not recurse into them — they are artifacts, not skill substructures.
Multiple matches: Glob searches will find the same skill in both the plugin source and its symlinked location. Prefer the real path (plugin source) over the symlink.
Stateless skills are not broken: A thin skill that wraps a script and has no LEARNINGS.md may score D7=0 correctly. Don't inflate the score or auto-add LEARNINGS.md unless retention genuinely helps.
Keyword classification is unreliable: "PR" appears in Runbooks and CI/CD skills. "template" appears in Scaffolding and Business Process skills. Always read the surrounding text before classifying.
npx claudepluginhub omriariav/omri-cc-stuff --plugin skill-reviewerEvaluates Claude skills across description quality, content organization, writing style, and structural integrity. Generates weighted scores, letter grades, and improvement plans.
Audits Claude Code skills for structure compliance, triggering accuracy, instruction quality, and best practices. Scores 0-100 with prioritized improvement recommendations.
Evaluates SKILL.md design quality against official specs and best practices. Provides multi-dimensional scoring, knowledge categorization, and actionable improvement suggestions.