From skill-battlefield
Stress-test a Claude Code skill across diverse scenarios. Generates test scenarios (easy/medium/hard/adversarial), evaluates via simulated execution, proposes targeted improvements, and produces an evidence report. Use when: user wants to test a skill, evaluate skill quality, find weaknesses, sharpen a skill, or run skill diagnostics. Trigger on: 'battle', 'test this skill', 'evaluate my skill', 'sharpen', 'find weaknesses', 'skill quality'. DO NOT use for: writing new skills from scratch, general code review, non-skill markdown files.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-battlefield:battle <skill-path> [--scenarios 20] [--iterations 5]<skill-path> [--scenarios 20] [--iterations 5]The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Stress-test any Claude Code skill: generate scenarios, evaluate, sharpen, and report.
Stress-test any Claude Code skill: generate scenarios, evaluate, sharpen, and report.
Parse $ARGUMENTS:
skill_path (required)--scenarios N = num_scenarios (default 20)--iterations N = max_iterations (default 5)If no skill_path provided, print usage and stop:
Usage: /battle [--scenarios 20] [--iterations 5]
Resolve ~ in skill_path to the absolute home directory path.
skill_path exists and is readableIf invalid, print an error message describing the problem and stop.
skill_name from the parent directory name (e.g., my-skill/SKILL.md -> my-skill)skill_dir to the parent directory of skill_pathdate +%Y%m%d-%H%M%S~/.skill-battlefield/runs/{skill_name}/{timestamp}/{run_dir}/iterations/ subdirectory{run_dir}/skill-original.mdStarting battle for {skill_name}. Run directory: {run_dir}Set phases_dir to the phases/ directory relative to this skill file.
Set refs_dir to the references/ directory relative to this skill file.
Read {phases_dir}/01-analyze.md and {refs_dir}/contracts.md.
Dispatch a subagent with:
01-analyze.md as instructionscontracts.md (analysis.json schema)run_dir and skill_dir pathsHARD GATE: Verify {run_dir}/analysis.json exists after completion. If missing, stop with error.
Print: Phase 01 complete: analysis.json written
Read {phases_dir}/02-generate.md, {refs_dir}/contracts.md, {refs_dir}/taxonomy.md, and {refs_dir}/rubric-guide.md.
Dispatch a subagent with:
02-generate.md as instructionscontracts.md, taxonomy.md, and rubric-guide.mdrun_dir, skill_dir, and num_scenariosHARD GATE: Verify {run_dir}/scenarios.jsonl exists. Count lines and confirm it matches num_scenarios. If invalid, stop with error.
Print: Phase 02 complete: {num_scenarios} scenarios generated
Read {phases_dir}/03-baseline.md, {refs_dir}/contracts.md, and {refs_dir}/rubric-guide.md.
Dispatch a subagent with:
03-baseline.md as instructionscontracts.md and rubric-guide.mdrun_dir, skill_dir, and skill_version = "v0"HARD GATE: Verify {run_dir}/baseline-evals.jsonl exists. If missing, stop with error.
Calculate baseline score: mean of all score fields from baseline-evals.jsonl.
Print: Phase 03 complete: baseline score = {baseline_score}
Read {phases_dir}/04-sharpen.md, {refs_dir}/contracts.md, {refs_dir}/taxonomy.md, and {refs_dir}/rubric-guide.md.
Dispatch a subagent with:
04-sharpen.md as instructionscontracts.md, taxonomy.md, and rubric-guide.mdrun_dir, skill_dir, and max_iterationsHARD GATE: Verify {run_dir}/skill-current.md exists. If missing, stop with error.
Read {run_dir}/progress.jsonl. Extract the final score from the last entry. Compute delta from baseline.
Print: Phase 04 complete: score {baseline_score} -> {final_score} (delta {delta})
Select 3-5 scenarios from {run_dir}/scenarios.jsonl for live validation:
Read the final {run_dir}/skill-current.md.
For each selected scenario:
claude -p with the skill content prepended to the user_messageexpected_behaviors using the same binary check rubricsimulated_score vs real_score, compute delta|delta| <= 1 = OK, delta > 1 = SIMULATION-OPTIMISTIC, delta < -1 = SIMULATION-PESSIMISTICWrite {run_dir}/spot-check.json conforming to the schema in references/contracts.md.
Print average divergence: Spot-check: avg divergence = {avg_delta}, {num_ok}/{total} checks within tolerance
Read {phases_dir}/05-report.md, {refs_dir}/contracts.md, and {refs_dir}/taxonomy.md.
Dispatch a subagent with:
05-report.md as instructionscontracts.md and taxonomy.mdrun_dir and skill_dirHARD GATE: Verify {run_dir}/report.md and {run_dir}/skill-sharpened.md both exist. If either is missing, stop with error.
After all phases complete, read {run_dir}/report.md and extract the top findings.
Print to user:
=== Battle Complete ===
Skill: {skill_name}
Score: {baseline_score} -> {final_score} ({delta})
Top findings:
1. {finding_1}
2. {finding_2}
...
Report: {run_dir}/report.md
Sharpened skill: {run_dir}/skill-sharpened.md
Diff: diff {run_dir}/skill-original.md {run_dir}/skill-sharpened.md
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub waynewangyuxuan/skill-battlefield