From agent-harness-kit
Run Mini SWE-bench style harness regression tasks and A/B comparisons to measure harness improvement objectively.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-harness-kit:benchmark-suiteThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this when evaluating whether a harness change improved or regressed behavior.
Use this when evaluating whether a harness change improved or regressed behavior.
node .harness/scripts/bench-runner.mjs --variant=current
node .harness/scripts/bench-runner.mjs --variant=candidate
node .harness/scripts/bench-compare.mjs
### Benchmark Suite
### Tasks: <n>
### Pass rate: <percent>
### Avg score: <score>
### A/B delta: <delta or n/a>
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub tuanle96/agent-harness-kit --plugin agent-harness-kit