From agent-harness-kit
Run Mini SWE-bench style harness regression tasks and A/B comparisons to measure harness improvement objectively.
How this command is triggered — by the user, by Claude, or both
Slash command
/agent-harness-kit:SKILLtemplates/.claude/skills/benchmark-suite/This command is limited to the following tools:
The summary Claude sees in its command listing — used to decide when to auto-load this command
# Benchmark Suite Use this when evaluating whether a harness change improved or regressed behavior. ## Commands ## Output contract
Use this when evaluating whether a harness change improved or regressed behavior.
node .harness/scripts/bench-runner.mjs --variant=current
node .harness/scripts/bench-runner.mjs --variant=candidate
node .harness/scripts/bench-compare.mjs
### Benchmark Suite
### Tasks: <n>
### Pass rate: <percent>
### Avg score: <score>
### A/B delta: <delta or n/a>
/SKILLResolves GitHub issue via isolated worktree, TDD workflow, and auto-closing PR creation.
/SKILLCreates conventional git commit from conversation intent using git-agent and pushes to remote. Accepts optional Claude model name for co-author.
/SKILLSurfaces current session task from state file, evaluates clarity (prompts for clarification if needed), assesses completion, and verifies if fully done.
npx claudepluginhub tuanle96/agent-harness-kit --plugin agent-harness-kit