From claude-commands
Runs a single benchmark script comparing three pair executors: pairv2 (LangGraph), pair via Claude Teams, and pair via direct Python. Useful for regression testing or performance comparisons across executor modes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-commands:pair-benchmark-all-executorsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use one script to benchmark all pair executors:
Use one script to benchmark all pair executors:
pairv2 (.claude/pair/pair_execute_v2.py)/pair via Claude Teams prompt path/pair via direct Python (.claude/pair/pair_execute.py)python .claude/pair/benchmark_pair_executors.py (or venv/bin/python when using a virtualenv)
Run all three executors in one benchmark:
python .claude/pair/benchmark_pair_executors.py \
--executor-set all \
--benchmark-iterations 1 \
--timeout-seconds 180 \
--teams-command "claude --print" \
--artifact /tmp/{repo}/{branch}/pair/all_three.json
Run a small, named Amazon-style ecommerce preset:
python .claude/pair/benchmark_pair_executors.py \
--task-preset amazon_clone \
--pairv2-max-cycles 2 \
--timeout-seconds 1200 \
--artifact /tmp/{repo}/{branch}/pair/amazon_clone_benchmark.json
For historical benchmark prompts, including alternate Amazon-style variants, see
testing_llm/pair/benchmark_tasks.json.
Run all three in parallel:
python .claude/pair/benchmark_pair_executors.py \
--executor-set all \
--parallel \
--benchmark-iterations 1 \
--timeout-seconds 180 \
--teams-command "claude --print" \
--artifact /tmp/{repo}/{branch}/pair/all_three_parallel.json
--executor-set all: pair python + pair teams + pairv2--executor-set legacy-vs-v2: pair python vs pairv2--executor-set teams-vs-v2: pair teams vs pairv2--executor-set pairv2-only: pairv2 onlypair_python, pair_python_status_counts, pair_python_rawpair_claude_teams, pair_teams_status_counts, pair_teams_rawpairv2_langgraph, pairv2_status_counts, pairv2_rawspeedup_ratio_pair_python_over_v2speedup_ratio_pair_teams_over_v2speedup_ratio_pair_python_over_teams--executor-set: pairv2-only--coder-cli: minimax--verifier-cli: minimax--task: hello world script + pytest test (default); overridden by --task-preset when provided--timeout-seconds: 600Run with no args to use these defaults:
venv/bin/python .claude/pair/benchmark_pair_executors.py
pairv2 is LangGraph-based (.claude/pair/pair_execute_v2.py)./tmp/{repo}/{branch}/pair/<timestamp>/pair_executor_benchmark.json when --artifact is omitted.--coder-cli and --verifier-cli to align model/runtime choices across runs.minimax is not a CLI binary — it runs claude with MiniMax API endpoint. See pairv2-usage.md for details.npx claudepluginhub jleechanorg/claude-commands --plugin claude-commandsDocuments how to run the pairv2 LangGraph-based pair-programming executor, including direct invocation, benchmark script usage, and what "minimax" means as a CLI option (injects MiniMax API endpoint via environment variables).
Designs structured benchmarks comparing algorithms, models, or implementations with metrics, test cases, hardware context, and tradeoff analysis to produce reproducible performance reports.
Benchmarks Claude Code skill performance via multiple trials per eval, tracking pass rate, execution time, token usage, and variance. Aggregates to benchmark.json and generates version comparison reports. Use for 'benchmark skill' or performance tracking queries.