From just-useful-plugin
Use when measuring agent task performance in a codebase, evaluating environment setup quality for AI agents, benchmarking agent resource efficiency, or running A/B comparisons of documentation/context configurations
How this skill is triggered — by the user, by Claude, or both
Slash command
/just-useful-plugin:agent-benchmarkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Measure how well a codebase environment supports AI agent task performance. Compares resource efficiency (tokens, time, backtracking) across environment configurations using A/B benchmarking.
Measure how well a codebase environment supports AI agent task performance. Compares resource efficiency (tokens, time, backtracking) across environment configurations using A/B benchmarking.
isolation: "worktree" on every Agent dispatch (basic mode) or pre-configured worktrees (A/B mode). Never let agents modify the actual repo.Phase 1: Repo Analysis Phase 2: Agent Execution Phase 3: Report
& Task Generation & Hook Capture & Cleanup
───────────────────── ───────────────────── ─────────────────
[1] Scan repo structure [4] Setup hooks (JSONL log) [7] Parse logs (by session_id)
[2] Extract code elements [5] Run agents in parallel [8] Calculate metrics
[3] Generate dynamic tasks [6] Capture tool calls [9] Generate report
[10] Cleanup
Use AskUserQuestion at skill start to confirm execution mode and options. Skip items the user has already specified.
Items to confirm:
AskUserQuestion (adapt to user's language): "Please choose a benchmark mode:
1. Basic — single benchmark run on current environment
2. A/B — compare two environment configurations
Any preference on task count or focus category? (Default: 1 per category, 4–8 total)"
Collect the structural fingerprint of the target repository.
# Directory structure (depth-limited)
find . -type f -not -path './.git/*' | head -200
# Documentation listing
ls -la docs/ CLAUDE.md AGENTS.md README.md 2>/dev/null
# Recent change history
git log --oneline -20
git log --oneline --diff-filter=ADR --name-status --since="4 weeks ago"
Use Glob to map file patterns: **/*.ts, **/*.py, **/*.md, etc.
Extract key code elements that will seed task generation:
| Element | How to Extract | Purpose |
|---|---|---|
| Entry points | Glob for main.*, index.*, app.*, server.* | Navigation tasks |
| Modules/packages | Glob for directory patterns with __init__.py, package.json | Architecture tasks |
| Import/export graph | Grep for import, require, export | Dependency tasks |
| Error patterns | Grep for throw, raise, catch, Error, custom error classes | Debugging tasks |
| Config files | Glob for *.config.*, *.json, *.yaml, *.toml | Setup tasks |
| Test files | Glob for *test*, *spec*, __tests__/ | Verification tasks |
Generate 4–8 tasks per repo using extracted code elements. Reference references/task-templates.md for task category definitions and templates.
Task categories (minimum 1 per category):
| Category | Example | Measures |
|---|---|---|
| Discovery | "Find the entry point for the API server" | Token/time efficiency for navigation |
| Comprehension | "List all modules that depend on the auth package" | Token/time efficiency for understanding |
| Diagnosis | "Find where ValidationError is thrown and trace its handler" | Token/time efficiency for search |
| Modification | "Add a new field to the User model" (in worktree) | Token/time efficiency for end-to-end task |
Each generated task must include:
Expected files determination: The benchmark runner solves the task first (or uses repo analysis results) to establish the relevant files list. This is used to calculate Backtrack Rate (unique files accessed vs total file accesses).
After Phase 1, show generated tasks to the user and use AskUserQuestion to confirm before proceeding:
AskUserQuestion (adapt to user's language): "[Phase 1 complete] N tasks generated:
1. [Discovery] Find the entry point for the API server
2. [Comprehension] List all modules that depend on auth
3. [Diagnosis] Trace where ValidationError is thrown
4. [Modification] Add a new field to the User model
Proceed with benchmark? Let me know if you want to modify/add/remove any tasks."
Proceed to Phase 2 only after user approval.
Create a temporary JSONL log file for capturing all tool calls:
/tmp/agent-benchmark-{TIMESTAMP}.jsonl
Hook installation via settings.local.json:
Claude Code hooks receive JSON on stdin containing tool information. The tool_name field identifies which tool was called. Merge the hooks config into the project's existing .claude/settings.local.json (read the file first, add the hooks key, write back — do not overwrite other keys like permissions or enabledPlugins):
{
"hooks": {
"PreToolUse": [
{
"matcher": "",
"hooks": [
{
"type": "command",
"command": "jq -c '{phase: \"pre\", tool: .tool_name, file_path: (.tool_input.file_path // .tool_input.path // \"\"), session_id: .session_id, timestamp: (now | strftime(\"%Y-%m-%dT%H:%M:%SZ\"))}' >> /tmp/agent-benchmark-TIMESTAMP.jsonl"
}
]
}
],
"PostToolUse": [
{
"matcher": "",
"hooks": [
{
"type": "command",
"command": "jq -c '{phase: \"post\", tool: .tool_name, session_id: .session_id, timestamp: (now | strftime(\"%Y-%m-%dT%H:%M:%SZ\"))}' >> /tmp/agent-benchmark-TIMESTAMP.jsonl"
}
]
}
]
}
}
Important:
TIMESTAMP in the file path with the actual benchmark run timestamp before writing the settings file.hooks key from settings.local.json to restore normal operation (preserve other keys).session_id to correlate log entries to specific tasks — each subagent runs in its own session, so the benchmark runner records each subagent's session_id and maps it to a task during Phase 3 log parsing.Stdin JSON structure (provided by Claude Code):
session_id: Session identifier (unique per subagent — used to correlate logs to tasks)tool_name: Tool identifier (e.g., "Read", "Grep", "Bash", "Glob", "Edit", "Write", "Agent")tool_input: Tool parameters object (contains file_path, pattern, command, etc. depending on tool)tool_use_id: Unique tool call identifiertool_response: (PostToolUse only) Result returned by the toolEach log entry records:
{
"phase": "pre|post",
"tool": "Grep|Read|Bash|...",
"file_path": "/path/to/file",
"session_id": "session-abc123",
"timestamp": "ISO-8601"
}
Note: Each subagent dispatch gets a unique
session_id. The benchmark runner records the mapping ofsession_id → task_idfrom each Agent dispatch, then uses this mapping to correlate JSONL log entries to tasks during Phase 3 log parsing. This enables parallel task execution without log entry ambiguity.
Parallel execution: dispatch all tasks concurrently as subagents for maximum speed.
isolation: "worktree" on the Agent tool — even for a single task. This creates a temporary git worktree automatically; do not use manual git worktree add.Agent tool with isolation: "worktree" for every taskagentId (= session_id) for each task to correlate logs in Phase 3.# Dispatch ALL tasks in parallel in a single message
# Each gets its own isolated worktree — no conflicts
Agent(prompt="<task 1>", isolation="worktree", description="benchmark task 1")
Agent(prompt="<task 2>", isolation="worktree", description="benchmark task 2")
Agent(prompt="<task 3>", isolation="worktree", description="benchmark task 3")
Agent(prompt="<task 4>", isolation="worktree", description="benchmark task 4")
# ... all in one message block
Main Agent
───────────
[Generate all tasks]
│
├──dispatch task 1 (isolation: "worktree")──→ Subagent 1 (worktree-1)
├──dispatch task 2 (isolation: "worktree")──→ Subagent 2 (worktree-2)
├──dispatch task 3 (isolation: "worktree")──→ Subagent 3 (worktree-3)
├──dispatch task 4 (isolation: "worktree")──→ Subagent 4 (worktree-4)
│ (all running concurrently)
│
├── all complete ← collect results + agentIds
│
[Map agentId → task for log correlation]
Session-to-task mapping: After all agents complete, record the agentId returned by each Agent dispatch. This agentId corresponds to the session_id in JSONL log entries, enabling accurate per-task log correlation even with concurrent execution.
For comparing two environment configurations (e.g., with vs without CLAUDE.md, different docs structures):
git worktree add): one per condition, apply configuration differences (e.g., add/remove CLAUDE.md). A/B mode requires manual worktree setup because each condition needs different file modifications applied before agent execution.isolation: "worktree" in A/B mode — they run directly in the pre-configured worktree paths instead.session_id to correlate entries to tasks. ┌── Task 1-A (worktree-a) ──→ log-a.jsonl
├── Task 2-A (worktree-a) ──→ log-a.jsonl
All dispatched ├── Task 1-B (worktree-b) ──→ log-b.jsonl
concurrently ──────├── Task 2-B (worktree-b) ──→ log-b.jsonl
├── ...
└── Task N-B (worktree-b) ──→ log-b.jsonl
│
All complete → Compare
│
Comparison Report
Note: Tasks within the same worktree run concurrently. Since subagents only read files (Discovery, Comprehension, Diagnosis tasks), concurrent access is safe. For Modification tasks that write files, each subagent operates on different files as determined by the task, minimizing conflict risk.
Read the JSONL log file(s) and extract per-task data using the session_id → task mapping recorded during Phase 2:
session_id, then map each group to its task using the recorded mappingCalculate metrics as defined in references/metrics.md:
last_tool_call_timestamp - first_tool_call_timestamp(S - N) / SOnly successful tasks are included in summary statistics.
For each successful task present in both conditions, compute per-metric ratios:
token_ratio = tokens_A / tokens_B
time_ratio = time_A / time_B
backtrack_ratio = backtrack_A / backtrack_B
Compute mean ratios across tasks for the summary. See references/metrics.md §3.2 for edge cases.
Output a text status update when Phase 2 finishes (adapt to user's language):
[Phase 2 complete] N tasks finished. Starting log parsing and report generation.
Generate the report in terminal using the format defined in references/report-format.md.
Report sections:
After displaying the report, use AskUserQuestion to check for follow-up actions:
AskUserQuestion (adapt to user's language): "Would you like to export the report as JSON or Markdown? (Say 'no' if not needed)"
Export formats (only when user requests):
After report generation:
isolation: "worktree" auto-cleans worktrees with no changes. For Modification tasks where files were changed, the Agent tool returns the worktree path — run git worktree remove <path> to clean up.git worktree remove for both condition worktrees./tmp/ (user can delete manually)AskUserQuestion: Mode selection, task confirmation, report export — any point requiring user decisionGlob: Repo structure scan, file pattern search, test/config file discoveryGrep: Import/export graph collection, error pattern search, code element extractionRead: File content verification, JSONL log file reading, task answer verificationBash: git log for change history, hooks setup/teardown, worktree creation and cleanup, log file managementAgent: Benchmark task execution subagents, repo analysis assistance| Thought | Reality |
|---|---|
| "Repo is too small to benchmark" | Small repos can still vary in environment setup quality |
| "One task is enough" | Minimum 1 per category, 4 total required |
| "Agent can self-report tool calls" | Self-reporting can miss calls. Use hooks for external capture |
| "Run modification tasks directly on the repo" | Worktree isolation is mandatory. Never pollute the actual codebase |
| "Only one task, no need for isolation" | Even single tasks require isolation: "worktree". Modifications can pollute main |
| "Low score means the model is bad" | This benchmark measures environment quality, not model performance |
| "Estimate without hook logs" | Estimation is not benchmarking. Accurate data collection is the point |
| "Run tasks sequentially" | All tasks are independent. Use parallel dispatch to save time |
npx claudepluginhub identity16/just-useful-plugin --plugin just-useful-pluginCompares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Compares coding agents (Claude Code, Aider, Codex) head-to-head on custom tasks with pass rate, cost, time, and consistency metrics.
Audits agent codebases against the 12-Factor Agents methodology, analyzing per-factor compliance with file-level evidence. Use when reviewing LLM-powered system architecture or planning agent improvements.