From claude-code-stack
Run monthly to evaluate how each subagent has actually performed over the last 30 days. Reads subagent_runs table, identifies patterns of failure/success, proposes prompt refinements or model changes. Distinct from /model-audit (which looks outside at benchmarks) — this looks inside at actual outcomes. Targets the assumption that current configuration is still optimal.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-code-stack:agent-performance-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Look at what subagents actually did. Find what's not working.
Look at what subagents actually did. Find what's not working.
Source: ~/.claude/logs/subagent-runs.jsonl (append-only, written by the PreToolUse Agent hook ~/.claude/hooks/subagent-log.sh). Each row has ts, agent, desc, model, project, session_start.
CUTOFF=$(date -u -v-30d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ)
jq -r --arg c "$CUTOFF" 'select(.ts >= $c) | .agent' ~/.claude/logs/subagent-runs.jsonl \
| sort | uniq -c | sort -rn
Group by agent (optionally filter .project == "<path>" for per-repo view). For each: invocation count, model mix, sample of desc strings, most recent run.
Not yet in log (skip the corresponding Step 2 checks until the log schema grows): success/failure outcome, cost, wall time, escalation events. Today this skill produces a usage + benching picture only; revisit when outcome fields are added.
For each subagent:
Cross-cutting:
main-thread, agent-teams, and hybrid modes. If one mode is meaningfully outperforming for a given task class, surface it. If agent-teams is consistently underperforming main-thread for a task class, propose restricting it. This is how the stack learns whether the experimental orchestration is actually worth using.For each issue, propose:
docs/agent-perf/<YYYY-MM-DD>.md:
# Agent performance review: <date>
## At a glance
- Total invocations: <N>
- Total cost: $<X>
- Average success rate: <%>
- Top subagent by usage: <name> (<N> invocations)
- Top subagent by cost: <name> ($<X>)
- Top failure cause: <pattern>
## Per-subagent
### architect
- Invocations: <N> | Success: <%> | Avg cost: $<X> | p95 wall: <Y>min
- Notable: <pattern>
- Proposal: <none / refinement / model change>
### implementer
...
## Cross-cutting patterns
- <pattern>: observed in <N> sessions, suggested action: <...>
## Proposals summary
- <change>: <evidence> — apply? Y/N
After approval: librarian implements the prompt refinements (PRs to claude-code-stack).
npx claudepluginhub bschonbrun/claude-code-stack --plugin claude-code-stackSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.