Skill

agent-review-loop

Full multi-model codebase review. Dispatches Codex, Gemini, and Copilot in parallel to find bugs, security issues, tech debt, and gaps. Deduplicates findings via Gemini, classifies into structured backlogs under .planning/, and auto-routes findings to agent-remediate-loop by severity tier. Outputs review artifacts to .planning/reviews/YYYYMMDD-<agent>-findings.md. Trigger: "review codebase", "run triage", "full review", "find issues", "/agent-review-loop".

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/buymeagoat-skills:agent-review-loop

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are running the **agent-review-loop** pipeline.

SKILL.md

475 lines · ~3.8k tokens

Stats

Stars0

MaintenanceGood

Last CommitMay 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Agent Roles

Agent	Role in this skill
Claude Code	Orchestrates — builds prompts, routes results, classifies findings, writes state
Codex	Line-level review — bugs, logic errors, dead code, null checks, type mismatches
Gemini	Cross-file review — architecture, schema drift, system-level security, data flow
Copilot	Security + quality review — antipatterns, dependency hygiene, test gaps, API design

Claude Code owns all state mutations (backlogs, active-context). Reviewers only produce output.

LLM Handoff Optimization

Codex

Codex reviews best when given explicit read authorization and a strict output contract. Key directives to include:

"you are authorized to inspect source files broadly" (prevents false refusals)
"write exactly one review artifact to <path>" (prevents stdout dumping)
"do not edit application code" (prevents accidental mutations)
Output format: one finding per line, **[SEVERITY]** \file:line` — [CATEGORY] — Description. Fix.`
Use a git worktree for isolation: prevents any accidental writes to working tree

Codex failure modes:

Missing planned-work preamble → flags in-progress items as bugs
No output path specified → dumps to stdout, lost
Large repo without file scope → times out; re-dispatch with explicit file list

Gemini

Gemini's edge is large-context cross-file reasoning. Prime it to use that advantage:

Open with explicit read directive: "read all .py, .ts, .tsx, .json, .yaml files recursively"
Instruct it NOT to duplicate line-level findings (Codex handles those)
Request JSON array output — structured output reduces parse failures
Use timeout_seconds=1200 for large repos
Do NOT specify model — MCP bridge picks best available

Fallback chain if MCP fails: gemini -m gemini-2.5-pro → gemini-2.5-flash → gemini-1.5-pro

Gemini failure modes:

MODEL_CAPACITY_EXHAUSTED — use fallback chain, never block the pipeline
Prose output instead of JSON — parse best-effort, log degraded mode

Copilot

Copilot adds a confidence dimension other reviewers lack. Use it:

Request "confidence": "high|medium|low" on every finding
Request "fingerprint" field (slug of file:line:category) for cross-run dedup
NO_FINDINGS sentinel means clean — write that to file, don't treat as failure
Copilot returns empty on large unscoped repos — scope to high-risk paths if needed

Gemini (dedup pass)

After all three reviews complete, use Gemini again for cross-source deduplication. This is Gemini's highest-value use in this pipeline — large context lets it read all three review files simultaneously and merge findings without truncation. Claude Code doing this inline would burn context; Gemini does it in one call.

Claude Code (classifier)

After dedup, Claude Code classifies and writes to backlogs incrementally. Write each finding immediately after classifying — do not batch. Batching risks losing state if the session is interrupted.

Phase 0 — Pre-flight

REPO=$(basename $(git rev-parse --show-toplevel))
DATE=$(date +%Y%m%d)
mkdir -p .planning/reviews .planning/archive
[ -f .claude/model-routing.md ] && wc -c < .claude/model-routing.md | grep -v '^0$' && echo "routing: ok" || echo "ERROR: .claude/model-routing.md missing or empty"
echo "REPO=$REPO DATE=$DATE"

If model-routing.md missing or empty: stop. Create it first. Default for unknown task types: Gemini.

Extract planned-work preamble to prevent false positives on in-progress items:

grep -A 30 "## Next Phase\|## In Progress\|## Planned" active-context.md 2>/dev/null | head -40

Store as $PLANNED_WORK. If nothing found: "No planned-work context available."

Announce: agent-review-loop start — repo: $REPO — date: $DATE

Phase 1 — Craft Prompts

Build all three prompts. After building, verify DATE and REPO are substituted — print resolved Codex output path:

Codex output path: .planning/reviews/$DATE-codex-findings.md

If path contains literal DATE or REPO: stop and fix substitution before proceeding.

Shared Severity Rubric (inject verbatim into all three prompts)

SEVERITY definitions:
CRITICAL: data loss, auth bypass, crash, security breach
HIGH: broken feature, wrong output, missing validation at boundary
MEDIUM: degraded behavior, performance regression, missing error handling
LOW: code quality, style, dead code, minor UX friction

Shared Exclude Patterns (inject verbatim into all three prompts)

Exclude: .planning/archive/, keep/, node_modules/, __pycache__/, dist/, build/,
*.lock, uploads/, generated assets, .claude/, AGENTS.md, CLAUDE.md, GEMINI.md

Codex Prompt

For this review task only, you are explicitly authorized to inspect source files broadly
and write exactly one review artifact under .planning/reviews/. Do not edit application
code, backlog files, active-context.md, or any other docs.

Scope: source code, config, tests, API boundaries, security-sensitive paths, persistence,
auth/session logic, migrations, scripts, frontend user flows, integration calls.

[SEVERITY RUBRIC]

[EXCLUDE PATTERNS]

The following are intentional, planned, or in-progress — do not flag as issues:
$PLANNED_WORK

Findings must be:
- Actionable and tied to an existing file and line number.
- Reproducible or strongly evidenced — include a concrete fix.
- If no line-specific evidence exists, omit the finding.
- Speculative architecture opinions: LOW only, or omit.
- Prefer fewer high-signal findings over exhaustive noise.

Output format — one finding per line, no prose, no headers, no summaries:
**[SEVERITY]** `file:line` — **[CATEGORY]** — Description. Suggested fix.

CATEGORY options: BUG | SECURITY | TECH-DEBT | FEATURE-GAP | UX | PERFORMANCE | ARCHITECTURE | DEPENDENCY | TEST-GAP

Write ALL findings to: .planning/reviews/$DATE-codex-findings.md
Confirm: absolute path + finding count only. Do not return findings in your response.

Gemini Prompt

You have access to filesystem tools. Use them to read all source files before beginning.
Do not skip files due to length or count.

Repository root: <project_root>
Read all .py, .ts, .tsx, .json, .yaml, .toml files recursively. Do not skip config files.

[SEVERITY RUBRIC]

[EXCLUDE PATTERNS]

Exhaustive cross-file architectural audit.

Your primary value is cross-file reasoning that file-by-file tools miss. Prioritize:
- Cross-file inconsistencies: schema/contract drift between modules
- Data flow bugs spanning multiple files
- Architecture violations and coherence failures
- Security vulnerabilities at system level (auth flow, session handling, data exposure)
- Missing abstractions or boundary violations

Also cover: dependency hygiene, performance at architectural level, API surface design.
Do NOT duplicate line-level bug findings — Codex handles those.

Return findings as a JSON array. Each object:
{"severity": "", "category": "", "file": "", "line": 0, "issue": "", "fix": ""}

If no findings: return []
No prose, no section headers, no summaries outside the JSON array.

Copilot Prompt

Code security and quality review.

Scope: security antipatterns, dependency hygiene, test coverage gaps, API design issues,
frontend component quality, error handling exposed to users, external service calls,
webhook handlers, ancillary services.

[SEVERITY RUBRIC]

[EXCLUDE PATTERNS]

Return findings as a JSON array. Each object:
{
  "severity": "",
  "category": "",
  "file": "",
  "line": 0,
  "issue": "",
  "fix": "",
  "confidence": "high|medium|low",
  "fingerprint": ""
}

fingerprint: short slug of "file:line:category:brief-issue" — used for cross-run dedup.

If no findings: return the exact token NO_FINDINGS
No prose, no section headers, no summaries outside the JSON array.

Phase 2 — Parallel Dispatch

Before dispatch:

Verify no literal DATE or REPO remain in prompts.
Pre-create output directory: mkdir -p .planning/reviews
Set up Codex worktree:

git worktree add /tmp/agent-review-codex HEAD 2>/dev/null && echo "worktree: ok" || echo "worktree exists"

In a single response turn, dispatch all three simultaneously:

1. Codex — Bash, timeout: 300000:

cd /tmp/agent-review-codex && codex exec --dangerously-bypass-approvals-and-sandbox - <<'EOF'
<codex prompt with all substitutions applied>
EOF

2. Gemini — MCP bridge (same turn):

mcp__agent-bridge__call_gemini(prompt="<gemini prompt>", cwd="<project_root>", timeout_seconds=1200)

3. Copilot — MCP bridge (same turn):

mcp__agent-bridge__call_copilot(prompt="<copilot prompt>", cwd="<project_root>")

Announce: 3 reviewers dispatched in parallel — waiting

Phase 3 — Collect Results

Codex

wc -l .planning/reviews/$DATE-codex-findings.md 2>/dev/null || echo "codex output: missing"
git worktree remove /tmp/agent-review-codex --force 2>/dev/null; echo "worktree cleaned"

If missing after timeout: identify high-risk files and re-dispatch with explicit file list. If second attempt fails: skip Codex, announce codex: skipped (both attempts failed), continue.

Gemini

Valid JSON array → write to .planning/reviews/$DATE-gemini-findings.md
[] → write NO_FINDINGS
MCP failure → fallback chain:

gemini -m gemini-2.5-pro --approval-mode yolo - <<'PROMPT'
<gemini prompt>
PROMPT

If that fails → gemini-2.5-flash → gemini-1.5-pro. Write first successful output to file. All fallbacks fail → skip Gemini, announce gemini: skipped (all models at capacity).

Copilot

Write to .planning/reviews/$DATE-copilot-findings.md. NO_FINDINGS → write sentinel. Failure → skip, announce, continue.

Verify

wc -l .planning/reviews/$DATE-*.md

Announce: reviews complete — codex: N lines, gemini: N lines, copilot: N lines

Phase 4 — Dedup + Classify

Step 1 — Gemini Dedup Pass

mcp__agent-bridge__call_gemini(prompt="""
Read these review files and return a single deduplicated consolidated findings list.

Files:
- .planning/reviews/$DATE-codex-findings.md
- .planning/reviews/$DATE-gemini-findings.md
- .planning/reviews/$DATE-copilot-findings.md

Skip any file containing only NO_FINDINGS or absent.

Dedup rules:
- Same file + lines within ±3 + same CATEGORY = same finding. Keep most detailed.
- Same underlying defect at different line numbers = same finding. Merge.
- List all source models for each merged finding.

Return one consolidated JSON array:
{"severity": "", "category": "", "file": "", "line": 0, "issue": "", "fix": "", "sources": []}

No prose. JSON only. If nothing: []
""", cwd="<project_root>")

JSON parse failure → fall back to line-by-line processing of each file. Log: dedup: fallback.

Step 2 — Existence Verification

Before classifying each finding:

rg --files | grep -F "<cited file>"

Drop findings citing non-existent files. Log as invalid.

Step 3 — Classify + Write Incrementally

Run ID: $DATE-$(date +%H%M)

Bucket	File	Categories
Bugs & Security	`.planning/open-issues.md`	BUG, SECURITY
Code Health	`.planning/tech-debt.md`	TECH-DEBT, ARCHITECTURE, DEPENDENCY, TEST-GAP
Missing Capability	`.planning/feature-backlog.md`	FEATURE-GAP
User Experience	`.planning/ux-backlog.md`	UX

PERFORMANCE: user-facing → ux-backlog.md, internal → tech-debt.md.

Drop before writing: vague findings, findings citing deleted code, non-actionable findings. Normalize severity against rubric before writing.

Cross-run dedup: grep existing backlog for **Fingerprint**: \`` before appending. If match found: skip.

Output format per item:

## [SEVERITY] Short title (max 10 words)
- **Location**: `file:line`
- **Source**: codex | gemini | copilot
- **Run**: $RUN_ID
- **Fingerprint**: `file:line:category-slug`
- **Issue**: One sentence.
- **Fix**: One sentence.

---

Sort within each file: CRITICAL → HIGH → MEDIUM → LOW.

Write header if file empty:

# [Filename] — Last updated: $DATE

Write incrementally — append each finding immediately. Do not batch.

Step 4 — Count Report

Announce: classified: N open-issues, N tech-debt, N feature-backlog, N ux-backlog | dropped: N invalid, N dupes, N quality-filtered

Phase 4B — Severity-Tiered Dispatch to agent-remediate-loop

After Phase 4 writes all findings, scan each backlog for entries with **Run**: $RUN_ID. Group by severity.

CRITICAL / HIGH

Present numbered list:

CRITICAL/HIGH findings this run — approve each for agent-remediate-loop:
[1] CRITICAL: <title> — <file:line>
    Issue: <issue>
[2] HIGH: <title> — <file:line>
    Issue: <issue>
Approve items (e.g. "1 3 5"), "all", or "none":

Wait for user response. Collect fingerprints of approved items. If approved: invoke agent-remediate-loop scoped to those fingerprints. Announce: Routing N CRITICAL/HIGH item(s) to agent-remediate-loop. Skipped items remain in backlog.

MEDIUM

MEDIUM findings this run (N items):
  - <title> (<file:line>)
  ...
Run agent-remediate-loop on all N MEDIUM findings? [y / n / list]

list → show full issue + fix for each, then re-ask. y → collect fingerprints → invoke agent-remediate-loop. n → leave in backlog.

LOW

No user gate. Collect all LOW fingerprints this run. Announce: Auto-routing N LOW findings to agent-remediate-loop. Invoke agent-remediate-loop scoped to those fingerprints.

Phase 5 — Archive Reviews

mv .planning/reviews/$DATE-*-findings.md .planning/archive/ 2>/dev/null; echo "archived"

Phase 6 — Update active-context.md

Validate length first: wc -l active-context.md. Over 150 lines: strip completed-phase details first.

Replace ## Last Review section (or append if absent):

## Last Review — $DATE
Reviewers: [list which ran — note skipped and reason]
Findings classified into:
- .planning/open-issues.md (N items)
- .planning/tech-debt.md (N items)
- .planning/feature-backlog.md (N items)
- .planning/ux-backlog.md (N items)
Next: work through backlogs by severity. Start with open-issues CRITICAL/HIGH.

Phase 7 — Done

agent-review-loop complete.
Reviews: [list reviewers that produced output — note skipped and reason]
Received findings: N
Invalid (non-existent files) dropped: N
Duplicates merged: N
Quality-filtered dropped: N
Final unique findings: N
  open-issues: N items
  tech-debt: N items
  feature-backlog: N items
  ux-backlog: N items
active-context.md updated.

agent-review-loop

Invocation

Context Preview

SKILL.md

agent-review-loop

Invocation

Context Preview

SKILL.md

Agent Roles

LLM Handoff Optimization

Codex

Gemini

Copilot

Gemini (dedup pass)

Claude Code (classifier)

Phase 0 — Pre-flight

Phase 1 — Craft Prompts

Shared Severity Rubric (inject verbatim into all three prompts)

Shared Exclude Patterns (inject verbatim into all three prompts)

Codex Prompt

Gemini Prompt

Copilot Prompt

Phase 2 — Parallel Dispatch

Phase 3 — Collect Results

Codex

Gemini

Copilot

Verify

Phase 4 — Dedup + Classify

Step 1 — Gemini Dedup Pass

Step 2 — Existence Verification

Step 3 — Classify + Write Incrementally

Step 4 — Count Report

Phase 4B — Severity-Tiered Dispatch to agent-remediate-loop

CRITICAL / HIGH

MEDIUM

LOW

Phase 5 — Archive Reviews

Phase 6 — Update active-context.md

Phase 7 — Done

Similar Skills

Agent Roles

LLM Handoff Optimization

Codex

Gemini

Copilot

Gemini (dedup pass)

Claude Code (classifier)

Phase 0 — Pre-flight

Phase 1 — Craft Prompts

Shared Severity Rubric (inject verbatim into all three prompts)

Shared Exclude Patterns (inject verbatim into all three prompts)

Codex Prompt

Gemini Prompt

Copilot Prompt

Phase 2 — Parallel Dispatch

Phase 3 — Collect Results

Codex

Gemini

Copilot

Verify

Phase 4 — Dedup + Classify

Step 1 — Gemini Dedup Pass

Step 2 — Existence Verification

Step 3 — Classify + Write Incrementally

Step 4 — Count Report

Phase 4B — Severity-Tiered Dispatch to agent-remediate-loop

CRITICAL / HIGH

MEDIUM

LOW

Phase 5 — Archive Reviews

Phase 6 — Update active-context.md

Phase 7 — Done

Similar Skills