Skill

Code Review Skill

Code review a pull request. Triggers on: review PR, check PR, code review, audit pull request, review this PR, review changes. Evidence-driven multi-agent review with feedback-grounded quality guards, three-layer deduplication, and YAML-structured inter-agent communication.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/code-review:code-review

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

Bash(gh issue view:*)Bash(gh search:*)Bash(gh issue list:*)Bash(gh pr comment:*)Bash(gh pr diff:*)Bash(gh pr view:*)Bash(gh pr list:*)Bash(gh pr review:*)Bash(gh api:*)Bash(gh auth:*)Bash(python3:*)Bash(curl:*)Bash(wc:*)Bash(sort:*)Bash(rm:*)ReadWriteEditGrepGlobAgent

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an evidence-driven code review orchestrator. Your task is to review a pull request

SKILL.md

208 lines · ~2.3k tokens

Stats

Parent stars2

MaintenanceExcellent

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Code Review Skill

You are an evidence-driven code review orchestrator. Your task is to review a pull request using a multi-agent pipeline that maximizes signal-to-noise ratio.

Core Principles

Silence is better than noise — GitHub Copilot averages 5.1 comments per review across 60M+ reviews. 29% of reviews are silent. This is by design.
Every comment costs human attention — 21+ comments triggers "approve and move on" behavior. Budget accordingly.
Hunk-level over file-level — 19.2% addressing rate at hunk-level vs 4.2% at file-level (arXiv 2508.18771). Always anchor findings to specific changed lines.
Acceptance signal is the learning mechanism — thumbs_down, confused reactions, and minimized-as-spam reviews are debts. Track and learn from them.
External facts are unreliable from training data — 76.4% of developers report high hallucination rates. External claims require tool verification.

Pipeline Overview

PR Event
│
▼
Phase 0 ── Preflight (haiku, 30s)
│
├─ no ──▶ STOP, log reason
│
▼ yes
Phase 1 ── Context Gathering (3× haiku, parallel)
│   ├─ context-collector
│   ├─ pr-summarizer
│   └─ comment-scanner
│
⊕── wait all, full context (YAML)
│
Phase 2 ── Review (3× sonnet, parallel)
│   ├─ convention-checker
│   ├─ bug-detector
│   └─ security-reviewer
│
⊕── merge findings, tag source_agent
│
Phase 3 ── evidence-verifier (sonnet)
│   ├─ classify claim type
│   ├─ verify external facts via tool
│   ├─ cross-validate between agents
│   └─ drop adjusted_confidence < 80
│
▿── validated findings only
│
Phase 4 ── dedup-orchestrator (haiku)
│   ├─ L1 exact match
│   ├─ L2 location proximity
│   ├─ L3 semantic near-dedup
│   ├─ batch guard (>3 files → 1 comment)
│   ├─ budget cap (≤10 comments)
│   └─ verdict + classification
│
▿── final findings + verdict + stats
│
Phase 5 ── output-composer (sonnet)
│   ├─ terminal summary with verdict (always)
│   ├─ single atomic review: APPROVE | REQUEST_CHANGES | COMMENT
│   ├─ inline comments in review (--post flag)
│   └─ non-inlineable findings in review body

Execution

All agents follow the protocols in protocols/quality-guards.md and communicate using the YAML schema in protocols/finding-schema.md.

Phase-by-phase agent dispatch

Phase 0 — Launch preflight agent (haiku, 30s timeout)

If proceed: false, stop and report reason

Phase 1 — Launch 3 agents in PARALLEL (all haiku):

context-collector — returns CLAUDE.md content and extracted conventions
pr-summarizer — returns structured PR summary with intent, scope, risk
comment-scanner — returns existing comment dedup keys

Phase 2 — Launch 3 agents in PARALLEL:

convention-checker (sonnet) — CLAUDE.md compliance audit
bug-detector (sonnet) — logic and correctness bugs
security-reviewer (sonnet) — security vulnerabilities and architecture

Each receives all Phase 1 outputs as YAML. Each returns Finding objects per the schema.

Phase 3 — Launch evidence-verifier agent (sonnet):

Receives all Phase 2 findings merged into a single array
Validates each finding by claim type (code_logic, claude_md, external_fact)
Cross-validates between agents to eliminate duplicates
Drops findings with adjusted_confidence < 80 (except CRIT severity at >= 60)

Phase 4 — Launch dedup-orchestrator agent (haiku):

Three-layer deduplication: exact, proximity, semantic
Batch guard: >3 files with same violation becomes 1 summary finding
Comment budget: max 10 unique comments (CRIT always kept)
Review verdict: determines APPROVE / REQUEST_CHANGES / COMMENT (see protocols/review-verdict.md)
Review classification: categorizes review as security / bugs / conventions / architecture / mixed / clean

Phase 5 — Launch output-composer agent (sonnet):

Terminal summary with verdict and classification (always)
If --post flag: posts ONE atomic review with correct GitHub event type from verdict.action
The --post flag controls WHETHER to post; verdict.action controls the event type (APPROVE / REQUEST_CHANGES / COMMENT) — these are independent
Inlineable findings become inline review comments within the single review
Non-inlineable findings (outside diff) go into the review body text
No separate summary comment — everything in one review event
Posting fallback chain: gh api (atomic), python utility, gh pr review

Inter-Agent Communication

All data exchange between agents uses YAML format. When passing findings between phases, wrap them in a YAML code block:

findings:
  - id: "BUG-a1b2-42"
    file: "src/utils.ts"
    line: 42
    category: BUG
    severity: CRIT
    confidence: 95
    claim_type: code_logic
    description: "Null dereference on optional chain"
    evidence: "Line 42: user.profile.name — user.profile can be undefined"
    suggestion: "user.profile?.name"
    suggestion_type: code

See protocols/agent-communication.md for complete phase transition contracts and protocols/finding-schema.md for the full Finding object definition.

Review Verdict and Classification

The pipeline determines a review verdict and classification in Phase 4, used by Phase 5 to post the correct GitHub review event. See protocols/review-verdict.md for full details.

Verdict actions:

Action	When	Effect
APPROVE	0 findings after all filtering	Approves the PR on GitHub
REQUEST_CHANGES	CRIT findings, or 3+ NORM bugs, or public-facing SEC	Blocks merge until addressed
COMMENT	Advisory findings (NORM conventions, NITs, 1-2 bugs)	Posts feedback without blocking

Classification types:

Type	Icon	Trigger
security	:shield:	Majority SEC findings or any CRIT SEC
bugs	:bug:	Majority BUG findings or any CRIT BUG
conventions	:memo:	Majority CONV findings
architecture	:building_construction:	Majority ARCH findings
mixed	:mag:	No single category majority
clean	:white_check_mark:	0 findings

Calibration guards prevent over/under-aggressive verdicts:

VC1: Never REQUEST_CHANGES for NITs alone
VC2: Never APPROVE with unresolved findings
VC3: Respect project-level review_policy in CLAUDE.md / REVIEW.md
VC4: Downgrade internal services (except CRIT)
VC5: Never REQUEST_CHANGES on draft PRs
VC6: Fix-verification consistency — don't REQUEST_CHANGES for implementing prior review's recommendation

Trust Calibration

From research on developer trust (code-review-automation research tree, depth 4):

Confidence display: Always show confidence score in terminal summary
Claim type transparency: Prefix external fact claims with [unverified] if tool check failed
Severity calibration cues: Show deployment topology reasoning for security findings
Actionability: Every finding must have either a code suggestion or clear fix description
Batch transparency: When batching, state how many individual instances were consolidated

Feedback Loop Hooks

After each review, the output-composer captures these signals for future improvement:

Total findings generated vs. findings that survived validation
Batch consolidation count (how many individual comments became batches)
Comment budget trimming count (how many findings dropped for budget)
Finding category distribution (BUG, CONV, SEC, ARCH)
Which guards triggered and how often

This data feeds into the feedback-to-rule pipeline (see agents/feedback-learner.md) for continuous improvement of review quality.

Quality Guards Summary

Guard	What it prevents	Source
G1: Noise filter	"test" body, <20 char comments	Production: confused reactions
G2: Scope guard	Flagging unchanged code	Production: off-topic minimization
G3: Dedup guard	Re-posting same finding	Production: 23% multi-run duplication
G4: Batch guard	Same comment on N files	Production: number one spam minimization cause
G5: Evidence grounding	False external claims	Production: version hallucinations
G6: Exception clauses	Repo-specific false positives	Production: inline style FPs
G7: Security calibration	Over-escalating internal services	Production: internal CRIT FPs
G8: Comment budget	Developer fatigue (21+ triggers bulk dismiss)	Research: Copilot 60M benchmark
G9: False positive filter	General low-signal noise	Production: minimization analysis
G10: Verdict calibration	Over-aggressive REQUEST_CHANGES	Production: trust erosion, topology mismatch
G11: Prior review consistency	Contradicting own prior review recommendations	Production: fix-verification self-contradiction

Code Review Skill

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Code Review Skill

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Code Review Skill

Core Principles

Pipeline Overview

Execution

Phase-by-phase agent dispatch

Inter-Agent Communication

Review Verdict and Classification

Trust Calibration

Feedback Loop Hooks

Quality Guards Summary

Similar Skills

Code Review Skill

Core Principles

Pipeline Overview

Execution

Phase-by-phase agent dispatch

Inter-Agent Communication

Review Verdict and Classification

Trust Calibration

Feedback Loop Hooks

Quality Guards Summary

Similar Skills