From wystack-agent-kit
Run a multi-angle static code review with parallel lens reviewers, domain specialists, and configured work-item/doc context. Tunable effort (low/medium/high/ultra) trades recall vs precision; optional --comment posts findings inline on the PR; optional --fix applies MUSTs to the working tree. Use when the user asks to review code, review a branch, review changes, check regressions, or get feedback before merge. Also invoked by `wystack-agent-kit:full-review` and `wystack-agent-kit:finish-task`.
How this skill is triggered — by the user, by Claude, or both
Slash command
/wystack-agent-kit:code-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Static review from multiple expert perspectives along two orthogonal axes — **lenses** (what question) and **specialists** (what code area). No code written or executed by default — findings only. Optional flags can post findings to the PR or apply MUSTs to the working tree.
Static review from multiple expert perspectives along two orthogonal axes — lenses (what question) and specialists (what code area). No code written or executed by default — findings only. Optional flags can post findings to the PR or apply MUSTs to the working tree.
$ARGUMENTS — positional then flags. Positional: empty (current branch vs the base branch), a PR URL/number, or file paths. Flags:
--effort=<low|medium|high|ultra> — recall vs precision dial; default medium. See Effort dial.--lens=<list> — comma-separated lens names to override auto-selection (e.g. --lens=security,correctness). See Lens roster.--comment — post each in-diff MUST as an inline PR comment after the report. --comment=all also posts in-diff SUGGESTs. Out-of-diff findings are never posted — they are observations on lines the PR did not modify, and posting them adds noise on unrelated code. They still appear in the chat-session report and in the durable record. No-op if no open PR.--fix — after the report, skip the action menu and apply MUSTs to the working tree (test-edits still gated by docs/testing-philosophy.md). --fix=all also applies SUGGESTs. Findings persist before any edit.Run the pipeline in order; each stage gates the next.
Eligibility. Cheap precheck on the standard tier before any heavy work (judgment call — see Model assignment). If the target is a PR/MR, bail when it is closed, merged, draft, an automated bot bump (Dependabot, Renovate, version-only), already reviewed by this skill on this diff_sha (check the configured record store when present; skip this sub-check when no workspace is loaded), or trivially small with no behavioral change (lockfile-only, copy-only, regenerated-snapshot-only). On bail, surface a one-line reason and stop. The user can override with an explicit re-run. No PR target → skip the gate; local branch reviews always proceed.
Scope. Resolve the base branch — the PR's base, or the repo default (git symbolic-ref refs/remotes/origin/HEAD); never assume main. git diff --name-only $(git merge-base HEAD <base>)..HEAD, classify against the roster to pick specialists. Best-effort load wystack-agent-kit:workspace — its presence routes the roster (cached vs inline) and gates the record write; the review itself never gates on it. Also enumerate CLAUDE.md paths — walk every ancestor directory of every touched file, deduplicated, up to and including the repo root (light tier, paths only — never contents). For src/a/b/c.ts that means checking src/a/b/CLAUDE.md, src/a/CLAUDE.md, src/CLAUDE.md, CLAUDE.md, including only the ones that exist. The list passes to reviewers as the local rule source.
PR snapshot. If a PR/MR URL or number is given, or the branch has an open one, capture metadata via the workspace's prView capability — number, url, title, body, head/base refs, changed-files count, additions, deletions, author, review decision, merge-state status, status-check rollup. The capability resolves to the host's CLI (see docs/storage-contract.md). When prView is unavailable (cli: manual, no host CLI, network failure), synthesize the same shape from branch name, commits, diff stats. Feeds the report narrative, not severity.
Context. Hard gate. Invoke the wystack-agent-kit:engineering-context skill (see Context gathering). Skipping this is the #1 cause of false-positive reviews. Pass the returned block verbatim to every reviewer — not paraphrased.
Preflight. Run the project's configured preflight checks from storage.json (quality.preflight) when a workspace is loaded; otherwise use the smallest obvious read-only baseline for the repo (for example typecheck/lint/test scripts already present). Any fail → stop; reviewing broken code produces findings about breakage, not design.
Parallel reviews. Single message, all reviewers — both axes spawn together:
--lens override). Each scoped to its single question per the Lens roster.wystack-agent-kit:perspective with review intent when configured providers are available, unless effort is low (which skips it). Same-runtime reviewers can share blind spots.review and support observe.records for the current diff; normalize per docs/extension-contract.md.Each reviewer gets the PR snapshot, changed files, branch, base, the verbatim context block, its typed role (lens or domain), and the reviewer brief. At ultra effort, lens + specialist reviewers run twice with cross-comparison (LLM-sampling insurance, see Effort dial). External review output (perspective, extensions) enters triage as claims, not facts. If providers/extensions are unavailable, note the gap and continue.
Dedup + triage. Orchestrator partitions findings by Scope first:
ultra doesn't dump dozens of low-confidence observations into the bucket. The user can file follow-up tickets on explicit pick; the review controller never auto-files. Persist these alongside in-diff findings at step 8 with a scope: "out-of-diff" field so retro/audit can see what was deferred.Recommendations only: a follow-up ticket is filed at step 9 on the user's explicit pick, never here.
Report. Deliver per REPORT-FORMAT.md. Single message — lead with the recommendation, decision needed, PR read (summary, architecture, risk, verification), then findings. Not a chronological work log. Append the workspace footer per Record.
Record. Write durable state if a workspace is loaded — one finding file per triaged item (.wystack/findings/<findingId>.json) plus one immutable pass record (.wystack/reviews/REV-*.json) linking finding_ids. Skip with a one-line setup suggestion in the report footer if not. See Record.
Execute based on flags:
--comment. Before posting, re-run the step-0 eligibility precheck (standard tier — same judgment call) — PR state can change during a long review (merged, closed, marked draft). If ineligible now, skip the post with a one-line note in the report footer. Otherwise post each MUST as an inline PR comment per the Comment format. Today the storage contract only defines read capabilities for PR comments (prCommentsInline, prCommentsTop); the write side falls back to the host CLI directly (gh pr review --comment or gh api …/pulls/{pr}/comments). SUGGEST excluded unless --comment=all. No-op with a one-line note if no open PR or no host CLI. Doesn't suppress the action menu. A prReviewCreate/prCommentInlineCreate capability is a future addition to docs/storage-contract.md; until then, the skill carries the CLI invocation as a documented exception.--fix. Skip the menu; apply every MUST as an edit. SUGGEST untouched unless --fix=all. Findings must already be recorded (step 8) before any edit — durable audit comes first. Test changes still gate on docs/testing-philosophy.md (see below); skip and report any finding that fails the gate. After edits, report which findings were applied vs skipped.--comment --fix posts inline first, then applies MUSTs.For test changes anywhere in this step, apply the strategic test gate in docs/testing-philosophy.md — delegate to test-writer or a tdd skill only when the test protects a spec contract, regression, hidden edge case, or boundary; pass the anchor explicitly.
Reviewers hunt findings with maximum recall. The review controller owns triage, severity translation, filing discipline, and round termination — don't bake scope filters into reviewer prompts.
Composition scales with change type:
| Change type | Reviewers |
|---|---|
| Features, refactors, architectural work | lens panel + wystack-agent-kit:principal + wystack-agent-kit:qa + domain specialists |
| Bug fixes (scoped, observed bug) | lens panel + wystack-agent-kit:qa |
| Polish, docs, comments | lens panel alone |
| Security-sensitive | Full composition + ultra sampling (run reviews twice; LLM sampling insurance) |
The lens panel is whichever lenses the effort dial selects (or --lens overrides). It replaces what was historically a single generic code-reviewer subagent — lenses ask the same code-quality questions, but scoped and parallelized. Don't spawn a separate code-reviewer on top of the lens panel; that double-spawns the correctness/simplify questions and contradicts the role-scope clause in the reviewer brief.
Domain specialists — pick by matching the diff's touched domains to whichever roster is available. Universal roles (principal, qa) are subagents; specialists are brief-driven, whether the brief is cached or generated.
agents.specialists from storage.json. Each entry declares name, domain, and a brief path. Pick the entries whose domain matches the changed files; multiple join when the diff spans domains. A specialist runs as a general-purpose reviewer spawned with its cached brief as the role prompt. If the diff touches a domain not covered by the cached roster, fall through to the inline path for that domain and note the gap in the report — recommend wystack-agent-kit:identify-specialists to make the addition durable.wystack-agent-kit:identify-specialists step 1 (Analyze) to the changed paths only, not the whole tree — manifest files, top-level boundaries, signal files. For each detected domain, spawn an ephemeral specialist with a brief synthesized from identify-specialists/ANALYSIS-HEURISTICS.md and shaped to identify-specialists/BRIEF-TEMPLATE.md. No persistence — the brief lives only for this run. A workspace setup later codifies these picks via wystack-agent-kit:identify-specialists.With no domain matches detected either way, principal carries the domain perspective alone — an empty specialist panel is a valid outcome.
wystack-agent-kit:perspective runs as an advisory reviewer on every composition when configured providers are present. It can use internal agents, teammate briefs, external tools, or configured extensions; it does not replace the main reviewer roster. The review controller triages its claims alongside the rest.
Extension participation is explicit: the skill asks which enabled extensions
participate in review and can observe.records for the diff. No extension
runs as an implicit lifecycle hook. Any extension-proposed action such as
verify_record or apply_fix is only surfaced as an available action; the main
agent or user decides whether to invoke it later.
Lenses are the what-question axis, orthogonal to specialists (the what-area axis). Each lens is a scoped reviewer — one question, high recall on that question, no cross-commentary. Run in parallel with specialists at step 5; the review controller triages all output together at step 6.
| Lens | The question | Auto-include trigger |
|---|---|---|
| correctness | Bugs, AC violations, broken invariants, off-by-ones | Always (every effort level) |
| simplify | Dead code, over-abstraction, duplicated logic, premature generality | Always at medium+ |
| testing | Spec-grounded test gaps, waste tests encoding current shape, missing regressions | Diff touches test files or shipped behavior at medium+ |
| security | Auth/authz, injection, secret handling, unsafe deserialization, SSRF/XXE | Diff touches auth, network, parsing, crypto, env, or user input |
| performance | Hot paths, N+1, unnecessary allocations, sync work that should be async | Diff touches request handlers, loops over collections, DB queries |
| api-contract | Breaking changes, boundary leaks, undocumented surface changes | Diff touches public exports, route handlers, schemas, types |
| docs | Comments-as-code drift, stale docs, missing rationale on non-obvious calls | Always at high+ |
Selection — --lens=<list> overrides auto-selection entirely. Otherwise the effort dial sets the baseline; auto-include triggers add lenses on top. A specialist and a lens can coexist on the same diff (e.g. security lens + auth specialist) — they ask different questions; collapsing them loses recall.
Each lens runs as a general-purpose reviewer spawned with the lens name as its typed role; the reviewer brief carries the scope clause.
--effort trades recall vs precision, coverage vs cost, and single-pass vs sampling-insurance. Default medium. Affects step 5 (composition) and step 6 (triage threshold) only — pipeline shape and severity model are unchanged.
| Effort | Lenses | Specialists | Perspective | Confidence floor | Sampling | When |
|---|---|---|---|---|---|---|
low | correctness, simplify | none | off | ≥ high | single pass | Quick sanity check on a small or low-stakes diff |
medium (default) | correctness, simplify, testing + auto-include by diff signal | matched roster | on if available | ≥ medium | single pass | Standard pre-merge review |
high | all lenses with any signal + docs | matched roster (broader) | on if available | ≥ low (broader recall, may include uncertain) | single pass | Risky, security-sensitive, or large diffs |
ultra | all lenses unconditionally | matched roster (broader) | on, all configured providers | ≥ low | reviewers run twice, cross-compared for stability | Pre-release, paid-tier exhaustive review, or LLM-sampling-sensitive verdicts |
Confidence floor is applied at step 6: a reviewer-reported Confidence: low finding is dropped at low/medium, surfaced as SUGGEST at high, and surfaced normally at ultra. The floor never overrides a Severity: critical correctness finding — a low-confidence bug claim still surfaces; it's just labeled "needs verification" in the report.
Sampling at ultra — each lens and specialist runs twice in parallel; findings that appear in both passes are promoted in confidence, findings in only one pass are demoted one rung. This is the "LLM sampling insurance" from the specialist roster generalized: security-sensitive changes already trigger it implicitly at any effort level, but ultra makes it the rule.
Reviewer tier per effort — the effort dial also escalates reviewer tier where it pays off. low and medium run all reviewers (except principal) on standard. high escalates the security and correctness lenses to deep because false negatives there are most costly; everyone else stays standard. ultra escalates all reviewers to deep. principal is always deep whenever it runs — its job is the architectural read. See Model assignment and docs/model-tiers.md.
Cost note — surface the effort level in the report header so the user can see what they paid for. If the user passes --effort=ultra on a one-file polish diff, propose medium first and wait for confirmation; effort should match the diff's stakes.
Callers in a convergence loop. wystack-agent-kit:full-review and wystack-agent-kit:finish-task invoke this skill without args, so they get medium by default. The convergence loop (docs/review-loop.md) runs N rounds — cost multiplies. Callers iterating toward clean may pass --effort=low on later rounds once the high-signal findings have converged; the final round before merge can step back up to medium or high as a precision check.
Before invoking the skill, extract every ticket ID from the branch name and last ~20 commit messages (case-insensitive): task-{id} / task/{id} / task_{id}; {PROJECT}-{id} (2–6 letter prefix + dash + digits); bare leading digits {id}-{slug}; path-style feat/{id}-*, fix/{id}-*, {user}/{id}-*.
Pass "<branch-name> <id1> <id2> ... review" as the skill arguments. If a ticket ID was found, instruct context that task-manager MUST be dispatched with that ID — title-only freshness is unreliable.
If engineering-context reports gaps (no PRD, unresolved open questions affecting the diff), pause and ask the user before dispatching reviewers. If the skill is unavailable, gather the same block manually from ticket/spec links and label it fallback context — never reach step 5 with no context block.
Each reviewer gets: changed files, branch, base; the PR snapshot; the step-3 context block verbatim; the output format from REPORT-FORMAT.md; its typed role (a lens name from the Lens roster or a domain from the Specialist roster); and:
Scope: in-diff | out-of-diff based on whether the cited line was modified by the PR. Out-of-diff findings are not dropped — they go into a separate bucket (see step 6) so we can still capture latent bugs the reviewer noticed in adjacent code. Flag them, mark them, move on.docs/testing-philosophy.md: recommend tests only for hidden edge cases, spec contracts, regressions, or boundaries; flag waste tests that only encode current code shape. Missing-test findings need the same gate. A fix that requires modifying an existing test is a signal — that test may have encoded the bug; check its spec anchor first.Reviewers keep their native severity labels (Critical/High/…) as input signal — the review controller re-classifies at step 6. Don't force reviewers to pre-triage; it reduces recall.
When --comment posts a finding to the PR, the body is short, not the full report. Brevity matters — inline comments live forever in the PR thread; a wall of prose ages badly.
Per-finding body:
<one-line description of the issue> (<source>: "<verbatim quote of the rule>")
<permalink to the code with full sha + L-range>
<one-line recommendation>
<path>/CLAUDE.md, spec: <decision-id>, lens: <name>, or git history. Always quote the rule verbatim if it came from a file; never paraphrase.HEAD, not a branch name) and an L<start>-L<end> range with at least one line of context above and below the cited line. Bash interpolation like $(git rev-parse HEAD) does not work — comments render as static markdown. Format: https://<host>/<owner>/<repo>/blob/<full-sha>/<path>#L<start>-L<end>.Aggregate header before the per-finding list:
### Code review (<effort> effort, <N> findings)
Out-of-diff findings are never posted as PR comments — they would land on lines the PR did not touch and read as noise. They stay in the chat-session report and the durable record only.
If zero MUSTs and --comment was passed, post a single line: ### Code review — no issues found in the diff (<effort> effort). Don't post on --comment runs with zero findings and zero MUSTs unless the user explicitly opts in — silent success is fine.
The full report (per REPORT-FORMAT.md) still goes to the user's chat session; --comment posts the short form to the PR. Don't dump the full report inline on the PR.
Match the tier to the work, not to the step number. Tier vocabulary and per-harness mapping are owned by docs/model-tiers.md; skills name tiers, never provider models.
light — mechanical, no judgment: CLAUDE.md path enumeration (step 1), PR snapshot synthesis from branch/commits when prView is unavailable (step 2 fallback).standard — default for almost everything else: eligibility precheck (step 0) and recheck (step 9 --comment), lens reviewers, domain specialists, qa, perspective. Sonnet-class models handle these well; reaching higher by default burns cost without lifting quality.deep — reserved for architectural reasoning and the highest-effort runs: principal always, and at --effort=high the security and correctness lenses also escalate (where false negatives are most costly). At --effort=ultra all reviewers escalate to deep and run the double-sampling pass from the Effort dial.The rule: the lowest tier is for no-judgment work; deep is for work where reasoning is the bottleneck. Most review at medium effort is pattern-matching against established norms — standard is enough. See docs/model-tiers.md for the sharp tests that decide tier picks.
Reviewers are shared role briefs, not custom Codex agent types — universal roles live in agents/*.md, specialists at their configured agents.specialists brief path. Claude can consume those files as native agent definitions. Codex uses its built-in transports (explorer, worker, default) with the role brief injected into the prompt; the reviewer name belongs in the prompt/report, not in a claimed custom transport. Don't claim plugin parity unless engineering skills are actually exposed; if wystack-agent-kit:engineering-context is unavailable, say so and fall back.
When wystack-agent-kit:workspace resolved a root, write one immutable
review run-record per pass through the configured record.write binding from
storage.json. If no record-store extension is configured, or the write fails,
fall back to .wystack/reviews/REV-{unix-ts}-{short-sha}.json and record the
fallback reason per docs/run-record.md.
Default: persist every triaged finding as project state (clawpatch model) —
one file per finding under .wystack/findings/, evidence-backed, individually
triageable; the pass record links finding_ids and denormalized counts. Shape,
field definitions, and write rules: RECORD-FORMAT.md.
When no workspace resolved, skip the write and append one line to the report footer: "No workspace record store — review record skipped. Run wystack-agent-kit:setup-agent-kit to enable per-pass tracking." Once per report, not per finding.
wystack-agent-kit:full-review writes its unified verdict to the same configured
record store with "skill": "full-review" and merged findings from all lenses.
wystack-agent-kit:finish-task resume reads the configured record store first,
then .wystack/reviews/, as evidence of a converged pass on this diff_sha
(verdict + diff_sha only — findings are for audit and retro, not the skip gate).
wystack-agent-kit:full-review.--comment flips to ineligible → keep the report; skip the post; note the reason in the report footer.ultra, if one of the two sampling passes fails, treat the remaining pass as a high single-pass result and note the degraded confidence.wystack-agent-kit:identify-specialists to make the addition durable.--lens names an unknown lens → ignore the unknown name with a one-line note, run the recognized lenses; never fail the review.--fix on a test edit that fails the strategic test gate → skip that finding's edit, keep it in the report as a finding the user must decide on. Don't silently apply.--fix with no MUSTs → no-op, report says "no MUSTs to apply" and proceeds as if the flag were absent.--comment with no open PR → skip with a one-line note in the report footer; report itself is unaffected.--comment with no host CLI available → skip with a one-line note recommending the user install gh/glab (or the configured host CLI).--effort=ultra on a trivial diff → propose medium and wait for confirmation before running; effort should match stakes.docs/review-loop.md: round structure, the zero-MUST gate, scope-drift signal, round budget, stall handling.npx claudepluginhub youhaowei/wystack-agent-kit --plugin wystack-agent-kitGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.