From ngmeyer-skills
Audit a web codebase for security, performance, correctness, and refactoring improvements WITHOUT changing any outward-facing behavior. Fans out parallel read-only reviewers, scores findings on two axes (severity × confidence), suppresses predictable false positives, validates survivors with an INDEPENDENT wave (not self-recheck), classifies safe vs. gated, and writes a report. Applies only behavior-preserving fixes, and only on request. Use when: 'rigorous review', 'hardening audit', 'security and performance audit', 'internal audit', 'harden the codebase', 'audit for security/perf/refactoring', 'tech-debt audit', 'review this codebase without changing behavior'.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ngmeyer-skills:rigorous-review [path to scope] [--effort low|medium|high|max] [apply-safe][path to scope] [--effort low|medium|high|max] [apply-safe]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Investigate a codebase for **security**, **performance**, **correctness**, and
Investigate a codebase for security, performance, correctness, and refactoring improvements under one hard constraint: no observable change to any outward-facing page or API. Rendered HTML of public routes, URL structure, JSON response shapes, auth flows, and admin UI behavior must all be byte-for-byte equivalent for legitimate callers after any fix.
The deliverable is a report. Fixes are applied only when the user asks, and only the
ones classified safe.
file:line + a code quote. Lead the evidence with the observable consequence —
what a user, attacker, or operator experiences — not the code structure. No findings
from names or assumptions.safe — behavior-preserving for legitimate callers; no schema change, no infra.
Authorization fixes are safe: rejecting an unauthorized caller is the intent,
not a regression. Same for a guard legitimate callers already satisfy.gated — any risk of observable change, a schema/migration change, a data-semantics
change, or new infra. Report-only; never apply silently.drizzle-kit push, no script runs against
prod, no destructive commands. Schema/index recommendations go in the report only.Apply after synthesis, before the report:
--effort, default medium): low/medium report
high-confidence only (≥75) — fewer, surer findings. high/max widen recall (surface
gated-50s for triage, run the validator wave on more findings). Match depth to the request.Full anchor definitions, the dedup fingerprint, agreement promotion, the validator-wave protocol, and the precedents table: references/scoring-gating-validation.md.
middleware.ts / central auth layer, or does each route/action guard itself? State the
answer up front — it changes how the security pass reads every endpoint.
(grep -r "middleware" at the app root; check the framework's auth entry points.)Explore agent is good for this. Produce a finding for
any surface element with no corresponding guard (the attack-surface-inventory rule).CLAUDE.md/AGENTS.md (tenancy scoping, PII rules, money/
units conventions). Pass these verbatim to reviewers as "violating this is a P0."medium) and assign model tiers: the security
and correctness reviewers inherit the session model (high-stakes, miss-cost high); the
performance and refactoring reviewers may run a mid-tier model (~3–4× cheaper, no
quality loss on lower-stakes lanes). State the assignment.Dispatch four Agents in one message (run_in_background: true), each READ-ONLY, each
told to: verify every finding against the full code path, score it on both axes, classify
safe/gated, and apply its lane's do-NOT-flag list before emitting. Each returns
findings (id, severity, confidence, title, file:line, evidence quote + observable
consequence, fix sketch, class, lane) plus a short inventory with a one-word verdict per item.
The four lanes, with full checklists in references/reviewer-lanes.md:
dangerouslySetInnerHTML, exec interpolation, eval/new Function); map each
finding to OWASP Top 10:2025 + CWE.Promise.all/join — confirm the loop is real first); missing indexes (gated);
over-fetching; Core Web Vitals (LCP/INP/CLS, image/font, barrel imports, RSC
serialization); serverless traps (in-process caches broken across invocations, ephemeral
FS writes, unauthenticated cron). Project impact at 10×/100×/1000× data volume."undefined"/NaN;
error-masking fallbacks (empty array instead of propagating a failed query); TOCTOU and
half-updated state; race conditions.knip/ts-prune/ruff F401/ast-grep, accounting for barrel files, dynamic import(),
framework exports — not bare grep); drift-prone duplication (P1 if the copies already
disagree); Fowler 5-family smell taxonomy; the Deletion Test and two-adapter
seam rule (don't recommend a single-use abstraction); TS type-safety
(noUncheckedIndexedAccess, Result-vs-throw, discriminated unions). Deletion recs are gated.Plus an API-contract check that guards the core invariant directly (additive-vs-mutative,
silent-semantics-change like "count used to include deleted rows, now it doesn't"):
references/behavior-preservation.md.
Do not just concatenate the four outputs. In order:
normalize(file) + line_bucket(line, ±3) + normalize(title) — the
same bug flagged by two lanes is one finding, not two.residual risks / advisory tier in the
report, not the main tables.Do not re-verify your own synthesis — the orchestrator that merged the findings is not an
independent second opinion (it catches a wrong fact but not its own bias). Instead, for every
surviving P0 and P1, spawn a fresh validator Agent with no commitment to the
finding ("False positives are common; do not feel pressure to confirm"). One validator per
finding (a single batched validator recreates the bias). The validator reads the code path
cold and returns confirm / downgrade / reject with its own evidence.
low/medium effort, validate P0/P1; at high/max, also validate P2.Write to docs/audits/YYYY-MM-DD-rigorous-review.md:
If the user passes apply-safe (or asks afterward):
safe findings, highest-confidence first. One commit per concern,
conventional message. Never push.gated item. If a "safe" fix turns out to risk observable change once you're
in the code, stop and re-classify it as gated.gated, not a quiet cleanup. Same for any API-contract
shift.gated.--effort.First release of rigorous-review. Distilled from a working prototype (internal-hardening-audit)
that was forged by running it for real on signupspark and swim-records (Fable/Opus — good
results; it surfaced a committed DB credential the dedicated security pass had missed), then
hardened with a 3-agent research synthesis across CE reviewer agents, the built-in
code-review/security-review, Vercel skills, Anthropic security-review, Matt Pocock's
architecture skill, OWASP Top 10:2025, Fowler smells, and Core Web Vitals. The design, each piece
traceable to a source:
ce-code-review 5-anchor confidence; Anthropic's ≥0.8
report threshold) — severity alone can't express "critical but unverified"; the gate is the
main precision lever (suppress <75 except P0-at-50).code-review and Anthropic both
centre correctness.references/ to keep SKILL.md lean.Empirical precision/recall validation is pending the next real audit run.
tests/eval.sh asserts the design contract structurally (two axes + gate, P0-at-50 exception,
suppression lists, independent validator wave, dedup + promotion, four lanes incl. correctness,
API-contract check, OWASP/CWE + SSRF/deserialization, Core Web Vitals, real dead-code tooling,
effort dial, model-tiering, behavior-preservation verifier, prod-read-only, safe-vs-gated).
Behavioral validation is the next real audit run (re-audit signupspark/swim-records and compare
finding precision/recall against the prototype baseline).
npx claudepluginhub colorprint/my-thinking-skills --plugin ngmeyer-skillsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.