Run an exhaustive multi-wave codebase audit with an overlapping pipeline, auto-slicing, early flagging, cross-cutting verification, and a graded scorecard. Use when the user asks to audit, review, assess, grade, or evaluate the quality of a codebase, project, or setup. Also use when the user asks "how good is this", "what should I fix", "what's broken", "rate my setup", or any variant of wanting a brutally honest assessment. Produces both terminal output and a report file in .claude/reports/. This skill is aggressive — it reads everything before judging anything. Always use this over a quick surface-level review.
How this skill is triggered — by the user, by Claude, or both
Slash command
/parallel-hardening-loop:deep-auditThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
An overlapping-pipeline codebase audit that auto-slices the repo, reads every file with canary-gated parallel agents, verifies claims before readers finish, and produces a graded scorecard with prioritized action items and a persistent report file.
An overlapping-pipeline codebase audit that auto-slices the repo, reads every file with canary-gated parallel agents, verifies claims before readers finish, and produces a graded scorecard with prioritized action items and a persistent report file.
Surface-level audits lie. Module-level readers make claims without verifying code. In production runs, readers claimed ~10 critical bugs — ~20% were false positives that an independent verifier caught. The only way to get truth is: read everything first, verify with targeted agents second, grade with isolated judges third.
v2 adds overlapping execution — verification starts after 3 readers complete instead of waiting for all — cutting wall-clock time by ~30-40% while preserving the verification discipline that makes findings trustworthy.
This skill burns tokens aggressively and intentionally. Every finding is verified before it reaches the user.
Phase 0: Pre-Flight Telemetry (inline, ~20-45s)
|--- runs 9 static analysis tools in parallel via scripts/preflight.sh
|--- merges outputs into .audit/preflight/findings.json (unified schema)
|--- prints tool summary to terminal
v
Phase 0.5: Baseline Health (inline, ~30s)
|
v
Reading Wave (6-8 background agents, auto-sliced)
|--- each reader receives per-file telemetry dossier from Phase 0
|--- as each reader completes: surface [UNVERIFIED] findings to terminal
|--- after 3+ readers complete: trigger verification wave
v
Verification Wave (9-12 background agents, overlapping with late readers)
|--- each verifier receives category-filtered telemetry as ground truth
|--- Bug Verifier receives late reader findings via SendMessage
|--- as each verifier completes: print confirmed/rejected findings
v
Judges (2 blind judges, after all verifiers complete)
|--- judges receive full telemetry summary in evidence packet
v
Report (inline synthesis — terminal summary + markdown file)
v
Codex Sprint Plan (1 background agent — reviews findings, generates sprint doc)
|--- print full sprint plan to terminal when complete
v
Sprint Plan Review (1 background agent — adversarial review of the plan)
|--- print review verdict to terminal: READY TO EXECUTE or NEEDS REVISION
Wall-clock improvement: ~30-40% faster than v1 due to wave overlap.
Run deterministic static analysis tools before any AI agent starts. This produces ground truth that agents consume — they never need to rediscover what tools already proved.
Portability note (plugin version): The reference scripts/preflight.sh ships with claude-super-setup and assumes a TypeScript project. When this skill runs in another repo:
scripts/preflight.sh exists in the project, run it..audit/preflight/: semgrep --config=auto --json, gitleaks detect --report-format json, trivy fs --format json, npm audit --json / pip-audit -f json. Skip silently when a tool is absent.pre-flight: skipped (no tooling installed) in the report, and let the LLM-driven reading + verification waves carry the load. Do not block the audit on missing telemetry.When preflight runs, it executes up to 9 tools in parallel:
The script merges all outputs into .audit/preflight/findings.json — a unified report with:
Print the telemetry summary to terminal:
[preflight] 9 tools completed in 45s
[preflight] 1731 findings: 38 critical, 11 high, 924 medium, 755 low
[preflight] Type coverage: 100% | Dead code: 217 unused exports, 497 unused types | Duplication: 0.53%
[preflight] Architecture: 0 circular deps, 0 boundary violations
[preflight] Telemetry written to .audit/preflight/findings.json
Read .audit/preflight/findings.json and hold it in context for injection into downstream agents.Run inline (no agents). Adapt commands to the project's stack — detect via files in the repo root:
npm run typecheck (Node), mypy . / ruff check . (Python), cargo check (Rust), go vet ./... (Go). Run whichever exists; tail last 10 lines.npm test, pytest -x, cargo test, go test ./.... Tail last 10 lines.find <src-dir> -type f \( -name '*.ts' -o -name '*.py' -o -name '*.rs' -o -name '*.go' \) ! -name '*.test.*' ! -name '*_test.*' | wc -l. Detect <src-dir> from src/, lib/, api/, project layout.find filtered for *.test.* / *_test.* / tests/ dir.git log --oneline | wc -l and git log --format="%ai" --reverse | head -1.package.json, pyproject.toml, requirements.txt, Cargo.toml, go.mod (top 40 lines).CLAUDE.md, AGENTS.md, README.md (top 60 lines) — first one found.This gives you the baseline numbers to share with the judges.
Before launching readers, dynamically partition the codebase:
src/.ts files per directory (recursive, excluding .test.ts)## Project Hints section exists at the bottom of this file, use the Module Groups to guide grouping — keep listed groups togetherSlice Directories Files
───── ─────────────────────────────────── ─────
1 core/, peer-runtime.ts 28
2 cli/, wizard/ 22
3 swarm/ 45
4 warroom/, adversarial/ 38
5 mesh/, hive/, group/ 31
6 intel/, docs/, crowdsource/, share/ 19
7 generator/, templates/, handoff/ 35
8 guardian/, health/, plugin/, ... 27
Launch 6-8 feature-dev:code-explorer agents in background, one per slice.
Canary pattern: Launch 1 reader first on the smallest slice. If it returns in <30s with garbage, abort the audit. If healthy, launch the remaining readers.
Reader prompt template:
## Pre-Flight Telemetry for Your Slice
The following deterministic findings were detected by static analysis tools BEFORE you started reading. These are ground truth — do not re-discover them. Instead:
- CONFIRM or REFUTE each finding based on your contextual reading
- Look for issues that static tools CANNOT find (business logic errors, semantic bugs, race conditions, incorrect error handling)
- Use the metrics to prioritize which files need deepest reading
[INJECT: fileDossiers for this slice's files from findings.json]
---
Read EVERY file in [directories] thoroughly — full files, not just tops.
[If Project Hints contain Known Complexity for this slice:]
Pay extra attention to: [hint text]
For each file report:
1. What it actually does (not what it claims to do)
2. Dead code, unused exports, stubs that don't do anything
3. Bugs, logic errors, missing edge cases
4. Integration gaps — functions that exist but nothing calls them
5. Missing error handling or silent failures
6. What would break if used in production
Be ruthlessly honest. I need gaps, not praise. Provide file:line references for every finding.
Early flagging: As each reader completes:
[UNVERIFIED] tagProgress bar:
Reading: [####------] 4/8 complete | 12 unverified findings surfaced
Trigger: 3+ readers completed. Do NOT wait for all readers.
Launch verification agents as soon as the trigger fires. Late reader findings are handled via SendMessage.
Mandatory agents (always launch):
| # | Agent | Type | What It Does |
|---|---|---|---|
| 1 | Bug Verifier | voltagent-qa-sec:debugger | Top 10 claims from readers. CONFIRMED / FALSE POSITIVE / PARTIALLY CONFIRMED. Receives late reader findings via SendMessage. |
| 2 | Security Audit | voltagent-qa-sec:penetration-tester | Command injection, path traversal, secret leakage, prototype pollution, supply chain, credential storage |
| 3 | Architecture Scanner | Architecture Scanner | Import graph, circular deps, layer violations, god files |
| 4 | Dead Code Scanner | Dead Code Scanner | Verify claimed dead modules by grepping for imports |
| 5 | Race Condition Audit | voltagent-qa-sec:chaos-engineer | TOCTOU, concurrent access, orphaned locks, signal handlers |
| 6 | Error Handling Audit | voltagent-qa-sec:error-detective | Swallowed errors, unhandled rejections, partial state on failure |
| 7 | Test Gap Scan | Test Gap Scanner | Untested modules, vacuous tests, rule violations |
| 8 | Duplicate Code Scan | Simplifier Scanner | Cross-module duplicated logic |
| 9 | Performance Scan | Performance Scanner | Sync I/O in async, O(n^2), unbounded memory |
| 10 | Config Consistency | feature-dev:code-explorer | Hardcoded model IDs, timeouts, env vars, cwd vs projectPath |
Optional agents (triggered by Project Hints Optional Audits section):
| # | Agent | Trigger |
|---|---|---|
| 11 | Hook System Audit | hooks: yes in hints |
| 12 | Token/Budget Audit | token-budget: yes in hints |
| 13 | Plugin System Audit | plugin: yes in hints |
Telemetry injection for verifiers: Each verification agent receives category-filtered findings from Phase 0 as ground truth:
| Verifier | Telemetry Categories |
|---|---|
| Bug Verifier | All categories (cross-references reader claims against tool findings) |
| Security Audit | security, secrets, vulnerability, supply-chain |
| Architecture Scanner | architecture (circular deps, boundary violations from dep-cruiser) |
| Dead Code Scanner | dead-code (unused exports/files/deps from Knip) |
| Duplicate Code Scan | duplication (clone pairs from jscpd) |
| Performance Scan | complexity (cognitive/cyclomatic from Oxlint, FTA scores) |
| Config Consistency | custom-rule (hardcoded model IDs from ast-grep) |
Verifiers use telemetry as a starting point, not a ceiling. Their job is to:
Late reader handling: When a reader finishes after verification has started, send its findings to the Bug Verifier via SendMessage. Other still-running verifiers also receive via SendMessage. Already-completed verifiers miss late data — acceptable since the Bug Verifier is the longest-running agent and catches straggler findings.
Progress bar:
Verifying: [######----] 7/12 complete | 8 confirmed | 2 false positives | 3 pending
Trigger: All verification agents complete.
Launch 2 blind voltagent-qa-sec:architect-reviewer agents in parallel. Each gets the same evidence packet assembled from pre-flight numbers and verified findings only. Judges do NOT see each other's scores. Judges never read source code — they score from the evidence packet only.
Evidence packet:
Rubric (10 dimensions, each 1-10):
| Dimension | What It Measures |
|---|---|
| Architecture | Module boundaries, dependency direction, layer discipline |
| Code Quality | Error handling, type safety, algorithmic correctness |
| Test Suite | Coverage, assertion quality, integration tests |
| Security Posture | Input validation, secret handling, sandboxing |
| CI/Infra | Build pipeline, quality gates, deployment readiness |
| Dogfooding | Does the system use its own tools? |
| Production Readiness | Ship tomorrow? Error recovery, monitoring, rollback |
| DX / Onboarding | New developer productive in 1 day? |
| Ambition | Scope, novelty, difficulty relative to team size |
| Debt Ratio | Dead code, stubs, orphaned features |
Scoring: Median of 2 judges per dimension. Divergence of 3+ points on any dimension = "disputed" flag.
Benchmark comparison prompt for each judge:
You are grading this codebase on a rubric. Score each dimension 1-10.
For context, here is what a 10/10 looks like in each dimension for comparable setups:
- A senior hedge fund quant's personal trading system: 9-10 on security, 8-9 on tests, 7-8 on DX
- A YC-funded startup's core product at Series A: 7-8 on architecture, 6-7 on tests, 8-9 on ambition
- A top open-source CLI tool (e.g., mise, biome, turborepo): 9-10 on DX, 8-9 on CI, 7-8 on architecture
- A power Claude Code user's personal setup (4000+ sessions): 8-9 on dogfooding, 7-8 on ambition, 6-7 on production readiness
- A FAANG-tier internal developer tool: 9-10 on CI, 8-9 on tests, 7-8 on security
Score honestly. A 5 is average. A 7 is good. A 9 means best-in-class. Do not inflate.
For each dimension, provide:
- Score (1-10)
- One sentence justification
- One specific improvement that would raise the score by 1 point
Also provide:
- Overall weighted score (architecture and code quality weighted 2x)
- Percentile estimate: where does this codebase sit relative to all codebases you've seen of similar scope?
- Top 3 things to work on (from the evidence, not generic advice)
Produced inline (no synthesis agent).
Terminal output:
Report file: Write to .claude/reports/deep-audit-YYYY-MM-DD.md. Contains everything in terminal output plus:
Terminal format:
# Deep Audit Report
## Scorecard
| Dimension | Score | Benchmark |
| ------------ | ----- | ---------------- |
| Architecture | 7 | Top OSS CLI: 7-8 |
| Code Quality | 6 | FAANG tool: 7-8 |
| ... | ... | ... |
## Overall Grade: X.X/10 (Xth percentile)
## Tier 1: Fix Now
- [BUG] description — file:line — verifier: CONFIRMED
## Tier 2: High-Impact Work
- [ARCH] description — affected modules
## Tier 3: Structural Cleanup
- [DEBT] description — scope estimate
## Tier 4: Wire or Kill
- [DEAD] description — recommendation
## What To Do In Order
1. Most impactful item
2. Second most impactful
3. ...
Report file format (.claude/reports/deep-audit-YYYY-MM-DD.md):
# Deep Audit Report — YYYY-MM-DD
## Baseline
- Source files: N | Test files: N | Commits: N
- Typecheck: PASS/FAIL | Tests: PASS/FAIL
- Dependencies: N | First commit: YYYY-MM-DD
## Pre-Flight Telemetry
- Tools ran: [list] | Tools skipped: [list]
- Total findings: N | By severity: critical N, high N, medium N, low N
- Type coverage: N% (N any types)
- Duplication: N% (N clones)
- Architecture: N circular deps, N boundary violations
- Dead code: N unused files, N unused exports, N unused deps, N unused types
- Complexity hotspots: [files with cognitive > 15]
- Supply chain: N alerts (if Socket ran)
## Scorecard
[Full table with scores, justifications, disputed flags]
## Overall Grade: X.X/10 (Xth percentile)
## Confirmed Findings
### Tier 1: Fix Now
- **[BUG]** description
- File: path:line
- Code: `quoted snippet`
- Verifier: CONFIRMED
- Reader: Agent N | Verifier: Agent N
### Tier 2-4: ...
## False Positives
- **[CLEARED]** description
- Claimed by: Agent N
- Checked by: Bug Verifier
- Why cleared: explanation
## Judge Scorecards
### Judge 1
[Per-dimension scores and justifications]
### Judge 2
[Per-dimension scores and justifications]
### Disputed Dimensions
- Dimension: Judge 1 scored X, Judge 2 scored Y (delta: Z)
## Agent Roster
| Agent | Type | Slice/Scope | Status |
| ----- | ---- | ----------- | ------ |
| ... | ... | ... | ... |
## Unverified Late Findings
[Findings from late readers that were not verified]
[UNVERIFIED]. Verification confirms or rejects. Never present unverified findings as confirmed..claude/reports/deep-audit-YYYY-MM-DD.md) is the reference artifact. Terminal output is the action summary.run_in_background: true. Launch everything in parallel that can be parallel.The plugin version of this skill ships with no built-in hints. The host project supplies them by creating .claude/audit-hints.md with the template below. If the file exists, treat its Module Groups, Known Complexity, and Optional Audits sections as additional context for slice grouping, reader prompts, and verifier triggers.
If no hints file exists, infer module groups from top-level directory layout and skip the optional audits.
### Module Groups
- <dir1>, <dir2> — short description of what these modules do
- <dir3>, <dir4> — another group
### Known Complexity
- path/to/file.ext is N lines — why it's complex / what to watch for
### Optional Audits
- hooks: yes|no
- token-budget: yes|no
- plugin: yes|no
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub griosai/griosai-marketplace --plugin parallel-hardening-loop