Skill

deep-audit

Run an exhaustive multi-wave codebase audit with an overlapping pipeline, auto-slicing, early flagging, cross-cutting verification, and a graded scorecard. Use when the user asks to audit, review, assess, grade, or evaluate the quality of a codebase, project, or setup. Also use when the user asks "how good is this", "what should I fix", "what's broken", "rate my setup", or any variant of wanting a brutally honest assessment. Produces both terminal output and a report file in .claude/reports/. This skill is aggressive — it reads everything before judging anything. Always use this over a quick surface-level review.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/parallel-hardening-loop:deep-audit

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

An overlapping-pipeline codebase audit that auto-slices the repo, reads every file with canary-gated parallel agents, verifies claims before readers finish, and produces a graded scorecard with prioritized action items and a persistent report file.

Supporting Files

deep-audit-hints.md

SKILL.md

486 lines · ~5.9k tokens(exceeds 5k compaction limit)

Stats

Parent stars0

MaintenanceGood

Last CommitMay 3, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Deep Audit v2

Why This Exists

Surface-level audits lie. Module-level readers make claims without verifying code. In production runs, readers claimed ~10 critical bugs — ~20% were false positives that an independent verifier caught. The only way to get truth is: read everything first, verify with targeted agents second, grade with isolated judges third.

v2 adds overlapping execution — verification starts after 3 readers complete instead of waiting for all — cutting wall-clock time by ~30-40% while preserving the verification discipline that makes findings trustworthy.

This skill burns tokens aggressively and intentionally. Every finding is verified before it reaches the user.

When to Use

User asks to audit, review, assess, or grade any codebase
User asks "how good is this" or "what should I work on"
User wants a brutally honest assessment of code quality
After a major milestone to check overall health
When comparing a setup against industry best practices

Pipeline Architecture

Phase 0: Pre-Flight Telemetry (inline, ~20-45s)
    |--- runs 9 static analysis tools in parallel via scripts/preflight.sh
    |--- merges outputs into .audit/preflight/findings.json (unified schema)
    |--- prints tool summary to terminal
    v
Phase 0.5: Baseline Health (inline, ~30s)
    |
    v
Reading Wave (6-8 background agents, auto-sliced)
    |--- each reader receives per-file telemetry dossier from Phase 0
    |--- as each reader completes: surface [UNVERIFIED] findings to terminal
    |--- after 3+ readers complete: trigger verification wave
    v
Verification Wave (9-12 background agents, overlapping with late readers)
    |--- each verifier receives category-filtered telemetry as ground truth
    |--- Bug Verifier receives late reader findings via SendMessage
    |--- as each verifier completes: print confirmed/rejected findings
    v
Judges (2 blind judges, after all verifiers complete)
    |--- judges receive full telemetry summary in evidence packet
    v
Report (inline synthesis — terminal summary + markdown file)
    v
Codex Sprint Plan (1 background agent — reviews findings, generates sprint doc)
    |--- print full sprint plan to terminal when complete
    v
Sprint Plan Review (1 background agent — adversarial review of the plan)
    |--- print review verdict to terminal: READY TO EXECUTE or NEEDS REVISION

Wall-clock improvement: ~30-40% faster than v1 due to wave overlap.

Execution

Phase 0: Pre-Flight Telemetry (optional — skip gracefully if unavailable)

Run deterministic static analysis tools before any AI agent starts. This produces ground truth that agents consume — they never need to rediscover what tools already proved.

Portability note (plugin version): The reference scripts/preflight.sh ships with claude-super-setup and assumes a TypeScript project. When this skill runs in another repo:

If scripts/preflight.sh exists in the project, run it.
Otherwise, run whichever of the following are installed and write per-tool JSON to .audit/preflight/: semgrep --config=auto --json, gitleaks detect --report-format json, trivy fs --format json, npm audit --json / pip-audit -f json. Skip silently when a tool is absent.
If no tools are available at all, skip Phase 0 entirely, set pre-flight: skipped (no tooling installed) in the report, and let the LLM-driven reading + verification waves carry the load. Do not block the audit on missing telemetry.

When preflight runs, it executes up to 9 tools in parallel:

Semgrep — SAST (OWASP, injection, XSS) → JSON
Gitleaks — Secrets (API keys, tokens, passwords) → JSON
Trivy — Dependency CVEs + license compliance → JSON
Knip — Dead code, unused exports/deps/types → JSON
dependency-cruiser — Architecture rules, circular deps → JSON
jscpd — Code duplication (Rabin-Karp) → JSON
Oxlint — Complexity, quality, 720+ rules → JSON
type-coverage — % explicit types vs any → JSON
FTA — Per-file maintainability index → structured text

The script merges all outputs into .audit/preflight/findings.json — a unified report with:
- findings[] — Every finding with tool, category, severity, file, line, ruleId, message
- fileDossiers{} — Per-file dossiers grouping all findings + metrics for that file
- summary — Aggregate counts by severity, category, and tool
- duplications[] — Cross-file clone pairs
- circularDependencies[][] — Circular dependency chains
Print the telemetry summary to terminal:

[preflight] 9 tools completed in 45s
[preflight] 1731 findings: 38 critical, 11 high, 924 medium, 755 low
[preflight] Type coverage: 100% | Dead code: 217 unused exports, 497 unused types | Duplication: 0.53%
[preflight] Architecture: 0 circular deps, 0 boundary violations
[preflight] Telemetry written to .audit/preflight/findings.json

Load the telemetry: Read .audit/preflight/findings.json and hold it in context for injection into downstream agents.

Phase 0.5: Baseline Health

Run inline (no agents). Adapt commands to the project's stack — detect via files in the repo root:

Type/lint check: npm run typecheck (Node), mypy . / ruff check . (Python), cargo check (Rust), go vet ./... (Go). Run whichever exists; tail last 10 lines.
Tests: npm test, pytest -x, cargo test, go test ./.... Tail last 10 lines.
Source file count: find <src-dir> -type f \( -name '*.ts' -o -name '*.py' -o -name '*.rs' -o -name '*.go' \) ! -name '*.test.*' ! -name '*_test.*' | wc -l. Detect <src-dir> from src/, lib/, api/, project layout.
Test file count: same find filtered for *.test.* / *_test.* / tests/ dir.
Git stats: git log --oneline | wc -l and git log --format="%ai" --reverse | head -1.
Manifests: read whichever exist — package.json, pyproject.toml, requirements.txt, Cargo.toml, go.mod (top 40 lines).
Project context: read CLAUDE.md, AGENTS.md, README.md (top 60 lines) — first one found.

This gives you the baseline numbers to share with the judges.

Auto-Slicing

Before launching readers, dynamically partition the codebase:

List all directories under src/
Count .ts files per directory (recursive, excluding .test.ts)
Group into 6-8 balanced slices by file count
If ## Project Hints section exists at the bottom of this file, use the Module Groups to guide grouping — keep listed groups together
Soft targets: no slice >50 source files, no slice <10
If a single directory has 60+ files, it gets its own slice
Print the slice table before launching readers:

Slice  Directories                          Files
─────  ───────────────────────────────────  ─────
  1    core/, peer-runtime.ts                 28
  2    cli/, wizard/                          22
  3    swarm/                                 45
  4    warroom/, adversarial/                 38
  5    mesh/, hive/, group/                   31
  6    intel/, docs/, crowdsource/, share/    19
  7    generator/, templates/, handoff/       35
  8    guardian/, health/, plugin/, ...       27

Reading Wave

Launch 6-8 feature-dev:code-explorer agents in background, one per slice.

Canary pattern: Launch 1 reader first on the smallest slice. If it returns in <30s with garbage, abort the audit. If healthy, launch the remaining readers.

Reader prompt template:

## Pre-Flight Telemetry for Your Slice

The following deterministic findings were detected by static analysis tools BEFORE you started reading. These are ground truth — do not re-discover them. Instead:
- CONFIRM or REFUTE each finding based on your contextual reading
- Look for issues that static tools CANNOT find (business logic errors, semantic bugs, race conditions, incorrect error handling)
- Use the metrics to prioritize which files need deepest reading

[INJECT: fileDossiers for this slice's files from findings.json]

---

Read EVERY file in [directories] thoroughly — full files, not just tops.

[If Project Hints contain Known Complexity for this slice:]
Pay extra attention to: [hint text]

For each file report:
1. What it actually does (not what it claims to do)
2. Dead code, unused exports, stubs that don't do anything
3. Bugs, logic errors, missing edge cases
4. Integration gaps — functions that exist but nothing calls them
5. Missing error handling or silent failures
6. What would break if used in production

Be ruthlessly honest. I need gaps, not praise. Provide file:line references for every finding.

Early flagging: As each reader completes:

Scan output for CRITICAL / "would break in production" claims
Print to terminal with [UNVERIFIED] tag
Add full output to findings queue

Progress bar:

Reading: [####------] 4/8 complete | 12 unverified findings surfaced

Verification Wave (Overlapping)

Trigger: 3+ readers completed. Do NOT wait for all readers.

Launch verification agents as soon as the trigger fires. Late reader findings are handled via SendMessage.

Mandatory agents (always launch):

#	Agent	Type	What It Does
1	Bug Verifier	`voltagent-qa-sec:debugger`	Top 10 claims from readers. CONFIRMED / FALSE POSITIVE / PARTIALLY CONFIRMED. Receives late reader findings via SendMessage.
2	Security Audit	`voltagent-qa-sec:penetration-tester`	Command injection, path traversal, secret leakage, prototype pollution, supply chain, credential storage
3	Architecture Scanner	`Architecture Scanner`	Import graph, circular deps, layer violations, god files
4	Dead Code Scanner	`Dead Code Scanner`	Verify claimed dead modules by grepping for imports
5	Race Condition Audit	`voltagent-qa-sec:chaos-engineer`	TOCTOU, concurrent access, orphaned locks, signal handlers
6	Error Handling Audit	`voltagent-qa-sec:error-detective`	Swallowed errors, unhandled rejections, partial state on failure
7	Test Gap Scan	`Test Gap Scanner`	Untested modules, vacuous tests, rule violations
8	Duplicate Code Scan	`Simplifier Scanner`	Cross-module duplicated logic
9	Performance Scan	`Performance Scanner`	Sync I/O in async, O(n^2), unbounded memory
10	Config Consistency	`feature-dev:code-explorer`	Hardcoded model IDs, timeouts, env vars, cwd vs projectPath

Optional agents (triggered by Project Hints Optional Audits section):

#	Agent	Trigger
11	Hook System Audit	`hooks: yes` in hints
12	Token/Budget Audit	`token-budget: yes` in hints
13	Plugin System Audit	`plugin: yes` in hints

Telemetry injection for verifiers: Each verification agent receives category-filtered findings from Phase 0 as ground truth:

Verifier	Telemetry Categories
Bug Verifier	All categories (cross-references reader claims against tool findings)
Security Audit	security, secrets, vulnerability, supply-chain
Architecture Scanner	architecture (circular deps, boundary violations from dep-cruiser)
Dead Code Scanner	dead-code (unused exports/files/deps from Knip)
Duplicate Code Scan	duplication (clone pairs from jscpd)
Performance Scan	complexity (cognitive/cyclomatic from Oxlint, FTA scores)
Config Consistency	custom-rule (hardcoded model IDs from ast-grep)

Verifiers use telemetry as a starting point, not a ceiling. Their job is to:

Validate tool findings in context (false positive rate for tools is ~5-15%)
Find issues tools missed (semantic bugs, cross-module interactions, timing issues)
Synthesize tool findings with reader findings into coherent assessments

Late reader handling: When a reader finishes after verification has started, send its findings to the Bug Verifier via SendMessage. Other still-running verifiers also receive via SendMessage. Already-completed verifiers miss late data — acceptable since the Bug Verifier is the longest-running agent and catches straggler findings.

Progress bar:

Verifying: [######----] 7/12 complete | 8 confirmed | 2 false positives | 3 pending

Judges

Trigger: All verification agents complete.

Launch 2 blind voltagent-qa-sec:architect-reviewer agents in parallel. Each gets the same evidence packet assembled from pre-flight numbers and verified findings only. Judges do NOT see each other's scores. Judges never read source code — they score from the evidence packet only.

Evidence packet:

Pre-flight telemetry summary (total findings by tool, severity, category)
Type coverage percentage
Duplication percentage
Architecture violation count
Dead code volume (unused exports, files, deps)
Complexity hotspots (files with cognitive complexity > 15)
Pre-flight baseline numbers (files, tests, commits, dependencies, age)
Typecheck and test pass/fail status
Count of confirmed bugs by severity
Count of security findings by severity
Architecture violations found
Test coverage gaps found
Dead code volume
Duplicate code volume
Performance issues found

Rubric (10 dimensions, each 1-10):

Dimension	What It Measures
Architecture	Module boundaries, dependency direction, layer discipline
Code Quality	Error handling, type safety, algorithmic correctness
Test Suite	Coverage, assertion quality, integration tests
Security Posture	Input validation, secret handling, sandboxing
CI/Infra	Build pipeline, quality gates, deployment readiness
Dogfooding	Does the system use its own tools?
Production Readiness	Ship tomorrow? Error recovery, monitoring, rollback
DX / Onboarding	New developer productive in 1 day?
Ambition	Scope, novelty, difficulty relative to team size
Debt Ratio	Dead code, stubs, orphaned features

Scoring: Median of 2 judges per dimension. Divergence of 3+ points on any dimension = "disputed" flag.

Benchmark comparison prompt for each judge:

You are grading this codebase on a rubric. Score each dimension 1-10.

For context, here is what a 10/10 looks like in each dimension for comparable setups:
- A senior hedge fund quant's personal trading system: 9-10 on security, 8-9 on tests, 7-8 on DX
- A YC-funded startup's core product at Series A: 7-8 on architecture, 6-7 on tests, 8-9 on ambition
- A top open-source CLI tool (e.g., mise, biome, turborepo): 9-10 on DX, 8-9 on CI, 7-8 on architecture
- A power Claude Code user's personal setup (4000+ sessions): 8-9 on dogfooding, 7-8 on ambition, 6-7 on production readiness
- A FAANG-tier internal developer tool: 9-10 on CI, 8-9 on tests, 7-8 on security

Score honestly. A 5 is average. A 7 is good. A 9 means best-in-class. Do not inflate.

For each dimension, provide:
- Score (1-10)
- One sentence justification
- One specific improvement that would raise the score by 1 point

Also provide:
- Overall weighted score (architecture and code quality weighted 2x)
- Percentile estimate: where does this codebase sit relative to all codebases you've seen of similar scope?
- Top 3 things to work on (from the evidence, not generic advice)

Report

Produced inline (no synthesis agent).

Terminal output:

Scorecard table with median scores per dimension
Overall grade (weighted, architecture and code quality at 2x)
Tiered findings:
- Tier 1: Fix Now — confirmed runtime bugs with file:line
- Tier 2: High-Impact Work — architecture and infrastructure items
- Tier 3: Structural Cleanup — debt, duplicates, dead code
- Tier 4: Wire or Kill — dead infrastructure decisions
Ordered action list — what to do, most impactful first

Report file: Write to .claude/reports/deep-audit-YYYY-MM-DD.md. Contains everything in terminal output plus:

Full evidence for every confirmed finding (file:line, quoted code, verifier verdict)
Full evidence for every false positive (what was checked and cleared)
Judge scorecards with per-dimension justifications
Disputed dimensions flagged
All agents launched with types and coverage areas
Pre-flight baseline numbers
Unverified findings from late readers that were never verified

Output Format

Terminal format:

# Deep Audit Report

## Scorecard

| Dimension    | Score | Benchmark        |
| ------------ | ----- | ---------------- |
| Architecture | 7     | Top OSS CLI: 7-8 |
| Code Quality | 6     | FAANG tool: 7-8  |
| ...          | ...   | ...              |

## Overall Grade: X.X/10 (Xth percentile)

## Tier 1: Fix Now

- [BUG] description — file:line — verifier: CONFIRMED

## Tier 2: High-Impact Work

- [ARCH] description — affected modules

## Tier 3: Structural Cleanup

- [DEBT] description — scope estimate

## Tier 4: Wire or Kill

- [DEAD] description — recommendation

## What To Do In Order

1. Most impactful item
2. Second most impactful
3. ...

Report file format (.claude/reports/deep-audit-YYYY-MM-DD.md):

# Deep Audit Report — YYYY-MM-DD

## Baseline

- Source files: N | Test files: N | Commits: N
- Typecheck: PASS/FAIL | Tests: PASS/FAIL
- Dependencies: N | First commit: YYYY-MM-DD

## Pre-Flight Telemetry

- Tools ran: [list] | Tools skipped: [list]
- Total findings: N | By severity: critical N, high N, medium N, low N
- Type coverage: N% (N any types)
- Duplication: N% (N clones)
- Architecture: N circular deps, N boundary violations
- Dead code: N unused files, N unused exports, N unused deps, N unused types
- Complexity hotspots: [files with cognitive > 15]
- Supply chain: N alerts (if Socket ran)

## Scorecard

[Full table with scores, justifications, disputed flags]

## Overall Grade: X.X/10 (Xth percentile)

## Confirmed Findings

### Tier 1: Fix Now

- **[BUG]** description
  - File: path:line
  - Code: `quoted snippet`
  - Verifier: CONFIRMED
  - Reader: Agent N | Verifier: Agent N

### Tier 2-4: ...

## False Positives

- **[CLEARED]** description
  - Claimed by: Agent N
  - Checked by: Bug Verifier
  - Why cleared: explanation

## Judge Scorecards

### Judge 1

[Per-dimension scores and justifications]

### Judge 2

[Per-dimension scores and justifications]

### Disputed Dimensions

- Dimension: Judge 1 scored X, Judge 2 scored Y (delta: Z)

## Agent Roster

| Agent | Type | Slice/Scope | Status |
| ----- | ---- | ----------- | ------ |
| ...   | ...  | ...         | ...    |

## Unverified Late Findings

[Findings from late readers that were not verified]

Rules

Read first, judge second. Never score a module you haven't read.
Every confirmed finding must have exact file:line and quoted code.
Reader findings are UNVERIFIED. Print them with [UNVERIFIED]. Verification confirms or rejects. Never present unverified findings as confirmed.
The Bug Verifier is mandatory. It exists because readers are wrong ~20% of the time on critical claims.
Judges see the evidence packet, not each other's scores.
Verification starts after 3+ readers complete, not after all readers complete.
Late reader findings go to Bug Verifier via SendMessage.
Canary the first reader on the smallest slice before launching the rest.
Report file (.claude/reports/deep-audit-YYYY-MM-DD.md) is the reference artifact. Terminal output is the action summary.
No praise in output. No "impressive." No "genuinely." No hedging. Findings, scores, actions.
Burn tokens aggressively. This skill is not for saving money. It is for finding truth.
Phase 0 telemetry is deterministic ground truth. Agents must not contradict tool findings without explicit evidence. If a reader says "no dead code" but Knip found 217 unused exports, the reader is wrong unless it can explain why Knip's analysis is incorrect (e.g., dynamic imports).
All reader and verification agents run in background with run_in_background: true. Launch everything in parallel that can be parallel.
The user's time is the bottleneck, not tokens. Minimize user wait time by maximizing parallelism and streaming progress bars.

Project Hints

The plugin version of this skill ships with no built-in hints. The host project supplies them by creating .claude/audit-hints.md with the template below. If the file exists, treat its Module Groups, Known Complexity, and Optional Audits sections as additional context for slice grouping, reader prompts, and verifier triggers.

If no hints file exists, infer module groups from top-level directory layout and skip the optional audits.

Hints file template

### Module Groups
- <dir1>, <dir2> — short description of what these modules do
- <dir3>, <dir4> — another group

### Known Complexity
- path/to/file.ext is N lines — why it's complex / what to watch for

### Optional Audits
- hooks: yes|no
- token-budget: yes|no
- plugin: yes|no

deep-audit

Invocation

Context Preview

Supporting Files

SKILL.md

deep-audit

Invocation

Context Preview

Supporting Files

SKILL.md

Deep Audit v2

Why This Exists

When to Use

Pipeline Architecture

Execution

Phase 0: Pre-Flight Telemetry (optional — skip gracefully if unavailable)

Phase 0.5: Baseline Health

Auto-Slicing

Reading Wave

Verification Wave (Overlapping)

Judges

Report

Output Format

Rules

Project Hints

Hints file template

Similar Skills

Deep Audit v2

Why This Exists

When to Use

Pipeline Architecture

Execution

Phase 0: Pre-Flight Telemetry (optional — skip gracefully if unavailable)

Phase 0.5: Baseline Health

Auto-Slicing

Reading Wave

Verification Wave (Overlapping)

Judges

Report

Output Format

Rules

Project Hints

Hints file template

Similar Skills