Skill

codebase-audit

Use when the user invokes /codebase-audit to run a language-agnostic codebase quality audit measuring up to 12 quality criteria + development velocity with industry benchmarks, grading, and actionable recommendations.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/joesys-skills:codebase-audit

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run a comprehensive, language-agnostic codebase quality audit. Measures up to 12 core quality criteria + development velocity across 6 parallel collection agents, displays graded metrics on console, and optionally writes a full analysis report with industry benchmarks and actionable recommendations.

Supporting Files

SKILL.md

466 lines · ~6k tokens(exceeds 5k compaction limit)

Stats

LanguagePython

Stars1

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Codebase Audit Skill

Out of Scope

This skill MUST NOT:

Modify the audited code. Both observation and tool runs are read-only — even if a typo or obvious bug is noticed, do not edit.
Run write-mode tools (formatter --fix, linter --fix, code generators) as part of measurement. Read-only invocations only. If a tool only runs in fix mode, skip it.
Cite benchmarks, comparisons, or "industry standards" without naming the source. Unsourced numbers read as fabrication.
Continue past the safety gate when live commands were declined. Static-only fallback is mandatory in that case.
Grade a criterion without at least one measured metric backing it. If a helper script failed and no metric is available, the grade is "Not measured" — never "Good" or "—" by default.
Inflate grades to be encouraging or deflate them to be alarming. The grade reflects measured evidence, not the reaction it should produce.
Claim "no issues found" without actually measuring. Absence of evidence is not evidence of absence.
Skip the live-command safety gate. Even on /codebase-audit metrics, live commands need approval.
Pretend a tool ran when it didn't. If a tool was unavailable, configured-but-unavailable, or timed out, the report says so — it does not fold a missing tool into the "no findings" pile.

Reference Files

This skill uses progressive disclosure — read reference files only when needed:

File	Contents	When to read
`references/agent-prompts.md`	Full prompt templates for all 6 collection agents + Phase 4 author agent	Before dispatching agents in Phase 1 or Phase 4
`references/output-schemas.md`	metrics.json schema, metrics.md template, codebase-audit.md preferences template, execution flows	Before writing output files in Phase 5
`references/detection-defaults.md`	Language marker files, language defaults, path auto-detection, polyglot rules, config file format	During Phase 0 detection

Invocation

Parse the user's /codebase-audit arguments:

Invocation	Mode
`/codebase-audit`	Full pipeline (all 12 criteria + velocity)
`/codebase-audit metrics`	Collect + display only, write metrics.json + metrics.md
`/codebase-audit analysis`	Re-analyze from most recent metrics.json
`/codebase-audit delta`	Compare two most recent audits
`/codebase-audit maintainability performance`	Only specified criteria
`/codebase-audit velocity`	Just development velocity
`/codebase-audit --static-only`	No live commands (no test run, no dep audit)

Parsing Rules

Reserved phase words: metrics, analysis, delta
Everything else: Treated as criterion names, validated against the 13 valid names
Flags: --static-only — skip all live commands
Invalid names: Print error listing valid options, stop

Valid Criterion Names

Argument	Criterion	Category
`maintainability`	1. Maintainability	Core
`evolvability`	2. Evolvability	Core
`correctness`	3. Correctness	Core
`testability`	4. Testability	Core
`reliability`	5. Reliability	Core
`performance`	6. Performance	Core
`readability`	7. Readability	Core
`modularity`	8. Modularity	Core
`consistency`	9. Consistency	Core
`operability`	10. Operability	Core
`security`	11. Security	Core
`story-readability`	12. Story Readability	Core
`velocity`	13. Development Velocity	Extended

Phase 0 — Parse, Detect & Route

Read references/detection-defaults.md for language marker files, language defaults, path detection rules, and config file format.

Detection Steps

Parse arguments — determine invocation mode: full, metrics, analysis, delta, or scoped
Load user preferences — read shared/skill-context.md for the full protocol. Load .claude/skill-context/preferences.md (shared) and .claude/skill-context/codebase-audit.md (skill-specific). If no shared preferences exist, invoke /preferences (streamlined mode). Shared preferences supply project phase, team size, and business priority — Phase 3 will skip questions already answered here.
Load config — read .claude/audit.yaml if it exists (all fields optional)
Auto-detect language — check marker files in priority order (see reference)
Apply language defaults — function patterns, test runner, extension (see reference)
Detect static analysis tooling — read shared/tooling-registry.md and per-language profiles from shared/tooling/. Classify tools as available, configured-but-unavailable, or absent. Build gap recommendations for absent tools.
Auto-detect paths — source, test, and exclude paths (see reference)
Polyglot detection — secondary language >10% of source files → add to additional
Auto-detect test runner — check framework configs, package.json scripts, language default
Domain inference — read README, package manifests, scan key imports, check directory names. Use WebSearch for comparable projects if available.
Prerequisites check — verify Python 3 is available for helper scripts. If not, warn and offer qualitative-only mode.
Scope size check & tier selection — count source files and classify. Default threshold: large tier at >1000 source files. Override via .claude/skill-context/codebase-audit.md key large_tier_threshold_files. Large tier activates Phase 1 module decomposition (§ Large Repo Decomposition) and Phase 2 heat-map-driven deep dive (§ Heat-Map-Driven Deep Dive). Below threshold: current whole-repo behavior.
Merge config — auto-detected defaults ← config overrides (config always wins)

Output of Phase 0

A project context block passed to all agents:

Language: {primary} (+{additional})
Source paths: {paths}
Test paths: {paths}
Exclude: {patterns}
Test runner: {runner}
Domain: {summary}
Engine: {if detected}

If --static-only was passed, skip tool execution in Phase 1 but still detect and classify tools.

Phase 1 — Parallel Collection

MUST spawn 6 measurement agents in parallel via the Agent tool — all 6 in a single response. Each uses model: "opus". Sequential dispatch is a defect. Read references/agent-prompts.md for the full prompt template for each agent.

Agent Roster

#	Agent	Key Metrics	Helper Script
1	Structural	LOC, file/function lengths, nesting, comment density	`helpers/compute_structure.py`
2	Quality	Cyclomatic complexity, naming, magic numbers, duplication, secrets	`helpers/compute_complexity.py`
3	Architecture	Coupling, circular deps, CI/CD, dependency health, tooling adoption	— (Grep/Read)
4	Git/Velocity	Churn, commit frequency, bus factor, knowledge concentration	`helpers/compute_churn.py`
5	Performance	Algorithm issues, N+1, blocking I/O, memory leaks	— (Grep/Read)
6	Tests	Pass rate, test ratio, assertion density, test quality	Test runner (if approved)

Scoped Invocations

For scoped criteria, launch only the required agents:

Criterion	Required Agents
Maintainability	Structural, Quality
Evolvability	Structural, Architecture
Correctness	Tests, Structural
Testability	Tests, Structural, Architecture
Reliability	Architecture, Structural
Performance	Performance, Architecture
Readability	Structural, Quality
Modularity	Architecture
Consistency	Quality, Architecture
Operability	Architecture, Structural
Security	Architecture, Quality
Story Readability	Structural, Quality
Velocity	Git/Velocity

Live Command Safety Gate

MUST present all live commands for approval before dispatching agents:

The following live commands will be executed during collection:

{test_runner} (Tests agent)

{audit_command} (Architecture agent)

{tool_command} (Tooling — {tool_name})

Options: Run all | Static only | Select

Read-only commands (helper scripts, git log, Glob/Grep/Read, tool detection) do not need approval.

Large Repo Decomposition (Large Tier Only)

Activates when Phase 0 step 12 classified the repo as large tier.

When the repo exceeds the large-tier threshold, the three qualitative agents (Architecture, Performance, Security) switch from whole-repo to per-module dispatch. Statistical agents (Structural, Quality, Git/Velocity, Tests) continue to run whole-repo — their helper scripts aggregate without reading files, so repo size doesn't degrade them.

Step 1 — Module detection. Treat each top-level directory under the detected source paths as a module. Example: src/api/, src/payment/, src/ui/, src/worker/ → four modules.

If a top-level dir contains fewer than 20 source files, merge small dirs into a sibling misc module to avoid agent spam.
If one module holds >50% of source files, subdivide it one level deeper (e.g., src/core/ → src/core/auth/, src/core/data/).

Step 2 — Per-module dispatch. For each module, dispatch one Architecture + one Performance + one Security agent in parallel. MUST fire all modules × 3 agents in the same parallel batch, alongside the 3 statistical agents that run whole-repo.

Each qualitative agent receives only its module's files plus the shared project context block.

Step 3 — Roll-up. Per-module findings carry a module tag so the heat map and console display can reference them. The Phase 2 grade for Architecture/Performance/Security is the weighted average across modules (weighted by source file count).

This is the "structural breadth" half of large-tier analysis. The "risk depth" half runs in Phase 2 (see § Heat-Map-Driven Deep Dive).

Failure Handling

Agent timeout: 60s default. Proceed with available data, note missing agent.
Helper script failure: Agent falls back to qualitative-only. Metrics marked "Not measured."
No test runner: Tests agent does static analysis only.
Live commands declined: Static analysis fallback. Mark as "Skipped (live commands declined)."
Large tier — partial module failure: If some modules' qualitative agents fail, proceed with successful modules. Roll-up grade notes "N of M modules analyzed."

Phase 2 — Display & Gate

Assemble Results & Grade

Collect structured JSON from each agent. For each criterion, compute a grade using the principle file rubric + benchmark data.

Audit Confidence Model: Each criterion gets a confidence level (high, medium, low). Append ~ to grades with low confidence (e.g., "B~"). Overall confidence = lowest among all criteria.

Tooling Impact on Grades:

Criterion	Positive Signal	Negative Signal
Security	Scanner present + clean	No scanner, or vulnerabilities found
Consistency	Formatter + linter clean	Violations, or no formatter/linter
Operability	Analysis tooling present, CI-integrated	No tooling at all
Maintainability	Static analyzer clean	Analyzer found issues
Correctness	Type checker clean	Type errors found

Risk Heat Map

Cross-reference complexity (Quality agent) with churn (Git/Velocity agent):

              High Churn
                  │
   ┌──────────────┼──────────────┐
   │  Refactor    │  Danger Zone │
   │  candidates  │  (act now)   │
───┼──────────────┼──────────────┼─── High Complexity
   │  Stable      │  Monitor     │
   │  (leave it)  │  (watch)     │
   └──────────────┼──────────────┘
              Low Churn

"Danger Zone" files (high complexity + high churn) MUST be named explicitly.

Heat-Map-Driven Deep Dive (Large Tier Only)

Activates when Phase 0 step 12 classified the repo as large tier.

Runs after the heat map is computed, in addition to the Phase 1 per-module dispatch. Module decomposition gave breadth — every top-level dir got a reviewer. Heat-map deep dive gives depth on actual risk.

Risk clusters to deep-dive:

All files in the "Danger Zone" quadrant (high complexity × high churn)
Any module whose Phase 1 qualitative grade was C or below
Cross-module concerns surfaced by Phase 1 (e.g., coupling or shared-type issues that span modules)

Dispatch. For each risk cluster, dispatch one Architecture + one Performance + one Security agent in parallel. Each agent receives:

The specific files in the cluster
The heat map context (complexity + churn for each file)
Phase 1 findings for those files, so the deep dive builds on rather than repeats

Deep-dive findings merge into their criteria grades. Mark each deep-dive finding with source: heat-map-deep-dive so the methodology section can cite the two-pass structure.

Skip condition: if the heat map is clean (no Danger Zone files) and every module graded B+ or above in Phase 1, skip this step — no risk to dive into.

Grading Scale

Grade	Meaning
A+	Exceeds industry best practice
A	Meets best practice
B	Acceptable, minor improvements possible
C	Below average, attention needed
D	Significant issues, action required
F	Critical deficiencies

Grading is relative to resolved benchmarks (language-specific → general fallback).

Console Display

Print a summary table directly in the conversation:

╔══════════════════════════════════════════════════════════════╗
║  CODEBASE AUDIT — {Project Name}                           ║
║  {Domain Summary} · {Language} · {Date}                    ║
╠══════════════════════════════════════════════════════════════╣
║  Overall Grade: {GRADE} (confidence: {CONFIDENCE})         ║
╠══════════════════════════════════════════════════════════════╣
║  #  Criterion        Grade  Key Metric         Benchmark   ║
║  ── ──────────────── ────── ────────────────── ─────────── ║
║   1 Maintainability    B    CC avg: 8.2        ≤ 10        ║
║   2 Evolvability       B+   Fan-out avg: 3.1   ≤ 5         ║
║  ...                                                       ║
║  12 Story Readability  B+   Narr: 8, Chunk: 6 ≥ 7 avg     ║
║  ── ──────────────── ────── ────────────────── ─────────── ║
║  13 Velocity           —    +2.1k lines/30d    —           ║
╠══════════════════════════════════════════════════════════════╣
║  Top Risk: {criterion} ({grade}) — {reason}                ║
║  Top Strength: {criterion} ({grade}) — {reason}            ║
║  Danger Zone: {file} (CC:{N}, {N} changes)                 ║
╚══════════════════════════════════════════════════════════════╝

Dynamically generated — only measured criteria appear. Failed/skipped agents show "—" with a note.

Gate

After displaying the table:

Metrics collected. What would you like to do?

Write both — metrics.json + metrics.md + full analysis

Metrics only — write metrics.json + metrics.md (numbers, no commentary)

Done — just the console display, no files

Routing rules:

/codebase-audit metrics → skip gate, write metrics files directly
/codebase-audit analysis → skip Phase 1, load most recent metrics.json, proceed to Phase 3
User selects option 1 → proceed to Phase 3
User selects option 2 → skip to Phase 5 (write metrics only)
User selects option 3 → stop

Phase 3 — User Context Interview

Gathers context the code alone can't reveal. Uses the shared preferences system (shared/skill-context.md) to avoid re-asking questions.

Context Sources (checked in order)

Shared preferences (.claude/skill-context/preferences.md) — loaded in Phase 0 step 2. Contains project phase, team size, business priority.
Audit-specific preferences (.claude/skill-context/codebase-audit.md) — deployment cadence, known trade-offs.
Legacy profile (docs/reports/codebase-audit/project-context.md) — if this exists but no shared preferences file does, migrate its contents into the shared system.

First Audit — Build the Profile

Check what's already known from shared preferences. MUST only ask questions whose answers are not already captured:

Question	Skip if already in...
Project phase	shared preferences → "Project phase"
Team size	shared preferences → "Team size"
Deployment cadence	shared preferences → "Deployment cadence" or audit-specific preferences
Business priority	shared preferences → "Business priority"
Known trade-offs	audit-specific preferences → "Known trade-offs"

If shared preferences exist and cover project phase, team size, and business priority, the only new questions are deployment cadence (if missing), known trade-offs, and informed questions based on Phase 1 findings (e.g., "I noticed zero tests — intentional for now?").

If no shared preferences exist at all, /preferences was already invoked in Phase 0 step 2 — those answers are now available. Ask only the audit-specific questions: deployment cadence, known trade-offs, and informed questions.

Save audit-specific answers to .claude/skill-context/codebase-audit.md.

Returning Audits — Confirm the Profile

Check for existing audit-specific preferences at .claude/skill-context/codebase-audit.md. If found, present the combined profile (shared + audit-specific) and ask if anything has changed. Quick on repeat audits.

Legacy Migration

If docs/reports/codebase-audit/project-context.md exists but .claude/skill-context/preferences.md does not:

Read the legacy file
Extract project phase, team size, deployment cadence, business priority
Write shared fields to .claude/skill-context/preferences.md
Write audit-specific fields (known trade-offs, audit history) to .claude/skill-context/codebase-audit.md
Inform the user: "Migrated your audit profile to the shared preferences system."
The legacy file remains in place (existing reports may reference it) but is no longer the source of truth.

How User Context Shapes the Analysis

User Context	Analysis Effect
Solo + Prototype	Lighter on process, heavier on "what to invest in first"
Team of 10 + Mature	Heavier on consistency, modularity, onboarding friction
"Speed to market" priority	Recommendations framed as "do this now" vs. "before scaling"
"Low test coverage intentional"	Testability acknowledges trade-off rather than flagging as surprise

Phase 4 — Analysis Writing

A single author agent writes the full analysis in one pass. MUST use model: "opus". Read references/agent-prompts.md for the full author agent prompt.

The author receives: assembled metrics JSON, project context, user context, risk heat map, and previous audit data (if any).

Dynamic Criteria Weighting

The author assigns a priority rank (1–12) and weight (High/Medium/Low) to each criterion based on language + domain expertise. This affects priority order, overall grade, analysis depth, and recommended actions. Users can override via criteria_priority in audit.yaml.

Phase 5 — Write & Output

Read references/output-schemas.md for the full schemas and templates.

Output Files

File	Content	When written
`metrics.json`	Machine-readable metrics with grades, benchmarks, methodology	Always (options 1 & 2)
`metrics.md`	Human-readable metrics table	Always (options 1 & 2)
`analysis.md`	Full qualitative report per `templates/analysis-template.md`	Option 1 only
`.claude/skill-context/codebase-audit.md`	Audit-specific preferences (trade-offs, cadence, history)	Updated each audit

Output directory: docs/reports/codebase-audit/YYYYMMDD/

Cleanup

Remove temp files. Report output paths:

Audit complete.

Metrics: docs/reports/codebase-audit/{DATE}/metrics.md

Analysis: docs/reports/codebase-audit/{DATE}/analysis.md

Overall grade: {GRADE}

Notify completion (cross-platform):

if command -v powershell.exe &>/dev/null; then
  powershell.exe -c "[Console]::Beep(800, 300)"
elif command -v afplay &>/dev/null; then
  afplay /System/Library/Sounds/Glass.aiff &
elif command -v paplay &>/dev/null; then
  paplay /usr/share/sounds/freedesktop/stereo/complete.oga &
else
  printf '\a'
fi

Guardrails

Never present a benchmark without a source citation. Unsourced benchmarks look made up.
Never grade a criterion without at least one measured metric. Grades based on vibes are worthless.
Never claim "no issues found" without having actually measured. Absence of evidence ≠ evidence of absence.
If a helper script fails, mark affected metrics as "Not measured" — not "Good."
Always show delta direction (improved/declined/stable) when comparing.
Domain inference must be stated explicitly so the user can challenge it.
Criteria priority rationale must be shown in the report.
Effort estimates must reference codebase size and team context.
All Danger Zone files must be named explicitly.
User context trade-offs must be acknowledged in relevant criterion sections.

Graceful Degradation

Situation	Behavior
1–2 agents fail or time out	Proceed with available data. Note missing agents. Offer to retry.
All agents fail	Report failure. Suggest retrying or narrowing scope.
No git history	Git/Velocity agent skips. Churn/bus factor marked "No git history."
No test runner detected	Tests agent does static analysis only.
Helper script fails	Agent falls back to qualitative-only. Metrics marked "Not measured."
No internet (WebSearch unavailable)	Use cached benchmarks. Note "cached benchmarks only" in methodology.
Unknown language	Use general benchmarks. Extension-count fallback.
Massive repo (large tier)	Phase 1 module decomposition (per top-level dir) + Phase 2 heat-map-driven deep dive fire automatically. See § Large Repo Decomposition and § Heat-Map-Driven Deep Dive.
No `.claude/audit.yaml`	Fully auto-detected. Note in methodology.
Python not available	Qualitative-only for helper-dependent metrics.
No previous audit for delta	"Need at least 2 audits for delta comparison."
Live commands declined	Static analysis fallback. Mark as "Skipped (live commands declined)."
No static analysis tools detected	Gap recommendations included. Criteria graded without tool input.
Tool configured but not installed	Graded as absent. Config noted in analysis.
Tool execution fails	Skip tool, proceed with remaining tools. Note failure.
`--static-only` with tools detected	Tools detected and classified but not executed.

Error Handling

Error	Behavior
Invalid criterion name	Print valid names, stop
`.claude/audit.yaml` is malformed	Report parse error, proceed with auto-detection
No source files found	"No source files found in detected paths. Check project structure."
metrics.json not found for `analysis` mode	"No previous metrics found. Run `/codebase-audit` first."
<2 metrics.json for `delta` mode	"Need at least 2 audits for delta comparison. Found {N}."
Output directory creation fails	Report error, suggest alternative path
Agent returns malformed JSON	Use what's parseable, note the issue
Tool binary not found	Classify as `configured-but-unavailable`, skip, continue
Tool output unparseable	Report raw summary, skip structured parsing, continue
Tool timeout	Kill process, skip tool, continue with remaining tools

codebase-audit

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

codebase-audit

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Codebase Audit Skill

Out of Scope

Reference Files

Invocation

Parsing Rules

Valid Criterion Names

Phase 0 — Parse, Detect & Route

Detection Steps

Output of Phase 0

Phase 1 — Parallel Collection

Agent Roster

Scoped Invocations

Live Command Safety Gate

Large Repo Decomposition (Large Tier Only)

Failure Handling

Phase 2 — Display & Gate

Assemble Results & Grade

Risk Heat Map

Heat-Map-Driven Deep Dive (Large Tier Only)

Grading Scale

Console Display

Gate

Phase 3 — User Context Interview

Context Sources (checked in order)

First Audit — Build the Profile

Returning Audits — Confirm the Profile

Legacy Migration

How User Context Shapes the Analysis

Phase 4 — Analysis Writing

Dynamic Criteria Weighting

Phase 5 — Write & Output

Output Files

Cleanup

Guardrails

Graceful Degradation

Error Handling

Similar Skills

Codebase Audit Skill

Out of Scope

Reference Files

Invocation

Parsing Rules

Valid Criterion Names

Phase 0 — Parse, Detect & Route

Detection Steps

Output of Phase 0

Phase 1 — Parallel Collection

Agent Roster

Scoped Invocations

Live Command Safety Gate

Large Repo Decomposition (Large Tier Only)

Failure Handling

Phase 2 — Display & Gate

Assemble Results & Grade

Risk Heat Map

Heat-Map-Driven Deep Dive (Large Tier Only)

Grading Scale

Console Display

Gate

Phase 3 — User Context Interview

Context Sources (checked in order)

First Audit — Build the Profile

Returning Audits — Confirm the Profile

Legacy Migration

How User Context Shapes the Analysis

Phase 4 — Analysis Writing

Dynamic Criteria Weighting

Phase 5 — Write & Output