From forge
Evaluates a codebase across 12 pillars using 3 parallel evaluator agents, producing a scored assessment for targeted remediation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/forge:repo-evalThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You coordinate a 3-evaluator hiring panel assessment of a codebase. Each evaluator runs as a separate agent with its own context window.
You coordinate a 3-evaluator hiring panel assessment of a codebase. Each evaluator runs as a separate agent with its own context window.
$ARGUMENTS is optional context — the repo path, role level being evaluated, or specific concerns. If empty, evaluate the current working directory.
Ask scoping questions one at a time, preferring multiple choice. Wait for each answer before asking the next.
The code evaluation runs 3 evaluator agents in parallel, each scoring 4 pillars (12 total). These questions calibrate the evaluation.
Question 1 — Known pain points give the evaluators a starting hypothesis instead of scanning cold:
Are there parts of the codebase you already know are problematic?
Things that keep breaking, areas you dread touching, modules that slow down every PR.
A) Yes (tell me which areas and what's wrong)
B) No — scan everything with fresh eyes
Question 2 — Role level sets the scoring bar:
What role level should I evaluate this codebase against?
A) Junior Developer — fundamentals: readability, basic error handling, test presence
B) Mid-Level Developer — patterns: separation of concerns, consistent conventions, test coverage
C) Senior Developer — production: defensive coding, observability, performance awareness, type rigor
D) Staff+ / Principal — systems: architectural coherence, scalability, operational excellence
Question 3 — Focus areas weight what evaluators pay extra attention to (they still score all 12 pillars):
Any specific concerns the evaluators should weight more heavily?
A) Performance — hot paths, algorithmic complexity, resource management
B) Security — input validation, auth patterns, secrets handling
C) Testing — coverage quality, test architecture, edge cases
D) Architecture — separation of concerns, modularity, coupling
E) Multiple (tell me which)
F) None — balanced evaluation across all pillars
Question 4 — Scope and exclusions:
What should the evaluators look at?
A) Full repo, standard exclusions (vendor, generated, node_modules, __pycache__)
B) Full repo, no exclusions
C) Specific directories only (tell me which to include or exclude)
Question 5 — Pillar overrides. By default, /pipeline remediates until all 12 pillars hit 9/10. Some pillars may not be improvable through code changes. The 12 pillars are:
Any pillars to accept below the default 9/10 threshold?
A) None — require 9/10 on all 12 pillars
B) Specific overrides (tell me which pillars and target scores, e.g., "Creativity: 7, Git Hygiene: accept")
Record overrides in the eval.md frontmatter.
Generate the directory name: YYYY-MM-DD-eval-slug
eval-ragstack, eval-billing-api)docs/plans/YYYY-MM-DD-eval-slug/Create the directory.
You (the orchestrator) must read the role prompt files and embed their contents in each agent's prompt. Agents cannot access skill directory files.
skills/pipeline/eval-hire.md — store contents as HIRE_PROMPTskills/pipeline/eval-stress.md — store contents as STRESS_PROMPTskills/pipeline/eval-day2.md — store contents as DAY2_PROMPTThen spawn 3 Agents in parallel:
<role_prompt>
[Contents of eval-hire.md]
</role_prompt>
<task>
Evaluate the codebase in the current working directory.
Role level: [from Step 1]
Focus areas: [from Step 1]
Exclusions: [from Step 1]
</task>
<role_prompt>
[Contents of eval-stress.md]
</role_prompt>
<task>
Evaluate the codebase in the current working directory.
Role level: [from Step 1]
Focus areas: [from Step 1]
Exclusions: [from Step 1]
</task>
<role_prompt>
[Contents of eval-day2.md]
</role_prompt>
<task>
Evaluate the codebase in the current working directory.
Role level: [from Step 1]
Focus areas: [from Step 1]
Exclusions: [from Step 1]
</task>
Verify each evaluator's output contains its completion signal before proceeding:
EVAL_HIRE_COMPLETEEVAL_STRESS_COMPLETEEVAL_DAY2_COMPLETEIf any signal is missing, the agent may have been truncated. Report the incomplete evaluator to the user and do NOT write eval.md with partial data.
If all signals present, Write docs/plans/YYYY-MM-DD-eval-slug/eval.md:
---
type: repo-eval
target: 9
role_level: [from Step 1]
date: YYYY-MM-DD
pillar_overrides:
# Pillars with custom thresholds (omit for default 9)
# creativity: 7
# git_hygiene: accept
---
# Repo Evaluation: [repo name]
## Configuration
- **Role Level:** [Junior | Mid | Senior | Staff+]
- **Focus Areas:** [list]
- **Exclusions:** [list]
## Combined Scorecard
| # | Lens | Pillar | Score | Target | Status |
|---|------|--------|-------|--------|--------|
| 1 | Hire | Problem-Solution Fit | X/10 | 9 | [PASS ≥target | NEEDS WORK <target] |
| 2 | Hire | Architecture | X/10 | ... |
| 3 | Hire | Code Quality | X/10 | ... |
| 4 | Hire | Creativity | X/10 | ... |
| 5 | Stress | Pragmatism | X/10 | ... |
| 6 | Stress | Defensiveness | X/10 | ... |
| 7 | Stress | Performance | X/10 | ... |
| 8 | Stress | Type Rigor | X/10 | ... |
| 9 | Day 2 | Test Value | X/10 | ... |
| 10 | Day 2 | Reproducibility | X/10 | ... |
| 11 | Day 2 | Git Hygiene | X/10 | ... |
| 12 | Day 2 | Onboarding | X/10 | ... |
**Pillars at target (≥9):** N/12
**Pillars needing work (<9):** M/12
## Hire Evaluation — The Pragmatist
[Full evaluator output]
## Stress Evaluation — The Oncall Engineer
[Full evaluator output]
## Day 2 Evaluation — The Team Lead
[Full evaluator output]
## Consolidated Remediation Targets
[Merged and deduplicated targets from all 3 evaluators, prioritized by:
1. Lowest score first
2. Highest complexity last
3. Overlapping findings consolidated]
Append an entry to .claude/skill-runs.json in the repo root. If the file does not exist, create it with an empty array first.
{
"skill": "repo-eval",
"date": "YYYY-MM-DD",
"plan": "YYYY-MM-DD-eval-slug"
}
Evaluation complete: docs/plans/YYYY-MM-DD-eval-slug/eval.md
Scores: [N]/12 pillars at target (≥9)
Lowest: [pillar] at [X]/10
To remediate and bring all pillars to 9/10, run:
/pipeline YYYY-MM-DD-eval-slug
/pipeline after all remediation is complete.npx claudepluginhub hatmanstack/claude-forge --plugin forgeRuns codebase audits (health, evaluation, documentation) with parallel agents, producing intake docs for a pipeline run.
Performs comprehensive multi-agent evaluation of code projects across 12 dimensions like safety, completeness, and design quality. Outputs scored reports with executive summaries and improvement roadmaps in 5-10 minutes.
Performs a strategic first-pass review of a repository, producing an evidence-cited map of its state calibrated to a reference class. Helps decide where to engage, tread carefully, or leave alone.