From citadel
Scores a target against a rubric and iteratively improves it until all axes score 8.0 or higher. Each loop re-scores from scratch, selects the highest-leverage axis, attacks it, and verifies.
How this skill is triggered — by the user, by Claude, or both
Slash command
/citadel:improveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Use when:** Scoring a target against a rubric and iteratively improving it. Rubric required at `.planning/rubrics/{target}.md` (Phase 0 creates one if missing).
Use when: Scoring a target against a rubric and iteratively improving it. Rubric required at .planning/rubrics/{target}.md (Phase 0 creates one if missing).
Don't use when: Refactoring without a rubric (use /refactor), one-time code review (use /review), or debugging a specific bug (use /systematic-debugging).
/improve {target} # Loop until plateau or all axes >= 8.0
/improve {target} --n=3 # Run exactly N loops then stop
/improve {target} --axis={name} # Force-attack a specific axis (skips scoring)
/improve {target} --score-only # Score and report, no attack
/improve {target} --continue # Resume from campaign state (used by daemon)
/improve citadel # Targets Citadel itself
target is a slug that maps to .planning/rubrics/{target}.md.
If no rubric exists, run Phase 0 first.
When invoked with --n or --continue, improve operates in campaign mode and maintains a campaign file that daemon can attach to.
Campaign file: .planning/campaigns/improve-{target}.md, created automatically on the first invocation with --n (full template: docs/QUALITY_LOOPS.md#campaign-file-template).
Frontmatter: version, id (improve-{target}-{ISO-date-slug}), status: active, type: improve, target, total_loops ({n} or unlimited), completed_loops: 0, current_level (from rubric frontmatter), estimated_cost_per_loop: 12, started.
Body: status and direction lines, a Loop History table (Loop | Axis Attacked | Outcome | Score Movement), and a Continuation State block (next_loop, last_scorecard_log, last_outcome, phase_within_loop, level_up_triggered).
Update phase_within_loop at each phase: scoring → selected-{axis} → attacking-{axis} → verifying → not-started.
On loop complete: increment completed_loops, update next_loop/last_scorecard_log/last_outcome, append Loop History row.
--continue flag.planning/campaigns/improve-{target}.md — error if missing or status not activecompleted_loops >= total_loops: mark completed, exitphase_within_loop is not not-started: restart current loop from Phase 1 (interrupted mid-loop)last_scorecard_log for delta comparison, then run Phase 1 onwardsRun only when .planning/rubrics/{target}.md does not exist.
.planning/research/ if available/research --parallel to survey comparable products if no research existsrubric_approved: {answer}..planning/rubrics/{target}.mdScore every axis in the rubric. No shortcuts. No cached scores from the previous loop.
Execute the programmatic verification steps from the rubric. A programmatic failure caps that axis at 5 regardless of evaluator scores. Record raw results: which checks passed, which failed, what the failure was.
Execute the structural checks from each axis's verification spec: file path existence, frontmatter schema consistency, benchmark coverage ratios, link rot, and cross-reference accuracy (check descriptions: docs/QUALITY_LOOPS.md#structural-check-types).
Spawn three evaluator agents in parallel. Each receives the rubric with all axis definitions and anchors, read access to the target, its persona (A/B/C as defined in the rubric's Scoring Protocol), and the instruction to score every axis 0-10 with a one-sentence justification per axis (input list: docs/QUALITY_LOOPS.md#evaluator-panel).
Each evaluator scores independently. For each axis:
needs-refinementneeds-refinement axes are logged but still scored. Do not halt on evaluator disagreement.
Compile a table with columns Axis | A | B | C | Prog | Final | Delta | Flag (layout: docs/QUALITY_LOOPS.md#scorecard-format).
Final = min(A, B, C), then apply programmatic cap (sets Flag=cap). Delta = current − prior loop score (empty on loop 1).
Choose the single axis to attack this loop.
Selection formula:
score(axis) = (10 - current_score) × weight × effort_multiplier × recency_penalty
effort_multiplier: low = 1.0, medium = 0.7, high = 0.4recency_penalty: 0.5 if attacked in previous 2 loops, otherwise 1.0If --axis flag was set, skip selection and attack the specified axis.
Announce the selection:
Selected: {axis_name} (score: {n}/10, weight: {w}, effort: {e}, selection score: {s})
Rationale: {one sentence on why this axis now, not another}
Execute the improvement. Dispatch strategy depends on the axis category (expanded per-category playbooks: docs/QUALITY_LOOPS.md#attack-dispatch-strategies).
ISOLATION MANDATE: When dispatching to /experiment, /fleet, or /research --parallel, always use the Agent tool with isolation: "worktree". Sub-agents in worktrees get their own context windows; the orchestrator only receives their HANDOFF results.
| Category | Dispatch | Verification |
|---|---|---|
| technical | /experiment with before/after comparison; speculative worktrees (Agent + isolation: "worktree") for approaches that might conflict | node scripts/run-with-timeout.js 300 node scripts/test-all.js as the oracle |
| documentation | direct: read current docs, fix specific gaps; cross-reference every claim against source | structural verification before committing |
| experience | structural fixes + doc updates; run the actual install flow in a clean temp dir; inject synthetic failures per the programmatic spec | /qa |
| positioning | /research to verify the competitive landscape is accurate, then update README/FAQ/demo copy | /qa confirms the updated page renders |
| presentation | targeted changes per rubric anchors (no rewrites unless score is below 3) | /live-preview or /qa confirms visual changes render |
| security | read the specific hooks/scripts involved, make targeted code changes | run the rubric's programmatic verification steps directly |
Artifact archiving: when the attack tried multiple approaches, write a decision record to the loop log: APPROACH COMPARISON: [approach A] vs [approach B] — winner: [A] because [reason].
After the attack, re-score only the targeted axis (not full re-score).
Run the four verification tiers from the rubric for the targeted axis:
/do command.
onboarding_friction, error_recovery, documentation_accuracy, command_discoverabilityPASS {wall_time} or FAIL at step {n}: {what broke}visual_coherence, api_surface_consistency)Regression check (run on all axes, not just targeted):
On abort: revert the changes, log the failure, treat as "no improvement this loop".
On pass: commit the changes with a descriptive message.
Write the loop log. Always. Even on abort.
Log path: .planning/improvement-logs/{target}/loop-{n}.md
Required sections (full template: docs/QUALITY_LOOPS.md#loop-log-template):
APPROACH COMPARISON record if multiple approaches were triedPASS {wall_time} | FAIL at step {n}: {reason} | SKIPPEDPROPOSED AXIS: {name} | Rationale | Category | Weight | Anchors: 0=... 5=... 10=... (or: None proposed this loop.)All proposals go to .planning/rubrics/{target}-proposals.md. Never to the live rubric.
Exit conditions (check in order):
--n flag was set and N loops have completed: exit, report scorecardOn Level-Up: do not exit. Escalate. See Level-Up Protocol section.
On ceiling (all >= 8.0): report the final scorecard and recommend a Level-Up run.
On normal loop: return to Phase 1. Re-score everything from scratch.
Campaign mode exit handling:
status: completed, move to completed/status: completed, move to completed/status: level-up-pending (daemon will pause, not retry)status: parkedstatus: parked with reasonstatus: pausedTriggers when no axis improved > 0.5 in the last 2 consecutive loops, no programmatic cap is active, and at least 3 loops have completed.
Step 1: Freeze the snapshot
Write .planning/rubrics/{target}-level-{n}-final.md with: date, loops completed, final scorecard, axes at ceiling (≥9.0 — their 10 anchors become Level {n+1}'s 5 anchors), and axes that plateaued below 9.0 with why.
Step 2: Write proposals
For each axis: propose Level {n+1} re-anchoring (current 10 → new 5, propose new 10). For plateaued axes: re-anchor, replace with measurable proxy, or retire.
Auto-include these three process axes if not already in the rubric: decomposition_quality, scope_appropriateness, verification_depth.
Write to .planning/rubrics/{target}-proposals.md: re-anchored axes (current 10 anchor, proposed 0/5/10), proposed new axes, axes proposed for retirement.
Step 3: Halt -- human approval required
Do not self-approve. Do not continue looping.
In campaign mode: set status: level-up-pending, set level_up_triggered: true, and write awaiting: human approval of level-up proposals to Continuation State.
Report: what was achieved at this level (scorecard summary), the proposals file location, and what the expected new gains look like at the next level.
The loop resumes only when the human edits the live rubric with approved proposals
and sets the campaign status back to active. Level {n+1} loops continue incrementing
the loop number (they do not reset to 1).
Step 4: Historical context for future evaluators
When the loop resumes after a level-up, every evaluator in Phase 1c receives the level-{n}-final.md snapshot as a reference baseline, plus the instruction: "Scores from the previous level are the floor. A score of 5 at Level 2 means you have reached what was the ceiling at Level 1."
needs-refinement, use minimum score, continue.--continue + no campaign file: error, suggest --n.--continue + level-up-pending: halt, point to proposals file, require human approval then status: active.--continue + completed: do not resume, report final scorecard.--n + existing active campaign: treat as --continue. If completed/parked: new campaign, incremented slug..planning/rubrics/{target}-proposals.md only. Human approval required.status: level-up-pending, not parked or active.Disclosure: State loop count, target, per-loop cost (~$12), total estimate. For --continue: loops remaining and spend so far. For unlimited: state exit conditions (plateau or all axes >= 8.0).
Reversibility: Green = --score-only | Amber = standard loops (each commits separately) | Red = level-up (rewrites rubric anchors permanently). Red requires explicit confirmation.
Proportionality: No rubric + no explicit request → suggest /review. All axes > 8.0 + --n=1 → suggest --axis. Cost > $50 → confirm.
Trust gating: Novice (0-4): --score-only / --n=1 only. Familiar (5-19): up to --n=5. Trusted (20+): no cap; confirm unlimited or cost > $50.
---HANDOFF---
- Target: {target} — Loop {n} of {n_total or "∞"} — Level {current_level}
- Outcome: {improved | plateau | ceiling | aborted | n-complete | level-up-triggered}
- Score movement: {axis} {before} → {after} (+{delta})
- Behavioral simulation: {PASS {wall_time} | FAIL | SKIPPED}
- Proposed rubric additions: {count} — written to .planning/rubrics/{target}-proposals.md
- Loop log: .planning/improvement-logs/{target}/loop-{n}.md
- Reversibility: amber -- each loop commits separately, revert individual loops with git revert
- Next recommended axis: {axis_name} (if not exiting)
- Level-up snapshot: .planning/rubrics/{target}-level-{n}-final.md (if level-up triggered)
---
npx claudepluginhub sethgammon/citadel --plugin citadelDirects multi-cycle improvement campaigns by forming hypotheses, scouting before attacking, and extracting transferable patterns. Use for sustained autonomous quality advancement across sessions.
Guides iterative improvement loops with scored auditing and substrate-gated termination. Generic fallback for ad-hoc quality improvement when no domain workflow applies.
Builds a scoring rubric interactively, evaluates an artifact with multiple models in parallel, then autonomously improves it one criterion at a time until a score threshold is met or circuit breaker fires.