Skill

supervise-loop

This skill should be used when the user invokes "/supervise-loop", or wants an autonomous supervisor/worker iteration loop — phrased as "supervise a loop", "delegate this and review until it's perfect", "iterate until done without asking me", "evaluator-optimizer loop", "critic loop", "boss agent that supervises a worker agent", "loop until the supervisor is satisfied", or in Chinese "主 Agent 监督子 Agent 反复做到完美 / 一直循环到没问题 / 别问我自己决定 / 一个子Agent干活另一个独立子Agent当评委审 / 评估器-优化器循环 / 定验收标准反复迭代到通过". The main agent acts as a SUPERVISOR that never does the work itself: it turns the goal into a checkable acceptance rubric, delegates to worker sub-agent(s), grades the result with mechanical checks plus a SEPARATE critic sub-agent, and re-dispatches with feedback round after round — WITHOUT asking the human between rounds — until the rubric passes or a hard cap is hit. Use this — NOT the built-in /loop (which only re-runs one command on a fixed time interval, no review) and NOT ralph-loop (which re-feeds the SAME prompt to ONE agent in-session with no independent critic). The distinguishing feature is a SEPARATE evaluator agent grading against an explicit rubric.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/supervise-loop:supervise-loop

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run a goal through an autonomous **supervisor → worker → critic** iteration loop. The user states one goal once; the loop drives it to "done" across as many rounds as needed, with no human prompting between rounds.

Supporting Files

references/prompts.mdreferences/rubric-guide.mdreferences/termination.mdscripts/init-run.sh

SKILL.md

100 lines · ~3.1k tokens

Stats

LanguageShell

Parent stars1

MaintenanceGood

Last CommitJun 16, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Supervise Loop

Run a goal through an autonomous supervisor → worker → critic iteration loop. The user states one goal once; the loop drives it to "done" across as many rounds as needed, with no human prompting between rounds.

This is the evaluator-optimizer pattern (Anthropic, Building Effective Agents) fused with orchestrator-workers, hardened with the anti-thrash guardrails that grassroots loops (Ralph) lack. The load-bearing idea: the agent that judges the work is never the agent that did the work.

The one rule that makes this work

The supervisor (the main agent reading this skill) does NOT produce the deliverable. It only: builds the rubric, dispatches workers, runs mechanical checks, runs a critic, decides pass/revise/stop, and folds feedback forward. If tempted to "just fix it myself," stop — that collapses the separation that prevents the model from rubber-stamping its own output. (The one allowed exception is the state/checkpoint operation in Round 0 step 5 and regression handling — committing or reverting files is bookkeeping, not authoring.)

Execution model — how the loop actually runs

There is no external runtime driving rounds. You, the top-level agent, ARE the loop. Execute every round sequentially within this single turn — call the worker, run the mechanical checks, call the critic, decide, then call the next worker — and do not yield the turn until a terminal state is reached (APPROVED / MAX_ROUNDS_REACHED / NOT_VERIFIED / STUCK). Do not run one round and stop as if something else will continue it; nothing will.

Preflight (before Round 0): confirm a sub-agent dispatch tool is available — the Agent tool (with subagent_type: general-purpose) in a normal Claude Code session, or Task. This loop must run from the top-level session, because a sub-agent cannot itself spawn the worker/critic sub-agents (dispatch is one level deep). If no dispatch tool exists, STOP and tell the user — do not silently collapse the roles and do the work yourself (that breaks the one rule). Fallback when no in-session dispatch tool is available: run each worker/critic as a headless claude -p '<filled prompt>' subprocess via Bash and capture stdout.

Roles

Role	Who	Job	Must NOT
Supervisor	Main agent (you)	Route, decompose, grade, decide, loop, report	Write the deliverable; judge by vibes
Worker	Sub-agent via Agent tool	Do the actual work, persist to files, report what changed	Declare itself "done"
Critic	SEPARATE sub-agent via Agent tool	Adversarially score the deliverable against the rubric	Contribute fixes; be lenient

One level deep only — workers and critics never spawn their own sub-agents.

The loop

Round 0 — Setup (do this once):

Parse the goal and any flags (see Flags).
Operationalize "done" into a rubric — a numbered list of discrete, individually checkable items. Tag each [mech] (maps to a real command: test / typecheck / lint / build / run) or [judge] (subjective, scored by the critic). This is the crux; see references/rubric-guide.md. Without a concrete rubric the loop can never terminate honestly.
Decide worker mode: a single worker for a focused task, or decompose into parallel sub-tasks (multiple Agent calls in one message) when the goal has independent parts. Default: judge per task.
Initialize the run state so progress is visible and each round starts from facts, not memory. Run the bundled scripts/init-run.sh "<short-slug>" from this skill's own scripts/ directory (resolve its absolute path first — ${CLAUDE_PLUGIN_ROOT}/skills/supervise-loop/scripts/ when installed as a plugin, or the personal-skill dir; if the path is unknown, just Write the STATE.md / RUBRIC.md / MEMORY.md files directly instead). The script prints an absolute run-dir path. Store that path and reuse it for every STATE.md / RUBRIC.md / MEMORY.md edit (cwd is not stable between Bash calls — always use the absolute path). Tell the user once: "watch progress live in <abs-run-dir>/STATE.md". Record in STATE.md: goal, rubric, round=0, mode, terminal_state=RUNNING.
Establish a checkpoint baseline so regressions can be undone. If cwd is a git repo, note the current commit (or git stash-clean state). If it is not a git repo (common — verify with git rev-parse), do NOT assume git: either git init a scratch repo for the work, or snapshot the files the worker will touch into <run-dir>/snapshots/round-0/. "Revert to last good checkpoint" later means restoring from whichever mechanism was set up here.

Each round N:

Dispatch the worker(s) via the Agent tool. Pass ONLY: the goal, the rubric (as the target), the latest critique, and an accumulated short memory of what prior rounds already tried (so mistakes aren't repeated). Strip prior chatter — irrelevant context degrades reliability. Use the template in references/prompts.md.
Verify mechanically — the supervisor itself runs every [mech] rubric item as a real command (Bash). This is the primary ground truth. A [mech] item that fails is an automatic REVISE regardless of what anyone claims. A [mech] item that cannot run at all (missing tool/env — distinct from failing) is neither pass nor fail: if the worker cannot install/provide it within budget, terminate NOT_VERIFIED immediately, naming the check — do not keep looping blind.
Critique — spawn a SEPARATE critic sub-agent (fresh, adversarial) with the rubric as its scoring sheet and the worker's changed files/output. It returns a structured verdict PASS | REVISE | FAIL plus per-item findings (what's wrong / why / impact / fix). Template in references/prompts.md. Skip only if --no-critic.
Record the round in STATE.md — fill the row: round number, items_pass/total as an integer count, verdict, the must-fix findings, and a one-line iteration-specific progress note. The integer count and the supervisor-computed change fingerprint (next step) are what the anti-thrash detectors compare across rounds, so they must actually be written down each round.
Decide (see references/termination.md for the full logic):
- All [mech] pass AND critic == PASS → APPROVED. Exit loop.
- Otherwise → fold findings into the next round's feedback and continue — but first check the backstops below.

Backstops (checked before every new round — full logic in references/termination.md):

Hard cap — --max-rounds is the global round count (default 5). On cap, return the best partial result, not nothing.
No-improvement / plateau — compare the recorded items_pass integers: if the count hasn't risen across the plateau window (default max(2, max-rounds − 2) rounds), treat as stuck. The supervisor computes consecutive-round output similarity itself (diff the changed files — do NOT trust the worker's self-described "iteration-specific changes"). Track which specific failing items moved, not just the aggregate, so toggling an easy item doesn't mask a stuck hard one.
Persistent FAIL — if the critic returns FAIL (fundamentally wrong approach) for ≥2 consecutive rounds, escalate; don't keep re-dispatching the same dead approach.
Regression — regressions have a single owner: the [mech] No regressions rubric item makes the round FAIL through the normal pass/fail machinery, and the supervisor restores the broken file from the Round-0 checkpoint and tells the next worker "do not touch X." Two consecutive revert cycles on the same item count as stuck.
Pivot-on-stuck — on the first stuck trigger (any cause), set has_pivoted=true and give the next worker a one-shot ~2-round diagnosis-first pivot: DIAGNOSE the root cause, then try a fundamentally different approach. The pivot round is exempt from auto-revert (compare net item delta instead).
Escalate — if still stuck after the pivot (or on persistent FAIL), stop and hand the user the best partial + a diagnosis of the ceiling. (With --gate, pause here for a human decision.)

Terminate — always report one explicit terminal state: APPROVED · MAX_ROUNDS_REACHED · NOT_VERIFIED (changes made, a [mech] check couldn't run) · STUCK (escalated). Then deliver: the final artifact, the rubric scorecard (every item + pass/fail), and the round log.

Operationalizing "done"

"Perfect" / "no problems" is not checkable until it is decomposed. Turn the goal into rubric items that are each either a command that exits 0, or a yes/no a critic can score against a stated standard. Grade each [judge] dimension with its OWN critic question, not one blob score. Full method + cross-domain examples (code, a web app, a video script): references/rubric-guide.md.

If the user supplied acceptance criteria, use them verbatim as the rubric. If not, the supervisor drafts the rubric and proceeds autonomously (the whole point is not pestering the user) — unless --gate is set, in which case show the rubric once for approval before round 1.

Anti-patterns to refuse

No placeholder / stub deliverables. Models drift toward "it compiles" minimalism. Make "full implementation, no TODOs/stubs" a top rubric item, and have the critic re-detect stubs and turn them back into open items rather than trusting a "done" claim.
No self-grading. The worker never gets to declare completion; only mechanical checks + the separate critic do.
No infinite polishing. The critic flags only gaps that affect correctness or a stated rubric item; everything else is "optional, do not block."
No mid-loop questions. Default is fully autonomous. Decide and proceed; surface choices only at the end or when --gate / escalation triggers.

Flags

--max-rounds N — hard ceiling (default 5).
--workers single|auto|N — dispatch mode (default auto: supervisor decides).
--gate — human approval at phase boundaries (rubric, and before final sign-off). Default OFF.
--no-critic — mechanical checks only, skip the LLM critic (faster, weaker; use when the rubric is fully [mech]).
--rubric "..." — supply acceptance criteria directly.

Quick start

/supervise-loop Build a working CLI todo app: add/list/done/delete, persists to JSON, all commands have passing tests, README with usage. Don't ask me — loop until it's solid.

/supervise-loop --max-rounds 8 --gate Rewrite my landing page copy until it's punchy, on-brand, and every claim is backed. Show me the rubric first.

The supervisor builds the rubric, delegates, grades, and keeps re-dispatching until the scorecard is green or it hits the cap — then reports the terminal state and the artifact.

Resources

references/rubric-guide.md — turn a vague goal into a checkable, mechanically-grounded rubric (with examples).
references/prompts.md — copy-paste worker-dispatch and critic prompt templates + the verdict contract.
references/termination.md — full pass/revise/stop decision logic, anti-thrash math, terminal states, per-phase caps.
scripts/init-run.sh — scaffold a run directory with STATE.md / RUBRIC.md / MEMORY.md templates.

supervise-loop

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

supervise-loop

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Supervise Loop

The one rule that makes this work

Execution model — how the loop actually runs

Roles

The loop

Operationalizing "done"

Anti-patterns to refuse

Flags

Quick start

Resources

Similar Skills

Supervise Loop

The one rule that makes this work

Execution model — how the loop actually runs

Roles

The loop

Operationalizing "done"

Anti-patterns to refuse

Flags

Quick start

Resources

Similar Skills