From supervise-loop
This skill should be used when the user invokes "/supervise-loop", or wants an autonomous supervisor/worker iteration loop — phrased as "supervise a loop", "delegate this and review until it's perfect", "iterate until done without asking me", "evaluator-optimizer loop", "critic loop", "boss agent that supervises a worker agent", "loop until the supervisor is satisfied", or in Chinese "主 Agent 监督子 Agent 反复做到完美 / 一直循环到没问题 / 别问我自己决定 / 一个子Agent干活另一个独立子Agent当评委审 / 评估器-优化器循环 / 定验收标准反复迭代到通过". The main agent acts as a SUPERVISOR that never does the work itself: it turns the goal into a checkable acceptance rubric, delegates to worker sub-agent(s), grades the result with mechanical checks plus a SEPARATE critic sub-agent, and re-dispatches with feedback round after round — WITHOUT asking the human between rounds — until the rubric passes or a hard cap is hit. Use this — NOT the built-in /loop (which only re-runs one command on a fixed time interval, no review) and NOT ralph-loop (which re-feeds the SAME prompt to ONE agent in-session with no independent critic). The distinguishing feature is a SEPARATE evaluator agent grading against an explicit rubric.
How this skill is triggered — by the user, by Claude, or both
Slash command
/supervise-loop:supervise-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run a goal through an autonomous **supervisor → worker → critic** iteration loop. The user states one goal once; the loop drives it to "done" across as many rounds as needed, with no human prompting between rounds.
Run a goal through an autonomous supervisor → worker → critic iteration loop. The user states one goal once; the loop drives it to "done" across as many rounds as needed, with no human prompting between rounds.
This is the evaluator-optimizer pattern (Anthropic, Building Effective Agents) fused with orchestrator-workers, hardened with the anti-thrash guardrails that grassroots loops (Ralph) lack. The load-bearing idea: the agent that judges the work is never the agent that did the work.
The supervisor (the main agent reading this skill) does NOT produce the deliverable. It only: builds the rubric, dispatches workers, runs mechanical checks, runs a critic, decides pass/revise/stop, and folds feedback forward. If tempted to "just fix it myself," stop — that collapses the separation that prevents the model from rubber-stamping its own output. (The one allowed exception is the state/checkpoint operation in Round 0 step 5 and regression handling — committing or reverting files is bookkeeping, not authoring.)
There is no external runtime driving rounds. You, the top-level agent, ARE the loop. Execute every round sequentially within this single turn — call the worker, run the mechanical checks, call the critic, decide, then call the next worker — and do not yield the turn until a terminal state is reached (APPROVED / MAX_ROUNDS_REACHED / NOT_VERIFIED / STUCK). Do not run one round and stop as if something else will continue it; nothing will.
Preflight (before Round 0): confirm a sub-agent dispatch tool is available — the Agent tool (with subagent_type: general-purpose) in a normal Claude Code session, or Task. This loop must run from the top-level session, because a sub-agent cannot itself spawn the worker/critic sub-agents (dispatch is one level deep). If no dispatch tool exists, STOP and tell the user — do not silently collapse the roles and do the work yourself (that breaks the one rule). Fallback when no in-session dispatch tool is available: run each worker/critic as a headless claude -p '<filled prompt>' subprocess via Bash and capture stdout.
| Role | Who | Job | Must NOT |
|---|---|---|---|
| Supervisor | Main agent (you) | Route, decompose, grade, decide, loop, report | Write the deliverable; judge by vibes |
| Worker | Sub-agent via Agent tool | Do the actual work, persist to files, report what changed | Declare itself "done" |
| Critic | SEPARATE sub-agent via Agent tool | Adversarially score the deliverable against the rubric | Contribute fixes; be lenient |
One level deep only — workers and critics never spawn their own sub-agents.
Round 0 — Setup (do this once):
[mech] (maps to a real command: test / typecheck / lint / build / run) or [judge] (subjective, scored by the critic). This is the crux; see references/rubric-guide.md. Without a concrete rubric the loop can never terminate honestly.scripts/init-run.sh "<short-slug>" from this skill's own scripts/ directory (resolve its absolute path first — ${CLAUDE_PLUGIN_ROOT}/skills/supervise-loop/scripts/ when installed as a plugin, or the personal-skill dir; if the path is unknown, just Write the STATE.md / RUBRIC.md / MEMORY.md files directly instead). The script prints an absolute run-dir path. Store that path and reuse it for every STATE.md / RUBRIC.md / MEMORY.md edit (cwd is not stable between Bash calls — always use the absolute path). Tell the user once: "watch progress live in <abs-run-dir>/STATE.md". Record in STATE.md: goal, rubric, round=0, mode, terminal_state=RUNNING.git stash-clean state). If it is not a git repo (common — verify with git rev-parse), do NOT assume git: either git init a scratch repo for the work, or snapshot the files the worker will touch into <run-dir>/snapshots/round-0/. "Revert to last good checkpoint" later means restoring from whichever mechanism was set up here.Each round N:
references/prompts.md.[mech] rubric item as a real command (Bash). This is the primary ground truth. A [mech] item that fails is an automatic REVISE regardless of what anyone claims. A [mech] item that cannot run at all (missing tool/env — distinct from failing) is neither pass nor fail: if the worker cannot install/provide it within budget, terminate NOT_VERIFIED immediately, naming the check — do not keep looping blind.PASS | REVISE | FAIL plus per-item findings (what's wrong / why / impact / fix). Template in references/prompts.md. Skip only if --no-critic.items_pass/total as an integer count, verdict, the must-fix findings, and a one-line iteration-specific progress note. The integer count and the supervisor-computed change fingerprint (next step) are what the anti-thrash detectors compare across rounds, so they must actually be written down each round.references/termination.md for the full logic):
[mech] pass AND critic == PASS → APPROVED. Exit loop.Backstops (checked before every new round — full logic in references/termination.md):
--max-rounds is the global round count (default 5). On cap, return the best partial result, not nothing.items_pass integers: if the count hasn't risen across the plateau window (default max(2, max-rounds − 2) rounds), treat as stuck. The supervisor computes consecutive-round output similarity itself (diff the changed files — do NOT trust the worker's self-described "iteration-specific changes"). Track which specific failing items moved, not just the aggregate, so toggling an easy item doesn't mask a stuck hard one.FAIL (fundamentally wrong approach) for ≥2 consecutive rounds, escalate; don't keep re-dispatching the same dead approach.[mech] No regressions rubric item makes the round FAIL through the normal pass/fail machinery, and the supervisor restores the broken file from the Round-0 checkpoint and tells the next worker "do not touch X." Two consecutive revert cycles on the same item count as stuck.has_pivoted=true and give the next worker a one-shot ~2-round diagnosis-first pivot: DIAGNOSE the root cause, then try a fundamentally different approach. The pivot round is exempt from auto-revert (compare net item delta instead).--gate, pause here for a human decision.)Terminate — always report one explicit terminal state:
APPROVED · MAX_ROUNDS_REACHED · NOT_VERIFIED (changes made, a [mech] check couldn't run) · STUCK (escalated). Then deliver: the final artifact, the rubric scorecard (every item + pass/fail), and the round log.
"Perfect" / "no problems" is not checkable until it is decomposed. Turn the goal into rubric items that are each either a command that exits 0, or a yes/no a critic can score against a stated standard. Grade each [judge] dimension with its OWN critic question, not one blob score. Full method + cross-domain examples (code, a web app, a video script): references/rubric-guide.md.
If the user supplied acceptance criteria, use them verbatim as the rubric. If not, the supervisor drafts the rubric and proceeds autonomously (the whole point is not pestering the user) — unless --gate is set, in which case show the rubric once for approval before round 1.
--gate / escalation triggers.--max-rounds N — hard ceiling (default 5).--workers single|auto|N — dispatch mode (default auto: supervisor decides).--gate — human approval at phase boundaries (rubric, and before final sign-off). Default OFF.--no-critic — mechanical checks only, skip the LLM critic (faster, weaker; use when the rubric is fully [mech]).--rubric "..." — supply acceptance criteria directly./supervise-loop Build a working CLI todo app: add/list/done/delete, persists to JSON, all commands have passing tests, README with usage. Don't ask me — loop until it's solid.
/supervise-loop --max-rounds 8 --gate Rewrite my landing page copy until it's punchy, on-brand, and every claim is backed. Show me the rubric first.
The supervisor builds the rubric, delegates, grades, and keeps re-dispatching until the scorecard is green or it hits the cap — then reports the terminal state and the artifact.
references/rubric-guide.md — turn a vague goal into a checkable, mechanically-grounded rubric (with examples).references/prompts.md — copy-paste worker-dispatch and critic prompt templates + the verdict contract.references/termination.md — full pass/revise/stop decision logic, anti-thrash math, terminal states, per-phase caps.scripts/init-run.sh — scaffold a run directory with STATE.md / RUBRIC.md / MEMORY.md templates.Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub wdzhwsh4067/supervise-loop --plugin supervise-loop