Use when designing any LLM-as-Judge, Critic, or Evaluator node. Covers input structure, output schema, chain-of-thought ordering, single-pass vs multi-stage tradeoffs, and known failure modes. Prevents the most common design mistakes that cause Critic nodes to be unreliable.
How this skill is triggered — by the user, by Claude, or both
Slash command
/build-reliable-agents:critic-judge-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Terminology:** This skill uses "Critic", "Judge", and "Evaluator" interchangeably. Pick one term for your codebase and use it consistently.
Terminology: This skill uses "Critic", "Judge", and "Evaluator" interchangeably. Pick one term for your codebase and use it consistently.
A Critic node has one primary job: judge whether something meets a quality bar. Everything else it does (generating feedback, planning follow-ups, classifying failure types) is secondary.
The most common mistake: making the Critic do too much in one pass. When a node simultaneously judges quality AND generates structured follow-up plans, the two tasks interfere with each other — and the primary job (judgment) suffers.
Before designing any Critic, answer:
The cleaner the primary job definition, the more reliable the Critic.
This is the most important decision. Input structure determines reasoning patterns more than prompt instructions do.
If you organize input by sub-question:
SQ1: "Find the patch window" → 0 rows
SQ2: "Find rollback artifacts" → 3 rows (contains patch window)
The LLM will evaluate each sub-question against its own text. It will conclude "patch window not found" from SQ1 even if SQ2's rows contain the answer. The input structure primes per-sub-question reasoning.
Instead of organizing by sub-question, present evidence as a flat pool:
Original question: ...
Retrieved evidence:
- Row 1: {customer: "X", patch_window: "2026-03-24 02:00", ...}
- Row 2: {customer: "X", rollback_cmd: "orchestrator rollback ...", ...}
This forces the LLM to evaluate evidence against the original question, not against sub-question text.
Tradeoff: You lose sub-question provenance (can't say "SQ2 found this"). Gain: correct holistic reasoning.
Keep it when the Critic's job IS per-sub-question evaluation (e.g., "did this specific sub-question return useful data?"). Remove it when the Critic's job is holistic sufficiency (e.g., "can the original question be answered from all evidence combined?").
Chain-of-thought ordering matters. The LLM reasons in the order it writes. Put the reasoning BEFORE the judgment, not after.
{
"is_sufficient": true,
"judgment": "Evidence covers the patch window...",
"confidence": 90
}
The LLM commits to is_sufficient before writing the reasoning. The reasoning becomes post-hoc justification.
{
"evidence_present": ["patch_window", "rollback_cmd"],
"evidence_missing": [],
"judgment": "Both required fields are present in the retrieved rows...",
"confidence": 90,
"is_sufficient": true
}
The LLM must enumerate what's present and missing before committing to a verdict. This produces more accurate judgments.
Use an intermediate score (1-4) and derive a binary from it:
coverage_score: 1-4
is_sufficient: derived from coverage_score >= 3
This gives the LLM room to express nuance, and keeps the routing logic deterministic (binary derived from score, not LLM-generated bool).
Use when:
Risk: As feedback complexity grows, judgment quality degrades. The two tasks interfere.
Use when:
Structure:
Stage 1 — Judgment node:
Input: original question + flat evidence pool
Output: is_sufficient, retrieval_outcome, confidence, brief judgment
Stage 2 — Feedback Planner node (only runs if insufficient):
Input: original question + evidence + judgment from Stage 1
Output: gap_type, missing_evidence, suggested_tables, dependency_hints
Stage 3 — Follow-up Parser (already exists):
Input: feedback from Stage 2
Output: new sub-questions
Cost: One extra LLM call on insufficient cases only. No cost on sufficient cases.
Always derive routing from deterministic fields, not LLM-generated booleans.
# Fragile — LLM generates the bool
if state["is_sufficient"]:
route to synthesizer
# Robust — derive from score
if state["coverage_score"] >= 3:
route to synthesizer
For multi-outcome routing (sufficient / needs_more / no_evidence / unsupported_premise):
match state["retrieval_outcome"]:
case "sufficient": route to synthesizer
case "needs_more_retrieval": route to follow-up parser (if rounds < max)
case "no_evidence": route directly to synthesizer
case "unsupported_premise": route directly to synthesizer
case _: route to synthesizer (safe default)
Always have a safe default that prevents infinite loops.
Every Critic-driven loop MUST have a hard exit condition independent of Critic judgment:
if critic_round_count >= max_rounds:
force route to synthesizer
# regardless of what Critic says
Also add sub-question deduplication to prevent the Critic from requesting the same sub-question repeatedly:
seen_signatures = set()
# before adding to pending queue, check signature
if sub_question_signature not in seen_signatures:
pending_queue.append(sub_question)
seen_signatures.add(sub_question_signature)
| Failure | Symptom | Root Cause | Fix |
|---|---|---|---|
| Per-sub-question reasoning | Critic judges each SQ independently, misses cross-SQ evidence | Input organized by sub-question | Flatten evidence pool |
| Post-hoc justification | Critic commits to verdict, writes reasoning to match | Judgment field before reasoning in schema | Reorder: reasoning → score → verdict |
| Ghost rules | Critical rules ignored, less important rules followed | Prompt too long, rules competing for attention | Simplify prompt, put critical rules first |
| Infinite loops | Critic keeps saying insufficient even when evidence exists | No loop guard, or loop guard too permissive | Hard exit at max_rounds regardless of Critic |
| Follow-up drift | Follow-up sub-questions drift from original question scope | Critic generates follow-ups without entity anchoring | Require Critic to scan completed SQs for entity names before generating feedback |
npx claudepluginhub victoriakaey/build-reliable-agents --plugin build-reliable-agentsProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.