From eval-guide
Generates structured eval plans from agent descriptions: 10–15 'The agent should…' criteria on Value×Risk matrix with pass/fail conditions and test methods; outputs customer-ready .docx before test cases.
How this skill is triggered — by the user, by Claude, or both
Slash command
/eval-guide:eval-suite-plannerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill produces the **Stage 1** artifact of the `/eval-guide` lifecycle: a written eval plan that a customer's PM, security partner, or business owner can sign off on. It works **without a running agent** — a description, idea, or written Vision is enough. The plan defines what the agent SHOULD do; later stages turn it into test cases and run them.
This skill produces the Stage 1 artifact of the /eval-guide lifecycle: a written eval plan that a customer's PM, security partner, or business owner can sign off on. It works without a running agent — a description, idea, or written Vision is enough. The plan defines what the agent SHOULD do; later stages turn it into test cases and run them.
This is the standalone form of /eval-guide Stage 1. Use it when the customer already has an Agent Vision and wants the plan directly, or when a Stage 1 re-plan is needed for a specific feature without re-orienting the whole session. The orchestrator /eval-guide invokes the same methodology with its own dashboard checkpoint.
Knowledge sources:
Maturity callout — Pillar 1 (Define what "good" means): Stage 1 advances Pillar 1 from L100 Initial ("good lives in the builder's head") to L300 Systematic ("written acceptance criteria with pass/fail conditions, tied to eval methods"). The eval plan IS the Pillar 1 artifact.
When invoked as /eval-suite-planner <agent description>:
.docx eval plan.Do not pad responses. Do not hedge. Be specific to the described agent — no generic advice.
| Architecture | What it is | Eval layers to apply |
|---|---|---|
| Prompt-level | Single-turn LLM call, fixed system prompt, no retrieval, no tools | Acceptance criteria + safety/refusal |
| RAG | Retrieves from knowledge sources before responding | + Grounding + citation accuracy + hallucination prevention |
| Agentic | Routes between topics, tools, or connectors | + Tool/topic routing accuracy + slot extraction + multi-step task completion |
The architecture call drives which capability families apply. Don't write tool-routing tests for a simple FAQ bot.
Each criterion belongs in one quadrant based on two judgments:
| Low risk | High risk | |
|---|---|---|
| High value | High Value · Low Risk — expected capabilities users rely on. Solid coverage; occasional misses tolerable. | High Value · High Risk — product-defining; failure hurts. Heaviest investment, strictest review. |
| Low value | Low Value · Low Risk — exploratory or rare. Light coverage; revisit if usage grows. | Low Value · High Risk — rarely triggered but must never fail. Safety, compliance, refusals. Zero tolerance. |
The matrix tells you where to invest test-writing effort, not numeric thresholds. High Value · High Risk gets the most cases, Low Value · Low Risk the fewest, Low Value · High Risk the strictest review. Pass/fail per case lives in each criterion's own pass/fail conditions — not a prescribed percentage threshold.
Targets vary by risk_profile. Targets are reference patterns, not gates. Only push back on red flags.
| Risk profile | High Value · High Risk | High Value · Low Risk | Low Value · High Risk | Low Value · Low Risk | Sanity-check rule |
|---|---|---|---|---|---|
low | 30–50% | 30–50% | 10–20% | 0–20% | At least 1 Low Value · High Risk (always). |
medium | 25–40% | 25–40% | 20–30% | 0–15% | At least 1 Low Value · High Risk. |
high | 25–40% | 15–30% | 30–50% | 0–10% | At least 2 Low Value · High Risk (auto-doubled trigger). |
critical | 20–35% | 10–20% | 40–60% | 0–5% | At least 3 Low Value · High Risk. Compliance / Safety domains required. |
Push back only on these red flags:
70% High Value · High Risk — every criterion is "the most important." Anchoring bias; force re-evaluation.
Marginal deviations (e.g., High Value · Low Risk at 13% with target 15–30%) are NOT red flags. Do not re-litigate customer-confirmed moves.
Every plan needs at least 1 Low Value · High Risk / Red-Teaming criterion. The mandate auto-doubles to 2 minimum when any of these triggers fire:
When a trigger fires, narrate it: "Your agent matches the sensitive-data trigger ([reason]) — doubling the adversarial coverage mandate from 1 to 2 minimum. Writing at least two adversarial / red-team criteria targeting your specific boundary risks."
Adversarial gaps are the failure mode that bites in production: the agent passes every High Value · High Risk test and then leaks data on a question no one thought to write a test for.
Group criteria by quality dimension. Default consolidated dimensions:
Customers fragment dimensions when the AI does. "Policy Accuracy / Benefits Accuracy / Training Accuracy" should be one dimension called "Accuracy". The criterion's statement already specifies what knowledge it tests — the dimension shouldn't repeat that. Consolidate aggressively.
Pick the method based on what you need to verify, not on familiarity. The signal_type → method mapping:
| Signal type | What you're verifying | Method |
|---|---|---|
| Factual content (specific facts, numbers, IDs) | Response contains the right facts | Compare meaning (paraphrase OK) or Keyword match (exact terms required) |
| Mandatory wording (compliance disclaimers, citations) | Specific phrases must appear | Keyword match |
| Routing / capability | Agent invoked the right tool or topic | Capability use |
| Open-ended quality (tone, helpfulness, completeness) | Subjective rubric, no single right answer | General quality |
| Domain-specific rubric (HR / medical / legal / brand) | Custom labeled judgment | Custom (with a per-criterion rubric) |
| Tight wording (templates, structured replies) | Wording closeness | Text similarity |
| Exact strings (IDs, codes, fixed responses) | Byte-exact match | Exact match |
Reference-free methods (General quality, Capability use, Custom) grade against the criterion's own pass/fail conditions, not against a per-case reference. They don't need an "expected response" per case in Stage 2.
Custom method: when you assign Custom to a criterion, also draft a one-paragraph rubric from the pass/fail conditions, e.g.:
Rate the response Pass / Fail. Pass = [pass_condition]. Fail = [fail_condition]. Output PASS or FAIL with a one-sentence reason.
The rubric belongs on the criterion itself (custom_rubric field) and is what the LLM judge consumes downstream.
Every criterion gets explicit Pass = and Fail = lines.
should NOT…), Pass = "agent correctly refused / redirected"; Fail = "agent disclosed / acted".Don't prescribe percentage thresholds per criterion. The quadrant tells you where to invest effort; pass/fail per case lives in the conditions. "Critical must pass at 90%" is wrong — pass/fail is per-case, not per-criterion.
Before locking the plan, walk the Agent Vision and confirm coverage:
If a Vision capability has no criterion, surface the gap: "I noticed Capability X has no criterion — add one or mark it out of scope?" Don't slide gaps silently.
.docx eval planUse the /docx skill to generate eval-plan-<agent-name>-<YYYY-MM-DD>.docx. The report must be:
Report structure:
Agent Vision summary (5–6 lines max) — purpose, users, knowledge, capabilities, boundaries, success criteria, risk profile.
Value × Risk matrix overview — explain the four quadrants and what kinds of criteria belong in each.
Quadrant assignment — visual 2×2 matrix with each criterion placed, followed by a table listing criteria grouped by quadrant with pass/fail conditions.
Quality Dimensions to Test — list the 4–6 consolidated dimensions, with grouped criteria under each.
Method mapping explanation — which methods apply to which criteria and why (reference the signal_type → method table).
Distribution check — actual percentages vs. risk-profile targets, with red-flag verdict.
Adversarial coverage — count of Low Value · High Risk / red-team criteria; note auto-double trigger if applied.
Next steps — "Run /eval-generator on this plan to produce test cases (Stage 2). Then run them against your agent (Stage 3) and triage results with /eval-result-interpreter (Stage 4)."
Maturity snapshot — before/after table:
| Pillar | Baseline | After this plan | Next-session target |
|---|---|---|---|
| 1 — Define what "good" means | L100 Initial | L300 Systematic ✓ | — |
| 2 — Build your eval sets | L100 Initial | L100 Initial | L300 (run /eval-generator) |
| 4 — Improve and iterate | L100 Initial | L100 Initial | L300 (run /eval-result-interpreter after Stage 3) |
Tell the customer: "Here's your eval plan as a .docx — share it with your team. Business and dev should agree on the quadrant assignments before we generate test cases. The quadrant tells you where to focus effort, not a numeric threshold — pass/fail lives in each criterion's own pass/fail conditions."
Display before ending. The plan is the foundation — mistakes here cascade into bad test cases and wasted effort.
| # | Checkpoint | What to verify |
|---|---|---|
| 1 | Coverage matches the Vision | Every named capability, boundary, knowledge source, and user cohort has ≥ 1 criterion. |
| 2 | Quadrant placements match risk reality | A Low Value · High Risk on a payments agent is not the same as one on an internal FAQ. Sense-check with the security/compliance partner. |
| 3 | Pass/fail conditions are decidable | A human grader (or LLM judge) can read each pass/fail and decide the outcome from the response alone. |
| 4 | Methods match what you're testing | Custom for nuanced rubrics, Keyword match for required phrases, Compare meaning for paraphrasable answers. Wrong method = wrong signal. |
| 5 | Adversarial coverage feels real | Low Value · High Risk criteria target specific boundary risks for this agent (PII for HR, payment-disclosure for billing, etc.) — not generic prompt-injection boilerplate. |
| 6 | Quality dimensions consolidated | 4–6 dimensions, not 12. "Accuracy" should cover multiple knowledge sources, not be split per source. |
Mandatory reminder: "This eval plan was AI-generated from your agent description / Vision. Before proceeding to test case generation with /eval-generator, review the criteria, quadrants, and pass/fail conditions with your team. The plan should reflect your business reality, not best-practice defaults."
statement, quadrant, method, pass_condition, fail_condition.method: "Custom", also draft custom_rubric from the pass/fail./eval-suite-planner I'm building an HR policy bot for a global company with 18 offices. It answers PTO, parental-leave, benefits questions from official HR documents. Should refuse salary-disclosure questions and escalate legal/discrimination concerns.
/eval-suite-planner Customer support agent for refund requests. Polite, follows refund policy, doesn't make promises beyond policy. Risk profile: HIGH (handles financial decisions).
/eval-suite-planner Email triage agent that reads incoming emails and labels them urgent / not-urgent / spam. Must NOT label real customer emails as spam.
/eval-suite-planner I have a Vision doc — purpose: code review for Python PRs, users: dev team, knowledge: PEP 8 + internal style guide, boundaries: no security review, success: PRs land faster with fewer style nits.
/eval-generator — Stage 2: takes this plan and produces concrete test cases (single CSV per quality signal, 3 columns, one row per case × method)./eval-result-interpreter — Stage 4: takes Stage 3 results and produces a triage report (SHIP / ITERATE / BLOCK with root-cause classification)./eval-faq — methodology Q&A grounded in Microsoft's eval ecosystem./eval-guide — the orchestrator. Wraps Stages 0–4 with an interactive dashboard checkpoint at each stage.npx claudepluginhub microsoft/eval-guideProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.