By microsoft
Plan structured evaluations for Copilot Studio AI agents from descriptions, generate grouped test cases in CSV and docx, analyze results with verdicts and patterns, triage failures by root causes, and prioritize interactive fixes using Microsoft's Eval Scenario Library and Triage Playbook.
Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.
Analyzes Copilot Studio evaluation CSV results using Microsoft's Triage & Improvement Playbook. Returns a SHIP / ITERATE / BLOCK verdict with root cause classification, diagnostic triage, prioritized remediation, and pattern analysis.
Stage 2 standalone — turns an eval plan (output of `/eval-suite-planner`) into concrete test cases grouped by quality signal. Methods live at the signal level (one or many per signal); each criterion shares the signal's method set. Outputs one CSV per signal (2 columns: Question + Expected response, one row per case) plus a customer-ready `.docx` test-case report and an `eval-setup-guide.docx` that walks the customer through assigning testing methods per row manually in Copilot Studio's Evaluate tab. Use after planning, before running.
Stage 1 standalone — turns an Agent Vision (or plain-English description) into a structured eval plan: 10–15 acceptance criteria phrased "The agent should…", each placed on a Value × Risk matrix (High Value · High Risk / High Value · Low Risk / Low Value · High Risk / Low Value · Low Risk), each with explicit pass/fail conditions and a test method. Output is a customer-ready `.docx` eval plan. Use before generating test cases or running any evals.
Use this skill when the user's Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
AI agent evaluation toolkit for Copilot Studio. Plan evals, generate test cases, interpret results, and triage failures — from Claude Code or GitHub Copilot.
Grounded in Microsoft's Eval Scenario Library, Triage & Improvement Playbook, Common Evaluation Approaches, and MS Learn agent evaluation documentation.
claude plugin marketplace add microsoft/eval-guide
claude plugin install eval-guide@eval-guide
npx skills add microsoft/eval-guide
| Skill | Command | What it does |
|---|---|---|
| Eval Guide | /eval-guide | Full eval lifecycle — discover, plan, generate, run, interpret. Start here. |
| Eval Suite Planner | /eval-suite-planner | Structured eval plan with scenarios, methods, quality signals, thresholds, and test data strategy |
| Eval Generator | /eval-generator | Test cases for single-response and conversation (multi-turn) evaluation modes |
| Eval Result Interpreter | /eval-result-interpreter | SHIP / ITERATE / BLOCK verdict with root cause classification |
| Eval Triage & Improvement | /eval-triage-and-improvement | Interactive diagnosis and remediation for failing evals |
| Eval FAQ | /eval-faq | Methodology questions answered from Microsoft's eval ecosystem |
> /eval-guide
Tell me about your agent — what does it do, who uses it, and what does "good" look like?
Works the same in both Claude Code and GitHub Copilot.
The toolkit walks you through Microsoft's 4-stage evaluation lifecycle:
| Stage | What happens | Works without a running agent? |
|---|---|---|
| 0. Discover | Articulate what the agent does and what success looks like | Yes |
| 1. Plan | Scope eval depth by agent architecture, map to scenario types, pick methods, set thresholds | Yes |
| 2. Generate & Baseline | Produce test case CSVs (single-response) or conversation blueprints (multi-turn) importable into Copilot Studio | Yes |
| 3. Run | Execute tests against a live agent | Needs running agent |
| 4. Interpret & Improve | Triage results, classify root causes, prioritize fixes, re-test | Needs eval results |
Stages 0-2 work from just an agent description — no running agent required.
Each stage generates an interactive HTML dashboard served locally in your browser. You review, edit inline, and confirm before the AI proceeds — no more back-and-forth in chat to fix test cases.
Stage complete → Dashboard opens → You review & edit → Confirm → Final artifacts generated
| Stage | What you review in the dashboard | What you can edit |
|---|---|---|
| 0. Discover | Agent Vision (purpose, users, knowledge, capabilities, boundaries, success criteria) | All fields inline, add/remove list items |
| 1. Plan | Scenario table, methods, thresholds, quality signals | Add/remove scenarios, change methods, adjust thresholds |
| 2. Generate | Test cases per quality signal | Edit expected responses, questions, methods, add/remove cases |
| 4. Interpret | Verdict, failure triage, root causes, actions | Reclassify root causes, add comments |
Final deliverables (.docx reports, .csv test sets) are only generated after you confirm via the dashboard.
The dashboard is a standalone HTML file generated by skills/eval-guide/dashboard/serve.py (zero dependencies) and opened directly in your browser — no server required. Feedback auto-saves as you edit via localStorage — if the browser closes, your work is preserved.
The toolkit automatically scopes evaluation depth based on your agent's architecture:
| Architecture | What gets tested |
|---|---|
| Prompt-level (simple Q&A, no knowledge sources) | Response quality, tone, boundaries, refusal behavior |
| RAG / Knowledge-grounded (has knowledge sources, no tools) | All of the above + retrieval accuracy, grounding, hallucination prevention |
| Agentic (multi-step, tool use, orchestration) | All of the above + tool selection, action correctness, error recovery, task completion |
A simple FAQ bot doesn't need tool-routing tests. A multi-step workflow agent does. The toolkit handles this so you test what actually matters.
The eval generator supports both modes:
npx claudepluginhub microsoft/eval-guideTurn your coding agent into a SOTA browser agent. Drives a local Playwright workspace via one bash command at a time, saving screenshots and an action log into final_runs/run_<id>/, and visually self-verifies the result.
AGT governance hooks and MCP tools for Claude Code sessions
Azure SDK patterns and best practices for Java developers covering AI, communication, storage, identity, monitoring, and management libraries.
Azure SDK patterns and best practices for Rust developers covering identity, Key Vault, storage, Cosmos DB, and Event Hubs.
Azure SDK patterns and best practices for Python developers covering AI, storage, identity, monitoring, messaging, and management libraries.
Comprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
Tools to maintain and improve CLAUDE.md files - audit quality, capture session learnings, and keep project memory current.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.
A growing collection of Claude-compatible academic workflow bundles. Covers scientific figures, manuscript writing and polishing, reviewer assessment, citation retrieval, data availability, paper reading, literature search, response letters, paper-to-PPTX conversion, and evidence-grounded Chinese invention patent drafting. Rules are organized as reusable skill folders with explicit workflows and quality checks.
Create new skills, improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, or benchmark skill performance with variance analysis.
Unity Development Toolkit - Expert agents for scripting/refactoring/optimization, script templates, and Agent Skills for Unity C# development