Search everything...

Stats

Actions

Available In

eval-guide

Name: eval-guide
Author: microsoft

By microsoft

Plan structured evaluations for Copilot Studio AI agents from descriptions, generate grouped test cases in CSV and docx, analyze results with verdicts and patterns, triage failures by root causes, and prioritize interactive fixes using Microsoft's Eval Scenario Library and Triage Playbook.

ai-ml

testing

npx claudepluginhub microsoft/eval-guide

Popularity

Stars

Top 25%

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Skills5

eval-faq

/eval-faq

Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.

eval-result-interpreter

/eval-result-interpreter

Analyzes Copilot Studio evaluation CSV results using Microsoft's Triage & Improvement Playbook. Returns a SHIP / ITERATE / BLOCK verdict with root cause classification, diagnostic triage, prioritized remediation, and pattern analysis.

eval-generator

/eval-generator

Stage 2 standalone — turns an eval plan (output of `/eval-suite-planner`) into concrete test cases grouped by quality signal. Methods live at the signal level (one or many per signal); each criterion shares the signal's method set. Outputs one CSV per signal (2 columns: Question + Expected response, one row per case) plus a customer-ready `.docx` test-case report and an `eval-setup-guide.docx` that walks the customer through assigning testing methods per row manually in Copilot Studio's Evaluate tab. Use after planning, before running.

eval-suite-planner

/eval-suite-planner

Stage 1 standalone — turns an Agent Vision (or plain-English description) into a structured eval plan: 10–15 acceptance criteria phrased "The agent should…", each placed on a Value × Risk matrix (High Value · High Risk / High Value · Low Risk / Low Value · High Risk / Low Value · Low Risk), each with explicit pass/fail conditions and a test method. Output is a customer-ready `.docx` eval plan. Use before generating test cases or running any evals.

eval-triage-and-improvement

/eval-triage-and-improvement

Use this skill when the user's Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.

Stats

Version1.0.0

LanguageHTML

Stars8

Forks4

MaintenanceExcellent

LicenseMIT

Last CommitMay 6, 2026

AddedMar 30, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

README

eval-guide

AI agent evaluation toolkit for Copilot Studio. Plan evals, generate test cases, interpret results, and triage failures — from Claude Code or GitHub Copilot.

Grounded in Microsoft's Eval Scenario Library, Triage & Improvement Playbook, Common Evaluation Approaches, and MS Learn agent evaluation documentation.

Install

Claude Code

claude plugin marketplace add microsoft/eval-guide
claude plugin install eval-guide@eval-guide

GitHub Copilot

npx skills add microsoft/eval-guide

Skills

Skill	Command	What it does
Eval Guide	`/eval-guide`	Full eval lifecycle — discover, plan, generate, run, interpret. Start here.
Eval Suite Planner	`/eval-suite-planner`	Structured eval plan with scenarios, methods, quality signals, thresholds, and test data strategy
Eval Generator	`/eval-generator`	Test cases for single-response and conversation (multi-turn) evaluation modes
Eval Result Interpreter	`/eval-result-interpreter`	SHIP / ITERATE / BLOCK verdict with root cause classification
Eval Triage & Improvement	`/eval-triage-and-improvement`	Interactive diagnosis and remediation for failing evals
Eval FAQ	`/eval-faq`	Methodology questions answered from Microsoft's eval ecosystem

Quick start

> /eval-guide

Tell me about your agent — what does it do, who uses it, and what does "good" look like?

Works the same in both Claude Code and GitHub Copilot.

The toolkit walks you through Microsoft's 4-stage evaluation lifecycle:

Stage	What happens	Works without a running agent?
0. Discover	Articulate what the agent does and what success looks like	Yes
1. Plan	Scope eval depth by agent architecture, map to scenario types, pick methods, set thresholds	Yes
2. Generate & Baseline	Produce test case CSVs (single-response) or conversation blueprints (multi-turn) importable into Copilot Studio	Yes
3. Run	Execute tests against a live agent	Needs running agent
4. Interpret & Improve	Triage results, classify root causes, prioritize fixes, re-test	Needs eval results

Stages 0-2 work from just an agent description — no running agent required.

Interactive dashboard review

Each stage generates an interactive HTML dashboard served locally in your browser. You review, edit inline, and confirm before the AI proceeds — no more back-and-forth in chat to fix test cases.

Stage complete → Dashboard opens → You review & edit → Confirm → Final artifacts generated

Stage	What you review in the dashboard	What you can edit
0. Discover	Agent Vision (purpose, users, knowledge, capabilities, boundaries, success criteria)	All fields inline, add/remove list items
1. Plan	Scenario table, methods, thresholds, quality signals	Add/remove scenarios, change methods, adjust thresholds
2. Generate	Test cases per quality signal	Edit expected responses, questions, methods, add/remove cases
4. Interpret	Verdict, failure triage, root causes, actions	Reclassify root causes, add comments

Final deliverables (.docx reports, .csv test sets) are only generated after you confirm via the dashboard.

The dashboard is a standalone HTML file generated by skills/eval-guide/dashboard/serve.py (zero dependencies) and opened directly in your browser — no server required. Feedback auto-saves as you edit via localStorage — if the browser closes, your work is preserved.

Architecture-aware eval scoping

The toolkit automatically scopes evaluation depth based on your agent's architecture:

Architecture	What gets tested
Prompt-level (simple Q&A, no knowledge sources)	Response quality, tone, boundaries, refusal behavior
RAG / Knowledge-grounded (has knowledge sources, no tools)	All of the above + retrieval accuracy, grounding, hallucination prevention
Agentic (multi-step, tool use, orchestration)	All of the above + tool selection, action correctness, error recovery, task completion

A simple FAQ bot doesn't need tool-routing tests. A multi-step workflow agent does. The toolkit handles this so you test what actually matters.

Single-response and conversation evaluation

The eval generator supports both modes:

View full README on GitHub

eval-guide

Popularity

What's Inside

Confidence

README

eval-guide

Install

Claude Code

GitHub Copilot

Skills

Quick start

Interactive dashboard review

Architecture-aware eval scoping

Single-response and conversation evaluation

Similar Plugins

fullstack-dev-skills

claude-md-management

godot-skills

nature-skills

skill-creator

unity-dev-toolkit

More by microsoft

webwright

agt-governance

azure-sdk-java

azure-sdk-rust

azure-sdk-python

eval-guide

Install

Claude Code

GitHub Copilot

Skills

Quick start

Interactive dashboard review

Architecture-aware eval scoping

Single-response and conversation evaluation

Popularity

Health & Quality

More by microsoft

webwright

agt-governance

azure-sdk-java

azure-sdk-rust

azure-sdk-python

Similar Plugins

fullstack-dev-skills

claude-md-management

godot-skills

nature-skills

skill-creator

unity-dev-toolkit