Skill

evaluate

Run the MCP evaluation harness against a user's MCP server using a QA-pair XML file. Wraps scripts/evaluation.py (stdio/http/sse transports) and returns a scored markdown report. Invoke with `/mcp-builder:evaluate <eval.xml> <server-description>` — server-description can be natural language describing how to launch or reach the server.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/mcp-builder:evaluate

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

BashReadWrite

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are wrapping `${CLAUDE_PLUGIN_ROOT}/scripts/evaluation.py` for the user. They have provided a QA XML file and a description of their MCP server. Your job: parse the description, build the right invocation, run it, and summarize the report.

SKILL.md

68 lines · ~929 tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitApr 19, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

/mcp-builder:evaluate

You are wrapping ${CLAUDE_PLUGIN_ROOT}/scripts/evaluation.py for the user. They have provided a QA XML file and a description of their MCP server. Your job: parse the description, build the right invocation, run it, and summarize the report.

Procedure

Verify prerequisites. Run these in parallel:
- test -f "${CLAUDE_PLUGIN_ROOT}/scripts/evaluation.py" && echo OK (script exists)
- python3 -c "import anthropic, mcp" 2>&1 (deps installed)
- test -n "$ANTHROPIC_API_KEY" && echo OK || echo MISSING (API key set)
If deps are missing, tell the user to run:
```
pip install -r "${CLAUDE_PLUGIN_ROOT}/scripts/requirements.txt"
```
If ANTHROPIC_API_KEY is unset, stop and ask the user to export it before retrying. Do not prompt them for the key value.

Parse the user's arguments. The first argument is the path to the eval XML file. Everything else describes the server. Map the description to transport flags:

User says	Transport	Flags
"stdio: node dist/index.js"	`stdio`	`-t stdio -c node -a dist/index.js`
"stdio: python server.py with API_KEY=xyz"	`stdio`	`-t stdio -c python -a server.py -e API_KEY=xyz`
"http://host/mcp" or "https://..."	`http`	`-t http -u <url>`
"http://host/mcp with header Authorization: Bearer X"	`http`	`-t http -u <url> -H "Authorization: Bearer X"`
"sse://..." or "SSE endpoint ..."	`sse`	`-t sse -u <url>`

When the user's description is ambiguous, ask one targeted clarifying question rather than guessing.

Verify the eval file exists and preview it:
```
test -f "<eval.xml>" && head -30 "<eval.xml>"
```
If it's missing, stop and tell the user the path didn't resolve.
Run the evaluation. Allocate a tempfile with mktemp (so concurrent invocations don't clobber each other), then write the report there:
```
REPORT=$(mktemp -t mcp-eval-report.XXXXXX.md)
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/evaluation.py" \
  <flags-from-step-2> \
  "<eval.xml>" \
  -o "$REPORT"
echo "Report: $REPORT"
```
Stream with a long timeout (evals take minutes). If the user passed -m <model> in their arguments, forward it; otherwise accept the script default.
Summarize the report. Read $REPORT and produce:
- Accuracy (correct / total)
- Average duration and tool calls per task
- Names of any failed tasks and a one-line reason each (from their <feedback> block)
- Path to the full report
Do not paste the whole report — it's long. Offer to show specific failed tasks if the user asks.

Notes

The harness is Python 3 and must find mcp + anthropic on the active interpreter. If the user's system has multiple Pythons, use python3 explicitly.
For stdio transport, the harness launches the server itself — never start it separately first.
For http/sse, the user must start their server before invoking this skill.
Scoring is exact string match on <response> tags. A "failed" task may just mean the answer format differs — surface the actual vs. expected so the user can judge.

evaluate

Invocation

Tool Access

Context Preview

SKILL.md

evaluate

Invocation

Tool Access

Context Preview

SKILL.md

/mcp-builder:evaluate

Procedure

Notes

Similar Skills

/mcp-builder:evaluate

Procedure

Notes

Similar Skills