From agent-eval-harness
Generate eval.yaml configuration for the agent eval harness by analyzing a skill's SKILL.md, sub-skills, scripts, and test cases. Useful for setting up evaluation, testing, quality checks, and benchmarking skills.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-eval-harness:eval-analyzeThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You analyze a target skill and produce `eval.yaml` — the configuration that `/eval-run` needs. You read the skill deeply (including sub-skills it invokes), explore existing test cases, and generate everything: dataset schema, output descriptions, judges, and thresholds.
You analyze a target skill and produce eval.yaml — the configuration that /eval-run needs. You read the skill deeply (including sub-skills it invokes), explore existing test cases, and generate everything: dataset schema, output descriptions, judges, and thresholds.
The core principle: observe, don't assume. Every field name, file pattern, and directory path in the generated eval.yaml must come from reading actual files. If you can't point to a specific file or field you observed, don't put it in the config.
| Argument | Required | Default | Description |
|---|---|---|---|
--skill <name> | no | auto-detect | Which skill to analyze |
--config <path> | no | auto-discover | Output path for the config |
--update | no | false | Fill in missing sections only, preserve user edits |
mkdir -p tmp
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/analyze-config.yaml \
skill=<skill> config=<config> update=<true/false>
If --config was explicitly provided, use that path directly (skip discovery).
Otherwise, discover existing eval configs:
python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py
Based on discovery results:
eval.yaml at the project root (simple default for first eval)--skill targets a different eval than the existing one: offer to reorganize into eval/ layout. If the user accepts, run the reorganization script (see Phase 7). If declined, ask where to put the new config.eval/<skill-name>/eval.yaml (nested) or alongside existing flat configs--config provided: use the explicit path, bypass layout logicSet the resolved config path as <config> for all subsequent steps. Set <eval_md_path> to the same directory as <config>, with filename eval.md.
If --skill was provided, locate its SKILL.md:
python3 ${CLAUDE_SKILL_DIR}/scripts/find_skills.py --name <skill>
If not provided, list all project skills:
python3 ${CLAUDE_SKILL_DIR}/scripts/find_skills.py
This reads .claude-plugin/plugin.json for custom skill paths, falls back to .claude/skills/ and skills/, and excludes eval harness skills. If only one skill is found, use it automatically. If multiple, ask the user which to analyze. If none are found, tell the user — they may need to check their skill directory paths or create a skill first.
If --update and eval.yaml already has a skill field: use that skill. If --skill is also provided and differs, ask the user which they mean — don't silently overwrite.
If the resolved <config> already exists and --update was not set:
test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
If it exists, validate it:
python3 ${CLAUDE_SKILL_DIR}/scripts/validate_eval.py config <config>
Then check if eval.md (the cached analysis) is still fresh — meaning the SKILL.md hasn't changed since the last analysis:
python3 ${CLAUDE_SKILL_DIR}/scripts/validate_eval.py memory <eval_md_path>
If FRESH and eval.yaml has a non-empty dataset.schema, at least one outputs entry with a schema, at least one judge, and models.skill set, report that config is up to date and exit. No work needed. (An INCOMPLETE config — empty sections, or missing models.skill from a pre-restructure eval.yaml — still needs analysis.)
If STALE, NO_CONFIG, or --update was set, proceed to full analysis.
This is the most important step — the quality of everything downstream depends on how thoroughly you understand the skill.
Launch an Explore agent to do the analysis:
${CLAUDE_SKILL_DIR}/prompts/analyze-skill.md to get the analysis instructionssubagent_type="Explore"The analysis is recursive — the agent follows sub-skill chains (Skill tool calls, /skill-name references) until it finds the skills that produce the final artifacts (typically 2-5 levels, capped at 5 to avoid circular references), reading each sub-skill's SKILL.md to trace the full pipeline. The outputs section must describe what the entire pipeline produces, not just the top-level orchestrator.
The agent returns structured YAML with: purpose, inputs, outputs, sub_skills, flags, pipeline, quality_criteria, and suggested_judges. See ${CLAUDE_SKILL_DIR}/prompts/analyze-skill.md for the full schema.
Verify the response: check that outputs reference actual directories and file patterns (not placeholders like <output-dir>), that sub_skills lists real skill names, and that suggested_judges include working code snippets. If anything looks fabricated, ask the agent to re-examine specific files.
First check if eval.yaml already has a dataset.path (from a previous run or --update):
ls <dataset_path>/ 2>/dev/null | head -20
If not set or doesn't exist, search the project (relative to <config> directory) for test case directories using the Glob tool:
Glob: **/cases/ or **/test-cases/ or **/fixtures/ or **/examples/ or **/dataset/ or **/eval/ or **/tests/data/
Exclude .venv/, .git/, node_modules/ from results.
If nothing found, ask the user where their test cases are (or will be).
If a cases directory exists, read one complete sample case — every file in it. Note:
This is what you'll describe in dataset.schema. If you didn't read the actual files, your schema description will be wrong — and downstream judges will fail because they expect fields that don't exist.
If no test cases exist, note this clearly and suggest running /eval-dataset to generate them. Describe the expected case structure in dataset.schema anyway — eval-dataset uses that description to create matching cases.
Combine the skill analysis (Step 3) and dataset exploration (Step 4) into a complete eval.yaml. Read the full template and writing guidance at ${CLAUDE_SKILL_DIR}/references/eval-yaml-template.md.
Key points:
execution.mode from the skill analysis (Step 3). If the analyzer returned ASK_USER, ask the user which mode to use — explain what the analyzer observed and let them decide. Do not default to case without evidence; a skill that processes collections of items internally (batch-size controls, multi-item iteration, multi-agent fan-out, result aggregation) is batch even if it also accepts a single item. See eval-yaml-template.md for the full mode selection guidance.execution.arguments. For case mode, build a template with {field} placeholders matching the input.yaml fields you observed in Step 4 (e.g., "{strat_key} {adr_file?}"). For batch mode, use the literal arguments string (e.g., "--input batch.yaml --headless").runner.type: claude-code is the default and almost always correct. Only change it if the user has explicitly mentioned another harness.models.skill to claude-opus-4-6 (the default for eval runs). Set models.judge to claude-opus-4-6 — LLM and pairwise judges need a strong model for accurate scoring. If the skill uses AskUserQuestion interactively (not --headless), set models.hook to claude-sonnet-4-6 for LLM-based question answering (fast enough for picking options, cheaper than Opus). CLI flags override.mlflow.experiment to <project>-eval (or leave blank — it falls back to the top-level name).dataset.schema and outputs[*].schema fields drive the entire pipeline — be specific, reference actual file/field names you observedJIRA_SERVER), annotate those fields in dataset.schema with [EXTERNAL: System] markers (e.g., 'project_key' ([EXTERNAL: Jira] — must be a real project key)). This tells /eval-dataset not to fabricate values for these fields. See eval-yaml-template.md for the convention.allowed-tools frontmatter includes Skill (meaning it invokes sub-skills), add "Skill" to permissions.allow. The Skill tool requires explicit permission in headless mode — without it, nested skill calls fail silently and the pipeline degrades.JIRA_SERVER for a jira-emulator, API keys for test instances), add execution.env entries. Use $VAR syntax for values that should be resolved from the caller's environment (e.g., $JIRA_TOKEN), or literal values for test-only endpoints (e.g., http://localhost:8080).inputs.tools entries. Use match to describe what to intercept in natural language (e.g., "any Jira interaction via MCP or scripts"), and prompt for how to handle it. The AskUserQuestion hook uses 3-tier answer resolution: exact match from case_overrides, then an LLM call (using models.hook) with the case's input.yaml and answers.yaml as context, then fallback to the first option. If the skill asks domain-specific questions (e.g., "is this a duplicate?"), suggest the user create answers.yaml files per case with guidance for the LLM answerer.outputs["annotations"] — the parsed annotations.yaml from the dataset case. Use this for outcome-aware scoring where the expected result depends on the test case (e.g., annotations.get("dedup_is_duplicate") determines whether producing no output is correct).agent_eval/judges/. Use builtin: instead of writing inline code. Discover available builtins: python3 ${CLAUDE_SKILL_DIR}/scripts/list_builtins.py. See the template for examples.arguments: — all judge types support an arguments: dict. Use it instead of hardcoding values in check code or prompt text. For inline checks, arguments is passed as the second parameter. For LLM prompts, use {{ arguments.key }} (Jinja2 rendered).builtin judges + 2-3 inline check judges + 1-2 LLM prompt judges. Start lean.--update: preserve everything already in the file, only add missing top-level keys (e.g., add a models: block if the user is upgrading from an older config that lacked it). Check existing inline check judges — if any use the old (outputs) signature (single parameter), update them to (outputs, arguments) (the current contract). Also check LLM judge prompts for literal {{ }} that isn't a template variable — all prompts are now Jinja2 rendered.After writing eval.yaml to the resolved <config> path, validate that all references are correct:
python3 ${CLAUDE_SKILL_DIR}/scripts/validate_eval.py config <config>
This checks dataset path exists (resolved relative to the config file's directory), output paths are relative, judge prompt_file/context/module references resolve, and runner.settings exists.
Errors (exit code 1): fix before proceeding — broken file references, absolute paths, missing modules.
Warnings (exit code 0): may be expected — empty dataset (user hasn't created cases yet), missing judges (will be added later). Report them to the user but don't block.
The eval.md caches the skill analysis so it doesn't need to be repeated. Write it to <eval_md_path> (same directory as the config file). The hash tracks only the top-level SKILL.md — if sub-skills change, the user should run /eval-analyze --update to refresh. Compute the skill hash:
python3 -c "import hashlib; from pathlib import Path; print(hashlib.sha256(Path('<skill-path>/SKILL.md').read_bytes()).hexdigest()[:12])"
Read the template at ${CLAUDE_SKILL_DIR}/prompts/generate-eval-md.md. Write eval.md with YAML frontmatter (skill, analyzed_at, skill_hash) and a markdown narrative of the analysis.
Tell the user what was generated:
<path> (M cases found)<hash>)/eval-dataset to generate test cases (required before eval-run)/eval-run --model <model> to execute the evaluationIf validation produced warnings, list them so the user knows what's incomplete.
$ARGUMENTS
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessExecutes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.
Creates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.