From agent-eval-harness
Generates evaluation test cases for skills by analyzing skill config and metadata. Bootstraps datasets or expands existing ones for /eval-run.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-eval-harness:eval-datasetThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You generate evaluation test cases for a skill. You read the skill analysis (eval.md) and eval config (eval.yaml) to understand what the skill does, then create realistic test cases that match the dataset schema. The goal is giving `/eval-run` something meaningful to test against.
You generate evaluation test cases for a skill. You read the skill analysis (eval.md) and eval config (eval.yaml) to understand what the skill does, then create realistic test cases that match the dataset schema. The goal is giving /eval-run something meaningful to test against.
| Argument | Required | Default | Description |
|---|---|---|---|
--config <path> | no | auto-discover | Path to eval config |
--count <N> | no | 5 | Number of cases to generate |
--strategy <type> | no | bootstrap | Generation strategy (see Step 3) |
--run-id <id> | no | — | Previous eval run to learn from (used with expand) |
--harbor | no | — | Also generate Harbor task packages (Step 8) |
--image <image> | with --harbor | — | Container image for Harbor task packages |
If --config was explicitly provided, use that path directly. Otherwise, auto-discover:
python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py
<config>/eval-analyze firstRead eval.yaml and eval.md to understand:
execution.mode (case or batch) and execution.arguments (the argument template). In case mode, {field} placeholders in the arguments are resolved per case from input.yaml — every field referenced in the template (e.g., {strat_key}, {prompt}) must exist in the generated input.yaml files.dataset.schema describes the case structure (files, fields, formats)outputs[*].schema describes what the skill produces (informs what reference outputs look like)builtin judges have predefined criteria — list them with python3 ${CLAUDE_SKILL_DIR}/../eval-analyze/scripts/list_builtins.py. Use the builtin's known behavior to inform case design (e.g., cost_budget → include a large-input case that tests cost scaling; output_completeness → include a case with many requirements)check snippets reveal exact validation logic — what fields are accessed, what thresholds are used, what conditions trigger pass/failprompt / prompt_file text describes quality dimensions (completeness, accuracy, etc.)description summarizes what each judge evaluatesBuild a list of judge-driven requirements — these are the concrete things judges will check. Each test case should be designed to exercise at least one of these requirements. For example:
len(content) >= 100 → include a case with minimal input that might produce short outputIf eval.yaml doesn't exist, ask the user which skill to evaluate, then invoke /eval-analyze to create the config:
Use the Skill tool to invoke /eval-analyze --skill <skill-name>
Wait for the analysis to complete, then re-read eval.yaml. If /eval-analyze fails or the user skips it, you cannot generate meaningful cases — stop and explain why.
If eval.md doesn't exist, you can still work from eval.yaml's schema descriptions, but the cases will be less targeted.
After reading the skill analysis and judges, estimate whether --count is sufficient. Count the skill's distinct execution paths (branches, modes, optional steps), the number of judges, and the number of conditional judges. A rough guideline: you need at least one case per execution path, plus enough variety for each judge to have both passing and failing examples. If the skill has 4 execution paths and 6 judges, 5 cases may be thin — suggest a higher count to the user ("This skill has N distinct paths and M judges — consider --count 12 for better coverage").
Read dataset.schema and extract a concrete checklist:
Required files — what files each case directory must contain (e.g., input.yaml, reference.md)
Required fields per file — for structured files like YAML/JSON, which fields are mandatory
Optional fields — fields described with "optionally" or "if available" — vary these across cases (include in some, omit in others) to test the skill's handling of missing optional context
Field semantics — what kind of content each field expects (e.g., "problem statement", "clarifying context", "priority level"). Use these descriptions to generate realistic content, not generic placeholders
Naming patterns — any file naming conventions mentioned (e.g., "named NNN-slug.md")
Argument fields — if execution.mode is case, parse execution.arguments for {field} placeholders. Every placeholder must appear as a required field in input.yaml. Cross-check against items 1-2 above — if {strat_key} is in the arguments but not in the schema, add it as a required field.
External-state fields — look for fields marked with [EXTERNAL: System] in the schema description. These reference real resources in external systems (Jira projects, GitHub repos, API endpoints) that must exist at execution time. Do NOT invent values for these fields — fabricated values (e.g., a Jira project key derived from the repo directory name) cause silent failures when the skill queries the external system and gets zero results. Mark these in your generation template as requiring TODO_ placeholder values (see Step 5).
This checklist is your generation template. Every case must satisfy items 1-2 and 6. Items 3-4 guide content variety.
Check what already exists:
ls <dataset_path>/ 2>/dev/null | head -20
Count existing cases and read one to understand the current structure. Note:
bootstrap (default) — Generate N cases from scratch. Use this when starting from zero or when fewer than 5 cases exist.
Design cases to cover:
expand — Read existing cases, identify gaps, generate cases that fill them. Use this when cases exist but coverage is thin.
Read each existing case's input file to understand what's already covered. Then look for gaps by comparing against:
If --run-id was provided, also read the eval results to target empirical failure patterns:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<run-id>/summary.yaml
Use the results to prioritize new cases:
Avoid duplicating existing scenarios — each new case should test something distinct that isn't already covered. Number new cases continuing from the highest existing case number.
from-traces — Extract real inputs from MLflow traces and turn them into test cases. Use this when the skill has been used in production and traces are available.
Run the extraction script:
python3 ${CLAUDE_SKILL_DIR}/../eval-mlflow/scripts/from_traces.py \
--config <config> \
--count <N>
This outputs YAML with extracted trace inputs (prompt text, tool interactions). Read the output and create case directories following the generation template from Step 2. The trace inputs give you realistic content for the input fields — but you still need to structure the files according to dataset.schema.
If the script exits with code 2 (no traces found) or MLflow is not configured, tell the user and fall back to expand strategy.
For each case, create a directory under dataset.path following the structure described in dataset.schema.
Naming: Use descriptive directory names that indicate what the case tests:
case-001-simple-basic-input/
case-002-complex-multi-requirement/
case-003-edge-empty-context/
case-004-long-detailed-input/
case-005-ambiguous-phrasing/
Content: Use the generation template from Step 2. Every case must include all required files and fields. Vary optional fields across cases — include them in some, omit in others. Use the field semantics to generate realistic content appropriate to each field's purpose.
Realism: Cases should look like something a real user would encounter. Don't generate lorem ipsum or obviously synthetic inputs. Use realistic names, scenarios, and domain language appropriate to the skill.
External-state placeholders: For fields marked [EXTERNAL: System] in the schema, use TODO_<SYSTEM>_<FIELD> as the value (e.g., project_key: "TODO_JIRA_PROJECT_KEY"). If you want to show a plausible real value, put it in a YAML comment (e.g., # replace with real key, such as MYPROJECT). The TODO_ prefix signals that this must be replaced with a real value from the target system before execution. List all placeholders in Step 7 so the user knows what needs manual review.
Answers, annotations, companion files, and reference outputs: See ${CLAUDE_SKILL_DIR}/references/case-generation.md for answers.yaml (interactive skills), annotations.yaml (outcome-aware judges with if conditions), companion files, and reference output guidance.
After generating, verify the cases:
dataset.schema describes?execution.mode is case, verify that input.yaml contains all fields referenced by {field} placeholders in execution.argumentsif conditions referencing annotations, verify that the generated cases cover both branches — at least one case where the condition is true and one where it's false. Warn if any conditional judge would never run (or always run) across the entire dataset.ls <dataset_path>/case-001-*/
Tell the user what was created:
<path>TODO_ placeholder values were generated, list each one with which case it's in, which external system it references, and what kind of value is needed (e.g., "case-001/input.yaml TODO_JIRA_PROJECT_KEY — needs a real Jira project key from your test instance"). These MUST be replaced with real values before running /eval-run.--config <config> if a non-default config was used):
/eval-run --model <model> to test the skill against these cases/eval-run --model <model> --gold to generate gold references from the best outputs/eval-dataset --strategy expand --count 10 to add more cases later--harbor): Emit Harbor task packagesIf --harbor was passed, generate self-contained task packages for
containerized execution. Run:
python3 ${CLAUDE_SKILL_DIR}/scripts/harbor.py \
--config <config> --out <dataset_dir>/../harbor-tasks --image <image> \
[--judge-model <model>] [--verifier-timeout 900] [--agent-timeout 3600]
See ${CLAUDE_SKILL_DIR}/references/case-generation.md for details.
dataset.schema says "input.yaml with a 'prompt' field", create input.yaml with a prompt field. Not input.json, not query.yaml.case-003-edge-empty-context is better than case-003. The name should indicate what scenario is being tested.$ARGUMENTS
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessGenerate eval.yaml configuration for the agent eval harness by analyzing a skill's SKILL.md, sub-skills, scripts, and test cases. Useful for setting up evaluation, testing, quality checks, and benchmarking skills.
Generates EvalView test cases from SKILL.md files using LLM, captures real agent interactions as tests, or creates individual test YAMLs manually.
Creates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.