From agent-eval-harness
Executes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-eval-harness:eval-runThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.
prompts/analyze-results.mdprompts/comparison-judge.mdreferences/data-pipeline.mdreferences/execution-monitoring.mdreferences/tool-interception.mdscripts/agent_evalscripts/collect.pyscripts/execute.pyscripts/preflight.pyscripts/report.pyscripts/score.pyscripts/tools.pyscripts/workspace.pyscripts/workspace_files.pyYou are an evaluation executor. You run a skill against test cases, score the outputs with judges, and report results. You orchestrate by calling scripts — never duplicate their work.
For the full data flow (dataset → workspace → execution → collection → scoring), see ${CLAUDE_SKILL_DIR}/references/data-pipeline.md. For tool interception mechanics, see ${CLAUDE_SKILL_DIR}/references/tool-interception.md.
Parse $ARGUMENTS:
| Argument | Required | Default | Description |
|---|---|---|---|
--config <path> | no | auto-discover | Path to eval config |
--model <model> | no | models.skill from config | Skill model. Required if models.skill is unset in eval.yaml. |
--subagent-model <model> | no | models.subagent → falls back to skill model | Model for subagents (e.g., claude-sonnet-4-6 while main is claude-opus-4-7) |
--skill <name> | no | from config | Override the skill to test |
--run-id <id> | no | YYYY-MM-DD-<model> | Identifier for this run |
--cases <id> [<id> ...] | no | all cases | Exact case IDs to run |
--baseline <run-id> | no | — | Previous run to compare against |
--no-llm-judges | no | false | Skip LLM judges (prompt, prompt_file, LLM builtins). Run deterministic judges (check, Python builtins, external code). |
--gold | no | false | Save outputs as gold references after run |
--effort <level> | no | runner.effort from config | Claude Code reasoning effort (Claude Code only; ignored by other runners) |
--runner <type> | no | local | local (default Steps 1–8) or harbor (containerized — skips to Harbor runner section) |
--env <name> | no | kubernetes | Harbor execution environment: podman, kubernetes, openshift (only with --runner harbor) |
If --runner harbor: after config discovery, skip to the Harbor runner section below. Steps 2–6 are replaced by one run.py call.
If --config was explicitly provided, use that path directly.
Otherwise, auto-discover eval configs:
python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py
<config>After selecting a config, read its skill field to set <eval-name> (used in $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> paths below).
Check if the resolved config file exists:
test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
If config is missing: invoke eval-analyze to bootstrap:
Use the Skill tool to invoke /eval-analyze [--skill <skill>]
Once config exists, read it to understand the eval setup — the skill under test, runner, dataset, outputs, judges, models, and any tool interception. The downstream scripts read the same config; you don't need to pass these fields through, just confirm they're present and warn the user about anything missing or surprising.
If inputs.tools has entries but the skill uses AskUserQuestion or external APIs, verify the handlers cover those tools. Warn the user if a tool the skill uses isn't intercepted — headless execution may hang.
Persist parsed flags:
mkdir -p tmp ${AGENT_EVAL_RUNS_DIR:-eval/runs}
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/eval-config.yaml \
model=<model> skill=<skill> run_id=<id> baseline=<baseline> \
gold=<true/false> no_llm_judges=<true/false>
Read dataset.path from eval.yaml. Verify the directory exists and contains at least one case subdirectory:
ls <dataset_path>/ | head -20
If --cases was specified, pass the IDs to workspace.py as --cases <id> <id> ....
If no cases found, stop and tell the user clearly:
/eval-dataset to generate test cases, or /eval-analyze --update to reconfigure the dataset pathBefore setting up the workspace, verify the project's artifact directories are clean. Skills write to the project directory (not the workspace), so stale artifacts from previous runs contaminate results — wrong IDs, stale run reports, inflated file counts.
python3 ${CLAUDE_SKILL_DIR}/scripts/preflight.py \
--config <config> \
[--run-id <id>] \
[--baseline <baseline-id>]
The script checks tmp/ state files, whether $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> already has results from a previous run, and (if --baseline is given) that the baseline run-id exists under the same eval-name directory.
CLEAN: proceed to workspace setup.DIRTY: report the findings to the user and ask what to do:
preflight.py --clean --force to delete all stale artifacts, then proceed.2026-04-11-opus-v2) and re-check. This avoids overwriting previous run results but still requires cleaning project artifacts — re-run preflight with --clean and the new run-id.MISSING_BASELINE (exit 2): the baseline run-id wasn't found at $AGENT_EVAL_RUNS_DIR/<eval-name>/<baseline>. The script lists nearby run-ids — confirm the correct one with the user (typo, or did they mean a different date/variant?) before retrying.Create an isolated workspace with the test cases and output directories:
python3 ${CLAUDE_SKILL_DIR}/scripts/workspace.py \
--config <config> \
--run-id <id> \
[--cases <id> [<id> ...]]
The script prints WORKSPACE: <path>, CASES: <count>, BATCH: <path>. Report these to the user. If inputs.tools is configured, it also prints HOOKS: N tool interceptors configured.
If the case count is 0, stop — the filter matched nothing.
inputs.tools configured)If eval.yaml has inputs.tools entries, this step is mandatory. workspace.py emits a skeleton in tool_handlers.yaml; you must resolve each handler's prompt into concrete runtime checks (input_filters, env_checks, case_overrides). Do not skip this even when the eval.yaml is unchanged — the workspace is created fresh each time.
Read ${CLAUDE_SKILL_DIR}/references/tool-interception.md for the full format, field reference, and resolution examples. Then read <workspace>/tool_handlers.yaml and for each handler:
input_filters for Bash, env_checks for services, case_overrides for deterministic answersinput_filters — without them the handler is non-functionalWrite the resolved handlers back to tool_handlers.yaml.
Run the skill headlessly against test cases. In case mode (default), execute.py runs the skill once per case with case-specific arguments and workspace — each case gets its own stdout.log and subagent transcripts. In batch mode, all cases run in a single invocation via batch.yaml.
If hooks: is configured in eval.yaml, execute.py automatically runs lifecycle hooks at the appropriate points: before_all before any case executes, before_each/after_each around each case execution, and after_all after all cases complete (guaranteed, even on failure). Hook logs are written to $AGENT_EVAL_RUNS_DIR/<id>/hooks/.
python3 ${CLAUDE_SKILL_DIR}/scripts/execute.py \
--config <config> \
--workspace <workspace_path> \
--skill <skill_name> \
--skill-args "<skill arguments>" \
--model <model> \
--output $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> \
--run-id <id> \
[--agent <runner>] \
[--subagent-model <model>] \
[--mlflow-experiment <name>] \
[--effort <level>] \
[--parallelism <n>]
Launch with run_in_background: true (no pipes). Monitor via tail -20 <output_file>. After completion, check run_result.json — if exit_code is non-zero, report the failure and stop. See ${CLAUDE_SKILL_DIR}/references/execution-monitoring.md for CLI flag fallbacks, monitoring patterns, and problem detection.
Distribute workspace outputs into per-case directories so judges can score each case independently:
python3 ${CLAUDE_SKILL_DIR}/scripts/collect.py \
--config <config> \
--workspace <workspace_path> \
--output $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>
Read the collection summary (JSON file — read it with cat or jq, not state.py):
cat $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/collection.json
Report per-case counts. If any case has 0 artifacts, warn — the skill may not have produced output for that case.
Run all configured judges against the collected outputs. Four judge types are supported: builtin (reusable from the harness library), inline check (Python snippets), LLM prompt/prompt_file (Jinja2 rendered), and external module/function. All support optional arguments: for parameterization.
If --no-llm-judges was specified, skip judges that make LLM API calls (prompt, prompt_file, and LLM builtins). Run deterministic judges only (check, Python builtins, external code).
python3 ${CLAUDE_SKILL_DIR}/scripts/score.py judges \
--run-id <id> \
--config <config> \
--workspace <workspace_path> \
--model <model>
If hooks.before_scoring is configured in eval.yaml, score.py runs those hooks before judge execution. Pass --workspace and --model so hook environment variables are populated.
Judges receive a record dict with:
outputs["files"], outputs["<dir>_content"]outputs["exit_code"], outputs["duration_s"], outputs["cost_usd"], outputs["num_turns"] (if traces.metrics enabled)outputs["tool_calls"] (if outputs has tool: entries)outputs["stdout"], outputs["stderr"] (if traces.stdout/stderr enabled)outputs["annotations"] — parsed annotations.yaml from the dataset case directory (always present, empty dict if no file). Use for outcome-aware scoring where expected results depend on the test case.This means judges can check output quality, execution efficiency, AND expected outcomes from annotations.
If --baseline was specified, also run pairwise comparison:
python3 ${CLAUDE_SKILL_DIR}/scripts/score.py pairwise \
--run-id <id> \
--baseline <baseline_id> \
--config <config>
Read the full results:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/summary.yaml
summary.yaml has three sections: judges (per-judge mean and pass_rate), per_case (per-case {value, rationale} per judge), and pairwise (only if --baseline was used: run_a, run_b, wins_a, wins_b, ties).
Read the summary and analyze the results. Read ${CLAUDE_SKILL_DIR}/prompts/analyze-results.md for the full analysis framework and output format. Lead with Recommendation (self-contained — many readers only read this section). Save the analysis to $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/analysis.md with YAML frontmatter (agent, model, date) — see the prompt file for the template.
Generate HTML report:
python3 ${CLAUDE_SKILL_DIR}/scripts/report.py \
--run-id <id> \
--config <config> \
[--baseline <baseline_id>] \
--open
Tell the user the report is at $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/report.html.
If --gold flag: After scoring, copy collected artifacts to dataset case dirs as reference files. Report which cases were saved.
Suggest next steps (include --config <config> if a non-default config was used):
/eval-review --run-id <id> for interactive human review of the results/eval-optimize --model <model> for automated improvement based on failures/eval-mlflow --run-id <id> to log results to MLflowIf mlflow.experiment is configured in eval.yaml:
Use the Skill tool to invoke /eval-mlflow --action log-results --run-id <id> --config <config>
--runner harbor)When --runner harbor is specified, skip Steps 2–6 and call run.py instead —
it handles task generation (or reuse), harbor run, per-case judging (in-container),
result mapping, and report generation in one call:
PYTHONPATH="$(pwd)" python3 -m agent_eval.harbor.run \
--config <config> --model <model> \
--output $AGENT_EVAL_RUNS_DIR/<eval-name>/<run-id> \
--tasks-dir <tasks-dir> --jobs-dir <tmp-jobs> \
[--image <image>] [--agent <agent>] [--n-concurrent N] \
[--env kubernetes]
Cluster-specific config (namespace, credentials secret) is read from a
.env file in the project root. Create it with AGENT_EVAL_K8S_NAMESPACE
and AGENT_EVAL_K8S_CREDENTIALS_SECRET — run.py loads it automatically.
Tasks come from /eval-dataset (which emits Harbor task packages via
scripts/harbor.py). If --tasks-dir already has them, run.py reuses them; if
empty, it generates on the fly (needs --image). The output is a standard
run_result.json + summary.yaml + report.html — then continue with Step 6
pairwise (if --baseline) and Step 8 (MLflow) as normal. See
deploy/harbor/README.md for image build, credentials, and environment setup.
--runner evalhub)When --runner evalhub is specified, skip Steps 2–8 and call the EvalHub
runner instead — it creates ConfigMaps, submits a job to EvalHub, polls for
completion, and maps the results back:
python3 -m agent_eval.evalhub.runner \
--config <config> --model <model> \
--output $AGENT_EVAL_RUNS_DIR/<eval-name>/<run-id> \
[--evalhub-url <url>] [--namespace <ns>] [--project-dir <path>]
The runner creates K8s ConfigMaps for the eval config and project resources
(via k8s_resources), submits the job to EvalHub (which creates a Job pod
running the adapter in-process), polls until completion, and maps the results
into the standard summary.yaml + report.html. No image rebuild needed —
ConfigMaps carry the project-specific content.
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py so flags and results survive context compression.$ARGUMENTS
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessGenerate eval.yaml configuration for the agent eval harness by analyzing a skill's SKILL.md, sub-skills, scripts, and test cases. Useful for setting up evaluation, testing, quality checks, and benchmarking skills.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.