From agent-eval-harness
Automated skill improvement loop that runs evals, diagnoses judge failures from traces and rationale, edits SKILL.md to fix issues, re-runs, and checks for regressions. Use when improving a skill based on eval results without manual iteration.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-eval-harness:eval-optimizeThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an automated skill improver. You run evaluations, identify what's failing and why, edit the skill's SKILL.md to fix the issues, re-run to verify, and check for regressions. You iterate until judges pass or you hit the max iteration limit.
You are an automated skill improver. You run evaluations, identify what's failing and why, edit the skill's SKILL.md to fix the issues, re-run to verify, and check for regressions. You iterate until judges pass or you hit the max iteration limit.
The key difference from /eval-review: you act autonomously. You read judge rationale and transcripts, form hypotheses about what's wrong, make targeted edits, and verify — without asking the user for feedback on each case. The user sets the goal ("make this pass") and you work toward it.
| Argument | Required | Default | Description |
|---|---|---|---|
--config <path> | no | auto-discover | Path to eval config |
--model <model> | no | models.skill from eval.yaml | Model to use for eval runs (overrides config default) |
--max-iterations <N> | no | 3 | Stop after N improvement cycles |
--run-id <id> | no | auto-generated | Base run ID (iterations append -iter-N) |
--target-judge <name> | no | all judges | Focus on a specific failing judge |
If --config was explicitly provided, use that path directly. Otherwise, auto-discover:
python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py
<config>/eval-analyze firstAfter selecting a config, read its skill field to set <eval-name> (used in $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> paths below).
mkdir -p tmp
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/optimize-config.yaml \
model=<model> max_iterations=<N> run_id=<id> target_judge=<judge>
If no recent eval results exist, run the eval suite first:
Use the Skill tool to invoke /eval-run --run-id <id>-iter-0 --config <config> [--model <model>]
Pass --model only if the user provided one — otherwise let /eval-run fall back to models.skill from eval.yaml. Pass the same model on every iteration for comparable results.
If results already exist (the user just ran /eval-run), skip this and use the existing run.
Read the results:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>-iter-0/summary.yaml
If all judges pass, report success and exit — nothing to improve.
From summary.yaml, identify:
Also check for human feedback — these catch things judges miss:
test -f $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/review.yaml && echo "REVIEW_EXISTS" || echo "NO_REVIEW"
If review.yaml exists, read its feedback section (human feedback from /eval-review) and mlflow_feedback section (annotations pulled from MLflow UI). Human feedback is higher-signal than judge rationale — prioritize issues the user flagged.
If --target-judge was specified, focus only on that judge's failures.
Build a failure map, noting each judge's type (judge_type in results: builtin, check, llm, code) — the type determines what you can do about it:
agent_eval/judges/. Don't edit their code — suggest adjusting arguments: in eval.yaml (e.g., raising max_cost_usd for cost_budget)judge_name (type) → [case_id, case_id, ...] → rationale for each
human_review → [case_id, ...] → user comment for each
For each failure pattern, investigate why the skill produces bad output:
Read the skill's SKILL.md — locate it via eval.yaml's skill field:
python3 ${CLAUDE_SKILL_DIR}/../eval-analyze/scripts/find_skills.py --name <skill>
Read transcripts (if available) — transcripts can be very large, so delegate to an Agent. Check run_result.json for execution_mode: in case mode, each case has its own transcript at $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/cases/<case>/stdout.log; in batch mode, there's one at $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/stdout.log. Focus on the failing cases.
Include the specific judge failure in the prompt so the agent traces the causal chain rather than producing a generic summary:
Agent tool, subagent_type="Explore": "Read the transcript at <path>.
The judge '<judge_name>' (<judge_type>) failed this case with rationale:
'<rationale from summary.yaml>'
Find evidence explaining WHY this failure happened:
- Where in the transcript did the skill handle (or skip) the relevant task?
- What instructions from SKILL.md led to this behavior?
- Did the skill attempt the right thing but produce wrong output, or skip it entirely?
- If it tried multiple approaches, which one stuck and why?"
In batch mode, failures across cases may interact — ask the agent to check whether earlier cases' processing affected later ones.
Read failing case outputs — use an Explore agent to examine the actual output files. Include what the judge expected so the comparison is targeted:
Agent tool, subagent_type="Explore": "Read the outputs in $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/cases/<failing_case>/.
The judge '<judge_name>' failed with: '<rationale>'.
Compare the actual output against this expectation — what specifically is missing or wrong?"
Form hypotheses — connect the judge rationale + transcript evidence + output examination to specific parts of the SKILL.md. Be specific: "The judge says the output is missing acceptance criteria. The transcript shows the skill skipped Step 4 of the pipeline. Step 4 in SKILL.md says 'optionally add acceptance criteria' — the word 'optionally' is the problem."
Apply targeted fixes to the SKILL.md. For each edit:
Show each edit before applying. If the change is risky (could affect passing cases), note it.
Execution mode context: check execution.mode in eval.yaml. In case mode, each case runs in its own isolated workspace with all case files copied in — the skill receives case-specific arguments resolved from input.yaml. In batch mode, all cases are in one workspace via batch.yaml. Your edits must work for the configured mode.
Re-run eval with the baseline flag. If only a subset of cases failed, target them with --cases for faster verification. Once targeted cases pass, do a final full run (all cases) to check for regressions.
Consider --no-llm-judges when you only need to verify structural fixes — it skips LLM API calls and runs only deterministic judges (check, Python builtins), which is faster and cheaper.
# Targeted re-run (failing cases only)
Use the Skill tool to invoke /eval-run --run-id <id>-iter-<N> --cases <failing-case-id> [<failing-case-id> ...] --baseline <id>-iter-<N-1> --config <config> [--model <model>]
# Full re-run (all cases) — use for final verification
Use the Skill tool to invoke /eval-run --run-id <id>-iter-<N> --baseline <id>-iter-<N-1> --config <config> [--model <model>]
Read the new results:
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>-iter-<N>/summary.yaml
Check:
If the fix caused regressions (previously passing cases now fail):
/eval-review --run-id <id> for human input on the tricky cases.If failures remain and iterations < max:
If all judges pass:
If max iterations reached with failures remaining:
/eval-review --run-id <final-id> for human assessment of the remaining issues/eval-dataset --strategy expand if failures suggest missing test coverageIn all cases (include --config <config> if a non-default config was used):
/eval-mlflow --run-id <final-id> to log the optimization results to MLflow for tracking.agent_eval/judges/) are versioned and shared — never edit their code. If a builtin judge's behavior needs adjustment, suggest changing its arguments: in eval.yaml instead. For inline check or LLM prompt judges, suggest improvements to the user but don't edit eval.yaml yourself.$ARGUMENTS
npx claudepluginhub opendatahub-io/agent-eval-harness --plugin agent-eval-harnessAutonomously optimizes Claude Code skills by iteratively running them on test inputs, scoring against binary evals, reflecting on failures to mutate prompts, and archiving improvements. Invoke via /auto-optimize for skill enhancement or autoresearch.
Iterates on a skill by running it against a behavioral spec with and without proposed changes via parallel-eval harness, then ships the winning version.
Analyzes skill executions from conversation friction, file diffs, user feedback, diagnostics, and lessons to propose concrete improvements to SKILL.md files for efficiency.