Skill

eval-optimize

Automated skill improvement loop that runs evals, diagnoses judge failures from traces and rationale, edits SKILL.md to fix issues, re-runs, and checks for regressions. Use when improving a skill based on eval results without manual iteration.

Python

GitHub Actions

automation

developer-tools

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agent-eval-harness:eval-optimize

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteEditBashGlobGrepAgentSkillAskUserQuestion

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an automated skill improver. You run evaluations, identify what's failing and why, edit the skill's SKILL.md to fix the issues, re-run to verify, and check for regressions. You iterate until judges pass or you hit the max iteration limit.

Supporting Files

scripts/agent_eval

SKILL.md

203 lines · ~2.9k tokens

Stats

LanguagePython

Stars28

Forks28

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

Step 0: Parse Arguments

Argument	Required	Default	Description
`--config <path>`	no	auto-discover	Path to eval config
`--model <model>`	no	`models.skill` from eval.yaml	Model to use for eval runs (overrides config default)
`--max-iterations <N>`	no	3	Stop after N improvement cycles
`--run-id <id>`	no	auto-generated	Base run ID (iterations append `-iter-N`)
`--target-judge <name>`	no	all judges	Focus on a specific failing judge

Config Discovery

If --config was explicitly provided, use that path directly. Otherwise, auto-discover:

python3 ${CLAUDE_SKILL_DIR}/../../scripts/discover.py

1 config found: auto-select it as <config>
Multiple configs found: present the list and ask the user which eval to optimize
No configs found: suggest running /eval-analyze first

After selecting a config, read its skill field to set <eval-name> (used in $AGENT_EVAL_RUNS_DIR/<eval-name>/<id> paths below).

mkdir -p tmp
python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py init tmp/optimize-config.yaml \
  model=<model> max_iterations=<N> run_id=<id> target_judge=<judge>

Step 1: Initial Eval Run

If no recent eval results exist, run the eval suite first:

Use the Skill tool to invoke /eval-run --run-id <id>-iter-0 --config <config> [--model <model>]

Pass --model only if the user provided one — otherwise let /eval-run fall back to models.skill from eval.yaml. Pass the same model on every iteration for comparable results.

If results already exist (the user just ran /eval-run), skip this and use the existing run.

Read the results:

python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>-iter-0/summary.yaml

If all judges pass, report success and exit — nothing to improve.

Step 2: Identify Failures

From summary.yaml, identify:

Which judges failed — and on which cases
Failure rationale — what did each judge say about why it failed?
Failure patterns — does one judge fail everywhere (systematic) or only on specific cases (input-dependent)?

Also check for human feedback — these catch things judges miss:

test -f $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/review.yaml && echo "REVIEW_EXISTS" || echo "NO_REVIEW"

If review.yaml exists, read its feedback section (human feedback from /eval-review) and mlflow_feedback section (annotations pulled from MLflow UI). Human feedback is higher-signal than judge rationale — prioritize issues the user flagged.

If --target-judge was specified, focus only on that judge's failures.

Build a failure map, noting each judge's type (judge_type in results: builtin, check, llm, code) — the type determines what you can do about it:

builtin: versioned, shared judges from agent_eval/judges/. Don't edit their code — suggest adjusting arguments: in eval.yaml (e.g., raising max_cost_usd for cost_budget)
check: inline Python in eval.yaml. Read the snippet to understand exactly what's checked — failures are deterministic and reproducible
llm: LLM prompt judges. Read the prompt to understand scoring criteria — the failure may be in the skill output or in an overly strict prompt
code: external Python module. Read the function to understand the validation logic

judge_name (type) → [case_id, case_id, ...] → rationale for each
human_review → [case_id, ...] → user comment for each

Step 3: Analyze Root Causes

For each failure pattern, investigate why the skill produces bad output:

Read the skill's SKILL.md — locate it via eval.yaml's skill field:

python3 ${CLAUDE_SKILL_DIR}/../eval-analyze/scripts/find_skills.py --name <skill>

Read transcripts (if available) — transcripts can be very large, so delegate to an Agent. Check run_result.json for execution_mode: in case mode, each case has its own transcript at $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/cases/<case>/stdout.log; in batch mode, there's one at $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/stdout.log. Focus on the failing cases.

Include the specific judge failure in the prompt so the agent traces the causal chain rather than producing a generic summary:
```
Agent tool, subagent_type="Explore": "Read the transcript at <path>.
The judge '<judge_name>' (<judge_type>) failed this case with rationale:
'<rationale from summary.yaml>'

Find evidence explaining WHY this failure happened:
- Where in the transcript did the skill handle (or skip) the relevant task?
- What instructions from SKILL.md led to this behavior?
- Did the skill attempt the right thing but produce wrong output, or skip it entirely?
- If it tried multiple approaches, which one stuck and why?"
```
In batch mode, failures across cases may interact — ask the agent to check whether earlier cases' processing affected later ones.

Read failing case outputs — use an Explore agent to examine the actual output files. Include what the judge expected so the comparison is targeted:

Agent tool, subagent_type="Explore": "Read the outputs in $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>/cases/<failing_case>/.
The judge '<judge_name>' failed with: '<rationale>'.
Compare the actual output against this expectation — what specifically is missing or wrong?"

Form hypotheses — connect the judge rationale + transcript evidence + output examination to specific parts of the SKILL.md. Be specific: "The judge says the output is missing acceptance criteria. The transcript shows the skill skipped Step 4 of the pipeline. Step 4 in SKILL.md says 'optionally add acceptance criteria' — the word 'optionally' is the problem."

Step 4: Edit the Skill

Apply targeted fixes to the SKILL.md. For each edit:

Ground it in evidence — cite the judge name, failing cases, and transcript evidence
Be surgical — change the minimum needed. Don't rewrite sections that are working.
Explain the why — use the Skill Creator's principle: explain to the model why the change matters, rather than adding rigid MUSTs
Don't overfit — if only 1 of 20 cases fails, the fix should be general enough to help without breaking the other 19

Show each edit before applying. If the change is risky (could affect passing cases), note it.

Execution mode context: check execution.mode in eval.yaml. In case mode, each case runs in its own isolated workspace with all case files copied in — the skill receives case-specific arguments resolved from input.yaml. In batch mode, all cases are in one workspace via batch.yaml. Your edits must work for the configured mode.

Step 5: Re-Run and Verify

Re-run eval with the baseline flag. If only a subset of cases failed, target them with --cases for faster verification. Once targeted cases pass, do a final full run (all cases) to check for regressions.

Consider --no-llm-judges when you only need to verify structural fixes — it skips LLM API calls and runs only deterministic judges (check, Python builtins), which is faster and cheaper.

# Targeted re-run (failing cases only)
Use the Skill tool to invoke /eval-run --run-id <id>-iter-<N> --cases <failing-case-id> [<failing-case-id> ...] --baseline <id>-iter-<N-1> --config <config> [--model <model>]

# Full re-run (all cases) — use for final verification
Use the Skill tool to invoke /eval-run --run-id <id>-iter-<N> --baseline <id>-iter-<N-1> --config <config> [--model <model>]

Read the new results:

python3 ${CLAUDE_SKILL_DIR}/scripts/agent_eval/state.py read $AGENT_EVAL_RUNS_DIR/<eval-name>/<id>-iter-<N>/summary.yaml

Check:

Fixed: did the targeted failures pass now?
Regressions: did any previously passing cases/judges now fail?
Score improvement: did aggregate scores improve?

Step 6: Handle Regressions

If the fix caused regressions (previously passing cases now fail):

Assess severity — is the regression worse than the original failure?
If minor — the fix is still a net positive, continue
If major — revert the edit and try a different approach. The skill's instructions may need a different framing rather than a different rule.
If stuck — report to the user what you tried and why it didn't work. Suggest /eval-review --run-id <id> for human input on the tricky cases.

Step 7: Iterate or Report

If failures remain and iterations < max:

Go back to Step 2 with the new results
Each iteration should target different failures or try different approaches for persistent ones

If all judges pass:

Report success: which edits fixed which failures, how many iterations it took
Show the final summary.yaml scores

If max iterations reached with failures remaining:

Report what was fixed and what couldn't be fixed
For persistent failures, explain what you tried and why it didn't work
Suggest /eval-review --run-id <final-id> for human assessment of the remaining issues
Suggest /eval-dataset --strategy expand if failures suggest missing test coverage

In all cases (include --config <config> if a non-default config was used):

Suggest /eval-mlflow --run-id <final-id> to log the optimization results to MLflow for tracking.

Rules

Never make broad, generic changes — every edit must be grounded in a specific failure with evidence from judges and transcripts
Check for regressions after every edit — a fix that breaks other cases is not a fix
Stop after max iterations — don't loop forever. Report what couldn't be fixed.
Don't modify test cases or judges — the eval harness is the ground truth. If you think a judge is wrong, report it but don't change it. Builtin judges (from agent_eval/judges/) are versioned and shared — never edit their code. If a builtin judge's behavior needs adjustment, suggest changing its arguments: in eval.yaml instead. For inline check or LLM prompt judges, suggest improvements to the user but don't edit eval.yaml yourself.
Don't modify eval.yaml — your job is to improve the skill, not the evaluation config. If judges or arguments need updating, suggest it to the user.
Try different approaches — if the same type of edit fails twice, try a fundamentally different framing. Explain why instead of adding more rules.

$ARGUMENTS

eval-optimize

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

eval-optimize

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Step 0: Parse Arguments

Config Discovery

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules

Similar Skills

Step 0: Parse Arguments

Config Discovery

Step 1: Initial Eval Run

Step 2: Identify Failures

Step 3: Analyze Root Causes

Step 4: Edit the Skill

Step 5: Re-Run and Verify

Step 6: Handle Regressions

Step 7: Iterate or Report

Rules

Similar Skills