eval

Harness Evaluation

Run the evaluation suite to measure harness quality.

Deterministic checks

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/eval_runner.py --eval-dir ${CLAUDE_PLUGIN_ROOT}/eval-tasks --cwd . 2>&1 || echo "Eval runner not available"

Instructions

Review the deterministic check results above.
For each eval task that has llm_judge criteria, evaluate the criteria manually:
- Read the relevant files
- Assess whether each criteria is met
- Record evidence for your judgment
Compute the final score: 0.6 * deterministic + 0.4 * llm_judge
Present results using the Meta-Harness output format.

If a specific run_id was provided as $ARGUMENTS, evaluate that candidate's artifacts in runs/{run_id}/. Otherwise, evaluate the current harness state.

Harness Evaluation

Run the evaluation suite to measure harness quality.

Deterministic checks

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/eval_runner.py --eval-dir ${CLAUDE_PLUGIN_ROOT}/eval-tasks --cwd . 2>&1 || echo "Eval runner not available"

Instructions

Review the deterministic check results above.
For each eval task that has llm_judge criteria, evaluate the criteria manually:
- Read the relevant files
- Assess whether each criteria is met
- Record evidence for your judgment
Compute the final score: 0.6 * deterministic + 0.4 * llm_judge
Present results using the Meta-Harness output format.

If a specific run_id was provided as $ARGUMENTS, evaluate that candidate's artifacts in runs/{run_id}/. Otherwise, evaluate the current harness state.

Invocation

Tool Access

Context Preview

SKILL.md

eval

Invocation

Tool Access

Context Preview

SKILL.md

Harness Evaluation

Deterministic checks

Instructions

Similar Skills

Harness Evaluation

Deterministic checks

Instructions

Similar Skills