Skill

harness:certify

Verifies Evolver agent's score stability by running 3 evaluations on current code, computing mean ± std from combined_scores, and reporting STABLE/MARGINAL/UNSTABLE verdict.

Python

Bash

testing

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness-evolver:certify

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadBashGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Verify score stability by running evaluation multiple times and reporting statistical confidence.

SKILL.md

64 lines · ~500 tokens

Stats

LanguagePython

Stars21

Forks2

MaintenanceExcellent

Last CommitApr 18, 2026

Actions

View Source View Plugin View on GitHub View README

/harness:certify

Verify score stability by running evaluation multiple times and reporting statistical confidence.

Resolve Tool Path

TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"

What To Do

Read .evolver.json to get the best experiment and dataset.

Run evaluation 3 times on the current code (not a worktree — the best code is already merged):

for i in 1 2 3; do
    $EVOLVER_PY $TOOLS/run_eval.py \
        --config .evolver.json \
        --worktree-path "." \
        --experiment-prefix "certify-run-$i"
done

After all 3 runs complete, read results and compute statistics:

$EVOLVER_PY $TOOLS/read_results.py --experiments "certify-run-1-{suffix},certify-run-2-{suffix},certify-run-3-{suffix}" --config .evolver.json --format summary

Calculate mean and standard deviation from the 3 combined_scores.

Report

CERTIFICATION REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Runs:  3
Mean:  {mean:.3f}
Std:   {std:.3f}
Range: {min:.3f} — {max:.3f}

Verdict: {STABLE|UNSTABLE}

STABLE (std < 0.05): Score is reliable. The agent performs consistently.

MARGINAL (0.05 <= std < 0.10): Score varies moderately. Consider adding rubrics to reduce judge variance.

UNSTABLE (std >= 0.10): Score is unreliable. The LLM judge interprets criteria differently across runs. Add few-shot examples or tighter rubrics.

After Certification

If STABLE: suggest /harness:deploy to finalize. If UNSTABLE: suggest adding rubrics to dataset examples, or running /harness:evolve with heavy mode for more thorough evaluation.

harness:certify

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

harness:certify

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

/harness:certify

Resolve Tool Path

What To Do

Report

After Certification

Similar Skills

/harness:certify

Resolve Tool Path

What To Do

Report

After Certification

Similar Skills