Skill

tune-harness

Use when harness output has plateaued, the evaluator's grading disagrees with a human spot-check, the same criterion keeps failing across runs, or quality is drifting between iterations. Triggers on "/tune-harness", "tune the harness", "the evaluator is wrong", "the generator keeps missing X", "improve the harness". This is the trace-reading loop — the primary engineering work after instantiation.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness:tune-harness

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This is where the real harness engineering happens. It is a discipline, not a magic trick. The pattern: read traces, find divergence, update one prompt, re-run, compare. Repeat.

SKILL.md

63 lines · ~856 tokens

Stats

LanguagePython

Stars0

MaintenanceGood

Last CommitJun 1, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

tune-harness

This is where the real harness engineering happens. It is a discipline, not a magic trick. The pattern: read traces, find divergence, update one prompt, re-run, compare. Repeat.

When to invoke

Output quality has plateaued across N iterations.
A human spot-check disagrees with an evaluator verdict.
The same criterion keeps failing across unrelated runs.
Generator output reads as "AI slop" despite passing the rubric.
A pivot fires repeatedly on the same criterion.

Protocol

Open the most recent N traces under state/traces/ and state/verdicts/.
Find the first place your judgment diverges from the agent's. Quote the exact excerpt.
Decide which artifact to update:
- The planner under-specified or over-specified → edit planner.md.
- The generator missed a deliverable shape → edit generator.md.
- The evaluator approved weak work → edit evaluator.md (most common).
- The criterion itself was vague → edit rubric.json.
- Calibration was missing a failure mode → add a .harness/calibration/ example.
Edit one file. Not multiple. Otherwise you can't attribute the change.
Re-run the same iteration (or a held-out test prompt) and compare verdicts before/after.

Log the change. Append a line to .harness/state/progress.jsonl:

{"t":"...","event":"prompt_edit","file":"evaluator.md","reason":"approved iter 7 despite c2 obviously failing on line 14"}

Commit. Tune-cycle changes belong in version control with a one-line rationale.

Heuristics for "which file"

Symptom	First file to inspect
Iterations look good individually but drift across runs	`evaluator.md` — calibration not anchoring
Generator builds the wrong thing	`planner.md` — spec was too vague
Evaluator passes obviously broken artifacts	`evaluator.md` — hedge list or "skeptical" framing missing
Same criterion keeps failing without progress	`rubric.json` — criterion too coarse, OR add calibration example
Generator self-approves in handoff	`generator.md` — role-as-builder not enforced
Verdicts are vague ("could be better")	`evaluator.md` — granularity requirement missing
Two criteria contradict each other	`rubric.json` — merge or re-define

What good looks like

A tuning session that:

Quotes specific trace excerpts.
Touches one file.
Has a one-line rationale.
Was followed by a re-run with measurably different verdict.
Is committed with that rationale in the message.

Anti-patterns

Editing multiple files at once. You lose attribution. One change, one re-run, then iterate.
"Just make the evaluator harsher." Vague. Edit specific phrases or add specific calibration examples.
Tuning without re-running. The change might have made things worse. Always validate.
Letting traces accumulate without reading. They're the input to this skill. The skill is dead without them.

tune-harness

Invocation

Context Preview

SKILL.md

tune-harness

Invocation

Context Preview

SKILL.md

tune-harness

When to invoke

Protocol

Heuristics for "which file"

What good looks like

Anti-patterns

Similar Skills

tune-harness

When to invoke

Protocol

Heuristics for "which file"

What good looks like

Anti-patterns

Similar Skills