Skill

evaluator-discovery

Build a trustworthy, anchored evaluator for an agent whose output you want to optimize, when no reliable metric/judge/gold exists yet. It learns exactly what the agent produces, gives the evaluator the SAME inputs the agent saw (building + testing those surfaces), drafts the evaluator as an EXTERNAL markdown rubric, has a DIVERSE expert panel (incl. an adversary) harden it, and calibrates it against a real anchor while hill-climbing self-consistency — then registers it so gepa can optimize the agent against it. Use whenever the user wants to optimize / improve / "make better" an agent's prompt or output and the way to MEASURE quality is missing or weak — e.g. "set up a judge for my extraction agent", "how do I score whether my summarizer is good", "optimize this prompt but I have no gold labels", "build an evaluator for <agent>", or BEFORE any `gepa run` whose metric is absent or untrustworthy. NOT for laying out the repo (that's gepa-init) or driving the optimization loop (that's gepa-run); this is only the build-the-metric step. The evaluator is the bottleneck on every optimization — do not skip building one well.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/gepa-anywhere:evaluator-discovery

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

A prompt optimizer hill-climbs whatever you measure, so a weak evaluator yields

Supporting Files

references/calibration.mdscripts/calibrate.py

SKILL.md

203 lines · ~3.3k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 9, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

evaluator-discovery

A prompt optimizer hill-climbs whatever you measure, so a weak evaluator yields confident garbage (Goodhart). This skill builds the evaluator first, well: a calibrated, input-grounded, externally-stored rubric that gepa run can then optimize an agent against. Run it before optimizing any agent whose quality signal is missing or shaky.

Deliverable for an agent <name>, under .gepa/agents/<name>/: target.md (what the output is), surfaces/ (the inputs the agent saw + a parity test), evaluator.md (the external rubric — never inlined), anchors.jsonl (the correctness anchor), config.yaml (its metric points the judge at evaluator.md), and calibration/report.md (with an evidence-gated trust level). After this, gepa run --config .gepa/agents/<name>/config.yaml optimizes the agent against the calibrated evaluator.

Run gepa-init first if the repo has no .gepa/agents/ layout. Work the six stages in order — each depends on the last. Announce the stage you're on.

Honest contract (read first): this does NOT evaluate from nothing. It minimizes ground truth to a small anchor and reports how much it trusts the result. Two pieces are wired by naming a literal path inside the judge prompt (there is no {evaluator}/{surfaces} token today — see Stage 3) and Stage 5's calibration runs a real but explicit second gepa config (see references/calibration.md). Don't imply more automation than exists.

Stage 1 — Identify the target output

You can't evaluate what you can't name. Read the agent's prompt/spec/code and sample 3–5 real outputs (the prompt's intent and the actual output often diverge). Capture in target.md: the purpose, the output contract (exact schema, and what "one output" is), the success criteria (in the user's terms), and the known failure modes (each becomes a rubric line-item). Also list the agent's input surfaces and a hidden-conditioning inventory (chain-of-thought, tool calls/retrieval, sampling randomness) — Stage 2 needs both. If purpose/criteria are genuinely ambiguous and the samples don't settle it, ask the user the one or two questions that actually disambiguate.

Stage 2 — Input parity (the evaluator must see what the agent saw)

The stage people skip, that silently breaks everything. A judge without the agent's inputs hallucinates grounding. The evaluator must read the same inputs the agent saw for each output. There is no {surfaces} token in the judge prompt today, so a surface reaches the judge only because its path is named literally in the rubric prompt (Stage 3) — keep that in mind while building them.

Branch by where the input comes from:

Case A — the input is the dataset input file (no runtime fetch). Parity is automatic: the judge reads the same input path the rollout received. The "test" is a one-line assertion that the path the judge will read equals the example's input. Done.
Case B — the agent fetched/retrieved/computed context at runtime. Build a capture that writes that bundle to surfaces/<id>/… from the same invocation that produced the output being judged (bind it by hash/timestamp — a re-fetch can differ from what the agent saw, and a parity test on a re-fetch passes vacuously). Parity test = byte-equal diff (or a documented, named projection) between the captured bundle and the judge-time loader, for ≥1 example; exit non-zero on mismatch; it lives at surfaces/parity_test.sh.

For each item in the hidden-conditioning inventory, either expose it as a surface or record in target.md that the rubric cannot judge that dimension (e.g. faithfulness-to- reasoning, tool-correctness). A green parity test that ignores un-exposed conditioning is false confidence — fail closed when an enumerated surface is marked captured but isn't bound to the judged output.

Stage 3 — Draft the evaluator as an external markdown rubric

Write .gepa/agents/<name>/evaluator.md. It must be an external file, because (a) it is the artifact you hill-climb in Stage 5 and (b) the panel and humans must read+edit it directly.

Wiring (concrete — there is no rubric/surfaces token today). Set the agent's metric to a subagent whose prompt names the paths literally, so the judge subagent reads them with its own file tools:

metric:
  mode: subagent
  subagent:
    prompt: |
      Read the rubric at .gepa/agents/<name>/evaluator.md and the input surfaces in
      .gepa/agents/<name>/surfaces/. Score the agent outputs in {outputs} strictly per the
      rubric, grounding every judgement in those surfaces. Write {"score": <0..1>,
      "feedback": "<specific errors, naming the surface each is grounded in>"}.

Only {outputs} and {gold} are rendered; evaluator.md and surfaces/ reach the judge purely via these literal paths.

Write the rubric to the SAME standard it enforces (Stage 4 checks this; the standard below is itself ordered most-important-first — the rubric must obey its own §3):

Grounded + specific feedback (primary signal). Every criterion references the surfaces ("a claim not supported by surfaces/source.txt is a fabrication") and demands specific NL feedback ("name the fabricated claim"), never a bare number — the optimizer reads the feedback to improve the agent.
Deterministic defaults for borderline-but-judgeable calls. When the output is present and the call is close, a single mandated default, so the same output scores identically every run. (Cannot-judge cases are governed by abstention, #5 — not here.)
Per-criterion partial credit. Each criterion states what partial satisfaction earns, so the optimizer feels small improvements instead of a flat all-or-nothing middle.
No logical overlap. Each criterion has exactly one home; overlaps double-count and split run-to-run. Prefer few sharp criteria over many fuzzy ones.
Abstention on ungroundable input. Define what the evaluator does when a surface is missing/insufficient: a mandated default (e.g. score that criterion 0, say "ungroundable: "), never a guess — a silent invented score is the Stage-2 hallucination re-entering.
Position/order neutrality (any comparative judging). Judge each output against the surfaces independently before comparing; presentation order must not affect the verdict. (Stage 5's swap-symmetry test verifies this — so the rubric must encode it.)
Bounded score mapping. Define score ∈ [0,1] explicitly as a small weighted sum of the criteria above (what earns 1.0, what earns 0).
Importance-ordered directives. Open the rubric by stating its directives are listed most-important-first, and order them that way — a judge that stops early still weighs what matters most.

Stage 4 — A DIVERSE expert panel hardens the rubric (headcount ≠ a panel)

Same model + same persona × N is one reviewer sampled N times — it shares self-preference bias and blind spots, and "debate" among clones converges to a shared prior, not validation. Dispatch a heterogeneous team (Agent tool, parallel) with conflicting mandates:

a strictness reviewer (hunt false-positives / over-credit),
a leniency reviewer (hunt false-negatives / over-penalization),
a ground-truth skeptic (trusts only the surfaces),
a red-team adversary whose job is to construct an output that scores high but is bad — any success becomes a new anchor pair (Stage 5), and where feasible use ≥2 distinct model tiers (e.g. Opus + Sonnet/Haiku).

Each reviewer checks every §3 directive, in §3's order (don't re-copy the list — a copied checklist drifts from the standard). Then debate-until-correct, eliminate nits, and re-review the revised rubric (review → review until a clean pass). Apply the synthesis to evaluator.md. Record whether the judge model differs from the agent model; if they're the same, note the self-preference risk — it caps trust below "high" (Stage 6).

Stage 5 — Calibrate against an anchor while hill-climbing self-consistency

A beautiful rubric can still be inconsistent (noisy) or consistently wrong. Calibrate it: treat evaluator.md as the artifact and run a real second gepa config whose metric blends self-consistency with a correctness anchor. The full concrete wiring (the calibration config.yaml, the subagent rollout that applies the candidate rubric ×N, and the calibration_metric.py that computes variance + anchor-agreement) is in references/calibration.md — read it before this stage; gepa-anywhere has no built-in consistency/anchor metric, so this config is what makes it real.

Consistency tests (computed by re-running the judge under perturbation via replicas): repeat-agreement (low variance), paraphrase-invariance, swap-symmetry, self-compare neutrality and near-paraphrase neutrality (the realistic leak is preferring text stylistically like the judge's own).

The anchor is the only correctness signal — make it real, or the Goodhart detector is decorative:

≥15–20 pairs, scored as a continuous agreement fraction (3 pairs saturate at 1.0 and the divergence test never fires).
Boundary-difficulty, not gross corruptions: clean vs. subtly wrong (a plausibly- paraphrased-but-unsupported claim, an off-by-one level) — that's where optimization pushes.
Partly independent: not 100% self-constructed by the same model that wrote the rubric (shared blind spot). Include human-confirmed or externally-sourced pairs; the report states the independent fraction, which caps trust.
Anchor is a GATE, not a summed term: a candidate rubric that scores below the anchor floor (e.g. agreement < 0.9) is rejected before consistency is even compared. Consistency is necessary, not sufficient — otherwise the cheapest way to raise consistency is to make the rubric insensitive (collapse the dynamic range), which is a consistently-wrong judge.
Anti-degeneracy: flag "consistency rose because score-variance collapsed" — the rubric must keep discriminative range on the set.
Stop rule: halt if consistency rises while anchor-agreement is flat and saturated (the proxy–anchor gap, the universal Goodhart detector) — and treat "anchor at ceiling from step 0" as an invalid (too-easy) anchor, not a pass.

Write calibration/report.md: consistency before/after, anchor-agreement (+ independent fraction), panel sign-off, and the trust level (Stage 6).

Stage 6 — Register, gate trust, hand off

Add the agent to .gepa/registry.yaml (a convention these skills maintain by hand — gepa itself only consumes config.yaml). Trust is evidence-gated, not prose: high requires ≥N independently-verified anchors, anchor-agreement above floor with headroom, repeat- variance below bound, swap-symmetry passing, and the heterogeneous-panel sign-off; cap at medium if the anchor is fully self-constructed or the judge and agent share a base model. The registry carries that trust into every downstream run, so don't inflate it.

The agent is now ready: gepa run --config .gepa/agents/<name>/config.yaml optimizes its prompt; gepa frontier inspects/promotes. During the agent run, periodically re-check the held-out anchor (and spot-check at promotion): if top agent candidates diverge from the anchor, the evaluator is being hacked — halt and re-calibrate. Re-run this skill per agent; each gets its own grounded, calibrated, external evaluator, so a repo can hill-climb arbitrarily many.

Why each stage exists (skip one → this breaks)

Stage 2 → the judge hallucinates grounding. Stage 3 external file → you can't hill-climb or review it. Stage 4 diversity → clones rubber-stamp their own bias. Stage 5 gated anchor → the optimizer hill-climbs evaluator noise, or a consistently-wrong judge passes. Evidence-gated trust → downstream runs inherit unjustified confidence.

evaluator-discovery

Invocation

Context Preview

Supporting Files

SKILL.md

evaluator-discovery

Invocation

Context Preview

Supporting Files

SKILL.md

evaluator-discovery

Stage 1 — Identify the target output

Stage 2 — Input parity (the evaluator must see what the agent saw)

Stage 3 — Draft the evaluator as an external markdown rubric

Stage 4 — A DIVERSE expert panel hardens the rubric (headcount ≠ a panel)

Stage 5 — Calibrate against an anchor while hill-climbing self-consistency

Stage 6 — Register, gate trust, hand off

Why each stage exists (skip one → this breaks)

Similar Skills

evaluator-discovery

Stage 1 — Identify the target output

Stage 2 — Input parity (the evaluator must see what the agent saw)

Stage 3 — Draft the evaluator as an external markdown rubric

Stage 4 — A DIVERSE expert panel hardens the rubric (headcount ≠ a panel)

Stage 5 — Calibrate against an anchor while hill-climbing self-consistency

Stage 6 — Register, gate trust, hand off

Why each stage exists (skip one → this breaks)

Similar Skills