From gepa-anywhere
Build a trustworthy, anchored evaluator for an agent whose output you want to optimize, when no reliable metric/judge/gold exists yet. It learns exactly what the agent produces, gives the evaluator the SAME inputs the agent saw (building + testing those surfaces), drafts the evaluator as an EXTERNAL markdown rubric, has a DIVERSE expert panel (incl. an adversary) harden it, and calibrates it against a real anchor while hill-climbing self-consistency — then registers it so gepa can optimize the agent against it. Use whenever the user wants to optimize / improve / "make better" an agent's prompt or output and the way to MEASURE quality is missing or weak — e.g. "set up a judge for my extraction agent", "how do I score whether my summarizer is good", "optimize this prompt but I have no gold labels", "build an evaluator for <agent>", or BEFORE any `gepa run` whose metric is absent or untrustworthy. NOT for laying out the repo (that's gepa-init) or driving the optimization loop (that's gepa-run); this is only the build-the-metric step. The evaluator is the bottleneck on every optimization — do not skip building one well.
How this skill is triggered — by the user, by Claude, or both
Slash command
/gepa-anywhere:evaluator-discoveryThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A prompt optimizer hill-climbs whatever you measure, so a weak evaluator yields
A prompt optimizer hill-climbs whatever you measure, so a weak evaluator yields
confident garbage (Goodhart). This skill builds the evaluator first, well: a
calibrated, input-grounded, externally-stored rubric that gepa run can then optimize an
agent against. Run it before optimizing any agent whose quality signal is missing or shaky.
Deliverable for an agent <name>, under .gepa/agents/<name>/: target.md (what the
output is), surfaces/ (the inputs the agent saw + a parity test), evaluator.md (the
external rubric — never inlined), anchors.jsonl (the correctness anchor), config.yaml
(its metric points the judge at evaluator.md), and calibration/report.md (with an
evidence-gated trust level). After this, gepa run --config .gepa/agents/<name>/config.yaml
optimizes the agent against the calibrated evaluator.
Run gepa-init first if the repo has no
.gepa/agents/layout. Work the six stages in order — each depends on the last. Announce the stage you're on.
Honest contract (read first): this does NOT evaluate from nothing. It minimizes ground
truth to a small anchor and reports how much it trusts the result. Two pieces are wired by
naming a literal path inside the judge prompt (there is no {evaluator}/{surfaces} token
today — see Stage 3) and Stage 5's calibration runs a real but explicit second gepa config
(see references/calibration.md). Don't imply more automation than exists.
You can't evaluate what you can't name. Read the agent's prompt/spec/code and sample 3–5
real outputs (the prompt's intent and the actual output often diverge). Capture in
target.md: the purpose, the output contract (exact schema, and what "one output"
is), the success criteria (in the user's terms), and the known failure modes (each
becomes a rubric line-item). Also list the agent's input surfaces and a
hidden-conditioning inventory (chain-of-thought, tool calls/retrieval, sampling
randomness) — Stage 2 needs both. If purpose/criteria are genuinely ambiguous and the
samples don't settle it, ask the user the one or two questions that actually disambiguate.
The stage people skip, that silently breaks everything. A judge without the agent's
inputs hallucinates grounding. The evaluator must read the same inputs the agent saw for
each output. There is no {surfaces} token in the judge prompt today, so a surface reaches
the judge only because its path is named literally in the rubric prompt (Stage 3) — keep
that in mind while building them.
Branch by where the input comes from:
input file (no runtime fetch). Parity is
automatic: the judge reads the same input path the rollout received. The "test" is a
one-line assertion that the path the judge will read equals the example's input. Done.surfaces/<id>/… from the same invocation that produced the
output being judged (bind it by hash/timestamp — a re-fetch can differ from what the
agent saw, and a parity test on a re-fetch passes vacuously). Parity test = byte-equal
diff (or a documented, named projection) between the captured bundle and the judge-time
loader, for ≥1 example; exit non-zero on mismatch; it lives at surfaces/parity_test.sh.For each item in the hidden-conditioning inventory, either expose it as a surface or
record in target.md that the rubric cannot judge that dimension (e.g. faithfulness-to-
reasoning, tool-correctness). A green parity test that ignores un-exposed conditioning is
false confidence — fail closed when an enumerated surface is marked captured but isn't bound
to the judged output.
Write .gepa/agents/<name>/evaluator.md. It must be an external file, because (a) it is
the artifact you hill-climb in Stage 5 and (b) the panel and humans must read+edit it directly.
Wiring (concrete — there is no rubric/surfaces token today). Set the agent's metric to a
subagent whose prompt names the paths literally, so the judge subagent reads them with its
own file tools:
metric:
mode: subagent
subagent:
prompt: |
Read the rubric at .gepa/agents/<name>/evaluator.md and the input surfaces in
.gepa/agents/<name>/surfaces/. Score the agent outputs in {outputs} strictly per the
rubric, grounding every judgement in those surfaces. Write {"score": <0..1>,
"feedback": "<specific errors, naming the surface each is grounded in>"}.
Only {outputs} and {gold} are rendered; evaluator.md and surfaces/ reach the judge
purely via these literal paths.
Write the rubric to the SAME standard it enforces (Stage 4 checks this; the standard below is itself ordered most-important-first — the rubric must obey its own §3):
surfaces/source.txt is a fabrication") and demands specific
NL feedback ("name the fabricated claim"), never a bare number — the optimizer reads the
feedback to improve the agent.score ∈ [0,1] explicitly as a small weighted sum of the
criteria above (what earns 1.0, what earns 0).Same model + same persona × N is one reviewer sampled N times — it shares self-preference bias and blind spots, and "debate" among clones converges to a shared prior, not validation. Dispatch a heterogeneous team (Agent tool, parallel) with conflicting mandates:
Each reviewer checks every §3 directive, in §3's order (don't re-copy the list — a copied
checklist drifts from the standard). Then debate-until-correct, eliminate nits, and re-review
the revised rubric (review → review until a clean pass). Apply the synthesis to evaluator.md.
Record whether the judge model differs from the agent model; if they're the same, note the
self-preference risk — it caps trust below "high" (Stage 6).
A beautiful rubric can still be inconsistent (noisy) or consistently wrong. Calibrate it:
treat evaluator.md as the artifact and run a real second gepa config whose metric blends
self-consistency with a correctness anchor. The full concrete wiring (the calibration
config.yaml, the subagent rollout that applies the candidate rubric ×N, and the
calibration_metric.py that computes variance + anchor-agreement) is in
references/calibration.md — read it before this stage; gepa-anywhere has no built-in
consistency/anchor metric, so this config is what makes it real.
Consistency tests (computed by re-running the judge under perturbation via replicas):
repeat-agreement (low variance), paraphrase-invariance, swap-symmetry, self-compare neutrality
and near-paraphrase neutrality (the realistic leak is preferring text stylistically like
the judge's own).
The anchor is the only correctness signal — make it real, or the Goodhart detector is decorative:
Write calibration/report.md: consistency before/after, anchor-agreement (+ independent
fraction), panel sign-off, and the trust level (Stage 6).
Add the agent to .gepa/registry.yaml (a convention these skills maintain by hand — gepa
itself only consumes config.yaml). Trust is evidence-gated, not prose: high requires
≥N independently-verified anchors, anchor-agreement above floor with headroom, repeat-
variance below bound, swap-symmetry passing, and the heterogeneous-panel sign-off; cap at
medium if the anchor is fully self-constructed or the judge and agent share a base model.
The registry carries that trust into every downstream run, so don't inflate it.
The agent is now ready: gepa run --config .gepa/agents/<name>/config.yaml optimizes its
prompt; gepa frontier inspects/promotes. During the agent run, periodically re-check the
held-out anchor (and spot-check at promotion): if top agent candidates diverge from the
anchor, the evaluator is being hacked — halt and re-calibrate. Re-run this skill per agent;
each gets its own grounded, calibrated, external evaluator, so a repo can hill-climb
arbitrarily many.
Stage 2 → the judge hallucinates grounding. Stage 3 external file → you can't hill-climb or review it. Stage 4 diversity → clones rubber-stamp their own bias. Stage 5 gated anchor → the optimizer hill-climbs evaluator noise, or a consistently-wrong judge passes. Evidence-gated trust → downstream runs inherit unjustified confidence.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub evanfabry/gepa-anywhere --plugin gepa-anywhere