From dspy-agent-skills
Optimizes DSPy programs using the dspy.GEPA reflective/evolutionary optimizer for complex tasks with rich-feedback metrics. Requires a DSPy module, metric returning dspy.Prediction, trainset, and reflection LM.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dspy-agent-skills:dspy-gepa-optimizerWhen to use
User asks to optimize/compile/tune a DSPy program, mentions GEPA or reflective optimization, or has a working program with a non-trivial metric and wants to improve it.
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
GEPA (Genetic-Pareto) is a reflective optimizer: it mutates a program's instructions and few-shots using an LM that reads your metric's **textual feedback** and proposes improvements. It maintains a Pareto frontier across validation tasks and is the default recommendation for complex DSPy workloads in 2026.
GEPA (Genetic-Pareto) is a reflective optimizer: it mutates a program's instructions and few-shots using an LM that reads your metric's textual feedback and proposes improvements. It maintains a Pareto frontier across validation tasks and is the default recommendation for complex DSPy workloads in 2026.
The expansion "Genetic-Evolutionary Prompt Adaptation" that appears in some AI-generated summaries is an LLM-hallucinated backronym. The paper defines GEPA as Genetic-Pareto; the "Pareto" is load-bearing (GEPA keeps a frontier of candidates rather than collapsing to one).
dspy.Module that runs end-to-end (see dspy-fundamentals).dspy.Prediction(score=float, feedback=str) (see dspy-evaluation-harness). A float-only metric makes GEPA no better than MIPRO. A dict with the same fields still crashes dspy.Evaluate under DSPy 3.2.1 — use dspy.Prediction.trainset and a separate valset. For GEPA, maximize training examples and keep validation just large enough to represent the downstream distribution; do not reuse the same examples for both.reflection_lm — a strong LM (often the same or stronger than the task LM) set to temperature=1.0 for creative proposals. Current DSPy docs use a GPT-5-class reflection model with a large output budget.import dspy
dspy.configure(lm=dspy.LM("openai/gpt-5-mini"))
reflection_lm = dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000)
optimizer = dspy.GEPA(
metric=rich_metric,
auto="medium", # "light" / "medium" / "heavy"
reflection_lm=reflection_lm,
reflection_minibatch_size=3,
candidate_selection_strategy="pareto", # or "current_best"
skip_perfect_score=True,
use_merge=True,
num_threads=8,
track_stats=True,
track_best_outputs=True, # enables inference-time best-of selection
log_dir="./gepa_logs", # resume/checkpoint
seed=0,
)
optimized = optimizer.compile(
student=program,
trainset=trainset,
valset=valset,
)
# Pareto inspection
pareto = optimized.detailed_results.val_aggregate_scores
print("Pareto frontier:", sorted(pareto, reverse=True)[:5])
optimized.save("optimized_program.json", save_program=False)
Either works; use the top-level in new code:
import dspy
dspy.GEPA(...) # preferred
# equivalently:
from dspy.teleprompt import GEPA
import dspy
def rich_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
score = ... # 0.0..1.0
feedback = ... # detailed natural-language critique
return dspy.Prediction(score=score, feedback=feedback)
Return dspy.Prediction, not a dict. Some upstream GEPA prose describes score/feedback as a dict-like shape, but dspy.Evaluate in DSPy 3.2.1 still crashes on a literal dict metric (TypeError: unsupported operand type(s) for +: 'int' and 'dict'). GEPA uses dspy.Evaluate internally for candidate scoring, so a dict return can fail inside GEPA too, not just in your explicit Evaluate(...) calls.
pred_name / pred_trace are set during reflection on a specific predictor inside your module — write per-predictor feedback when possible (credit assignment). If you cannot localize feedback, return program-level feedback rather than a vague score-only critique.Use either auto=... or explicit budget — not both.
| Mode | Rough rollouts | When to use |
|---|---|---|
auto="light" | ~20–40 full evals | Sanity-check GEPA works on your metric |
auto="medium" | ~80–150 full evals | Everyday optimization |
auto="heavy" | ~300–600 full evals | Final run before ship |
max_full_evals=N | Explicit | Deterministic budget |
max_metric_calls=N | Explicit | Hard cap on metric invocations (more predictable cost) |
Each "full eval" ≈ len(valset) metric calls. Budget accordingly for cost.
dspy.GEPA(
metric, # required
auto=None, # Literal["light","medium","heavy"] | None
max_full_evals=None,
max_metric_calls=None,
reflection_minibatch_size=3,
candidate_selection_strategy="pareto", # or "current_best"
reflection_lm=None, # required in practice
skip_perfect_score=True,
add_format_failure_as_feedback=False,
instruction_proposer=None, # custom ProposalFn
component_selector="round_robin", # or a callable
use_merge=True,
max_merge_invocations=5,
num_threads=None,
failure_score=0.0,
perfect_score=1.0,
log_dir=None,
track_stats=False,
use_wandb=False,
wandb_api_key=None, # overrides WANDB_API_KEY env var
wandb_init_kwargs=None, # dict forwarded to wandb.init(...)
track_best_outputs=False,
warn_on_score_mismatch=True,
use_mlflow=False,
seed=0,
gepa_kwargs=None, # e.g. {"use_cloudpickle": True} for dynamic signatures
)
.compile(student, *, trainset, valset=None, teacher=None) — teacher is not currently used.
DSPy's general prompt-optimizer docs often recommend a validation-heavy split, such as 20% train / 80% validation, because small prompt optimizers can overfit tiny trainsets. GEPA is different: maximize the training set and reserve only enough validation examples to represent downstream behavior. The Pareto frontier still needs a real valset, but GEPA learns from traces and textual feedback on training examples, so starving trainset hurts.
If you want a multi-stage optimizer loop, DSPy 3.2.0's BetterTogether now accepts arbitrary named optimizers instead of the older fixed prompt_optimizer / weight_optimizer pair:
optimizer = dspy.BetterTogether(
metric=rich_metric,
bootstrap=dspy.BootstrapFewShotWithRandomSearch(metric=rich_metric),
gepa=dspy.GEPA(metric=rich_metric, auto="light", reflection_lm=reflection_lm),
)
optimized = optimizer.compile(
student=program,
trainset=trainset,
valset=valset,
strategy="bootstrap -> gepa",
)
Pass strategy= explicitly when you use named stages like bootstrap=... and gepa=.... DSPy 3.2.0's default strategy is still "p -> w -> p", which only works if your optimizer keys are literally p and w.
Keep plain GEPA as the default first pass. Reach for BetterTogether only when you have a specific reason to chain optimizers and want the valset to pick the best intermediate program.
dspy.MIPROv2.dspy.SIMBA is a lighter reflective optimizer. Try it when you want a cheaper reflective pass than GEPA, your program is simple, or you need quick exploration before committing to a full GEPA run. Keep GEPA as the default for multi-predictor programs where per-predictor feedback and Pareto candidate selection matter.
log_dir writes candidate programs + scores per round. To resume an interrupted run, point log_dir at the same directory — GEPA picks up from the last checkpoint. Inspect <log_dir>/candidates/ to see every proposed program.
track_best_outputsWith track_best_outputs=True, GEPA records, per task, the best prediction seen across all candidates. At inference time on held-out data, you can ensemble or select among the top-Pareto programs for robustness. Access via optimized.detailed_results.best_outputs_valset.
reflection_lm = small model — it can't critique; use the strongest LM you can afford for this role.auto="heavy" on an untested metric — burn money to learn the metric was bugged. Run auto="light" first.log_dir — losing a 4-hour run to a disconnect is very painful.reflection_lm is required at construction, not compiledspy.GEPA(...) asserts reflection_lm is not None (or a custom instruction_proposer) at init time — you cannot defer it to .compile(). If you see
AssertionError: GEPA requires a reflection language model...
add reflection_lm=dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000) to the constructor, or substitute the strongest instruction-following model available on your provider. dspy.LM(...) is a cheap stub until you actually call it, so constructing one doesn't hit the network.
dspy-evaluation-harness.dspy-advanced-workflow.npx claudepluginhub intertwine/dspy-agent-skills --plugin dspy-agent-skillsSequences DSPy prompt and weight optimizers (e.g., GEPA, BootstrapFinetune) into evaluated strategies like "p -> w -> p" and returns the best candidate program.
Orchestrates full DSPy 3.2.x pipeline: spec → program → metric → baseline → GEPA optimize → export → deploy. Delegates to companion skills for each step.
Optimizes project's target file using GEPA algorithm: proposes candidates, evaluates in isolated git worktrees with benchmarks and gates until budget or stall.