Skill

dspy-gepa-reflective

Optimizes agentic systems with DSPy's GEPA optimizer using LLM reflection on execution traces and Pareto-based search. For multi-step agents needing textual feedback metrics.

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/dspy-skills:dspy-gepa-reflective

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteGlobGrep

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Optimize complex agentic systems using LLM reflection on full execution traces with Pareto-based evolutionary search.

SKILL.md

206 lines · ~1.8k tokens

Stats

LanguagePython

Stars78

Forks10

MaintenanceGood

Last CommitJun 1, 2026

Actions

View Source View Plugin View on GitHub View README

DSPy GEPA Optimizer

Goal

Optimize complex agentic systems using LLM reflection on full execution traces with Pareto-based evolutionary search.

When to Use

Agentic systems with tool use
When you have rich textual feedback on failures
Complex multi-step workflows
Instruction-only optimization needed

Related Skills

For non-agentic programs: dspy-miprov2-optimizer, dspy-bootstrap-fewshot
Measure improvements: dspy-evaluation-suite

Inputs

Input	Type	Description
`program`	`dspy.Module`	Agent or complex program
`trainset`	`list[dspy.Example]`	Training examples
`metric`	`callable`	Accepts five arguments and returns `dspy.Prediction(score=..., feedback=...)`
`reflection_lm`	`dspy.LM`	Strong LM for reflection (GPT-4)
`auto`	`str`	"light", "medium", "heavy"

Outputs

Output	Type	Description
`compiled_program`	`dspy.Module`	Reflectively optimized program

Workflow

Phase 1: Define Feedback Metric

GEPA requires metrics that return textual feedback:

def gepa_metric(example, pred, trace=None, pred_name=None, pred_trace=None):
    """Return score and actionable feedback for GEPA reflection."""
    is_correct = example.answer.lower() in pred.answer.lower()
    
    if is_correct:
        feedback = "Correct. The answer accurately addresses the question."
    else:
        feedback = f"Incorrect. Expected '{example.answer}' but got '{pred.answer}'. The model may have misunderstood the question or retrieved irrelevant information."
    
    return dspy.Prediction(score=float(is_correct), feedback=feedback)

Phase 2: Setup Agent

import dspy

def search(query: str) -> list[str]:
    """Search knowledge base for relevant information."""
    rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
    results = rm(query, k=3)
    return results if isinstance(results, list) else [results]

def calculate(expression: str) -> float:
    """Safely evaluate mathematical expressions."""
    with dspy.PythonInterpreter() as interp:
        return interp(expression)

agent = dspy.ReAct("question -> answer", tools=[search, calculate])

Phase 3: Optimize with GEPA

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

optimizer = dspy.GEPA(
    metric=gepa_metric,
    reflection_lm=dspy.LM("openai/gpt-4o"),  # Strong model for reflection
    auto="medium"
)

compiled_agent = optimizer.compile(agent, trainset=trainset)

Production Example

import dspy
from dspy.evaluate import Evaluate
import logging

logger = logging.getLogger(__name__)

class ResearchAgent(dspy.Module):
    def __init__(self):
        self.react = dspy.ReAct(
            "question -> answer",
            tools=[self.search, self.summarize]
        )
    
    def search(self, query: str) -> list[str]:
        """Search for relevant documents."""
        rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
        results = rm(query, k=5)
        return results if isinstance(results, list) else [results]
    
    def summarize(self, text: str) -> str:
        """Summarize long text into key points."""
        summarizer = dspy.Predict("text -> summary")
        return summarizer(text=text).summary
    
    def forward(self, question):
        return self.react(question=question)

def detailed_feedback_metric(example, pred, trace=None, pred_name=None, pred_trace=None):
    """Rich feedback for GEPA reflection."""
    expected = example.answer.lower().strip()
    actual = pred.answer.lower().strip() if pred.answer else ""
    
    # Exact match
    if expected == actual:
        return dspy.Prediction(score=1.0, feedback="Perfect match. Answer is correct and concise.")
    
    # Partial match
    if expected in actual or actual in expected:
        return dspy.Prediction(score=0.7, feedback=f"Partial match. Expected '{example.answer}', got '{pred.answer}'. Answer contains correct info but may be verbose or incomplete.")
    
    # Check for key terms
    expected_terms = set(expected.split())
    actual_terms = set(actual.split())
    overlap = len(expected_terms & actual_terms) / max(len(expected_terms), 1)
    
    if overlap > 0.5:
        return dspy.Prediction(score=0.5, feedback=f"Some overlap. Expected '{example.answer}', got '{pred.answer}'. Key terms present but answer structure differs.")
    
    return dspy.Prediction(score=0.0, feedback=f"Incorrect. Expected '{example.answer}', got '{pred.answer}'. The agent may need better search queries or reasoning.")

def optimize_research_agent(trainset, devset):
    """Full GEPA optimization pipeline."""
    
    dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
    
    agent = ResearchAgent()
    
    # Convert metric for evaluation (just score)
    def eval_metric(example, pred, trace=None):
        return detailed_feedback_metric(example, pred, trace).score
    
    evaluator = Evaluate(devset=devset, num_threads=8, metric=eval_metric)
    baseline = evaluator(agent)
    logger.info(f"Baseline: {baseline:.2%}")
    
    # GEPA optimization
    optimizer = dspy.GEPA(
        metric=detailed_feedback_metric,
        reflection_lm=dspy.LM("openai/gpt-4o"),
        auto="medium"
    )
    
    compiled = optimizer.compile(agent, trainset=trainset)
    optimized = evaluator(compiled)
    logger.info(f"Optimized: {optimized:.2%}")
    
    compiled.save("research_agent_gepa.json")
    return compiled

Metric Contract

GEPA metrics must accept (gold, pred, trace, pred_name, pred_trace). Return dspy.Prediction(score=..., feedback=...) when textual feedback is available. Do not pass enable_tool_optimization; it is not a DSPy 3.2.1 GEPA constructor argument.

Best Practices

Rich feedback - More detailed feedback = better reflection
Strong reflection LM - Use GPT-4 or Claude for reflection
Agentic focus - Best for ReAct and multi-tool systems
Trace analysis - GEPA analyzes full execution trajectories

Limitations

Requires custom feedback metrics (not just scores)
Expensive: uses strong LM for reflection
Newer optimizer, less battle-tested than MIPROv2
Best for instruction optimization, less for demos

Official Documentation

DSPy Documentation: https://dspy.ai/
DSPy GitHub: https://github.com/stanfordnlp/dspy
GEPA Optimizer: https://dspy.ai/api/optimizers/GEPA/
Agents Guide: https://dspy.ai/tutorials/agents/

dspy-gepa-reflective

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

dspy-gepa-reflective

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

DSPy GEPA Optimizer

Goal

When to Use

Related Skills

Inputs

Outputs

Workflow

Phase 1: Define Feedback Metric

Phase 2: Setup Agent

Phase 3: Optimize with GEPA

Production Example

Metric Contract

Best Practices

Limitations

Official Documentation

Similar Skills

DSPy GEPA Optimizer

Goal

When to Use

Related Skills

Inputs

Outputs

Workflow

Phase 1: Define Feedback Metric

Phase 2: Setup Agent

Phase 3: Optimize with GEPA

Production Example

Metric Contract

Best Practices

Limitations

Official Documentation

Similar Skills