What is GEPA in this context?

GEPA is a Pareto-based evolutionary search that uses an LLM to reflect on full execution traces and guide improvements to agentic systems.

What inputs do I need to run GEPA?

You need a program (agent), a trainset of examples, a metric function returning (score, feedback), a reflection LM, and an auto level (e.g., 'light', 'medium', 'heavy').

How is success measured after optimization?

Success is measured by the metric's score plus the quality of the textual feedback, evaluated across improved execution traces and, if possible, by comparing Pareto fronts for tradeoffs like speed and accuracy.

dspy-gepa-reflective

Use Caution

npx machina-cli add skill OmidZamani/dspy-skills/dspy-gepa-reflective --openclaw

Files (1)

SKILL.md

7.0 KB

DSPy GEPA Optimizer

Goal

Optimize complex agentic systems using LLM reflection on full execution traces with Pareto-based evolutionary search.

When to Use

Agentic systems with tool use
When you have rich textual feedback on failures
Complex multi-step workflows
Instruction-only optimization needed

Related Skills

For non-agentic programs: dspy-miprov2-optimizer, dspy-bootstrap-fewshot
Measure improvements: dspy-evaluation-suite

Inputs

Input	Type	Description
`program`	`dspy.Module`	Agent or complex program
`trainset`	`list[dspy.Example]`	Training examples
`metric`	`callable`	Must return `(score, feedback)` tuple
`reflection_lm`	`dspy.LM`	Strong LM for reflection (GPT-4)
`auto`	`str`	"light", "medium", "heavy"

Outputs

Output	Type	Description
`compiled_program`	`dspy.Module`	Reflectively optimized program

Workflow

Phase 1: Define Feedback Metric

GEPA requires metrics that return textual feedback:

def gepa_metric(example, pred, trace=None):
    """Must return (score, feedback) tuple."""
    is_correct = example.answer.lower() in pred.answer.lower()
    
    if is_correct:
        feedback = "Correct. The answer accurately addresses the question."
    else:
        feedback = f"Incorrect. Expected '{example.answer}' but got '{pred.answer}'. The model may have misunderstood the question or retrieved irrelevant information."
    
    return is_correct, feedback

Phase 2: Setup Agent

import dspy

def search(query: str) -> list[str]:
    """Search knowledge base for relevant information."""
    rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
    results = rm(query, k=3)
    return results if isinstance(results, list) else [results]

def calculate(expression: str) -> float:
    """Safely evaluate mathematical expressions."""
    with dspy.PythonInterpreter() as interp:
        return interp(expression)

agent = dspy.ReAct("question -> answer", tools=[search, calculate])

Phase 3: Optimize with GEPA

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

optimizer = dspy.GEPA(
    metric=gepa_metric,
    reflection_lm=dspy.LM("openai/gpt-4o"),  # Strong model for reflection
    auto="medium"
)

compiled_agent = optimizer.compile(agent, trainset=trainset)

Production Example

import dspy
from dspy.evaluate import Evaluate
import logging

logger = logging.getLogger(__name__)

class ResearchAgent(dspy.Module):
    def __init__(self):
        self.react = dspy.ReAct(
            "question -> answer",
            tools=[self.search, self.summarize]
        )
    
    def search(self, query: str) -> list[str]:
        """Search for relevant documents."""
        rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
        results = rm(query, k=5)
        return results if isinstance(results, list) else [results]
    
    def summarize(self, text: str) -> str:
        """Summarize long text into key points."""
        summarizer = dspy.Predict("text -> summary")
        return summarizer(text=text).summary
    
    def forward(self, question):
        return self.react(question=question)

def detailed_feedback_metric(example, pred, trace=None):
    """Rich feedback for GEPA reflection."""
    expected = example.answer.lower().strip()
    actual = pred.answer.lower().strip() if pred.answer else ""
    
    # Exact match
    if expected == actual:
        return 1.0, "Perfect match. Answer is correct and concise."
    
    # Partial match
    if expected in actual or actual in expected:
        return 0.7, f"Partial match. Expected '{example.answer}', got '{pred.answer}'. Answer contains correct info but may be verbose or incomplete."
    
    # Check for key terms
    expected_terms = set(expected.split())
    actual_terms = set(actual.split())
    overlap = len(expected_terms & actual_terms) / max(len(expected_terms), 1)
    
    if overlap > 0.5:
        return 0.5, f"Some overlap. Expected '{example.answer}', got '{pred.answer}'. Key terms present but answer structure differs."
    
    return 0.0, f"Incorrect. Expected '{example.answer}', got '{pred.answer}'. The agent may need better search queries or reasoning."

def optimize_research_agent(trainset, devset):
    """Full GEPA optimization pipeline."""
    
    dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
    
    agent = ResearchAgent()
    
    # Convert metric for evaluation (just score)
    def eval_metric(example, pred, trace=None):
        score, _ = detailed_feedback_metric(example, pred, trace)
        return score
    
    evaluator = Evaluate(devset=devset, num_threads=8, metric=eval_metric)
    baseline = evaluator(agent)
    logger.info(f"Baseline: {baseline:.2%}")
    
    # GEPA optimization
    optimizer = dspy.GEPA(
        metric=detailed_feedback_metric,
        reflection_lm=dspy.LM("openai/gpt-4o"),
        auto="medium",
        enable_tool_optimization=True  # Also optimize tool descriptions
    )
    
    compiled = optimizer.compile(agent, trainset=trainset)
    optimized = evaluator(compiled)
    logger.info(f"Optimized: {optimized:.2%}")
    
    compiled.save("research_agent_gepa.json")
    return compiled

Tool Optimization

GEPA can jointly optimize predictor instructions AND tool descriptions:

optimizer = dspy.GEPA(
    metric=gepa_metric,
    reflection_lm=dspy.LM("openai/gpt-4o"),
    auto="medium",
    enable_tool_optimization=True  # Optimize tool docstrings too
)

Best Practices

Rich feedback - More detailed feedback = better reflection
Strong reflection LM - Use GPT-4 or Claude for reflection
Agentic focus - Best for ReAct and multi-tool systems
Trace analysis - GEPA analyzes full execution trajectories

Limitations

Requires custom feedback metrics (not just scores)
Expensive: uses strong LM for reflection
Newer optimizer, less battle-tested than MIPROv2
Best for instruction optimization, less for demos

Official Documentation

DSPy Documentation: https://dspy.ai/
DSPy GitHub: https://github.com/stanfordnlp/dspy
GEPA Optimizer: https://dspy.ai/api/optimizers/GEPA/
Agents Guide: https://dspy.ai/tutorials/agents/

Source

git clone https://github.com/OmidZamani/dspy-skills/blob/master/skills/dspy-gepa-reflective/SKILL.mdView on GitHub

Overview

This skill enables optimizing complex agentic systems by applying GEPA, a Pareto-based evolutionary search guided by LLM-backed reflection on full execution traces. It leverages textual feedback metrics to steer improvements in multi-step agent workflows, such as ReAct-based agents with tool use. The approach supports rich failure analysis and Pareto tradeoffs to produce a more capable, reflectively optimized program.

How This Skill Works

Define a feedback metric that returns a (score, feedback) pair for each trial. Build a ReAct-like agent with required tools (e.g., search, calculation). Configure GEPA with a reflection LM and an auto setting, then run the optimizer to produce a compiled_program that embodies reflective improvements.

When to Use It

Agentic systems with tool use
Rich textual feedback on failures
Complex multi-step workflows
Instruction-only optimization needed
When optimizing for tradeoffs using Pareto-based search

Quick Start

Step 1: Define a textual feedback metric that returns (score, feedback).
Step 2: Build your ReAct-based agent with required tools such as a search function and a calculator.
Step 3: Configure GEPA with a reflection LM and auto setting, then compile: compiled_agent = optimizer.compile(agent, trainset=trainset)

Best Practices

Define a clear textual feedback metric that always returns (score, feedback).
Use a strong reflection LM (e.g., GPT-4o) to extract actionable insights from traces.
Start with a balanced auto setting (e.g., auto = 'medium') to manage cost and exploration.
Provide varied trainset data including edge cases and failure modes.
Validate the optimized program on held-out scenarios and compare Pareto fronts before/after.

Example Use Cases

A ResearchAgent built on ReAct with search and summarize tools, refined by GEPA to improve factuality and responsiveness.
Using rich textual feedback to fix failures in a multi-step QA workflow.
Optimizing a pipeline that joins search results with a summarizer to improve key-point extraction.
Evolving planning and action selection components via reflective guidance from a strong LM.
Producing a production-ready compiled_agent after Phase 3 GEPA optimization and evaluation.

Frequently Asked Questions

Add this skill to your agents