Get the FREE Ultimate OpenClaw Setup Guide →

dspy-gepa-reflective

Use Caution
npx machina-cli add skill OmidZamani/dspy-skills/dspy-gepa-reflective --openclaw
Files (1)
SKILL.md
7.0 KB

DSPy GEPA Optimizer

Goal

Optimize complex agentic systems using LLM reflection on full execution traces with Pareto-based evolutionary search.

When to Use

  • Agentic systems with tool use
  • When you have rich textual feedback on failures
  • Complex multi-step workflows
  • Instruction-only optimization needed

Related Skills

Inputs

InputTypeDescription
programdspy.ModuleAgent or complex program
trainsetlist[dspy.Example]Training examples
metriccallableMust return (score, feedback) tuple
reflection_lmdspy.LMStrong LM for reflection (GPT-4)
autostr"light", "medium", "heavy"

Outputs

OutputTypeDescription
compiled_programdspy.ModuleReflectively optimized program

Workflow

Phase 1: Define Feedback Metric

GEPA requires metrics that return textual feedback:

def gepa_metric(example, pred, trace=None):
    """Must return (score, feedback) tuple."""
    is_correct = example.answer.lower() in pred.answer.lower()
    
    if is_correct:
        feedback = "Correct. The answer accurately addresses the question."
    else:
        feedback = f"Incorrect. Expected '{example.answer}' but got '{pred.answer}'. The model may have misunderstood the question or retrieved irrelevant information."
    
    return is_correct, feedback

Phase 2: Setup Agent

import dspy

def search(query: str) -> list[str]:
    """Search knowledge base for relevant information."""
    rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
    results = rm(query, k=3)
    return results if isinstance(results, list) else [results]

def calculate(expression: str) -> float:
    """Safely evaluate mathematical expressions."""
    with dspy.PythonInterpreter() as interp:
        return interp(expression)

agent = dspy.ReAct("question -> answer", tools=[search, calculate])

Phase 3: Optimize with GEPA

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

optimizer = dspy.GEPA(
    metric=gepa_metric,
    reflection_lm=dspy.LM("openai/gpt-4o"),  # Strong model for reflection
    auto="medium"
)

compiled_agent = optimizer.compile(agent, trainset=trainset)

Production Example

import dspy
from dspy.evaluate import Evaluate
import logging

logger = logging.getLogger(__name__)

class ResearchAgent(dspy.Module):
    def __init__(self):
        self.react = dspy.ReAct(
            "question -> answer",
            tools=[self.search, self.summarize]
        )
    
    def search(self, query: str) -> list[str]:
        """Search for relevant documents."""
        rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
        results = rm(query, k=5)
        return results if isinstance(results, list) else [results]
    
    def summarize(self, text: str) -> str:
        """Summarize long text into key points."""
        summarizer = dspy.Predict("text -> summary")
        return summarizer(text=text).summary
    
    def forward(self, question):
        return self.react(question=question)

def detailed_feedback_metric(example, pred, trace=None):
    """Rich feedback for GEPA reflection."""
    expected = example.answer.lower().strip()
    actual = pred.answer.lower().strip() if pred.answer else ""
    
    # Exact match
    if expected == actual:
        return 1.0, "Perfect match. Answer is correct and concise."
    
    # Partial match
    if expected in actual or actual in expected:
        return 0.7, f"Partial match. Expected '{example.answer}', got '{pred.answer}'. Answer contains correct info but may be verbose or incomplete."
    
    # Check for key terms
    expected_terms = set(expected.split())
    actual_terms = set(actual.split())
    overlap = len(expected_terms & actual_terms) / max(len(expected_terms), 1)
    
    if overlap > 0.5:
        return 0.5, f"Some overlap. Expected '{example.answer}', got '{pred.answer}'. Key terms present but answer structure differs."
    
    return 0.0, f"Incorrect. Expected '{example.answer}', got '{pred.answer}'. The agent may need better search queries or reasoning."

def optimize_research_agent(trainset, devset):
    """Full GEPA optimization pipeline."""
    
    dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
    
    agent = ResearchAgent()
    
    # Convert metric for evaluation (just score)
    def eval_metric(example, pred, trace=None):
        score, _ = detailed_feedback_metric(example, pred, trace)
        return score
    
    evaluator = Evaluate(devset=devset, num_threads=8, metric=eval_metric)
    baseline = evaluator(agent)
    logger.info(f"Baseline: {baseline:.2%}")
    
    # GEPA optimization
    optimizer = dspy.GEPA(
        metric=detailed_feedback_metric,
        reflection_lm=dspy.LM("openai/gpt-4o"),
        auto="medium",
        enable_tool_optimization=True  # Also optimize tool descriptions
    )
    
    compiled = optimizer.compile(agent, trainset=trainset)
    optimized = evaluator(compiled)
    logger.info(f"Optimized: {optimized:.2%}")
    
    compiled.save("research_agent_gepa.json")
    return compiled

Tool Optimization

GEPA can jointly optimize predictor instructions AND tool descriptions:

optimizer = dspy.GEPA(
    metric=gepa_metric,
    reflection_lm=dspy.LM("openai/gpt-4o"),
    auto="medium",
    enable_tool_optimization=True  # Optimize tool docstrings too
)

Best Practices

  1. Rich feedback - More detailed feedback = better reflection
  2. Strong reflection LM - Use GPT-4 or Claude for reflection
  3. Agentic focus - Best for ReAct and multi-tool systems
  4. Trace analysis - GEPA analyzes full execution trajectories

Limitations

  • Requires custom feedback metrics (not just scores)
  • Expensive: uses strong LM for reflection
  • Newer optimizer, less battle-tested than MIPROv2
  • Best for instruction optimization, less for demos

Official Documentation

Source

git clone https://github.com/OmidZamani/dspy-skills/blob/master/skills/dspy-gepa-reflective/SKILL.mdView on GitHub

Overview

This skill enables optimizing complex agentic systems by applying GEPA, a Pareto-based evolutionary search guided by LLM-backed reflection on full execution traces. It leverages textual feedback metrics to steer improvements in multi-step agent workflows, such as ReAct-based agents with tool use. The approach supports rich failure analysis and Pareto tradeoffs to produce a more capable, reflectively optimized program.

How This Skill Works

Define a feedback metric that returns a (score, feedback) pair for each trial. Build a ReAct-like agent with required tools (e.g., search, calculation). Configure GEPA with a reflection LM and an auto setting, then run the optimizer to produce a compiled_program that embodies reflective improvements.

When to Use It

  • Agentic systems with tool use
  • Rich textual feedback on failures
  • Complex multi-step workflows
  • Instruction-only optimization needed
  • When optimizing for tradeoffs using Pareto-based search

Quick Start

  1. Step 1: Define a textual feedback metric that returns (score, feedback).
  2. Step 2: Build your ReAct-based agent with required tools such as a search function and a calculator.
  3. Step 3: Configure GEPA with a reflection LM and auto setting, then compile: compiled_agent = optimizer.compile(agent, trainset=trainset)

Best Practices

  • Define a clear textual feedback metric that always returns (score, feedback).
  • Use a strong reflection LM (e.g., GPT-4o) to extract actionable insights from traces.
  • Start with a balanced auto setting (e.g., auto = 'medium') to manage cost and exploration.
  • Provide varied trainset data including edge cases and failure modes.
  • Validate the optimized program on held-out scenarios and compare Pareto fronts before/after.

Example Use Cases

  • A ResearchAgent built on ReAct with search and summarize tools, refined by GEPA to improve factuality and responsiveness.
  • Using rich textual feedback to fix failures in a multi-step QA workflow.
  • Optimizing a pipeline that joins search results with a summarizer to improve key-point extraction.
  • Evolving planning and action selection components via reflective guidance from a strong LM.
  • Producing a production-ready compiled_agent after Phase 3 GEPA optimization and evaluation.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers