Get the FREE Ultimate OpenClaw Setup Guide →

dspy-evaluation-suite

Scanned
npx machina-cli add skill OmidZamani/dspy-skills/dspy-evaluation-suite --openclaw
Files (1)
SKILL.md
7.7 KB

DSPy Evaluation Suite

Goal

Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution.

When to Use

  • Measuring program performance before/after optimization
  • Comparing different program variants
  • Establishing baselines
  • Validating production readiness

Related Skills

Inputs

InputTypeDescription
programdspy.ModuleProgram to evaluate
devsetlist[dspy.Example]Evaluation examples
metriccallableScoring function
num_threadsintParallel threads

Outputs

OutputTypeDescription
scorefloatAverage metric score
resultslistPer-example results

Workflow

Phase 1: Setup Evaluator

from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=my_metric,
    num_threads=8,
    display_progress=True
)

Phase 2: Run Evaluation

result = evaluator(my_program)
print(f"Score: {result.score:.2f}%")
# Access individual results: (example, prediction, score) tuples
for example, pred, score in result.results[:3]:
    print(f"Example: {example.question[:50]}... Score: {score}")

Built-in Metrics

answer_exact_match

import dspy

# Normalized, case-insensitive comparison
metric = dspy.evaluate.answer_exact_match

SemanticF1

LLM-based semantic evaluation:

from dspy.evaluate import SemanticF1

semantic = SemanticF1()
score = semantic(example, prediction)

Custom Metrics

Basic Metric

def exact_match(example, pred, trace=None):
    """Returns bool, int, or float."""
    return example.answer.lower().strip() == pred.answer.lower().strip()

Multi-Factor Metric

def quality_metric(example, pred, trace=None):
    """Score based on multiple factors."""
    score = 0.0
    
    # Correctness (50%)
    if example.answer.lower() in pred.answer.lower():
        score += 0.5
    
    # Conciseness (25%)
    if len(pred.answer.split()) <= 20:
        score += 0.25
    
    # Has reasoning (25%)
    if hasattr(pred, 'reasoning') and pred.reasoning:
        score += 0.25
    
    return score

GEPA-Compatible Metric

def feedback_metric(example, pred, trace=None):
    """Returns (score, feedback) for GEPA optimizer."""
    correct = example.answer.lower() in pred.answer.lower()
    
    if correct:
        return 1.0, "Correct answer provided."
    else:
        return 0.0, f"Expected '{example.answer}', got '{pred.answer}'"

Production Example

import dspy
from dspy.evaluate import Evaluate, SemanticF1
import json
import logging
from typing import Optional
from dataclasses import dataclass

logger = logging.getLogger(__name__)

@dataclass
class EvaluationResult:
    score: float
    num_examples: int
    correct: int
    incorrect: int
    errors: int

def comprehensive_metric(example, pred, trace=None) -> float:
    """Multi-dimensional evaluation metric."""
    scores = []
    
    # 1. Correctness
    if hasattr(example, 'answer') and hasattr(pred, 'answer'):
        correct = example.answer.lower().strip() in pred.answer.lower().strip()
        scores.append(1.0 if correct else 0.0)
    
    # 2. Completeness (answer not empty or error)
    if hasattr(pred, 'answer'):
        complete = len(pred.answer.strip()) > 0 and "error" not in pred.answer.lower()
        scores.append(1.0 if complete else 0.0)
    
    # 3. Reasoning quality (if available)
    if hasattr(pred, 'reasoning'):
        has_reasoning = len(str(pred.reasoning)) > 20
        scores.append(1.0 if has_reasoning else 0.5)
    
    return sum(scores) / len(scores) if scores else 0.0

class EvaluationSuite:
    def __init__(self, devset, num_threads=8):
        self.devset = devset
        self.num_threads = num_threads
    
    def evaluate(self, program, metric=None) -> EvaluationResult:
        """Run full evaluation with detailed results."""
        metric = metric or comprehensive_metric

        evaluator = Evaluate(
            devset=self.devset,
            metric=metric,
            num_threads=self.num_threads,
            display_progress=True
        )

        eval_result = evaluator(program)

        # Extract individual scores from results
        scores = [score for example, pred, score in eval_result.results]
        correct = sum(1 for s in scores if s >= 0.5)
        errors = sum(1 for s in scores if s == 0)

        return EvaluationResult(
            score=eval_result.score,
            num_examples=len(self.devset),
            correct=correct,
            incorrect=len(self.devset) - correct - errors,
            errors=errors
        )
    
    def compare(self, programs: dict, metric=None) -> dict:
        """Compare multiple programs."""
        results = {}
        
        for name, program in programs.items():
            logger.info(f"Evaluating: {name}")
            results[name] = self.evaluate(program, metric)
        
        # Rank by score
        ranked = sorted(results.items(), key=lambda x: x[1].score, reverse=True)
        
        print("\n=== Comparison Results ===")
        for rank, (name, result) in enumerate(ranked, 1):
            print(f"{rank}. {name}: {result.score:.2%}")
        
        return results
    
    def export_report(self, program, output_path: str, metric=None):
        """Export detailed evaluation report."""
        result = self.evaluate(program, metric)
        
        report = {
            "summary": {
                "score": result.score,
                "total": result.num_examples,
                "correct": result.correct,
                "accuracy": result.correct / result.num_examples
            },
            "config": {
                "num_threads": self.num_threads,
                "num_examples": len(self.devset)
            }
        }
        
        with open(output_path, 'w') as f:
            json.dump(report, f, indent=2)
        
        logger.info(f"Report saved to {output_path}")
        return report

# Usage
suite = EvaluationSuite(devset, num_threads=8)

# Single evaluation
result = suite.evaluate(my_program)
print(f"Score: {result.score:.2%}")

# Compare variants
results = suite.compare({
    "baseline": baseline_program,
    "optimized": optimized_program,
    "finetuned": finetuned_program
})

Best Practices

  1. Hold out test data - Never optimize on evaluation set
  2. Multiple metrics - Combine correctness, quality, efficiency
  3. Statistical significance - Use enough examples (100+)
  4. Track over time - Version control evaluation results

Limitations

  • Metrics are task-specific; no universal measure
  • SemanticF1 requires LLM calls (cost)
  • Parallel evaluation can hit rate limits
  • Edge cases may not be captured

Official Documentation

Source

git clone https://github.com/OmidZamani/dspy-skills/blob/master/skills/dspy-evaluation-suite/SKILL.mdView on GitHub

Overview

Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution. It guides you through setup with Evaluate, runs your module across a devset, and returns a score and per-example results to benchmark performance and quality.

How This Skill Works

You supply a devset, a metric, and optional number of threads. Instantiate Evaluate with these inputs, then call the evaluator on your dspy.Module. The result exposes a final score and per-example results for deeper analysis.

When to Use It

  • Measuring performance before and after optimization
  • Comparing different program variants
  • Establishing baselines for production readiness
  • Validating DSPy modules with built-in or custom metrics
  • Testing metric choices like answer_exact_match or SemanticF1 on RAG pipelines

Quick Start

  1. Step 1: Define a devset, metric, and number of threads
  2. Step 2: Create Evaluate(devset, metric, num_threads, display_progress=True)
  3. Step 3: Run evaluator(my_program) and inspect result.score and result.results

Best Practices

  • Use a representative devset that mirrors real tasks
  • Choose metrics that align with your success criteria
  • Run with adequate num_threads and monitor progress
  • Compare like-for-like variants and document baselines
  • Inspect per-example results to diagnose failures and bias

Example Use Cases

  • Benchmark a DSPy program before and after optimization to quantify gains
  • Compare two module variants to select the better performing one
  • Establish a production baseline and track changes over time
  • Evaluate RAG pipelines with SemanticF1 to assess semantic quality
  • Use a GEPA-compatible metric when working with GEPA optimizers

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers