How do I choose the right metric for my DSPy task?

Start with built-in options like answer_exact_match for exact string equality or SemanticF1 for semantic similarity; tailor metrics to task goals and error types.

Can I use custom metrics with the evaluation suite?

Yes. Implement a function like quality_metric or a GEPA-compatible metric and pass it as the metric argument to Evaluate.

What does the final score represent?

The score is the average of the selected metric over the devset, with per-example results available for inspection and debugging.

dspy-evaluation-suite

Scanned

npx machina-cli add skill OmidZamani/dspy-skills/dspy-evaluation-suite --openclaw

Files (1)

SKILL.md

7.7 KB

DSPy Evaluation Suite

Goal

Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution.

When to Use

Measuring program performance before/after optimization
Comparing different program variants
Establishing baselines
Validating production readiness

Related Skills

Use with any optimizer: dspy-bootstrap-fewshot, dspy-miprov2-optimizer, dspy-gepa-reflective
Evaluate RAG pipelines: dspy-rag-pipeline

Inputs

Input	Type	Description
`program`	`dspy.Module`	Program to evaluate
`devset`	`list[dspy.Example]`	Evaluation examples
`metric`	`callable`	Scoring function
`num_threads`	`int`	Parallel threads

Outputs

Output	Type	Description
`score`	`float`	Average metric score
`results`	`list`	Per-example results

Workflow

Phase 1: Setup Evaluator

from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=my_metric,
    num_threads=8,
    display_progress=True
)

Phase 2: Run Evaluation

result = evaluator(my_program)
print(f"Score: {result.score:.2f}%")
# Access individual results: (example, prediction, score) tuples
for example, pred, score in result.results[:3]:
    print(f"Example: {example.question[:50]}... Score: {score}")

Built-in Metrics

answer_exact_match

import dspy

# Normalized, case-insensitive comparison
metric = dspy.evaluate.answer_exact_match

SemanticF1

LLM-based semantic evaluation:

from dspy.evaluate import SemanticF1

semantic = SemanticF1()
score = semantic(example, prediction)

Custom Metrics

Basic Metric

def exact_match(example, pred, trace=None):
    """Returns bool, int, or float."""
    return example.answer.lower().strip() == pred.answer.lower().strip()

Multi-Factor Metric

def quality_metric(example, pred, trace=None):
    """Score based on multiple factors."""
    score = 0.0
    
    # Correctness (50%)
    if example.answer.lower() in pred.answer.lower():
        score += 0.5
    
    # Conciseness (25%)
    if len(pred.answer.split()) <= 20:
        score += 0.25
    
    # Has reasoning (25%)
    if hasattr(pred, 'reasoning') and pred.reasoning:
        score += 0.25
    
    return score

GEPA-Compatible Metric

def feedback_metric(example, pred, trace=None):
    """Returns (score, feedback) for GEPA optimizer."""
    correct = example.answer.lower() in pred.answer.lower()
    
    if correct:
        return 1.0, "Correct answer provided."
    else:
        return 0.0, f"Expected '{example.answer}', got '{pred.answer}'"

Production Example

import dspy
from dspy.evaluate import Evaluate, SemanticF1
import json
import logging
from typing import Optional
from dataclasses import dataclass

logger = logging.getLogger(__name__)

@dataclass
class EvaluationResult:
    score: float
    num_examples: int
    correct: int
    incorrect: int
    errors: int

def comprehensive_metric(example, pred, trace=None) -> float:
    """Multi-dimensional evaluation metric."""
    scores = []
    
    # 1. Correctness
    if hasattr(example, 'answer') and hasattr(pred, 'answer'):
        correct = example.answer.lower().strip() in pred.answer.lower().strip()
        scores.append(1.0 if correct else 0.0)
    
    # 2. Completeness (answer not empty or error)
    if hasattr(pred, 'answer'):
        complete = len(pred.answer.strip()) > 0 and "error" not in pred.answer.lower()
        scores.append(1.0 if complete else 0.0)
    
    # 3. Reasoning quality (if available)
    if hasattr(pred, 'reasoning'):
        has_reasoning = len(str(pred.reasoning)) > 20
        scores.append(1.0 if has_reasoning else 0.5)
    
    return sum(scores) / len(scores) if scores else 0.0

class EvaluationSuite:
    def __init__(self, devset, num_threads=8):
        self.devset = devset
        self.num_threads = num_threads
    
    def evaluate(self, program, metric=None) -> EvaluationResult:
        """Run full evaluation with detailed results."""
        metric = metric or comprehensive_metric

        evaluator = Evaluate(
            devset=self.devset,
            metric=metric,
            num_threads=self.num_threads,
            display_progress=True
        )

        eval_result = evaluator(program)

        # Extract individual scores from results
        scores = [score for example, pred, score in eval_result.results]
        correct = sum(1 for s in scores if s >= 0.5)
        errors = sum(1 for s in scores if s == 0)

        return EvaluationResult(
            score=eval_result.score,
            num_examples=len(self.devset),
            correct=correct,
            incorrect=len(self.devset) - correct - errors,
            errors=errors
        )
    
    def compare(self, programs: dict, metric=None) -> dict:
        """Compare multiple programs."""
        results = {}
        
        for name, program in programs.items():
            logger.info(f"Evaluating: {name}")
            results[name] = self.evaluate(program, metric)
        
        # Rank by score
        ranked = sorted(results.items(), key=lambda x: x[1].score, reverse=True)
        
        print("\n=== Comparison Results ===")
        for rank, (name, result) in enumerate(ranked, 1):
            print(f"{rank}. {name}: {result.score:.2%}")
        
        return results
    
    def export_report(self, program, output_path: str, metric=None):
        """Export detailed evaluation report."""
        result = self.evaluate(program, metric)
        
        report = {
            "summary": {
                "score": result.score,
                "total": result.num_examples,
                "correct": result.correct,
                "accuracy": result.correct / result.num_examples
            },
            "config": {
                "num_threads": self.num_threads,
                "num_examples": len(self.devset)
            }
        }
        
        with open(output_path, 'w') as f:
            json.dump(report, f, indent=2)
        
        logger.info(f"Report saved to {output_path}")
        return report

# Usage
suite = EvaluationSuite(devset, num_threads=8)

# Single evaluation
result = suite.evaluate(my_program)
print(f"Score: {result.score:.2%}")

# Compare variants
results = suite.compare({
    "baseline": baseline_program,
    "optimized": optimized_program,
    "finetuned": finetuned_program
})

Best Practices

Hold out test data - Never optimize on evaluation set
Multiple metrics - Combine correctness, quality, efficiency
Statistical significance - Use enough examples (100+)
Track over time - Version control evaluation results

Limitations

Metrics are task-specific; no universal measure
SemanticF1 requires LLM calls (cost)
Parallel evaluation can hit rate limits
Edge cases may not be captured

Official Documentation

DSPy Documentation: https://dspy.ai/
DSPy GitHub: https://github.com/stanfordnlp/dspy
Evaluation API: https://dspy.ai/api/evaluation/
Metrics Guide: https://dspy.ai/learn/evaluation/metrics/

Source

git clone https://github.com/OmidZamani/dspy-skills/blob/master/skills/dspy-evaluation-suite/SKILL.mdView on GitHub

Overview

Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution. It guides you through setup with Evaluate, runs your module across a devset, and returns a score and per-example results to benchmark performance and quality.

How This Skill Works

You supply a devset, a metric, and optional number of threads. Instantiate Evaluate with these inputs, then call the evaluator on your dspy.Module. The result exposes a final score and per-example results for deeper analysis.

When to Use It

Measuring performance before and after optimization
Comparing different program variants
Establishing baselines for production readiness
Validating DSPy modules with built-in or custom metrics
Testing metric choices like answer_exact_match or SemanticF1 on RAG pipelines

Quick Start

Step 1: Define a devset, metric, and number of threads
Step 2: Create Evaluate(devset, metric, num_threads, display_progress=True)
Step 3: Run evaluator(my_program) and inspect result.score and result.results

Best Practices

Use a representative devset that mirrors real tasks
Choose metrics that align with your success criteria
Run with adequate num_threads and monitor progress
Compare like-for-like variants and document baselines
Inspect per-example results to diagnose failures and bias

Example Use Cases

Benchmark a DSPy program before and after optimization to quantify gains
Compare two module variants to select the better performing one
Establish a production baseline and track changes over time
Evaluate RAG pipelines with SemanticF1 to assess semantic quality
Use a GEPA-compatible metric when working with GEPA optimizers

Frequently Asked Questions

Add this skill to your agents