Get the FREE Ultimate OpenClaw Setup Guide →

eval-harness

npx machina-cli add skill a5c-ai/babysitter/eval-harness --openclaw
Files (1)
SKILL.md
2.1 KB

Eval Harness

Overview

Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites.

Evaluation Types

1. Agent Performance Benchmark

  • Define test cases with known-correct outputs
  • Run agent against each test case
  • Score: accuracy, completeness, relevance
  • Compare against baseline performance
  • Track performance over time

2. Skill Quality Testing

  • Verify skill instructions produce expected outcomes
  • Test edge cases and boundary conditions
  • Measure consistency across multiple runs
  • Check for harmful or incorrect outputs
  • Validate against ground truth

3. Regression Suite

  • Collection of previously-passing test cases
  • Run after any agent/skill modification
  • Flag regressions with before/after comparison
  • Maintain pass rate threshold (>= 95%)

4. Process Verification

  • End-to-end process execution with known inputs
  • Verify each phase produces expected outputs
  • Check task ordering and dependency satisfaction
  • Measure total execution time

Quality Scoring

Accuracy Score (0-100)

  • Correctness of output vs expected
  • Partial credit for partially correct outputs
  • Penalty for hallucinated or fabricated content

Completeness Score (0-100)

  • Coverage of required output elements
  • Missing sections flagged and scored
  • Bonus for useful additional context

Consistency Score (0-100)

  • Run same input 3 times
  • Compare outputs for semantic similarity
  • Flag inconsistencies

Composite Score

  • (accuracy * 0.4 + completeness * 0.3 + consistency * 0.3)
  • Threshold: 80 to pass

When to Use

  • After creating new agents or skills
  • After modifying existing agents or skills
  • Periodic quality audits
  • Before promoting skills to production

Agents Used

  • Used by process-level evaluation orchestrators
  • No specific agent dependency (evaluates other agents)

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/methodologies/everything-claude-code/skills/eval-harness/SKILL.mdView on GitHub

Overview

The eval-harness provides structured frameworks to benchmark agent performance, test skill quality, and run regression suites. It adapts methodologies from the Everything Claude Code project to measure accuracy, completeness, and consistency across runs.

How This Skill Works

It defines test cases with known-correct outputs, runs the agent against each test case, and assigns scores for accuracy, completeness, and relevance. It also conducts skill quality testing, checks edge cases, and runs a regression suite with before/after comparisons to flag regressions, all while verifying end-to-end process outputs and timing.

When to Use It

  • After creating new agents or skills
  • After modifying existing agents or skills
  • Periodic quality audits
  • Before promoting skills to production
  • After running regression suites to verify stability

Quick Start

  1. Step 1: Define test cases with known-correct outputs and baseline expectations
  2. Step 2: Run the agent against each test case and collect scores (accuracy, completeness, consistency)
  3. Step 3: Review composite score, pass rate, and any regressions; iterate as needed

Best Practices

  • Define test cases with known-correct outputs and baselines
  • Maintain a pass rate threshold of >= 95% for the regression suite
  • Run end-to-end process verification and measure total execution time
  • Test edge cases and boundary conditions for skill quality
  • Flag and address harmful, incorrect, or hallucinated outputs; validate against ground truth

Example Use Cases

  • Benchmark a coding assistant by running a set of known-correct programming tasks and scoring accuracy, completeness, and relevance.
  • After updating a data extraction skill, execute the regression suite to detect regressions and verify output stability.
  • Perform skill quality testing to ensure instructions produce expected results across multiple runs.
  • Conduct end-to-end verification on a multi-step data pipeline to confirm correct task ordering and timing.
  • Aggregate composite scores to decide whether a skill is ready for production promotion.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers