What are the scoring metrics?

Accuracy Score (0-100), Completeness Score (0-100), Consistency Score (0-100), and a Composite Score computed as (accuracy * 0.4 + completeness * 0.3 + consistency * 0.3).

What is the pass threshold?

Composite Score threshold is 80 to pass, and regression suites maintain a pass rate of >= 95%.

eval-harness

npx machina-cli add skill a5c-ai/babysitter/eval-harness --openclaw

Files (1)

SKILL.md

2.1 KB

Eval Harness

Overview

Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites.

Evaluation Types

1. Agent Performance Benchmark

Define test cases with known-correct outputs
Run agent against each test case
Score: accuracy, completeness, relevance
Compare against baseline performance
Track performance over time

2. Skill Quality Testing

Verify skill instructions produce expected outcomes
Test edge cases and boundary conditions
Measure consistency across multiple runs
Check for harmful or incorrect outputs
Validate against ground truth

3. Regression Suite

Collection of previously-passing test cases
Run after any agent/skill modification
Flag regressions with before/after comparison
Maintain pass rate threshold (>= 95%)

4. Process Verification

End-to-end process execution with known inputs
Verify each phase produces expected outputs
Check task ordering and dependency satisfaction
Measure total execution time

Quality Scoring

Accuracy Score (0-100)

Correctness of output vs expected
Partial credit for partially correct outputs
Penalty for hallucinated or fabricated content

Completeness Score (0-100)

Coverage of required output elements
Missing sections flagged and scored
Bonus for useful additional context

Consistency Score (0-100)

Run same input 3 times
Compare outputs for semantic similarity
Flag inconsistencies

Composite Score

(accuracy * 0.4 + completeness * 0.3 + consistency * 0.3)
Threshold: 80 to pass

When to Use

After creating new agents or skills
After modifying existing agents or skills
Periodic quality audits
Before promoting skills to production

Agents Used

Used by process-level evaluation orchestrators
No specific agent dependency (evaluates other agents)

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/methodologies/everything-claude-code/skills/eval-harness/SKILL.md

View on GitHub

Overview

The eval-harness provides structured frameworks to benchmark agent performance, test skill quality, and run regression suites. It adapts methodologies from the Everything Claude Code project to measure accuracy, completeness, and consistency across runs.

How This Skill Works

It defines test cases with known-correct outputs, runs the agent against each test case, and assigns scores for accuracy, completeness, and relevance. It also conducts skill quality testing, checks edge cases, and runs a regression suite with before/after comparisons to flag regressions, all while verifying end-to-end process outputs and timing.

When to Use It

After creating new agents or skills
After modifying existing agents or skills
Periodic quality audits
Before promoting skills to production
After running regression suites to verify stability

Quick Start

Step 1: Define test cases with known-correct outputs and baseline expectations
Step 2: Run the agent against each test case and collect scores (accuracy, completeness, consistency)
Step 3: Review composite score, pass rate, and any regressions; iterate as needed

Best Practices

Define test cases with known-correct outputs and baselines
Maintain a pass rate threshold of >= 95% for the regression suite
Run end-to-end process verification and measure total execution time
Test edge cases and boundary conditions for skill quality
Flag and address harmful, incorrect, or hallucinated outputs; validate against ground truth

Example Use Cases

Benchmark a coding assistant by running a set of known-correct programming tasks and scoring accuracy, completeness, and relevance.
After updating a data extraction skill, execute the regression suite to detect regressions and verify output stability.
Perform skill quality testing to ensure instructions produce expected results across multiple runs.
Conduct end-to-end verification on a multi-step data pipeline to confirm correct task ordering and timing.
Aggregate composite scores to decide whether a skill is ready for production promotion.

Frequently Asked Questions

Add this skill to your agents

eval-harness

Eval Harness

Overview

Evaluation Types

1. Agent Performance Benchmark

2. Skill Quality Testing

3. Regression Suite

4. Process Verification

Quality Scoring

Accuracy Score (0-100)

Completeness Score (0-100)

Consistency Score (0-100)

Composite Score

When to Use

Agents Used

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is eval-harness used for?

What are the scoring metrics?

What is the pass threshold?