Where is the eval history stored?

The eval history is stored at .shinchan-docs/eval-history.jsonl.

What dimensions are tracked and how are they scored?

Dimensions tracked are correctness, efficiency, compliance, and quality, on a 1-5 scale. A moving average uses the last 5 evaluations per agent.

What does --regression show?

--regression shows only agents with detected regressions, including latest score, moving average, and delta.

team-shinchan:eval

npx machina-cli add skill seokan-jeong/team-shinchan/eval --openclaw

Files (1)

SKILL.md

2.3 KB

Eval Skill

View agent evaluations, detect regressions, and compare performance.

Usage

/team-shinchan:eval                     # All agents summary
/team-shinchan:eval --agent bo          # Single agent detail
/team-shinchan:eval --regression        # Regression report only
/team-shinchan:eval --compare           # Side-by-side comparison

Arguments

Arg	Default	Description
`--agent {name}`	(all)	Show evaluation for a specific agent
`--regression`	false	Show only agents with detected regressions
`--compare`	false	Side-by-side comparison of all agents

Process

Step 1: Run Regression Detection

Execute node src/regression-detect.js .shinchan-docs/eval-history.jsonl --format table

If --agent is provided, add --agent {name}.

If file does not exist or is empty:

No evaluation history found.
Evaluations are recorded automatically during auto-retrospective.

Step 2: Display Results

Default (all agents):

Evaluation Summary
  Agent       | Evals | Correctness | Efficiency | Compliance | Quality
  bo          |    12 |     4.2     |    4.5     |    4.0     |   4.3
  aichan      |     8 |     4.0     |    3.8     |    4.2     |   4.1
  ...

--agent (single): Show full history with trend arrows and latest notes.

--regression: Filter to only agents where has_regression is true. Show dimension, latest score, moving average, and delta.

--compare:

Agent Comparison (last 5 evaluations)
  Dimension    | bo   | aichan | bunta | masao
  correctness  | 4.2  | 4.0    | 3.8   | 4.5
  efficiency   | 4.5  | 3.8    | 4.2   | 4.0
  compliance   | 4.0  | 4.2    | 4.0   | 3.9
  quality      | 4.3  | 4.1    | 4.3   | 4.2

Step 3: Warnings

If any regressions detected, display:

!! Regression detected for {agent} in {dimension}
   Latest: {score} | Avg: {avg} | Delta: {delta}
   Action: Review recent {agent} outputs and adjust prompts.

Important

Eval history: .shinchan-docs/eval-history.jsonl
Detection script: src/regression-detect.js
Dimensions: correctness, efficiency, compliance, quality (1-5 scale)
Moving average window: last 5 evaluations per agent

Source

git clone https://github.com/seokan-jeong/team-shinchan/blob/main/skills/eval/SKILL.mdView on GitHub

Overview

The Eval Skill lets you view agent evaluation histories, see trend data, and identify performance regressions. It supports per-agent detail, regression-only views, and side-by-side comparisons, helping teams spot declines quickly and take action.

How This Skill Works

Evaluation data is read from the eval-history.jsonl file and analyzed with the regression-detect.js script. The tool presents a summary, optional regression filters, and comparisons, with outputs designed for quick decision making. Use --agent, --regression, or --compare to tailor the results.

When to Use It

After a feature rollout, verify overall agent performance
Identify agents with potential regressions in any dimension
Inspect full evaluation history for a single agent
Generate a regression-only report for leadership
Compare multiple agents side-by-side to benchmark performance

Quick Start

Step 1: Run Regression Detection: node src/regression-detect.js .shinchan-docs/eval-history.jsonl --format table
Step 2: Display results with the desired mode (--agent {name}, --regression, or --compare)
Step 3: Review any regression warnings and take corrective action

Best Practices

Rely on the last 5 evaluations for moving-average stability
Use --regression to surface only regressions
Review latest notes and trends before taking action
Validate data existence; handle empty history gracefully
Combine with per-agent history for context when deciding prompts

Example Use Cases

QA team checks bo and aichan after a major update
Drill into a single agent's history with --agent bo
Generate a regression report to identify underperformers
Run --compare to benchmark correctness across agents
Respond to a regression alert by reviewing recent outputs and prompts

Frequently Asked Questions

Add this skill to your agents