team-shinchan:eval
npx machina-cli add skill seokan-jeong/team-shinchan/eval --openclawEval Skill
View agent evaluations, detect regressions, and compare performance.
Usage
/team-shinchan:eval # All agents summary
/team-shinchan:eval --agent bo # Single agent detail
/team-shinchan:eval --regression # Regression report only
/team-shinchan:eval --compare # Side-by-side comparison
Arguments
| Arg | Default | Description |
|---|---|---|
--agent {name} | (all) | Show evaluation for a specific agent |
--regression | false | Show only agents with detected regressions |
--compare | false | Side-by-side comparison of all agents |
Process
Step 1: Run Regression Detection
Execute node src/regression-detect.js .shinchan-docs/eval-history.jsonl --format table
If --agent is provided, add --agent {name}.
If file does not exist or is empty:
No evaluation history found.
Evaluations are recorded automatically during auto-retrospective.
Step 2: Display Results
Default (all agents):
Evaluation Summary
Agent | Evals | Correctness | Efficiency | Compliance | Quality
bo | 12 | 4.2 | 4.5 | 4.0 | 4.3
aichan | 8 | 4.0 | 3.8 | 4.2 | 4.1
...
--agent (single): Show full history with trend arrows and latest notes.
--regression:
Filter to only agents where has_regression is true.
Show dimension, latest score, moving average, and delta.
--compare:
Agent Comparison (last 5 evaluations)
Dimension | bo | aichan | bunta | masao
correctness | 4.2 | 4.0 | 3.8 | 4.5
efficiency | 4.5 | 3.8 | 4.2 | 4.0
compliance | 4.0 | 4.2 | 4.0 | 3.9
quality | 4.3 | 4.1 | 4.3 | 4.2
Step 3: Warnings
If any regressions detected, display:
!! Regression detected for {agent} in {dimension}
Latest: {score} | Avg: {avg} | Delta: {delta}
Action: Review recent {agent} outputs and adjust prompts.
Important
- Eval history:
.shinchan-docs/eval-history.jsonl - Detection script:
src/regression-detect.js - Dimensions: correctness, efficiency, compliance, quality (1-5 scale)
- Moving average window: last 5 evaluations per agent
Source
git clone https://github.com/seokan-jeong/team-shinchan/blob/main/skills/eval/SKILL.mdView on GitHub Overview
The Eval Skill lets you view agent evaluation histories, see trend data, and identify performance regressions. It supports per-agent detail, regression-only views, and side-by-side comparisons, helping teams spot declines quickly and take action.
How This Skill Works
Evaluation data is read from the eval-history.jsonl file and analyzed with the regression-detect.js script. The tool presents a summary, optional regression filters, and comparisons, with outputs designed for quick decision making. Use --agent, --regression, or --compare to tailor the results.
When to Use It
- After a feature rollout, verify overall agent performance
- Identify agents with potential regressions in any dimension
- Inspect full evaluation history for a single agent
- Generate a regression-only report for leadership
- Compare multiple agents side-by-side to benchmark performance
Quick Start
- Step 1: Run Regression Detection: node src/regression-detect.js .shinchan-docs/eval-history.jsonl --format table
- Step 2: Display results with the desired mode (--agent {name}, --regression, or --compare)
- Step 3: Review any regression warnings and take corrective action
Best Practices
- Rely on the last 5 evaluations for moving-average stability
- Use --regression to surface only regressions
- Review latest notes and trends before taking action
- Validate data existence; handle empty history gracefully
- Combine with per-agent history for context when deciding prompts
Example Use Cases
- QA team checks bo and aichan after a major update
- Drill into a single agent's history with --agent bo
- Generate a regression report to identify underperformers
- Run --compare to benchmark correctness across agents
- Respond to a regression alert by reviewing recent outputs and prompts