What is eval-audit for?

To audit an existing evaluation workflow, surface gaps, and deliver severity-ranked findings with concrete next actions, guiding next steps or skill handoffs.

A structured set of findings ordered by severity, each with status, evidence, severity, recommended skill, and a concrete Next command to execute.

eval-audit

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/eval-audit --openclaw

Files (1)

SKILL.md

2.2 KB

Eval Audit

Audit LLM evaluation practice and route gaps to the right skills.

Interactive Q&A protocol (mandatory)

Ask one question at a time with lettered options whenever practical.

Example:

What should this audit prioritize first?
A) Live evaluation quality and coverage
B) Error analysis maturity
C) Review and promotion loop health
D) End-to-end process health

Rules:

One question per message.
Prefer lettered options.
Ask one follow-up only if ambiguity remains.

Inputs and evidence

Collect available evidence from Truesight first:

datasets and dataset rows
live evaluations
evaluation runs/results
review queue items
existing evaluation criteria and deployment patterns

If evidence is missing, record that as a finding.

Diagnostic areas

Evaluation coverage and quality dimensions
Error analysis practice and category quality
Review and promotion workflow discipline
Template usage versus custom needs
Operational hygiene (verification, reruns, iteration cadence)

Report format (mandatory)

For each finding, include:

### <Finding title>
Status: Problem exists | OK | Cannot determine
Evidence: <specific evidence from Truesight context>
Severity: critical | high | medium | low
Recommended skill: <one of current skill set>
Next command: <concrete instruction to run next>

Order findings by severity and impact.

Severity rubric

critical: likely causes incorrect go/no-go decisions or severe user harm
high: frequent quality failures or missing control loops
medium: meaningful process weakness with moderate impact
low: optimization opportunity, documentation, or ergonomics issue

Handoff map

Missing or weak failure taxonomy -> error-analysis
Missing live evaluation coverage -> create-evaluation or bootstrap-template-evaluation
Review backlog or low judgment throughput -> review-and-promote-traces
Unclear starting path -> truesight-workflows

Guardrails

Keep scope within current Truesight MCP capabilities.

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/eval-audit/SKILL.mdView on GitHub

Overview

Audits the existing evaluation workflow, surfaces gaps, and ranks findings by severity with concrete next actions. It’s designed for teams inheriting an eval setup, diagnosing quality regressions, or evaluating LLM evaluation maturity. The process relies on Truesight evidence (datasets, live evaluations, runs, review queues, criteria and deployment patterns) and outputs a structured report that maps findings to actionable next steps and related skills.

How This Skill Works

It guides the auditor through an interactive Q&A protocol (mandatory) to surface priorities with one question per message and lettered options. It collects evidence from Truesight sources such as datasets, live evaluations, evaluation runs, review queues, and existing criteria. For each finding it records the status, evidence, severity, recommended skill, and a concrete next command, and then orders findings by severity and impact.

When to Use It

Inheriting an evaluation setup
Diagnosing sudden quality regressions in evaluation
Assessing LLM evaluation process maturity
Verifying evidence availability and coverage in Truesight
Review and promotion workflow health assessment

Quick Start

Step 1: Gather evidence from Truesight (datasets, live evaluations, evaluation runs, review queues, criteria/deployment patterns).
Step 2: Run the Interactive Q&A protocol to surface gaps and capture findings with status, evidence, and severity.
Step 3: Compile findings in severity order and specify concrete Next commands to close each gap.

Best Practices

Follow the mandatory Interactive Q&A protocol (one question at a time, with lettered options)
Collect evidence from Truesight first and record missing data as findings
Rank findings by severity and business impact when compiling the report
Use the handoff map to route gaps to the appropriate supporting skills
Maintain the mandated report format and field integrity for consistency

Example Use Cases

Missing or weak live evaluation coverage detected in Truesight; plan to bootstrap-template-evaluation.
Error-analysis maturity is shallow; evaluation results lack granular categories.
Review backlog or low judgment throughput evident; escalate using review-and-promote-traces.
Starting path for a new team is unclear; route through truesight-workflows.
Template usage dominates tests without addressing custom needs; initiate a custom-test plan.

Frequently Asked Questions

Add this skill to your agents