Get the FREE Ultimate OpenClaw Setup Guide β†’

eval-audit

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/eval-audit --openclaw
Files (1)
SKILL.md
2.2 KB

Eval Audit

Audit LLM evaluation practice and route gaps to the right skills.

Interactive Q&A protocol (mandatory)

Ask one question at a time with lettered options whenever practical.

Example:

What should this audit prioritize first?
A) Live evaluation quality and coverage
B) Error analysis maturity
C) Review and promotion loop health
D) End-to-end process health

Rules:

  • One question per message.
  • Prefer lettered options.
  • Ask one follow-up only if ambiguity remains.

Inputs and evidence

Collect available evidence from Truesight first:

  • datasets and dataset rows
  • live evaluations
  • evaluation runs/results
  • review queue items
  • existing evaluation criteria and deployment patterns

If evidence is missing, record that as a finding.

Diagnostic areas

  1. Evaluation coverage and quality dimensions
  2. Error analysis practice and category quality
  3. Review and promotion workflow discipline
  4. Template usage versus custom needs
  5. Operational hygiene (verification, reruns, iteration cadence)

Report format (mandatory)

For each finding, include:

### <Finding title>
Status: Problem exists | OK | Cannot determine
Evidence: <specific evidence from Truesight context>
Severity: critical | high | medium | low
Recommended skill: <one of current skill set>
Next command: <concrete instruction to run next>

Order findings by severity and impact.

Severity rubric

  • critical: likely causes incorrect go/no-go decisions or severe user harm
  • high: frequent quality failures or missing control loops
  • medium: meaningful process weakness with moderate impact
  • low: optimization opportunity, documentation, or ergonomics issue

Handoff map

  • Missing or weak failure taxonomy -> error-analysis
  • Missing live evaluation coverage -> create-evaluation or bootstrap-template-evaluation
  • Review backlog or low judgment throughput -> review-and-promote-traces
  • Unclear starting path -> truesight-workflows

Guardrails

  • Keep scope within current Truesight MCP capabilities.

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/eval-audit/SKILL.mdView on GitHub

Overview

Audits the existing evaluation workflow, surfaces gaps, and ranks findings by severity with concrete next actions. It’s designed for teams inheriting an eval setup, diagnosing quality regressions, or evaluating LLM evaluation maturity. The process relies on Truesight evidence (datasets, live evaluations, runs, review queues, criteria and deployment patterns) and outputs a structured report that maps findings to actionable next steps and related skills.

How This Skill Works

It guides the auditor through an interactive Q&A protocol (mandatory) to surface priorities with one question per message and lettered options. It collects evidence from Truesight sources such as datasets, live evaluations, evaluation runs, review queues, and existing criteria. For each finding it records the status, evidence, severity, recommended skill, and a concrete next command, and then orders findings by severity and impact.

When to Use It

  • Inheriting an evaluation setup
  • Diagnosing sudden quality regressions in evaluation
  • Assessing LLM evaluation process maturity
  • Verifying evidence availability and coverage in Truesight
  • Review and promotion workflow health assessment

Quick Start

  1. Step 1: Gather evidence from Truesight (datasets, live evaluations, evaluation runs, review queues, criteria/deployment patterns).
  2. Step 2: Run the Interactive Q&A protocol to surface gaps and capture findings with status, evidence, and severity.
  3. Step 3: Compile findings in severity order and specify concrete Next commands to close each gap.

Best Practices

  • Follow the mandatory Interactive Q&A protocol (one question at a time, with lettered options)
  • Collect evidence from Truesight first and record missing data as findings
  • Rank findings by severity and business impact when compiling the report
  • Use the handoff map to route gaps to the appropriate supporting skills
  • Maintain the mandated report format and field integrity for consistency

Example Use Cases

  • Missing or weak live evaluation coverage detected in Truesight; plan to bootstrap-template-evaluation.
  • Error-analysis maturity is shallow; evaluation results lack granular categories.
  • Review backlog or low judgment throughput evident; escalate using review-and-promote-traces.
  • Starting path for a new team is unclear; route through truesight-workflows.
  • Template usage dominates tests without addressing custom needs; initiate a custom-test plan.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers β†—