Get the FREE Ultimate OpenClaw Setup Guide →

error-analysis

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/error-analysis --openclaw
Files (1)
SKILL.md
2.4 KB

Error Analysis

Guide the user through trace-grounded failure analysis and dataset labeling.

Interactive Q&A protocol (mandatory)

Ask one question at a time with lettered options whenever practical.

Example:

Which data source should we analyze first?
A) Existing Truesight dataset
B) New dataset to upload
C) Unsure, list datasets first

Rules:

  • One question per message during setup.
  • Prefer lettered options.
  • Ask one follow-up if response is ambiguous.

Core workflow

  1. Select or create dataset:
    • If dataset exists, use list_datasets.
    • If not, use upload_dataset.
  2. Collect representative traces:
    • Target approximately 100 traces when possible.
    • Use random plus stratified coverage when volume is high.
  3. Analyze row by row:
    • Use get_dataset_rows with pagination.
    • For each row, call suggest_error_notes.
  4. Persist annotations:
    • Save _ts_error_notes and _ts_error_category with update_dataset_row.
  5. Consolidate categories:
    • Run consolidate_error_categories.
    • Review mapping proposals, then apply with apply_category_mappings.
  6. Prioritize fixes:
    • Report most frequent categories first.
    • Recommend next skill based on failure type:
      • create-evaluation for new evaluation coverage
      • review-and-promote-traces for judgment backlog
      • eval-audit for broader process gaps

Analysis heuristics

  • Focus on first root failure in each trace, not every downstream symptom.
  • Let categories emerge from observed traces, not pre-baked labels.
  • Iterate categories after 20 traces, then relabel for consistency.
  • Stop when recent traces no longer reveal new failure categories.

Anti-patterns

  • Defining categories before reading traces.
  • Treating output quality labels as generic scores without concrete failure modes.
  • Skipping relabel after category definitions change.
  • Building new evaluators before fixing obvious prompt/tooling/engineering gaps.

Scopes reference

  • list_datasets, get_dataset_rows require datasets:read
  • upload_dataset, update_dataset_row, apply_category_mappings require datasets:write
  • suggest_error_notes, consolidate_error_categories require error-analysis:execute

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/error-analysis/SKILL.mdView on GitHub

Overview

Error Analysis guides trace-grounded failure analysis and dataset labeling using Truesight datasets and error-analysis tools. It emphasizes an interactive Q&A setup, a structured workflow to identify root causes, consolidate categories, and prioritize fixes for quality issues or drift.

How This Skill Works

Start by selecting an existing dataset via list_datasets or upload_dataset if needed. Collect about 100 representative traces using random sampling with stratified coverage when volume is high. Analyze each row with get_dataset_rows and suggest_error_notes, persisting findings with update_dataset_row. Then consolidate error categories with consolidate_error_categories, apply mappings with apply_category_mappings, and prioritize fixes by category frequency, guiding next steps such as create-evaluation, review-and-promote-traces, or eval-audit. The process follows a mandatory Interactive Q&A protocol during setup, asking one question at a time with lettered options.

When to Use It

  • Quality issues in evaluated traces are unclear or ambiguous
  • After major pipeline changes to identify failure modes
  • Incidents indicating drift in evaluation data or outputs
  • When backlog of judgments requires labeling and categorization
  • Before implementing fixes or expanding evaluation coverage to reveal gaps

Quick Start

  1. Step 1: Select an existing dataset with list_datasets or upload_dataset to create a new one.
  2. Step 2: Collect ~100 representative traces (random + stratified as needed) and run get_dataset_rows + suggest_error_notes; persist with update_dataset_row.
  3. Step 3: Run consolidate_error_categories and apply_category_mappings, then prioritize fixes based on category frequency.

Best Practices

  • Start by selecting an existing dataset or uploading a new one via the setup flow
  • Aim for about 100 traces with random sampling and stratified coverage when volume is high
  • Focus on the first root failure in each trace and avoid labeling every downstream symptom
  • Let failure categories emerge from the traces rather than predefining labels
  • Iterate after about 20 traces and relabel for consistency; consolidate categories and apply mappings

Example Use Cases

  • After a pipeline update, identify dominant error categories causing ambiguous quality
  • Across datasets, surface drift-related failures and track changes over time
  • Consolidate error categories to reveal top failure modes affecting evaluation
  • Apply category mappings to standardize labels across teams and tools
  • Prioritize fixes by most frequent categories to inform the next skill choice (e.g., create-evaluation)

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers