What data size should I target?

Aim for approximately 100 traces when possible; use random sampling plus stratified coverage when volume is high.

What permissions are required to run these steps?

list_datasets and upload_dataset require datasets:read or write; update_dataset_row and apply_category_mappings require datasets:write; suggest_error_notes and consolidate_error_categories require error-analysis:execute.

How should I handle changing categories?

Let categories emerge from traces, iterate after about 20 traces, and relabel for consistency; avoid defining categories before reading traces.

error-analysis

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/error-analysis --openclaw

Files (1)

SKILL.md

2.4 KB

Error Analysis

Guide the user through trace-grounded failure analysis and dataset labeling.

Interactive Q&A protocol (mandatory)

Ask one question at a time with lettered options whenever practical.

Example:

Which data source should we analyze first?
A) Existing Truesight dataset
B) New dataset to upload
C) Unsure, list datasets first

Rules:

One question per message during setup.
Prefer lettered options.
Ask one follow-up if response is ambiguous.

Core workflow

Select or create dataset:
- If dataset exists, use list_datasets.
- If not, use upload_dataset.
Collect representative traces:
- Target approximately 100 traces when possible.
- Use random plus stratified coverage when volume is high.
Analyze row by row:
- Use get_dataset_rows with pagination.
- For each row, call suggest_error_notes.
Persist annotations:
- Save _ts_error_notes and _ts_error_category with update_dataset_row.
Consolidate categories:
- Run consolidate_error_categories.
- Review mapping proposals, then apply with apply_category_mappings.
Prioritize fixes:
- Report most frequent categories first.
- Recommend next skill based on failure type:
  - create-evaluation for new evaluation coverage
  - review-and-promote-traces for judgment backlog
  - eval-audit for broader process gaps

Analysis heuristics

Focus on first root failure in each trace, not every downstream symptom.
Let categories emerge from observed traces, not pre-baked labels.
Iterate categories after 20 traces, then relabel for consistency.
Stop when recent traces no longer reveal new failure categories.

Anti-patterns

Defining categories before reading traces.
Treating output quality labels as generic scores without concrete failure modes.
Skipping relabel after category definitions change.
Building new evaluators before fixing obvious prompt/tooling/engineering gaps.

Scopes reference

list_datasets, get_dataset_rows require datasets:read
upload_dataset, update_dataset_row, apply_category_mappings require datasets:write
suggest_error_notes, consolidate_error_categories require error-analysis:execute

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/error-analysis/SKILL.mdView on GitHub

Overview

Error Analysis guides trace-grounded failure analysis and dataset labeling using Truesight datasets and error-analysis tools. It emphasizes an interactive Q&A setup, a structured workflow to identify root causes, consolidate categories, and prioritize fixes for quality issues or drift.

How This Skill Works

Start by selecting an existing dataset via list_datasets or upload_dataset if needed. Collect about 100 representative traces using random sampling with stratified coverage when volume is high. Analyze each row with get_dataset_rows and suggest_error_notes, persisting findings with update_dataset_row. Then consolidate error categories with consolidate_error_categories, apply mappings with apply_category_mappings, and prioritize fixes by category frequency, guiding next steps such as create-evaluation, review-and-promote-traces, or eval-audit. The process follows a mandatory Interactive Q&A protocol during setup, asking one question at a time with lettered options.

When to Use It

Quality issues in evaluated traces are unclear or ambiguous
After major pipeline changes to identify failure modes
Incidents indicating drift in evaluation data or outputs
When backlog of judgments requires labeling and categorization
Before implementing fixes or expanding evaluation coverage to reveal gaps

Quick Start

Step 1: Select an existing dataset with list_datasets or upload_dataset to create a new one.
Step 2: Collect ~100 representative traces (random + stratified as needed) and run get_dataset_rows + suggest_error_notes; persist with update_dataset_row.
Step 3: Run consolidate_error_categories and apply_category_mappings, then prioritize fixes based on category frequency.

Best Practices

Start by selecting an existing dataset or uploading a new one via the setup flow
Aim for about 100 traces with random sampling and stratified coverage when volume is high
Focus on the first root failure in each trace and avoid labeling every downstream symptom
Let failure categories emerge from the traces rather than predefining labels
Iterate after about 20 traces and relabel for consistency; consolidate categories and apply mappings

Example Use Cases

After a pipeline update, identify dominant error categories causing ambiguous quality
Across datasets, surface drift-related failures and track changes over time
Consolidate error categories to reveal top failure modes affecting evaluation
Apply category mappings to standardize labels across teams and tools
Prioritize fixes by most frequent categories to inform the next skill choice (e.g., create-evaluation)

Frequently Asked Questions

Add this skill to your agents