error-analysis
npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/error-analysis --openclawError Analysis
Guide the user through trace-grounded failure analysis and dataset labeling.
Interactive Q&A protocol (mandatory)
Ask one question at a time with lettered options whenever practical.
Example:
Which data source should we analyze first?
A) Existing Truesight dataset
B) New dataset to upload
C) Unsure, list datasets first
Rules:
- One question per message during setup.
- Prefer lettered options.
- Ask one follow-up if response is ambiguous.
Core workflow
- Select or create dataset:
- If dataset exists, use
list_datasets. - If not, use
upload_dataset.
- If dataset exists, use
- Collect representative traces:
- Target approximately 100 traces when possible.
- Use random plus stratified coverage when volume is high.
- Analyze row by row:
- Use
get_dataset_rowswith pagination. - For each row, call
suggest_error_notes.
- Use
- Persist annotations:
- Save
_ts_error_notesand_ts_error_categorywithupdate_dataset_row.
- Save
- Consolidate categories:
- Run
consolidate_error_categories. - Review mapping proposals, then apply with
apply_category_mappings.
- Run
- Prioritize fixes:
- Report most frequent categories first.
- Recommend next skill based on failure type:
create-evaluationfor new evaluation coveragereview-and-promote-tracesfor judgment backlogeval-auditfor broader process gaps
Analysis heuristics
- Focus on first root failure in each trace, not every downstream symptom.
- Let categories emerge from observed traces, not pre-baked labels.
- Iterate categories after 20 traces, then relabel for consistency.
- Stop when recent traces no longer reveal new failure categories.
Anti-patterns
- Defining categories before reading traces.
- Treating output quality labels as generic scores without concrete failure modes.
- Skipping relabel after category definitions change.
- Building new evaluators before fixing obvious prompt/tooling/engineering gaps.
Scopes reference
list_datasets,get_dataset_rowsrequiredatasets:readupload_dataset,update_dataset_row,apply_category_mappingsrequiredatasets:writesuggest_error_notes,consolidate_error_categoriesrequireerror-analysis:execute
Source
git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/error-analysis/SKILL.mdView on GitHub Overview
Error Analysis guides trace-grounded failure analysis and dataset labeling using Truesight datasets and error-analysis tools. It emphasizes an interactive Q&A setup, a structured workflow to identify root causes, consolidate categories, and prioritize fixes for quality issues or drift.
How This Skill Works
Start by selecting an existing dataset via list_datasets or upload_dataset if needed. Collect about 100 representative traces using random sampling with stratified coverage when volume is high. Analyze each row with get_dataset_rows and suggest_error_notes, persisting findings with update_dataset_row. Then consolidate error categories with consolidate_error_categories, apply mappings with apply_category_mappings, and prioritize fixes by category frequency, guiding next steps such as create-evaluation, review-and-promote-traces, or eval-audit. The process follows a mandatory Interactive Q&A protocol during setup, asking one question at a time with lettered options.
When to Use It
- Quality issues in evaluated traces are unclear or ambiguous
- After major pipeline changes to identify failure modes
- Incidents indicating drift in evaluation data or outputs
- When backlog of judgments requires labeling and categorization
- Before implementing fixes or expanding evaluation coverage to reveal gaps
Quick Start
- Step 1: Select an existing dataset with list_datasets or upload_dataset to create a new one.
- Step 2: Collect ~100 representative traces (random + stratified as needed) and run get_dataset_rows + suggest_error_notes; persist with update_dataset_row.
- Step 3: Run consolidate_error_categories and apply_category_mappings, then prioritize fixes based on category frequency.
Best Practices
- Start by selecting an existing dataset or uploading a new one via the setup flow
- Aim for about 100 traces with random sampling and stratified coverage when volume is high
- Focus on the first root failure in each trace and avoid labeling every downstream symptom
- Let failure categories emerge from the traces rather than predefining labels
- Iterate after about 20 traces and relabel for consistency; consolidate categories and apply mappings
Example Use Cases
- After a pipeline update, identify dominant error categories causing ambiguous quality
- Across datasets, surface drift-related failures and track changes over time
- Consolidate error categories to reveal top failure modes affecting evaluation
- Apply category mappings to standardize labels across teams and tools
- Prioritize fixes by most frequent categories to inform the next skill choice (e.g., create-evaluation)