review-and-promote-traces
npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/review-and-promote-traces --openclawReview and Promote Traces
Use this skill for the judgment and promotion loop after trace evaluation.
Interactive Q&A protocol (mandatory)
If context is unclear, ask one question at a time with lettered options.
Example:
What do you want to review first?
A) One specific run
B) Pending queue triage across runs
Rules:
- Ask one question per message.
- Use lettered options whenever practical.
- Ask one follow-up when needed, then continue.
Workflow
- If needed, flag evaluated run:
- Use
flag_review_itemwithrun_id.
- Use
- Retrieve review items:
- Use
list_review_itemswith pagination.
- Use
- Collect human judgments:
- Present review items to the user and ask for judgment one item at a time when needed.
- Capture
judgment_valueand optionalnotesfor each item. - Present lettered options that match the live evaluation setup:
- Binary example:
- A) Pass
- B) Fail
- Categorical example:
- A) <option 1>
- B) <option 2>
- C) <option 3>
- Continuous example:
- A) Enter numeric value (within configured range)
- B) Skip this item for now
- Binary example:
- If valid options are unclear, look them up before asking:
- Inspect
list_review_itemspayload for item result details and expected value hints. - Use
get_result(run_id)for run-level context and chain outputs. - If evaluation id is available in run metadata, call
get_evaluation(evaluation_id)and read config/judgment criteria to derive valid judgment options.
- Inspect
- When options remain ambiguous after lookup, ask one clarification question with lettered options before proceeding.
- Submit judgments:
- For each target item, call
judge_review_item. - Pass the user-provided
judgment_valueand optionalnotesintojudge_review_item. - Include notes when judgment context matters.
- For each target item, call
- Promote judged outputs:
- Use
add_reviewed_items_to_datasetwithrun_id.
- Use
- Report result:
- number of items judged
- promotion status and row counts
- any skipped or blocked items
Queue-wide triage guidance
- Group by run and status first.
- Prioritize high-impact runs or oldest pending runs.
- Keep an audit trail of judgment rationale in notes.
Scopes reference
list_review_itemsrequiresreview:readflag_review_item,judge_review_item, andadd_reviewed_items_to_datasetrequirereview:write
If a scope error occurs, ask the user to create a key with the missing scope in Truesight Settings.
Source
git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/review-and-promote-traces/SKILL.mdView on GitHub Overview
Handles the human-judgment loop after trace evaluation. It supports an interactive Q&A protocol to collect judgments on review items, then promotes approved traces back into the dataset. This ensures only vetted traces enrich the dataset and keeps an audit trail via optional notes.
How This Skill Works
This skill sequences flagging, retrieval, and human judgments: you may flag a run, retrieve review items with list_review_items, and present items to the user one at a time using lettered options. For each item, capture judgment_value and optional notes, submit via judge_review_item, then promote approved items with add_reviewed_items_to_dataset and report counts.
When to Use It
- When an evaluation run requires human judgment to determine correctness
- When there are items in the review queue across multiple runs awaiting triage
- When you need to convert judged outputs into dataset promotions for model training
- When you must preserve an audit trail with notes for justification
- When configuring and validating the judgment criteria via get_result/evaluation config
Quick Start
- Step 1: Flag the run if needed and retrieve review items with list_review_items
- Step 2: Present each review item to the user, capturing judgment_value and optional notes using the lettered options
- Step 3: Submit judgments with judge_review_item for each item and promote with add_reviewed_items_to_dataset, then review the outcome
Best Practices
- Follow the one-question-per-message protocol; avoid multi-question prompts
- Use list_review_items with pagination to manage large queues
- Capture both judgment_value and optional notes to preserve context
- Group review items by run and status for efficient triage
- Verify required scopes (review:read/write) before performing actions
Example Use Cases
- Flag a long-running evaluation run, list review items, judge a batch of items as Pass, then promote 10 items to dataset
- Triage pending items across runs, using get_result for run context to decide the right options
- Promote all judged items to dataset via add_reviewed_items_to_dataset and check promotion counts
- Add notes explaining why a trace was promoted or rejected
- Generate an audit trail of judgments to satisfy compliance requirements