What happens if I can't decide on a judgment?

Ask one clarifying question using the interactive protocol or temporarily skip the item using the provided options, then continue with the next item.

Who should have access to review and promotion actions?

Users with review:read can fetch items; review:write is required to submit judgments and promote items. Ensure correct permissions are configured.

How do I verify that items were promoted correctly?

Review the promotion report showing the number of items judged, promotion status, and row counts; consult the results from add_reviewed_items_to_dataset.

review-and-promote-traces

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/review-and-promote-traces --openclaw

Files (1)

SKILL.md

2.7 KB

Review and Promote Traces

Use this skill for the judgment and promotion loop after trace evaluation.

Interactive Q&A protocol (mandatory)

If context is unclear, ask one question at a time with lettered options.

Example:

What do you want to review first?
A) One specific run
B) Pending queue triage across runs

Rules:

Ask one question per message.
Use lettered options whenever practical.
Ask one follow-up when needed, then continue.

Workflow

If needed, flag evaluated run:
- Use flag_review_item with run_id.
Retrieve review items:
- Use list_review_items with pagination.
Collect human judgments:
- Present review items to the user and ask for judgment one item at a time when needed.
- Capture judgment_value and optional notes for each item.
- Present lettered options that match the live evaluation setup:
  - Binary example:
    - A) Pass
    - B) Fail
  - Categorical example:
    - A) <option 1>
    - B) <option 2>
    - C) <option 3>
  - Continuous example:
    - A) Enter numeric value (within configured range)
    - B) Skip this item for now
- If valid options are unclear, look them up before asking:
  1. Inspect list_review_items payload for item result details and expected value hints.
  2. Use get_result(run_id) for run-level context and chain outputs.
  3. If evaluation id is available in run metadata, call get_evaluation(evaluation_id) and read config/judgment criteria to derive valid judgment options.
- When options remain ambiguous after lookup, ask one clarification question with lettered options before proceeding.
Submit judgments:
- For each target item, call judge_review_item.
- Pass the user-provided judgment_value and optional notes into judge_review_item.
- Include notes when judgment context matters.
Promote judged outputs:
- Use add_reviewed_items_to_dataset with run_id.
Report result:
- number of items judged
- promotion status and row counts
- any skipped or blocked items

Queue-wide triage guidance

Group by run and status first.
Prioritize high-impact runs or oldest pending runs.
Keep an audit trail of judgment rationale in notes.

Scopes reference

list_review_items requires review:read
flag_review_item, judge_review_item, and add_reviewed_items_to_dataset require review:write

If a scope error occurs, ask the user to create a key with the missing scope in Truesight Settings.

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/review-and-promote-traces/SKILL.mdView on GitHub

Overview

Handles the human-judgment loop after trace evaluation. It supports an interactive Q&A protocol to collect judgments on review items, then promotes approved traces back into the dataset. This ensures only vetted traces enrich the dataset and keeps an audit trail via optional notes.

How This Skill Works

This skill sequences flagging, retrieval, and human judgments: you may flag a run, retrieve review items with list_review_items, and present items to the user one at a time using lettered options. For each item, capture judgment_value and optional notes, submit via judge_review_item, then promote approved items with add_reviewed_items_to_dataset and report counts.

When to Use It

When an evaluation run requires human judgment to determine correctness
When there are items in the review queue across multiple runs awaiting triage
When you need to convert judged outputs into dataset promotions for model training
When you must preserve an audit trail with notes for justification
When configuring and validating the judgment criteria via get_result/evaluation config

Quick Start

Step 1: Flag the run if needed and retrieve review items with list_review_items
Step 2: Present each review item to the user, capturing judgment_value and optional notes using the lettered options
Step 3: Submit judgments with judge_review_item for each item and promote with add_reviewed_items_to_dataset, then review the outcome

Best Practices

Follow the one-question-per-message protocol; avoid multi-question prompts
Use list_review_items with pagination to manage large queues
Capture both judgment_value and optional notes to preserve context
Group review items by run and status for efficient triage
Verify required scopes (review:read/write) before performing actions

Example Use Cases

Flag a long-running evaluation run, list review items, judge a batch of items as Pass, then promote 10 items to dataset
Triage pending items across runs, using get_result for run context to decide the right options
Promote all judged items to dataset via add_reviewed_items_to_dataset and check promotion counts
Add notes explaining why a trace was promoted or rejected
Generate an audit trail of judgments to satisfy compliance requirements

Frequently Asked Questions

Add this skill to your agents