Get the FREE Ultimate OpenClaw Setup Guide →

review-and-promote-traces

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/review-and-promote-traces --openclaw
Files (1)
SKILL.md
2.7 KB

Review and Promote Traces

Use this skill for the judgment and promotion loop after trace evaluation.

Interactive Q&A protocol (mandatory)

If context is unclear, ask one question at a time with lettered options.

Example:

What do you want to review first?
A) One specific run
B) Pending queue triage across runs

Rules:

  • Ask one question per message.
  • Use lettered options whenever practical.
  • Ask one follow-up when needed, then continue.

Workflow

  1. If needed, flag evaluated run:
    • Use flag_review_item with run_id.
  2. Retrieve review items:
    • Use list_review_items with pagination.
  3. Collect human judgments:
    • Present review items to the user and ask for judgment one item at a time when needed.
    • Capture judgment_value and optional notes for each item.
    • Present lettered options that match the live evaluation setup:
      • Binary example:
        • A) Pass
        • B) Fail
      • Categorical example:
        • A) <option 1>
        • B) <option 2>
        • C) <option 3>
      • Continuous example:
        • A) Enter numeric value (within configured range)
        • B) Skip this item for now
    • If valid options are unclear, look them up before asking:
      1. Inspect list_review_items payload for item result details and expected value hints.
      2. Use get_result(run_id) for run-level context and chain outputs.
      3. If evaluation id is available in run metadata, call get_evaluation(evaluation_id) and read config/judgment criteria to derive valid judgment options.
    • When options remain ambiguous after lookup, ask one clarification question with lettered options before proceeding.
  4. Submit judgments:
    • For each target item, call judge_review_item.
    • Pass the user-provided judgment_value and optional notes into judge_review_item.
    • Include notes when judgment context matters.
  5. Promote judged outputs:
    • Use add_reviewed_items_to_dataset with run_id.
  6. Report result:
    • number of items judged
    • promotion status and row counts
    • any skipped or blocked items

Queue-wide triage guidance

  • Group by run and status first.
  • Prioritize high-impact runs or oldest pending runs.
  • Keep an audit trail of judgment rationale in notes.

Scopes reference

  • list_review_items requires review:read
  • flag_review_item, judge_review_item, and add_reviewed_items_to_dataset require review:write

If a scope error occurs, ask the user to create a key with the missing scope in Truesight Settings.

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/review-and-promote-traces/SKILL.mdView on GitHub

Overview

Handles the human-judgment loop after trace evaluation. It supports an interactive Q&A protocol to collect judgments on review items, then promotes approved traces back into the dataset. This ensures only vetted traces enrich the dataset and keeps an audit trail via optional notes.

How This Skill Works

This skill sequences flagging, retrieval, and human judgments: you may flag a run, retrieve review items with list_review_items, and present items to the user one at a time using lettered options. For each item, capture judgment_value and optional notes, submit via judge_review_item, then promote approved items with add_reviewed_items_to_dataset and report counts.

When to Use It

  • When an evaluation run requires human judgment to determine correctness
  • When there are items in the review queue across multiple runs awaiting triage
  • When you need to convert judged outputs into dataset promotions for model training
  • When you must preserve an audit trail with notes for justification
  • When configuring and validating the judgment criteria via get_result/evaluation config

Quick Start

  1. Step 1: Flag the run if needed and retrieve review items with list_review_items
  2. Step 2: Present each review item to the user, capturing judgment_value and optional notes using the lettered options
  3. Step 3: Submit judgments with judge_review_item for each item and promote with add_reviewed_items_to_dataset, then review the outcome

Best Practices

  • Follow the one-question-per-message protocol; avoid multi-question prompts
  • Use list_review_items with pagination to manage large queues
  • Capture both judgment_value and optional notes to preserve context
  • Group review items by run and status for efficient triage
  • Verify required scopes (review:read/write) before performing actions

Example Use Cases

  • Flag a long-running evaluation run, list review items, judge a batch of items as Pass, then promote 10 items to dataset
  • Triage pending items across runs, using get_result for run context to decide the right options
  • Promote all judged items to dataset via add_reviewed_items_to_dataset and check promotion counts
  • Add notes explaining why a trace was promoted or rejected
  • Generate an audit trail of judgments to satisfy compliance requirements

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers