Get the FREE Ultimate OpenClaw Setup Guide →

evaluate-trace

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/evaluate-trace --openclaw
Files (1)
SKILL.md
1.8 KB

Evaluate Trace

Use this skill when the user wants to evaluate traces with an existing live evaluation endpoint.

Interactive Q&A protocol (mandatory)

If context does not make scope clear, ask one question at a time with lettered options.

Example:

Do you want to evaluate one trace or a batch?
A) One trace now
B) Small batch (up to 25)
C) Full batch loop

Rules:

  • Ask exactly one clarifying question per message.
  • Prefer lettered options.
  • Ask a single follow-up if needed, then proceed.

Workflow

  1. Identify target live evaluation:
    • If live evaluation id is unknown, call list_live_evaluations.
    • Select public_id and verify required input_columns.
  2. Prepare inputs:
    • Ensure inputs keys exactly match input_columns.
    • Include media_url for multimodal evaluations when needed.
  3. Execute evaluation:
    • Use the run_eval tool with live_evaluation_id and inputs for each trace.
  4. Return useful outputs:
    • run_id
    • per-judgment scores/outcomes
    • brief interpretation for next action
  5. Optional handoff:
    • If human judgment is needed, route to review-and-promote-traces.

Batch mode guidance

  • Use deterministic trace ordering and log run_id for each input.
  • Apply retries with stable idempotency context in caller logic if needed.
  • Summarize failures by category or threshold, then propose review handoff.

Scopes reference

  • list_live_evaluations requires live-evaluations:read
  • run_eval requires live-evaluations:execute

If a scope error occurs, ask the user to create an API key with the missing scope in Truesight Settings.

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/evaluate-trace/SKILL.mdView on GitHub

Overview

This skill evaluates one or more traces against an existing Truesight live evaluation endpoint. It returns run IDs, per-trace scores, and a brief interpretation, with an optional handoff to review and promote traces for human review.

How This Skill Works

Identify the target live evaluation by listing evaluations if the id is unknown, then select the public_id and verify the required input_columns. Prepare inputs so their keys exactly match input_columns, and include media_url for multimodal evaluations when needed. Execute the evaluation using run_eval for each trace, collect run_id and per-judgment results, and provide a brief action-oriented interpretation; optionally route to review-and-promote-traces for human review.

When to Use It

  • You already have a deployed live evaluation and want to run trace outputs against it
  • Evaluating a batch of traces (up to 25) to compare against the live evaluation
  • You need per-trace scores/outcomes with a brief interpretation for next actions
  • You want an optional handoff to human review and promotion after evaluation
  • You need to confirm or retrieve the input_columns for the live evaluation before running inputs

Quick Start

  1. Step 1: Identify the target live evaluation with list_live_evaluations and select public_id
  2. Step 2: Prepare inputs to exactly match input_columns; add media_url for multimodal traces
  3. Step 3: Run evaluation with run_eval for each trace, capture run_id and outputs, and decide on review-and-promote-traces if needed

Best Practices

  • Identify the target live evaluation with list_live_evaluations before running inputs
  • Ensure inputs keys exactly match input_columns to avoid errors
  • Include media_url for multimodal evaluations when required
  • Use deterministic trace ordering and log run_id for traceability
  • Implement idempotent retries and document failures by category for review

Example Use Cases

  • Evaluate a single customer trace against a live risk evaluation and review the run_id
  • Batch evaluate 20 recent sessions against a live sentiment evaluation and prepare for promotion
  • Audit traces for model behavior with per-judgment scores and interpretation
  • Verify model outputs before handoff to a data science reviewer
  • Route evaluation results to review-and-promote-traces for human approval

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers