What if the live_evaluation_id is unknown?

Call list_live_evaluations to retrieve public_id and required input_columns.

What scopes are required?

list_live_evaluations requires live-evaluations:read and run_eval requires live-evaluations:execute.

evaluate-trace

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/evaluate-trace --openclaw

Files (1)

SKILL.md

1.8 KB

Evaluate Trace

Use this skill when the user wants to evaluate traces with an existing live evaluation endpoint.

Interactive Q&A protocol (mandatory)

If context does not make scope clear, ask one question at a time with lettered options.

Example:

Do you want to evaluate one trace or a batch?
A) One trace now
B) Small batch (up to 25)
C) Full batch loop

Rules:

Ask exactly one clarifying question per message.
Prefer lettered options.
Ask a single follow-up if needed, then proceed.

Workflow

Identify target live evaluation:
- If live evaluation id is unknown, call list_live_evaluations.
- Select public_id and verify required input_columns.
Prepare inputs:
- Ensure inputs keys exactly match input_columns.
- Include media_url for multimodal evaluations when needed.
Execute evaluation:
- Use the run_eval tool with live_evaluation_id and inputs for each trace.
Return useful outputs:
- run_id
- per-judgment scores/outcomes
- brief interpretation for next action
Optional handoff:
- If human judgment is needed, route to review-and-promote-traces.

Batch mode guidance

Use deterministic trace ordering and log run_id for each input.
Apply retries with stable idempotency context in caller logic if needed.
Summarize failures by category or threshold, then propose review handoff.

Scopes reference

list_live_evaluations requires live-evaluations:read
run_eval requires live-evaluations:execute

If a scope error occurs, ask the user to create an API key with the missing scope in Truesight Settings.

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/evaluate-trace/SKILL.mdView on GitHub

Overview

This skill evaluates one or more traces against an existing Truesight live evaluation endpoint. It returns run IDs, per-trace scores, and a brief interpretation, with an optional handoff to review and promote traces for human review.

How This Skill Works

Identify the target live evaluation by listing evaluations if the id is unknown, then select the public_id and verify the required input_columns. Prepare inputs so their keys exactly match input_columns, and include media_url for multimodal evaluations when needed. Execute the evaluation using run_eval for each trace, collect run_id and per-judgment results, and provide a brief action-oriented interpretation; optionally route to review-and-promote-traces for human review.

When to Use It

You already have a deployed live evaluation and want to run trace outputs against it
Evaluating a batch of traces (up to 25) to compare against the live evaluation
You need per-trace scores/outcomes with a brief interpretation for next actions
You want an optional handoff to human review and promotion after evaluation
You need to confirm or retrieve the input_columns for the live evaluation before running inputs

Quick Start

Step 1: Identify the target live evaluation with list_live_evaluations and select public_id
Step 2: Prepare inputs to exactly match input_columns; add media_url for multimodal traces
Step 3: Run evaluation with run_eval for each trace, capture run_id and outputs, and decide on review-and-promote-traces if needed

Best Practices

Identify the target live evaluation with list_live_evaluations before running inputs
Ensure inputs keys exactly match input_columns to avoid errors
Include media_url for multimodal evaluations when required
Use deterministic trace ordering and log run_id for traceability
Implement idempotent retries and document failures by category for review

Example Use Cases

Evaluate a single customer trace against a live risk evaluation and review the run_id
Batch evaluate 20 recent sessions against a live sentiment evaluation and prepare for promotion
Audit traces for model behavior with per-judgment scores and interpretation
Verify model outputs before handoff to a data science reviewer
Route evaluation results to review-and-promote-traces for human approval

Frequently Asked Questions

Add this skill to your agents