evaluate-trace
npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/evaluate-trace --openclawEvaluate Trace
Use this skill when the user wants to evaluate traces with an existing live evaluation endpoint.
Interactive Q&A protocol (mandatory)
If context does not make scope clear, ask one question at a time with lettered options.
Example:
Do you want to evaluate one trace or a batch?
A) One trace now
B) Small batch (up to 25)
C) Full batch loop
Rules:
- Ask exactly one clarifying question per message.
- Prefer lettered options.
- Ask a single follow-up if needed, then proceed.
Workflow
- Identify target live evaluation:
- If live evaluation id is unknown, call
list_live_evaluations. - Select
public_idand verify requiredinput_columns.
- If live evaluation id is unknown, call
- Prepare inputs:
- Ensure
inputskeys exactly matchinput_columns. - Include
media_urlfor multimodal evaluations when needed.
- Ensure
- Execute evaluation:
- Use the
run_evaltool withlive_evaluation_idandinputsfor each trace.
- Use the
- Return useful outputs:
run_id- per-judgment scores/outcomes
- brief interpretation for next action
- Optional handoff:
- If human judgment is needed, route to
review-and-promote-traces.
- If human judgment is needed, route to
Batch mode guidance
- Use deterministic trace ordering and log
run_idfor each input. - Apply retries with stable idempotency context in caller logic if needed.
- Summarize failures by category or threshold, then propose review handoff.
Scopes reference
list_live_evaluationsrequireslive-evaluations:readrun_evalrequireslive-evaluations:execute
If a scope error occurs, ask the user to create an API key with the missing scope in Truesight Settings.
Source
git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/evaluate-trace/SKILL.mdView on GitHub Overview
This skill evaluates one or more traces against an existing Truesight live evaluation endpoint. It returns run IDs, per-trace scores, and a brief interpretation, with an optional handoff to review and promote traces for human review.
How This Skill Works
Identify the target live evaluation by listing evaluations if the id is unknown, then select the public_id and verify the required input_columns. Prepare inputs so their keys exactly match input_columns, and include media_url for multimodal evaluations when needed. Execute the evaluation using run_eval for each trace, collect run_id and per-judgment results, and provide a brief action-oriented interpretation; optionally route to review-and-promote-traces for human review.
When to Use It
- You already have a deployed live evaluation and want to run trace outputs against it
- Evaluating a batch of traces (up to 25) to compare against the live evaluation
- You need per-trace scores/outcomes with a brief interpretation for next actions
- You want an optional handoff to human review and promotion after evaluation
- You need to confirm or retrieve the input_columns for the live evaluation before running inputs
Quick Start
- Step 1: Identify the target live evaluation with list_live_evaluations and select public_id
- Step 2: Prepare inputs to exactly match input_columns; add media_url for multimodal traces
- Step 3: Run evaluation with run_eval for each trace, capture run_id and outputs, and decide on review-and-promote-traces if needed
Best Practices
- Identify the target live evaluation with list_live_evaluations before running inputs
- Ensure inputs keys exactly match input_columns to avoid errors
- Include media_url for multimodal evaluations when required
- Use deterministic trace ordering and log run_id for traceability
- Implement idempotent retries and document failures by category for review
Example Use Cases
- Evaluate a single customer trace against a live risk evaluation and review the run_id
- Batch evaluate 20 recent sessions against a live sentiment evaluation and prepare for promotion
- Audit traces for model behavior with per-judgment scores and interpretation
- Verify model outputs before handoff to a data science reviewer
- Route evaluation results to review-and-promote-traces for human approval