What is the Create Evaluation skill?

It scopes evaluation criteria, deploys binary evaluations via Truesight MCP, and generates a companion skill to apply them in your workflow.

What is produced by this skill?

Scoped evaluation dimensions, live eval endpoints, runnable cURL commands per endpoint (including exact live eval IDs and API keys), and a companion skill explaining how to use the evals.

When should I use it?

When you need to design new evals, quality checks, guardrails, or explicit pass/fail criteria for AI outputs.

create-evaluation

npx machina-cli add skill Goodeye-Labs/truesight-mcp-skills/create-evaluation --openclaw

Files (1)

SKILL.md

8.3 KB

Create Evaluation

Run this skill when a user asks to create evals for a task, workflow, or output type.

Outcome

Produce all of the following in one flow:

Scoped evaluation dimensions with clear pass/fail boundaries
Deployed live eval endpoints
Full runnable cURL per endpoint (must include exact live eval ID and exact API key)
A generated companion skill that explains how to use the evals in the user's workflow

Default behavior

Prioritize non-technical scoping first.
Use binary evaluations by default.
Create separate evals per dimension by default.
Avoid asking implementation-detail questions unless they change product intent.
Infer technical defaults and execute.

Interactive Q&A protocol (mandatory)

<HARD-GATE> Do NOT call template provisioning tools, create datasets, deploy evaluations, generate cURLs, or produce a companion skill until scoping is complete and the user explicitly approves the scoped evaluation design. </HARD-GATE>

Anti-pattern: "This is obvious, skip questions"

Do not skip the interactive scoping loop, even when the use case seems simple. Fast assumption-heavy execution creates weak criteria and poor downstream behavior. Keep the dialogue short when possible, but do not skip it.

Checklist (complete in order)

You MUST complete each item in order:

Initial framing -- restate the use case and intended operator outcome.
Clarifying dialogue -- ask one question at a time; prefer multiple-choice when possible.
Approach options -- propose 2-3 decomposition options with trade-offs and recommendation.
Design approval loop -- present these sections and get approval after each section:
- Quality dimensions
- Pass/fail boundaries and strictness
- Operational usage pattern (gate, rank, revise loop, monitor)
Build authorization checkpoint -- ask for explicit go-ahead before any MCP build or deploy action.
Implementation and verification -- execute from-scratch flow, verify, then deliver artifacts.

Dialogue rules

Ask exactly one clarifying question per message during scoping.
Prefer structured multiple-choice prompts with lettered options when practical; use open-ended questions only when needed.
If the user response is ambiguous, ask one follow-up question before moving forward.
Keep questions focused on quality intent, failure cost, and decision thresholds.

Quick trial redirect

If the user wants a quick trial or does not yet have a strong evaluation concept, route to bootstrap-template-evaluation instead of running this skill.

Use create-evaluation for from-scratch evaluation design and deployment.

Scoping workflow (high-information questions only)

Ask questions that define quality, not plumbing. Cover:

What is being evaluated
What "good" and "bad" look like
Highest-cost failure modes
Strictness preference (precision vs recall)
How results should be used (gating, ranking, revision loop, monitoring, etc.)

Do not ask about dataset schema, API structure, key storage, or endpoint wiring unless the user explicitly wants custom handling.

Criterion quality standard

For each proposed quality dimension:

Make it atomic: one dimension per criterion.
Use strict binary pass/fail boundaries by default.
Define explicit fail conditions, not just pass intent.
Include at least one borderline example in scoping discussion when ambiguity risk is high.
Prefer code-based checks for objective constraints and reserve LLM judgment for interpretive criteria.

Avoid holistic criteria like "is this good?" or "is this helpful?" without concrete boundaries.

Real traces first, synthetic fallback only when needed

Default to real traces from user workflows whenever available.

Only propose synthetic data generation if real traces are missing or too sparse to scope quality dimensions.

When synthetic fallback is needed:

Define 2-4 dimensions of variation tied to expected failure modes.
Draft tuple combinations and confirm realism with the user.
Generate additional tuples and convert each to natural-language traces.
Filter unrealistic traces before using them for scoping.

Synthetic traces are a bootstrap aid, not a replacement for production traces.

Synthesis step

After scoping, return:

Proposed eval dimensions
Recommended number of evals and why
Criterion text for each eval with explicit pass/fail boundary
Intended usage pattern for eval outputs in downstream workflow

Get explicit user approval on the scoped design before build.

Build step (Truesight MCP)

Use Truesight MCP to implement approved evals.

For each eval:

Create/upload dataset with upload_dataset or create_dataset
- Pass input_columns and judgment_configs inline to avoid separate configure calls
- Use idempotency_key for safe retries in agentic loops
Deploy using create_and_deploy_evaluation(dataset_id)
- CRITICAL: the full api_key is ONLY returned at creation -- capture and store it immediately
- The live evaluation public_id is also needed for run_eval calls
Verify endpoint works with a real call

judgment_configs reference

Each judgment_configs entry defines one scoring dimension. Pass as a list to upload_dataset or create_dataset.

Binary (pass/fail) -- most common:

[{
  "judgment_column": "quality",
  "judgment_type": "binary",
  "criterion": "The response fully addresses the user's question without factual errors. Pass if it does, Fail if it does not."
}]

Categorical (multiple labels):

[{
  "judgment_column": "tone",
  "judgment_type": "categorical",
  "options": ["professional", "neutral", "unprofessional"],
  "criterion": "Classify the tone of the response."
}]

Continuous (numeric score):

[{
  "judgment_column": "relevance",
  "judgment_type": "continuous",
  "min_value": 0,
  "max_value": 10,
  "criterion": "Score how relevant the response is to the question, from 0 (irrelevant) to 10 (perfectly relevant)."
}]

Multiple dimensions in one dataset:

[
  {"judgment_column": "accuracy", "judgment_type": "binary", "criterion": "..."},
  {"judgment_column": "tone", "judgment_type": "categorical", "options": ["formal", "casual"], "criterion": "..."}
]

Optional fields per config:

notes_column (str): column for judge reasoning text. Highly recommended so the judge has the reasoning for why the judgment was made.

cURL requirement (mandatory)

For every deployed eval, construct and store the full runnable cURL using:

Live eval endpoint ID (public_id)
Its corresponding API key (api_key)

Template:

curl -sS -X POST "https://api.truesight.goodeyelabs.com/api/eval/<public_id>" \
  -H "Authorization: Bearer <api_key>" \
  -H "Content-Type: application/json" \
  -d '{"inputs": { ... }}'

You must preserve exact endpoint IDs and keys returned from deployment. No placeholders in final delivered skill unless user asked for placeholders.

Verification requirement (mandatory)

Execute the exact cURL written into the companion skill for each eval.
Confirm successful response and extractable judgment fields.
Report verification evidence before claiming completion.

Companion skill generation

Generate a new usage skill tailored to the scoped workflow.

The companion skill must include:

Clear trigger description: what the eval suite does and when to use it
Input contract: what inputs must be provided
Eval execution instructions aligned to scoped usage (not hardcoded to one pattern)
Output parsing guidance: how to read pass/fail and reasoning
Full cURL blocks for every eval endpoint
Operator loop logic for the approved usage pattern (for example: revise-until-pass, gate-on-fail, or monitor-only)

Final delivery format

Return:

Scoping summary
Eval catalog (dimension + criterion + pass/fail boundary)
Deployment manifest (dataset IDs, eval IDs, live eval IDs, API keys)
Companion skill path
Verification results for every cURL

If any verification fails, stop and return a concrete fix plan instead of marking done.

Source

git clone https://github.com/Goodeye-Labs/truesight-mcp-skills/blob/main/skills/create-evaluation/SKILL.mdView on GitHub

Overview

The Create Evaluation skill helps you scope quality criteria for an AI task, convert them into actionable binary evaluations, deploy them through Truesight MCP, and generate a companion skill that teaches you how to apply the evals in your workflow. It’s used when you want to create new evals, quality checks, guardrails, or pass/fail criteria for AI outputs.

How This Skill Works

You start by defining atomic quality dimensions with clear pass/fail boundaries. The system then provisions live eval endpoints, generates runnable cURL commands for each endpoint (with exact live eval IDs and API keys), and creates a companion skill that explains how to apply the evals in your workflow. The process enforces an interactive scoping loop before any deployment, ensuring precise criteria and alignment with product intent.

When to Use It

You want to create new evals for a task, workflow, or output type.
You need quality checks or guardrails to constrain AI outputs.
You require explicit pass/fail criteria and strict thresholds for evaluation.
You want live eval endpoints and reproducible cURL commands for audits.
You want a companion skill that shows how to apply the evals in your workflow.

Quick Start

Step 1: Provide the task description and desired evaluation scope to define quality dimensions.
Step 2: Review and approve the scoped dimensions; system will prepare live eval endpoints and cURLs.
Step 3: Retrieve deployed endpoints, runnable cURLs (with live eval IDs and API keys), and the companion skill for workflow integration.

Best Practices

Start with non-technical scoping first to frame intent and risk.
Make each criterion atomic and strictly binary with explicit fail conditions.
Create separate evals per dimension by default to avoid conflation.
Include borderline examples to reduce ambiguity in scoring.
Obtain explicit go-ahead before any MCP build or deployment actions.

Example Use Cases

Factual correctness eval for a customer support chatbot, with pass if answers match authoritative sources.
Content safety/guardrail evaluation to prevent disallowed topics or unsafe outputs.
Hallucination rate check for a summarization task, with pass when factual accuracy crosses a threshold.
Privacy leakage detection in data-extraction outputs, with pass for no leakage indicators.
Brand voice and tone compliance evaluation to ensure outputs match approved style guidelines.

Frequently Asked Questions

Add this skill to your agents