What is the output structure?

The evaluator produces /evals/ with instructions.json, summary.json, summary_infeasible.json and per-scenario folders.

What if a capability is infeasible in this sandbox?

Mark it infeasible in summary_infeasible.json and omit its folder.

Can evals cover multiple skills in one tile?

If a tile contains multiple skills, split into separate tiles first and run the evaluator per skill.

creating-eval-scenarios

Scanned

npx machina-cli add skill pantheon-org/tekhne/creating-eval-scenarios --openclaw

Files (1)

SKILL.md

2.3 KB

Creating Eval Scenarios

Generate evaluation scenarios that measure whether agents follow instructions from skills.

Prerequisites

Skills must be packaged in a Tessl tile (directory with tile.json + skill folders). If not, use the converting-skill-to-tessl-tile skill first. Ask the user where to put the tile if its not specified.

Its possible for a tile to contain multiple skills. In this case, split the tile into multiple tiles, one for each skill first.

Quick Start

Read references/scenario-generation.md before starting. It will guide you through the workflow of researching the tile and creating all the expected files in the correct formats.

Output Structure

<tile>/evals/
├── instructions.json          # Json containing list of all instructions in the skill
├── summary.json               # Feasible scenarios
├── summary_infeasible.json    # Infeasible capabilities (no folders)
└── scenario-N/
    ├── task.md                # Goal description (may include inlined inputs)
    ├── criteria.json          # Scoring rubric (must sum to 100)
    └── capability.txt         # Single line: capability being tested

Eval Constraints

The eval is one-shot and file-based:

No proprietary software is preinstalled, or any API_keys, or any additional files
Agent receives task → produces output files in its working directory
Scorer reviews just the files the agent created (source code, outputs, etc.)
NO observation of agent process during execution
NO interactive back-and-forth The eval will time out if the scenario takes too long to complete or leaves any large files on disk at the end.

Mark capabilities as infeasible if they won't work in this sandbox.

Running Evals

Once ready, you can trigger the eval run on the tessl platform.

tessl eval run <path/to/tile>
tessl eval view-status <status_id> --json
tessl eval list

Source

git clone https://github.com/pantheon-org/tekhne/blob/main/.tessl/tiles/tessl-labs/tessl-skill-eval-scenarios/creating-eval-scenarios/SKILL.md

View on GitHub

Overview

Generates evaluation content to measure how well agents follow skill instructions in a Tessl tile. It builds an inventory of all instructions, identifies feasible and infeasible capabilities, and creates scenario files that support objective scoring. This structured approach enables repeatable testing and easier publishing.

How This Skill Works

The tool parses the skill's content to enumerate every instruction, then emits instructions.json. It then creates summary.json for feasible scenarios and summary_infeasible.json for infeasible ones, and scaffolds per-scenario folders with task.md, criteria.json, and capability.txt. Finally, it ensures the scoring rubrics sum to 100 and adheres to the one-shot, file-based eval constraints.

When to Use It

When you need to generate evals for a Tessl tile's skill
When you want to create evaluation scenarios for testing coverage
When you are testing this skill to validate instruction-following
When you want to measure skill value and publish-ready artifacts
When preparing for a Tessl publish and need structured evals

Quick Start

Step 1: Identify the target tile and skill, ensure tile.json exists
Step 2: Run the evaluator to generate instructions.json and the base summary
Step 3: Build scenario folders (scenario-N) with task.md, criteria.json, and capability.txt

Best Practices

Ensure the target tile is a proper Tessl tile (tile.json + skill folders)
Read references/scenario-generation.md before starting
Enumerate all instructions to avoid missing coverage in instructions.json
Label infeasible capabilities in summary_infeasible.json with clear reasoning
Keep evals lean: one-shot, file-based, no external dependencies

Example Use Cases

Evaluating a data-cleaning Tessl skill tile to verify coverage of common data patterns
Creating evals for a natural language task tile to confirm instruction adherence
Preparing a Tessl publish by generating all scenario docs and summaries
Measuring edge-case handling in a telemetry parsing tile
Validating multi-skill tiles by splitting into per-skill tiles and generating separate evals

Frequently Asked Questions

Add this skill to your agents