Get the FREE Ultimate OpenClaw Setup Guide →

creating-eval-scenarios

Scanned
npx machina-cli add skill pantheon-org/tekhne/creating-eval-scenarios --openclaw
Files (1)
SKILL.md
2.3 KB

Creating Eval Scenarios

Generate evaluation scenarios that measure whether agents follow instructions from skills.

Prerequisites

Skills must be packaged in a Tessl tile (directory with tile.json + skill folders). If not, use the converting-skill-to-tessl-tile skill first. Ask the user where to put the tile if its not specified.

Its possible for a tile to contain multiple skills. In this case, split the tile into multiple tiles, one for each skill first.

Quick Start

Read references/scenario-generation.md before starting. It will guide you through the workflow of researching the tile and creating all the expected files in the correct formats.

Output Structure

<tile>/evals/
├── instructions.json          # Json containing list of all instructions in the skill
├── summary.json               # Feasible scenarios
├── summary_infeasible.json    # Infeasible capabilities (no folders)
└── scenario-N/
    ├── task.md                # Goal description (may include inlined inputs)
    ├── criteria.json          # Scoring rubric (must sum to 100)
    └── capability.txt         # Single line: capability being tested

Eval Constraints

The eval is one-shot and file-based:

  • No proprietary software is preinstalled, or any API_keys, or any additional files
  • Agent receives task → produces output files in its working directory
  • Scorer reviews just the files the agent created (source code, outputs, etc.)
  • NO observation of agent process during execution
  • NO interactive back-and-forth The eval will time out if the scenario takes too long to complete or leaves any large files on disk at the end.

Mark capabilities as infeasible if they won't work in this sandbox.

Running Evals

Once ready, you can trigger the eval run on the tessl platform.

tessl eval run <path/to/tile>
tessl eval view-status <status_id> --json
tessl eval list

Source

git clone https://github.com/pantheon-org/tekhne/blob/main/.tessl/tiles/tessl-labs/tessl-skill-eval-scenarios/creating-eval-scenarios/SKILL.mdView on GitHub

Overview

Generates evaluation content to measure how well agents follow skill instructions in a Tessl tile. It builds an inventory of all instructions, identifies feasible and infeasible capabilities, and creates scenario files that support objective scoring. This structured approach enables repeatable testing and easier publishing.

How This Skill Works

The tool parses the skill's content to enumerate every instruction, then emits instructions.json. It then creates summary.json for feasible scenarios and summary_infeasible.json for infeasible ones, and scaffolds per-scenario folders with task.md, criteria.json, and capability.txt. Finally, it ensures the scoring rubrics sum to 100 and adheres to the one-shot, file-based eval constraints.

When to Use It

  • When you need to generate evals for a Tessl tile's skill
  • When you want to create evaluation scenarios for testing coverage
  • When you are testing this skill to validate instruction-following
  • When you want to measure skill value and publish-ready artifacts
  • When preparing for a Tessl publish and need structured evals

Quick Start

  1. Step 1: Identify the target tile and skill, ensure tile.json exists
  2. Step 2: Run the evaluator to generate instructions.json and the base summary
  3. Step 3: Build scenario folders (scenario-N) with task.md, criteria.json, and capability.txt

Best Practices

  • Ensure the target tile is a proper Tessl tile (tile.json + skill folders)
  • Read references/scenario-generation.md before starting
  • Enumerate all instructions to avoid missing coverage in instructions.json
  • Label infeasible capabilities in summary_infeasible.json with clear reasoning
  • Keep evals lean: one-shot, file-based, no external dependencies

Example Use Cases

  • Evaluating a data-cleaning Tessl skill tile to verify coverage of common data patterns
  • Creating evals for a natural language task tile to confirm instruction adherence
  • Preparing a Tessl publish by generating all scenario docs and summaries
  • Measuring edge-case handling in a telemetry parsing tile
  • Validating multi-skill tiles by splitting into per-skill tiles and generating separate evals

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers