Get the FREE Ultimate OpenClaw Setup Guide →

evaluate

npx machina-cli add skill Q00/ouroboros/evaluate --openclaw
Files (1)
SKILL.md
2.6 KB

/ouroboros:evaluate

Evaluate an execution session using the three-stage verification pipeline.

Usage

/ouroboros:evaluate <session_id> [artifact]

Trigger keywords: "evaluate this", "3-stage check"

How It Works

The evaluation pipeline runs three progressive stages:

  1. Stage 1: Mechanical Verification ($0 cost)

    • Lint checks, build validation, test execution
    • Static analysis, coverage measurement
    • Fails fast if mechanical checks don't pass
  2. Stage 2: Semantic Evaluation (Standard tier)

    • AC compliance assessment
    • Goal alignment scoring
    • Drift measurement
    • Reasoning explanation
  3. Stage 3: Multi-Model Consensus (Frontier tier, optional)

    • Multiple models vote on approval
    • Only triggered by uncertainty or manual request
    • Majority ratio determines outcome

Instructions

When the user invokes this skill:

  1. Determine what to evaluate:

    • If session_id provided: Use it directly
    • If no session_id: Check conversation for recent execution session IDs
  2. Gather the artifact to evaluate:

    • If user specifies a file: Read it with Read tool
    • If recent execution output exists in conversation: Use that
    • Ask user if unclear what to evaluate
  3. Call the ouroboros_evaluate MCP tool:

    Tool: ouroboros_evaluate
    Arguments:
      session_id: <session ID>
      artifact: <the code/output to evaluate>
      seed_content: <original seed YAML, if available>
      acceptance_criterion: <specific AC to check, optional>
      artifact_type: "code"  (or "docs", "config")
      trigger_consensus: false  (true if user requests Stage 3)
    
  4. Present results clearly:

    • Show each stage's pass/fail status
    • Highlight the final approval decision
    • If rejected, explain the failure reason
    • Suggest fixes if evaluation fails

Fallback (No MCP Server)

If the MCP server is not available, use the ouroboros:evaluator agent to perform a prompt-based evaluation:

  1. Delegate to ouroboros:evaluator agent
  2. The agent performs qualitative evaluation based on the seed spec
  3. Results are advisory (no numerical scoring without Python core)

Example

User: /ouroboros:evaluate sess-abc-123

Evaluation Results
============================================================
Final Approval: APPROVED
Highest Stage Completed: 2

Stage 1: Mechanical Verification
  [PASS] lint: No issues found
  [PASS] build: Build successful
  [PASS] test: 12/12 tests passing

Stage 2: Semantic Evaluation
  Score: 0.85
  AC Compliance: YES
  Goal Alignment: 0.90
  Drift Score: 0.08

Source

git clone https://github.com/Q00/ouroboros/blob/main/skills/evaluate/SKILL.mdView on GitHub

Overview

Evaluate an execution session with a three-stage verification pipeline to ensure reliability and standards. It first validates mechanical integrity, then semantic alignment, and finally consults multiple models for consensus when needed.

How This Skill Works

The pipeline runs three stages: Stage 1 Mechanical Verification checks lint, build validity, tests, static analysis and coverage. Stage 2 Semantic Evaluation assesses AC compliance, goal alignment and drift with reasoning. Stage 3 when needed uses a multi-model consensus vote to decide, based on majority, and is triggered by uncertainty or manual request. You invoke it with the ouroboros_evaluate tool supplying session_id, artifact, and optional seed_content and acceptance_criterion.

When to Use It

  • Before deeper evaluation, to fail fast if mechanical checks fail
  • When you need AC compliance, goal alignment and drift measurement on an artifact
  • When results are uncertain and you want a multi-model consensus
  • When you have a known session_id and an artifact to evaluate
  • When the MCP server is unavailable and you fallback to a prompt-based evaluator

Quick Start

  1. Step 1: Determine what to evaluate by using the session_id or checking recent execution session IDs
  2. Step 2: Gather the artifact to evaluate either by Reading a file or using recent conversation output
  3. Step 3: Call ouroboros_evaluate with session_id, artifact, and optional seed_content and acceptance_criterion; set artifact_type and trigger_consensus as needed

Best Practices

  • Always provide a valid session_id when available to anchor the evaluation
  • Specify the artifact explicitly or rely on the most recent execution output
  • If needed, include acceptance_criterion to focus the semantic checks
  • Review Stage 1 results first before proceeding to Stage 2
  • Enable Stage 3 only if uncertainty exists or the user explicitly requests it

Example Use Cases

  • Developer runs /ouroboros:evaluate sess-abc-123 after a code commit
  • QA team evaluates a docs artifact for AC compliance and drift
  • Data science pipeline triggers Stage 2 and Stage 3 when drift is detected
  • MCP server is offline; evaluator agent performs a prompt-based evaluation
  • Stage 1 passes but Stage 2 flags misalignment and triggers remediation notes

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers