What does Stage 1 verify?

Stage 1 performs mechanical checks including lint, build validation, test execution, static analysis and coverage; it fails fast if these checks do not pass.

What determines the final approval in Stage 3?

Stage 3 uses a multi-model consensus where majority voting decides the outcome; it is only triggered by uncertainty or a manual request.

What if the MCP server is unavailable?

Fallback to the prompt-based ouroboros:evaluator agent; results are advisory and do not include numerical scoring without Python core.

evaluate

npx machina-cli add skill Q00/ouroboros/evaluate --openclaw

Files (1)

SKILL.md

2.6 KB

/ouroboros:evaluate

Evaluate an execution session using the three-stage verification pipeline.

Usage

/ouroboros:evaluate <session_id> [artifact]

Trigger keywords: "evaluate this", "3-stage check"

How It Works

The evaluation pipeline runs three progressive stages:

Stage 1: Mechanical Verification ($0 cost)
- Lint checks, build validation, test execution
- Static analysis, coverage measurement
- Fails fast if mechanical checks don't pass
Stage 2: Semantic Evaluation (Standard tier)
- AC compliance assessment
- Goal alignment scoring
- Drift measurement
- Reasoning explanation
Stage 3: Multi-Model Consensus (Frontier tier, optional)
- Multiple models vote on approval
- Only triggered by uncertainty or manual request
- Majority ratio determines outcome

Instructions

When the user invokes this skill:

Determine what to evaluate:
- If session_id provided: Use it directly
- If no session_id: Check conversation for recent execution session IDs
Gather the artifact to evaluate:
- If user specifies a file: Read it with Read tool
- If recent execution output exists in conversation: Use that
- Ask user if unclear what to evaluate

Call the ouroboros_evaluate MCP tool:

Tool: ouroboros_evaluate
Arguments:
  session_id: <session ID>
  artifact: <the code/output to evaluate>
  seed_content: <original seed YAML, if available>
  acceptance_criterion: <specific AC to check, optional>
  artifact_type: "code"  (or "docs", "config")
  trigger_consensus: false  (true if user requests Stage 3)

Present results clearly:
- Show each stage's pass/fail status
- Highlight the final approval decision
- If rejected, explain the failure reason
- Suggest fixes if evaluation fails

Fallback (No MCP Server)

If the MCP server is not available, use the ouroboros:evaluator agent to perform a prompt-based evaluation:

Delegate to ouroboros:evaluator agent
The agent performs qualitative evaluation based on the seed spec
Results are advisory (no numerical scoring without Python core)

Example

User: /ouroboros:evaluate sess-abc-123

Evaluation Results
============================================================
Final Approval: APPROVED
Highest Stage Completed: 2

Stage 1: Mechanical Verification
  [PASS] lint: No issues found
  [PASS] build: Build successful
  [PASS] test: 12/12 tests passing

Stage 2: Semantic Evaluation
  Score: 0.85
  AC Compliance: YES
  Goal Alignment: 0.90
  Drift Score: 0.08

Source

git clone https://github.com/Q00/ouroboros/blob/main/skills/evaluate/SKILL.mdView on GitHub

Overview

Evaluate an execution session with a three-stage verification pipeline to ensure reliability and standards. It first validates mechanical integrity, then semantic alignment, and finally consults multiple models for consensus when needed.

How This Skill Works

The pipeline runs three stages: Stage 1 Mechanical Verification checks lint, build validity, tests, static analysis and coverage. Stage 2 Semantic Evaluation assesses AC compliance, goal alignment and drift with reasoning. Stage 3 when needed uses a multi-model consensus vote to decide, based on majority, and is triggered by uncertainty or manual request. You invoke it with the ouroboros_evaluate tool supplying session_id, artifact, and optional seed_content and acceptance_criterion.

When to Use It

Before deeper evaluation, to fail fast if mechanical checks fail
When you need AC compliance, goal alignment and drift measurement on an artifact
When results are uncertain and you want a multi-model consensus
When you have a known session_id and an artifact to evaluate
When the MCP server is unavailable and you fallback to a prompt-based evaluator

Quick Start

Step 1: Determine what to evaluate by using the session_id or checking recent execution session IDs
Step 2: Gather the artifact to evaluate either by Reading a file or using recent conversation output
Step 3: Call ouroboros_evaluate with session_id, artifact, and optional seed_content and acceptance_criterion; set artifact_type and trigger_consensus as needed

Best Practices

Always provide a valid session_id when available to anchor the evaluation
Specify the artifact explicitly or rely on the most recent execution output
If needed, include acceptance_criterion to focus the semantic checks
Review Stage 1 results first before proceeding to Stage 2
Enable Stage 3 only if uncertainty exists or the user explicitly requests it

Example Use Cases

Developer runs /ouroboros:evaluate sess-abc-123 after a code commit
QA team evaluates a docs artifact for AC compliance and drift
Data science pipeline triggers Stage 2 and Stage 3 when drift is detected
MCP server is offline; evaluator agent performs a prompt-based evaluation
Stage 1 passes but Stage 2 flags misalignment and triggers remediation notes

Frequently Asked Questions

Add this skill to your agents