evaluate
npx machina-cli add skill Q00/ouroboros/evaluate --openclaw/ouroboros:evaluate
Evaluate an execution session using the three-stage verification pipeline.
Usage
/ouroboros:evaluate <session_id> [artifact]
Trigger keywords: "evaluate this", "3-stage check"
How It Works
The evaluation pipeline runs three progressive stages:
-
Stage 1: Mechanical Verification ($0 cost)
- Lint checks, build validation, test execution
- Static analysis, coverage measurement
- Fails fast if mechanical checks don't pass
-
Stage 2: Semantic Evaluation (Standard tier)
- AC compliance assessment
- Goal alignment scoring
- Drift measurement
- Reasoning explanation
-
Stage 3: Multi-Model Consensus (Frontier tier, optional)
- Multiple models vote on approval
- Only triggered by uncertainty or manual request
- Majority ratio determines outcome
Instructions
When the user invokes this skill:
-
Determine what to evaluate:
- If
session_idprovided: Use it directly - If no session_id: Check conversation for recent execution session IDs
- If
-
Gather the artifact to evaluate:
- If user specifies a file: Read it with Read tool
- If recent execution output exists in conversation: Use that
- Ask user if unclear what to evaluate
-
Call the
ouroboros_evaluateMCP tool:Tool: ouroboros_evaluate Arguments: session_id: <session ID> artifact: <the code/output to evaluate> seed_content: <original seed YAML, if available> acceptance_criterion: <specific AC to check, optional> artifact_type: "code" (or "docs", "config") trigger_consensus: false (true if user requests Stage 3) -
Present results clearly:
- Show each stage's pass/fail status
- Highlight the final approval decision
- If rejected, explain the failure reason
- Suggest fixes if evaluation fails
Fallback (No MCP Server)
If the MCP server is not available, use the ouroboros:evaluator agent to perform a prompt-based evaluation:
- Delegate to
ouroboros:evaluatoragent - The agent performs qualitative evaluation based on the seed spec
- Results are advisory (no numerical scoring without Python core)
Example
User: /ouroboros:evaluate sess-abc-123
Evaluation Results
============================================================
Final Approval: APPROVED
Highest Stage Completed: 2
Stage 1: Mechanical Verification
[PASS] lint: No issues found
[PASS] build: Build successful
[PASS] test: 12/12 tests passing
Stage 2: Semantic Evaluation
Score: 0.85
AC Compliance: YES
Goal Alignment: 0.90
Drift Score: 0.08
Overview
Evaluate an execution session with a three-stage verification pipeline to ensure reliability and standards. It first validates mechanical integrity, then semantic alignment, and finally consults multiple models for consensus when needed.
How This Skill Works
The pipeline runs three stages: Stage 1 Mechanical Verification checks lint, build validity, tests, static analysis and coverage. Stage 2 Semantic Evaluation assesses AC compliance, goal alignment and drift with reasoning. Stage 3 when needed uses a multi-model consensus vote to decide, based on majority, and is triggered by uncertainty or manual request. You invoke it with the ouroboros_evaluate tool supplying session_id, artifact, and optional seed_content and acceptance_criterion.
When to Use It
- Before deeper evaluation, to fail fast if mechanical checks fail
- When you need AC compliance, goal alignment and drift measurement on an artifact
- When results are uncertain and you want a multi-model consensus
- When you have a known session_id and an artifact to evaluate
- When the MCP server is unavailable and you fallback to a prompt-based evaluator
Quick Start
- Step 1: Determine what to evaluate by using the session_id or checking recent execution session IDs
- Step 2: Gather the artifact to evaluate either by Reading a file or using recent conversation output
- Step 3: Call ouroboros_evaluate with session_id, artifact, and optional seed_content and acceptance_criterion; set artifact_type and trigger_consensus as needed
Best Practices
- Always provide a valid session_id when available to anchor the evaluation
- Specify the artifact explicitly or rely on the most recent execution output
- If needed, include acceptance_criterion to focus the semantic checks
- Review Stage 1 results first before proceeding to Stage 2
- Enable Stage 3 only if uncertainty exists or the user explicitly requests it
Example Use Cases
- Developer runs /ouroboros:evaluate sess-abc-123 after a code commit
- QA team evaluates a docs artifact for AC compliance and drift
- Data science pipeline triggers Stage 2 and Stage 3 when drift is detected
- MCP server is offline; evaluator agent performs a prompt-based evaluation
- Stage 1 passes but Stage 2 flags misalignment and triggers remediation notes