What is the evaluation-framework?

A generic pattern for weighted scoring and threshold-based decision making to evaluate artifacts against configurable criteria.

How do I define criteria?

Specify each criterion with a name, a weight, a description, and a scoring_guide that maps score ranges to outcomes.

How are decisions made?

Compute the total weighted score from all criteria and apply the predefined thresholds to determine the outcome (e.g., Accept, Review, or Reject).

evaluation-framework

Scanned

infrastructure

npx machina-cli add skill athola/claude-night-market/evaluation-framework --openclaw

Files (1)

SKILL.md

5.8 KB

Overview
When to Use
Core Pattern
1. Define Criteria
2. Score Each Criterion
3. Calculate Weighted Total
4. Apply Decision Thresholds
Quick Start
Define Your Evaluation
Example: Code Review Evaluation
Evaluation Workflow
Common Use Cases
Integration Pattern
Detailed Resources
Exit Criteria

Evaluation Framework

Overview

A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.

This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.

When To Use

Implementing quality gates or evaluation rubrics
Building scoring systems for artifacts, proposals, or submissions
Need consistent evaluation methodology across different domains
Want threshold-based automated decision making
Creating assessment tools with weighted criteria

When NOT To Use

Simple pass/fail without scoring needs

Core Pattern

1. Define Criteria

criteria:
  - name: criterion_name
    weight: 0.30          # 30% of total score
    description: What this measures
    scoring_guide:
      90-100: Exceptional
      70-89: Strong
      50-69: Acceptable
      30-49: Weak
      0-29: Poor

Verification: Run the command with --help flag to verify availability.

2. Score Each Criterion

scores = {
    "criterion_1": 85,  # Out of 100
    "criterion_2": 92,
    "criterion_3": 78,
}

Verification: Run the command with --help flag to verify availability.

3. Calculate Weighted Total

total = sum(score * weights[criterion] for criterion, score in scores.items())
# Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5

Verification: Run the command with --help flag to verify availability.

4. Apply Decision Thresholds

thresholds:
  80-100: Accept with priority
  60-79: Accept with conditions
  40-59: Review required
  20-39: Reject with feedback
  0-19: Reject

Verification: Run the command with --help flag to verify availability.

Quick Start

Define Your Evaluation

Identify criteria: What aspects matter for your domain?
Assign weights: Which criteria are most important? (sum to 1.0)
Create scoring guides: What does each score range mean?
Set thresholds: What total scores trigger which decisions?

Example: Code Review Evaluation

criteria:
  correctness: {weight: 0.40, description: Does code work as intended?}
  maintainability: {weight: 0.25, description: Is it readable?}
  performance: {weight: 0.20, description: Meets performance needs?}
  testing: {weight: 0.15, description: Tests detailed?}

thresholds:
  85-100: Approve immediately
  70-84: Approve with minor feedback
  50-69: Request changes
  0-49: Reject, major issues

Verification: Run pytest -v to verify tests pass.

Evaluation Workflow

**Verification:** Run the command with `--help` flag to verify availability.
1. Review artifact against each criterion
2. Assign 0-100 score for each criterion
3. Calculate: total = Σ(score × weight)
4. Compare total to thresholds
5. Take action based on threshold range

Verification: Run the command with --help flag to verify availability.

Common Use Cases

Quality Gates: Code review, PR approval, release readiness Content Evaluation: Document quality, knowledge intake, skill assessment Resource Allocation: Backlog prioritization, investment decisions, triage

Integration Pattern

# In your skill's frontmatter
dependencies: [leyline:evaluation-framework]

Verification: Run the command with --help flag to verify availability.

Then customize the framework for your domain:

Define domain-specific criteria
Set appropriate weights for your context
Establish meaningful thresholds
Document what each score range means

Detailed Resources

Scoring Patterns: See modules/scoring-patterns.md for detailed methodology
Decision Thresholds: See modules/decision-thresholds.md for threshold design

Exit Criteria

Criteria defined with clear descriptions
Weights assigned and sum to 1.0
Scoring guides documented for each criterion
Thresholds mapped to specific actions
Evaluation process documented and reproducible

Troubleshooting

Common Issues

Command not found Ensure all dependencies are installed and in PATH

Permission errors Check file permissions and run with appropriate privileges

Unexpected behavior Enable verbose logging with --verbose flag

Source

git clone https://github.com/athola/claude-night-market/blob/master/plugins/leyline/skills/evaluation-framework/SKILL.mdView on GitHub

Overview

Evaluation Framework provides a generic pattern for scoring artifacts against configurable criteria with weighted scores and threshold-based decisions. It enables consistent evaluation across domains by defining criteria, weights, scoring guides, and automated outcomes. Use it to build quality gates, rubrics, and decision frameworks.

How This Skill Works

Define criteria with weights, score each criterion, calculate a weighted total, and apply predefined thresholds to determine outcomes. The framework supports a clear pattern (define criteria → assign weights → score → compute total → apply thresholds) with practical examples like code review evaluations and YAML-based criteria definitions.

When to Use It

Implementing quality gates or evaluation rubrics
Building scoring systems for artifacts, proposals, or submissions
Need consistent evaluation methodology across different domains
Want threshold-based automated decision making
Creating assessment tools with weighted criteria

Quick Start

Step 1: Define criteria and assign weights (ensure they sum to 1.0)
Step 2: Create scoring guides that map score ranges to qualitative labels
Step 3: Set thresholds and run the evaluation to generate a decision

Best Practices

Define clear criteria and assign weights that sum to 1.0
Provide explicit scoring guides for each criterion with score ranges
Align thresholds with the impact of decisions and edge cases
Validate the framework with sample evaluations and revisions over time
Version criteria definitions and reuse templates across projects

Example Use Cases

Code Review Evaluation using weighted criteria (correctness, maintainability, performance, testing) with explicit weights
Artifact evaluation for proposals or submissions against a rubric
Quality gates in CI pipelines that auto-approve or flag items based on scores
Rubric design for research or feature proposals
Decision frameworks for gating feature flags or deployments

Frequently Asked Questions

Add this skill to your agents

Related Skills

terraform

chaterm/terminal-skills

Terraform 基础设施即代码

makefile-generation

athola/claude-night-market

Generate language-specific Makefiles with testing, linting, and automation targets. Use for project initialization and workflow standardization. Skip if Makefile exists.

precommit-setup

athola/claude-night-market

Configure three-layer pre-commit system with linting, type checking, and testing hooks. Use for quality gate setup and code standards. Skip if pre-commit is optimally configured.

error-patterns

athola/claude-night-market

'Standardized error handling patterns with classification, recovery,

risk-classification

athola/claude-night-market

'Inline risk classification for agent tasks using a 4-tier model. Hybrid

workflow-setup

athola/claude-night-market

Configure GitHub Actions CI/CD workflows for automated testing, linting, and deployment. Use for CI/CD setup and quality automation. Skip if CI/CD configured or using different platform.

evaluation-framework

Table of Contents

Evaluation Framework

Overview

When To Use

When NOT To Use

Core Pattern

1. Define Criteria

2. Score Each Criterion

3. Calculate Weighted Total

4. Apply Decision Thresholds

Quick Start

Define Your Evaluation

Example: Code Review Evaluation

Evaluation Workflow

Common Use Cases

Integration Pattern

Detailed Resources

Exit Criteria

Troubleshooting

Common Issues

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is the evaluation-framework?

How do I define criteria?

How are decisions made?

Related Skills

terraform

makefile-generation

precommit-setup

error-patterns

risk-classification

workflow-setup