evaluation-framework
npx machina-cli add skill athola/claude-night-market/evaluation-framework --openclawTable of Contents
- Overview
- When to Use
- Core Pattern
- 1. Define Criteria
- 2. Score Each Criterion
- 3. Calculate Weighted Total
- 4. Apply Decision Thresholds
- Quick Start
- Define Your Evaluation
- Example: Code Review Evaluation
- Evaluation Workflow
- Common Use Cases
- Integration Pattern
- Detailed Resources
- Exit Criteria
Evaluation Framework
Overview
A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.
This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.
When To Use
- Implementing quality gates or evaluation rubrics
- Building scoring systems for artifacts, proposals, or submissions
- Need consistent evaluation methodology across different domains
- Want threshold-based automated decision making
- Creating assessment tools with weighted criteria
When NOT To Use
- Simple pass/fail without scoring needs
Core Pattern
1. Define Criteria
criteria:
- name: criterion_name
weight: 0.30 # 30% of total score
description: What this measures
scoring_guide:
90-100: Exceptional
70-89: Strong
50-69: Acceptable
30-49: Weak
0-29: Poor
Verification: Run the command with --help flag to verify availability.
2. Score Each Criterion
scores = {
"criterion_1": 85, # Out of 100
"criterion_2": 92,
"criterion_3": 78,
}
Verification: Run the command with --help flag to verify availability.
3. Calculate Weighted Total
total = sum(score * weights[criterion] for criterion, score in scores.items())
# Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5
Verification: Run the command with --help flag to verify availability.
4. Apply Decision Thresholds
thresholds:
80-100: Accept with priority
60-79: Accept with conditions
40-59: Review required
20-39: Reject with feedback
0-19: Reject
Verification: Run the command with --help flag to verify availability.
Quick Start
Define Your Evaluation
- Identify criteria: What aspects matter for your domain?
- Assign weights: Which criteria are most important? (sum to 1.0)
- Create scoring guides: What does each score range mean?
- Set thresholds: What total scores trigger which decisions?
Example: Code Review Evaluation
criteria:
correctness: {weight: 0.40, description: Does code work as intended?}
maintainability: {weight: 0.25, description: Is it readable?}
performance: {weight: 0.20, description: Meets performance needs?}
testing: {weight: 0.15, description: Tests detailed?}
thresholds:
85-100: Approve immediately
70-84: Approve with minor feedback
50-69: Request changes
0-49: Reject, major issues
Verification: Run pytest -v to verify tests pass.
Evaluation Workflow
**Verification:** Run the command with `--help` flag to verify availability.
1. Review artifact against each criterion
2. Assign 0-100 score for each criterion
3. Calculate: total = Σ(score × weight)
4. Compare total to thresholds
5. Take action based on threshold range
Verification: Run the command with --help flag to verify availability.
Common Use Cases
Quality Gates: Code review, PR approval, release readiness Content Evaluation: Document quality, knowledge intake, skill assessment Resource Allocation: Backlog prioritization, investment decisions, triage
Integration Pattern
# In your skill's frontmatter
dependencies: [leyline:evaluation-framework]
Verification: Run the command with --help flag to verify availability.
Then customize the framework for your domain:
- Define domain-specific criteria
- Set appropriate weights for your context
- Establish meaningful thresholds
- Document what each score range means
Detailed Resources
- Scoring Patterns: See
modules/scoring-patterns.mdfor detailed methodology - Decision Thresholds: See
modules/decision-thresholds.mdfor threshold design
Exit Criteria
- Criteria defined with clear descriptions
- Weights assigned and sum to 1.0
- Scoring guides documented for each criterion
- Thresholds mapped to specific actions
- Evaluation process documented and reproducible
Troubleshooting
Common Issues
Command not found Ensure all dependencies are installed and in PATH
Permission errors Check file permissions and run with appropriate privileges
Unexpected behavior
Enable verbose logging with --verbose flag
Source
git clone https://github.com/athola/claude-night-market/blob/master/plugins/leyline/skills/evaluation-framework/SKILL.mdView on GitHub Overview
Evaluation Framework provides a generic pattern for scoring artifacts against configurable criteria with weighted scores and threshold-based decisions. It enables consistent evaluation across domains by defining criteria, weights, scoring guides, and automated outcomes. Use it to build quality gates, rubrics, and decision frameworks.
How This Skill Works
Define criteria with weights, score each criterion, calculate a weighted total, and apply predefined thresholds to determine outcomes. The framework supports a clear pattern (define criteria → assign weights → score → compute total → apply thresholds) with practical examples like code review evaluations and YAML-based criteria definitions.
When to Use It
- Implementing quality gates or evaluation rubrics
- Building scoring systems for artifacts, proposals, or submissions
- Need consistent evaluation methodology across different domains
- Want threshold-based automated decision making
- Creating assessment tools with weighted criteria
Quick Start
- Step 1: Define criteria and assign weights (ensure they sum to 1.0)
- Step 2: Create scoring guides that map score ranges to qualitative labels
- Step 3: Set thresholds and run the evaluation to generate a decision
Best Practices
- Define clear criteria and assign weights that sum to 1.0
- Provide explicit scoring guides for each criterion with score ranges
- Align thresholds with the impact of decisions and edge cases
- Validate the framework with sample evaluations and revisions over time
- Version criteria definitions and reuse templates across projects
Example Use Cases
- Code Review Evaluation using weighted criteria (correctness, maintainability, performance, testing) with explicit weights
- Artifact evaluation for proposals or submissions against a rubric
- Quality gates in CI pipelines that auto-approve or flag items based on scores
- Rubric design for research or feature proposals
- Decision frameworks for gating feature flags or deployments
Frequently Asked Questions
Related Skills
terraform
chaterm/terminal-skills
Terraform 基础设施即代码
makefile-generation
athola/claude-night-market
Generate language-specific Makefiles with testing, linting, and automation targets. Use for project initialization and workflow standardization. Skip if Makefile exists.
precommit-setup
athola/claude-night-market
Configure three-layer pre-commit system with linting, type checking, and testing hooks. Use for quality gate setup and code standards. Skip if pre-commit is optimally configured.
error-patterns
athola/claude-night-market
'Standardized error handling patterns with classification, recovery,
risk-classification
athola/claude-night-market
'Inline risk classification for agent tasks using a 4-tier model. Hybrid
quota-management
athola/claude-night-market
'Quota tracking, threshold monitoring, and graceful degradation for rate-limited