What is the Skill Evaluator?

A tool that audits a skill's SKILL.md against Anthropic's best practices, producing a structured evaluation with scores and actionable recommendations.

How are scores determined?

Scores follow the five criteria (Naming, Description, Content Quality, Structure & Organization, Degrees of Freedom) with respective weights; evaluators assign 1–5 per criterion.

Where can I run it or use its workflow?

Within your skill repository, following the Quick Start steps to run automated validation and manual evaluation.

skill-evaluator

Scanned

npx machina-cli add skill gotalab/skillport/skill-evaluator --openclaw

Files (1)

SKILL.md

6.3 KB

Skill Evaluator (WIP)

Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.

Quick Start

Read the skill's SKILL.md and understand its purpose
Run automated validation: scripts/validate_skill.py <skill-path>
Perform manual evaluation against criteria below
Generate evaluation report with scores and recommendations

Evaluation Workflow

Step 1: Automated Validation

Run the validation script first:

scripts/validate_skill.py <path/to/skill>

This checks:

SKILL.md exists with valid YAML frontmatter
Name follows conventions (lowercase, hyphens, max 64 chars)
Description is present and under 1024 chars
Body is under 500 lines
File references are one-level deep

Step 2: Manual Evaluation

Evaluate each dimension and assign a score (1-5):

A. Naming (Weight: 10%)

Score	Criteria
5	Gerund form (-ing), clear purpose, memorable
4	Descriptive, follows conventions
3	Acceptable but could be clearer
2	Vague or misleading
1	Violates naming rules

Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.

Good: processing-pdfs, analyzing-spreadsheets, building-dashboards Bad: pdf, my-skill, ClaudeHelper, anthropic-tools

B. Description (Weight: 20%)

Score	Criteria
5	Clear functionality + specific activation triggers + third person
4	Good description with some triggers
3	Adequate but missing triggers or vague
2	Too brief or unclear purpose
1	Missing or unhelpful

Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."

C. Content Quality (Weight: 30%)

Score	Criteria
5	Concise, assumes Claude intelligence, actionable instructions
4	Generally good, minor verbosity
3	Some unnecessary explanations or redundancy
2	Overly verbose or confusing
1	Bloated, explains obvious concepts

Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.

D. Structure & Organization (Weight: 25%)

Score	Criteria
5	Excellent progressive disclosure, clear navigation, optimal length
4	Good organization, appropriate file splits
3	Acceptable but could be better organized
2	Poor organization, missing references, or bloated SKILL.md
1	No structure, everything dumped in SKILL.md

Check:

SKILL.md under 500 lines
References are one-level deep (no nested chains)
Long reference files (>100 lines) have table of contents
Uses forward slashes in all paths

E. Degrees of Freedom (Weight: 10%)

Score	Criteria
5	Perfect match: high freedom for flexible tasks, low for fragile operations
4	Generally appropriate freedom levels
3	Acceptable but could be better calibrated
2	Mismatched: too rigid or too loose
1	Completely wrong freedom level for the task type

Guideline:

High freedom (text): Multiple valid approaches, context-dependent
Medium freedom (parameterized): Preferred pattern exists, some variation OK
Low freedom (specific scripts): Fragile operations, exact sequence required

F. Anti-Pattern Check (Weight: 5%)

Deduct points for each anti-pattern found:

Too many options without clear recommendation (-1)
Time-sensitive information with date conditionals (-1)
Inconsistent terminology (-1)
Windows-style paths (backslashes) (-1)
Deeply nested references (more than one level) (-2)
Scripts that punt error handling to Claude (-1)
Magic numbers without justification (-1)

Step 3: Generate Report

Use this template:

# Skill Evaluation Report: [skill-name]

## Summary
- **Overall Score**: X.X/5.0
- **Recommendation**: [Ready for publication / Needs minor improvements / Needs major revision]

## Dimension Scores
| Dimension | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Naming | X/5 | 10% | X.XX |
| Description | X/5 | 20% | X.XX |
| Content Quality | X/5 | 30% | X.XX |
| Structure | X/5 | 25% | X.XX |
| Degrees of Freedom | X/5 | 10% | X.XX |
| Anti-Patterns | X/5 | 5% | X.XX |
| **Total** | | 100% | **X.XX** |

## Strengths
- [List 2-3 things done well]

## Areas for Improvement
- [List specific issues with actionable fixes]

## Anti-Patterns Found
- [List any anti-patterns detected]

## Recommendations
1. [Priority 1 fix]
2. [Priority 2 fix]
3. [Priority 3 fix]

## Pre-Publication Checklist
- [ ] Description is specific with activation triggers
- [ ] SKILL.md under 500 lines
- [ ] One-level-deep file references
- [ ] Forward slashes in all paths
- [ ] No time-sensitive information
- [ ] Consistent terminology
- [ ] Concrete examples provided
- [ ] Scripts handle errors explicitly
- [ ] All configuration values justified
- [ ] Required packages listed
- [ ] Tested with Haiku, Sonnet, Opus

Score Interpretation

Score Range	Rating	Action
4.5 - 5.0	Excellent	Ready for publication
4.0 - 4.4	Good	Minor improvements recommended
3.0 - 3.9	Acceptable	Several improvements needed
2.0 - 2.9	Needs Work	Major revision required
1.0 - 1.9	Poor	Fundamental redesign needed

References

references/evaluation-criteria.md - Detailed evaluation criteria with examples
references/scoring-rubric.md - Complete scoring rubric and edge cases

Examples

See evaluations/ for example evaluation scenarios.

Source

git clone https://github.com/gotalab/skillport/blob/main/.skills/experimental/skill-evaluator/SKILL.mdView on GitHub

Overview

The Skill Evaluator audits a skill’s SKILL.md against Anthropic’s best-practice criteria, producing structured evaluation reports with scores and actionable recommendations. It analyzes naming, description quality, content organization, and potential anti-patterns to guide quality improvements. This helps teams maintain consistent, high-quality skill documentation and behavior.

How This Skill Works

It reads the skill’s SKILL.md, runs automated validation (per the referenced validation script), and then performs a manual evaluation across labeled dimensions (A–E) with weighted scores. The output is a formal evaluation report that includes scores and concrete recommendations for improvement.

When to Use It

When asked to review or audit a skill for quality and correctness
When validating that SKILL.md follows naming, structure, and length conventions
When identifying anti-patterns and improvement opportunities
When preparing formal evaluation reports with scores and recommendations
When ensuring references are one-level deep and total lines adhere to limits

Quick Start

Step 1: Read the skill's SKILL.md and understand its purpose
Step 2: Run automated validation: scripts/validate_skill.py <path/to/skill>
Step 3: Perform manual evaluation against criteria below and generate the report

Best Practices

Run automated validation (scripts/validate_skill.py) before starting manual review
Enforce naming conventions: lowercase, hyphens, max 64 chars, avoid reserved words
Ensure the description explains what the skill does and when to use it (activation triggers)
Keep SKILL.md under 500 lines; ensure references are one-level deep; add a TOC if long
Provide clear scores and actionable recommendations in the evaluation report

Example Use Cases

Audited 'data-cleaning-scripts' skill; corrected naming, enhanced description triggers, and added TOC for long references
Validated 'report-generator' skill; trimmed content to fit limits and clarified when to use
Reviewed 'csv-analytics' skill; replaced vague wording with explicit actions and outcomes
Assessed 'quiz-generator' skill; ensured path references are safe and consolidated references to one level
Evaluated 'monitoring-dashboards' skill; produced actionable recommendations to improve structure and reduce redundancy

Frequently Asked Questions

Add this skill to your agents