skill-evaluator
Scannednpx machina-cli add skill gotalab/skillport/skill-evaluator --openclawSkill Evaluator (WIP)
Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.
Quick Start
- Read the skill's SKILL.md and understand its purpose
- Run automated validation:
scripts/validate_skill.py <skill-path> - Perform manual evaluation against criteria below
- Generate evaluation report with scores and recommendations
Evaluation Workflow
Step 1: Automated Validation
Run the validation script first:
scripts/validate_skill.py <path/to/skill>
This checks:
- SKILL.md exists with valid YAML frontmatter
- Name follows conventions (lowercase, hyphens, max 64 chars)
- Description is present and under 1024 chars
- Body is under 500 lines
- File references are one-level deep
Step 2: Manual Evaluation
Evaluate each dimension and assign a score (1-5):
A. Naming (Weight: 10%)
| Score | Criteria |
|---|---|
| 5 | Gerund form (-ing), clear purpose, memorable |
| 4 | Descriptive, follows conventions |
| 3 | Acceptable but could be clearer |
| 2 | Vague or misleading |
| 1 | Violates naming rules |
Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.
Good: processing-pdfs, analyzing-spreadsheets, building-dashboards
Bad: pdf, my-skill, ClaudeHelper, anthropic-tools
B. Description (Weight: 20%)
| Score | Criteria |
|---|---|
| 5 | Clear functionality + specific activation triggers + third person |
| 4 | Good description with some triggers |
| 3 | Adequate but missing triggers or vague |
| 2 | Too brief or unclear purpose |
| 1 | Missing or unhelpful |
Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."
C. Content Quality (Weight: 30%)
| Score | Criteria |
|---|---|
| 5 | Concise, assumes Claude intelligence, actionable instructions |
| 4 | Generally good, minor verbosity |
| 3 | Some unnecessary explanations or redundancy |
| 2 | Overly verbose or confusing |
| 1 | Bloated, explains obvious concepts |
Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.
D. Structure & Organization (Weight: 25%)
| Score | Criteria |
|---|---|
| 5 | Excellent progressive disclosure, clear navigation, optimal length |
| 4 | Good organization, appropriate file splits |
| 3 | Acceptable but could be better organized |
| 2 | Poor organization, missing references, or bloated SKILL.md |
| 1 | No structure, everything dumped in SKILL.md |
Check:
- SKILL.md under 500 lines
- References are one-level deep (no nested chains)
- Long reference files (>100 lines) have table of contents
- Uses forward slashes in all paths
E. Degrees of Freedom (Weight: 10%)
| Score | Criteria |
|---|---|
| 5 | Perfect match: high freedom for flexible tasks, low for fragile operations |
| 4 | Generally appropriate freedom levels |
| 3 | Acceptable but could be better calibrated |
| 2 | Mismatched: too rigid or too loose |
| 1 | Completely wrong freedom level for the task type |
Guideline:
- High freedom (text): Multiple valid approaches, context-dependent
- Medium freedom (parameterized): Preferred pattern exists, some variation OK
- Low freedom (specific scripts): Fragile operations, exact sequence required
F. Anti-Pattern Check (Weight: 5%)
Deduct points for each anti-pattern found:
- Too many options without clear recommendation (-1)
- Time-sensitive information with date conditionals (-1)
- Inconsistent terminology (-1)
- Windows-style paths (backslashes) (-1)
- Deeply nested references (more than one level) (-2)
- Scripts that punt error handling to Claude (-1)
- Magic numbers without justification (-1)
Step 3: Generate Report
Use this template:
# Skill Evaluation Report: [skill-name]
## Summary
- **Overall Score**: X.X/5.0
- **Recommendation**: [Ready for publication / Needs minor improvements / Needs major revision]
## Dimension Scores
| Dimension | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| Naming | X/5 | 10% | X.XX |
| Description | X/5 | 20% | X.XX |
| Content Quality | X/5 | 30% | X.XX |
| Structure | X/5 | 25% | X.XX |
| Degrees of Freedom | X/5 | 10% | X.XX |
| Anti-Patterns | X/5 | 5% | X.XX |
| **Total** | | 100% | **X.XX** |
## Strengths
- [List 2-3 things done well]
## Areas for Improvement
- [List specific issues with actionable fixes]
## Anti-Patterns Found
- [List any anti-patterns detected]
## Recommendations
1. [Priority 1 fix]
2. [Priority 2 fix]
3. [Priority 3 fix]
## Pre-Publication Checklist
- [ ] Description is specific with activation triggers
- [ ] SKILL.md under 500 lines
- [ ] One-level-deep file references
- [ ] Forward slashes in all paths
- [ ] No time-sensitive information
- [ ] Consistent terminology
- [ ] Concrete examples provided
- [ ] Scripts handle errors explicitly
- [ ] All configuration values justified
- [ ] Required packages listed
- [ ] Tested with Haiku, Sonnet, Opus
Score Interpretation
| Score Range | Rating | Action |
|---|---|---|
| 4.5 - 5.0 | Excellent | Ready for publication |
| 4.0 - 4.4 | Good | Minor improvements recommended |
| 3.0 - 3.9 | Acceptable | Several improvements needed |
| 2.0 - 2.9 | Needs Work | Major revision required |
| 1.0 - 1.9 | Poor | Fundamental redesign needed |
References
- references/evaluation-criteria.md - Detailed evaluation criteria with examples
- references/scoring-rubric.md - Complete scoring rubric and edge cases
Examples
See evaluations/ for example evaluation scenarios.
Source
git clone https://github.com/gotalab/skillport/blob/main/.skills/experimental/skill-evaluator/SKILL.mdView on GitHub Overview
The Skill Evaluator audits a skill’s SKILL.md against Anthropic’s best-practice criteria, producing structured evaluation reports with scores and actionable recommendations. It analyzes naming, description quality, content organization, and potential anti-patterns to guide quality improvements. This helps teams maintain consistent, high-quality skill documentation and behavior.
How This Skill Works
It reads the skill’s SKILL.md, runs automated validation (per the referenced validation script), and then performs a manual evaluation across labeled dimensions (A–E) with weighted scores. The output is a formal evaluation report that includes scores and concrete recommendations for improvement.
When to Use It
- When asked to review or audit a skill for quality and correctness
- When validating that SKILL.md follows naming, structure, and length conventions
- When identifying anti-patterns and improvement opportunities
- When preparing formal evaluation reports with scores and recommendations
- When ensuring references are one-level deep and total lines adhere to limits
Quick Start
- Step 1: Read the skill's SKILL.md and understand its purpose
- Step 2: Run automated validation: scripts/validate_skill.py <path/to/skill>
- Step 3: Perform manual evaluation against criteria below and generate the report
Best Practices
- Run automated validation (scripts/validate_skill.py) before starting manual review
- Enforce naming conventions: lowercase, hyphens, max 64 chars, avoid reserved words
- Ensure the description explains what the skill does and when to use it (activation triggers)
- Keep SKILL.md under 500 lines; ensure references are one-level deep; add a TOC if long
- Provide clear scores and actionable recommendations in the evaluation report
Example Use Cases
- Audited 'data-cleaning-scripts' skill; corrected naming, enhanced description triggers, and added TOC for long references
- Validated 'report-generator' skill; trimmed content to fit limits and clarified when to use
- Reviewed 'csv-analytics' skill; replaced vague wording with explicit actions and outcomes
- Assessed 'quiz-generator' skill; ensured path references are safe and consolidated references to one level
- Evaluated 'monitoring-dashboards' skill; produced actionable recommendations to improve structure and reduce redundancy