skills-eval
npx machina-cli add skill athola/claude-night-market/skills-eval --openclawSkills Evaluation and Improvement
Table of Contents
Overview
This framework audits Claude skills against quality standards to improve performance and reduce token consumption. Automated tools analyze skill structure, measure context usage, and identify specific technical improvements. Run verification commands after each audit to confirm fixes work correctly.
The skills-auditor provides structural analysis, while the improvement-suggester ranks fixes by impact. Compliance is verified through the compliance-checker. Runtime efficiency is monitored by tool-performance-analyzer and token-usage-tracker.
Quick Start
Basic Audit
Run a full audit of all skills or target a specific file to identify structural issues.
# Audit all skills
make audit-all
# Audit specific skill
make audit-skill TARGET=path/to/skill/SKILL.md
Analysis and Optimization
Use skill_analyzer.py for complexity checks and token_estimator.py to verify the context budget.
make analyze-skill TARGET=path/to/skill/SKILL.md
make estimate-tokens TARGET=path/to/skill/SKILL.md
Improvements
Generate a prioritized plan and verify standards compliance using improvement_suggester.py and compliance_checker.py.
make improve-skill TARGET=path/to/skill/SKILL.md
make check-compliance TARGET=path/to/skill/SKILL.md
Evaluation Workflow
Start with make audit-all to inventory skills and identify high-priority targets. For each skill requiring attention, run analysis with analyze-skill to map complexity. Generate an improvement plan, apply fixes, and run check-compliance to verify the skill meets project standards. Finalize by checking the token budget for efficiency.
Evaluation and Optimization
Quality assessments use the skills-auditor and improvement-suggester to generate detailed reports. Performance analysis focuses on token efficiency through the token-usage-tracker and tool performance via tool-performance-analyzer. For standards compliance, the compliance-checker automates common fixes for structural issues.
Scoring and Prioritization
We evaluate skills across five dimensions: structure compliance, content quality, token efficiency, activation reliability, and tool integration. Scores above 90 represent production-ready skills, while scores below 50 indicate critical issues requiring immediate attention.
Improvements are prioritized by impact. Critical issues include security vulnerabilities or broken functionality. High-priority items cover structural flaws that hinder discoverability. Medium and low priorities focus on best practices and minor optimizations.
Structural Patterns
Deprecated: skills/shared/modules/ directories. Shared modules must be relocated into the consuming skill's own modules/ directory. The evaluator flags any remaining skills/shared/ as a structural warning.
Current: Each skill owns its modules at skills/<skill-name>/modules/. Cross-skill references use relative paths (e.g., ../skill-authoring/modules/anti-rationalization.md).
Resources
Shared Modules: Cross-Skill Patterns
- Anti-Rationalization Patterns: See anti-rationalization.md
- Enforcement Language: See enforcement-language.md
- Trigger Patterns: See trigger-patterns.md
Skill-Specific Modules
- Trigger Isolation Analysis: See
modules/trigger-isolation-analysis.md - Skill Authoring Best Practices: See
modules/skill-authoring-best-practices.md - Authoring Checklist: See
modules/authoring-checklist.md - Evaluation Workflows: See
modules/evaluation-workflows.md - Quality Metrics: See
modules/quality-metrics.md - Advanced Tool Use Analysis: See
modules/advanced-tool-use-analysis.md - Evaluation Framework: See
modules/evaluation-framework.md - Integration Patterns: See
modules/integration.md - Troubleshooting: See
modules/troubleshooting.md - Pressure Testing: See
modules/pressure-testing.md - Integration Testing: See
modules/integration-testing.md - Multi-Metric Evaluation: See
modules/multi-metric-evaluation-methodology.md - Performance Benchmarking: See
modules/performance-benchmarking.md
Tools and Automation
- Tools: Executable analysis utilities in
scripts/directory. - Automation: Setup and validation scripts in
scripts/automation/.
Source
git clone https://github.com/athola/claude-night-market/blob/master/plugins/abstract/skills/skills-eval/SKILL.mdView on GitHub Overview
Skills-eval audits Claude skills against established quality standards to improve performance and reduce token consumption. It leverages automated tools to assess structure, metadata quality, token efficiency, and tool integration, guiding improvement planning before production.
How This Skill Works
Run an audit (make audit-all or make audit-skill) with skills-auditor to map structure and context usage. The improvement-suggester ranks fixes by impact, while compliance-checker and token-usage-tracker verify changes and monitor runtime efficiency, culminating in a production-ready skill.
When to Use It
- Reviewing skill quality during QA or pre-release
- Preparing skills for production and shipping
- Auditing existing skills to identify token waste and performance gaps
- Generating improvement plans and validation for modular-design or integration tests
- Compliance reporting and performance benchmarking
Quick Start
- Step 1: Run a full audit with make audit-all or audit-skill TARGET=path/to/skill/SKILL.md
- Step 2: Analyze complexity and estimate tokens with make analyze-skill TARGET=path/to/skill/SKILL.md and make estimate-tokens TARGET=path/to/skill/SKILL.md
- Step 3: Generate an improvement plan and verify with make improve-skill TARGET=path/to/skill/SKILL.md and make check-compliance TARGET=path/to/skill/SKILL.md
Best Practices
- Run a full audit with make audit-all before any release
- Use analyze-skill and estimate-tokens to understand complexity and token budgets
- Prioritize fixes with improvement-suggester based on impact
- Verify fixes with check-compliance and token-usage-tracker
- Re-audit and document results, including performance metrics
Example Use Cases
- Audited a large Claude skill and cut token usage by reorganizing context and modularizing prompts.
- Applied top fixes from improvement-suggester and achieved higher structure compliance.
- Validated integration with tools and SDK using the compliance-checker.
- Generated a compliance report and fixed structural issues flagged by the auditor.
- Benchmarked performance with tool-performance-analyzer and token-usage-tracker after fixes.