ml-model-eval-benchmark
npx machina-cli add skill 0x-Professor/Agent-Skills-Hub/ml-model-eval-benchmark --openclawFiles (1)
SKILL.md
817 B
ML Model Eval Benchmark
Overview
Produce consistent model ranking outputs from metric-weighted evaluation inputs.
Workflow
- Define metric weights and accepted metric ranges.
- Ingest model metrics for each candidate.
- Compute weighted score and ranking.
- Export leaderboard and promotion recommendation.
Use Bundled Resources
- Run
scripts/benchmark_models.pyto generate benchmark outputs. - Read
references/benchmarking-guide.mdfor weighting and tie-break guidance.
Guardrails
- Keep metric names and scales consistent across candidates.
- Record weighting assumptions in the output.
Source
git clone https://github.com/0x-Professor/Agent-Skills-Hub/blob/main/skills/ml-model-eval-benchmark/SKILL.mdView on GitHub Overview
ML Model Eval Benchmark produces consistent model rankings by applying metric weights to candidate evaluation inputs. It supports benchmark leaderboards and promotion decisions with transparent, score-based outputs.
How This Skill Works
Start by defining metric weights and accepted ranges, then ingest each candidate’s metrics. The system computes a weighted score and a deterministic ranking, then exports a leaderboard and a promotion recommendation.
When to Use It
- Benchmark multiple model candidates for a public or internal leaderboard.
- Decide on model promotion by comparing weighted performance across metrics.
- Provide a transparent, auditable evaluation flow for governance and stakeholders.
- Experiment with different metric weights and study their impact on rankings.
- Publish standardized results with both raw metrics and weighted scores for cross-team review.
Quick Start
- Step 1: Define metric weights and accepted ranges.
- Step 2: Ingest model metrics and run scripts/benchmark_models.py to generate outputs.
- Step 3: Review the generated leaderboard, weighted scores, and promotion recommendation.
Best Practices
- Keep metric names and scales consistent across all candidates.
- Document all weighting assumptions and accepted metric ranges in outputs.
- Use the bundled script scripts/benchmark_models.py to generate outputs.
- Verify and align tie-break rules with references/benchmarking-guide.md.
- Publish both raw metrics and weighted scores for full transparency.
Example Use Cases
- A team ranks image-classification models for a public leaderboard to decide winner and finalists.
- A production vendor is selected based on a weighted benchmark that reflects deployment goals.
- Governance teams review benchmark outputs with documented weights and tie-break rules.
- Re-run benchmarks after metric changes to assess impact on rankings and promotions.
- Organizations maintain audit logs of weighting choices for regulatory review.
Frequently Asked Questions
Add this skill to your agents