How are ties resolved?

Deterministic ranking uses predefined tie-break guidance from references/benchmarking-guide.md; ensure explicit tie-break rules.

What outputs are produced by the tool?

A leaderboard, weighted scores, the ranking order, and a record of all weighting assumptions used.

How can I safely adjust weights?

Update the metric weights in your configuration, re-run with scripts/benchmark_models.py, and compare results while documenting changes and guardrails.

ml-model-eval-benchmark

npx machina-cli add skill 0x-Professor/Agent-Skills-Hub/ml-model-eval-benchmark --openclaw

Files (1)

SKILL.md

817 B

ML Model Eval Benchmark

Overview

Produce consistent model ranking outputs from metric-weighted evaluation inputs.

Workflow

Define metric weights and accepted metric ranges.
Ingest model metrics for each candidate.
Compute weighted score and ranking.
Export leaderboard and promotion recommendation.

Use Bundled Resources

Run scripts/benchmark_models.py to generate benchmark outputs.
Read references/benchmarking-guide.md for weighting and tie-break guidance.

Guardrails

Keep metric names and scales consistent across candidates.
Record weighting assumptions in the output.

Source

git clone https://github.com/0x-Professor/Agent-Skills-Hub/blob/main/skills/ml-model-eval-benchmark/SKILL.mdView on GitHub

Overview

ML Model Eval Benchmark produces consistent model rankings by applying metric weights to candidate evaluation inputs. It supports benchmark leaderboards and promotion decisions with transparent, score-based outputs.

How This Skill Works

Start by defining metric weights and accepted ranges, then ingest each candidate’s metrics. The system computes a weighted score and a deterministic ranking, then exports a leaderboard and a promotion recommendation.

When to Use It

Benchmark multiple model candidates for a public or internal leaderboard.
Decide on model promotion by comparing weighted performance across metrics.
Provide a transparent, auditable evaluation flow for governance and stakeholders.
Experiment with different metric weights and study their impact on rankings.
Publish standardized results with both raw metrics and weighted scores for cross-team review.

Quick Start

Step 1: Define metric weights and accepted ranges.
Step 2: Ingest model metrics and run scripts/benchmark_models.py to generate outputs.
Step 3: Review the generated leaderboard, weighted scores, and promotion recommendation.

Best Practices

Keep metric names and scales consistent across all candidates.
Document all weighting assumptions and accepted metric ranges in outputs.
Use the bundled script scripts/benchmark_models.py to generate outputs.
Verify and align tie-break rules with references/benchmarking-guide.md.
Publish both raw metrics and weighted scores for full transparency.

Example Use Cases

A team ranks image-classification models for a public leaderboard to decide winner and finalists.
A production vendor is selected based on a weighted benchmark that reflects deployment goals.
Governance teams review benchmark outputs with documented weights and tie-break rules.
Re-run benchmarks after metric changes to assess impact on rankings and promotions.
Organizations maintain audit logs of weighting choices for regulatory review.

Frequently Asked Questions

Add this skill to your agents