Get the FREE Ultimate OpenClaw Setup Guide →

evaluate-model

npx machina-cli add skill xvirobotics/metaskill/evaluate-model --openclaw
Files (1)
SKILL.md
4.8 KB

You are running model evaluation for this project. Your goal is to load a trained model checkpoint, evaluate it on the held-out test set, compute comprehensive metrics, and generate a structured report.

Dynamic Context

Current branch: !git branch --show-current Available checkpoints: !ls checkpoints/*.pt checkpoints/*.pth 2>/dev/null || echo "No checkpoints found" Test data: !ls data/processed/test* data/features/test* 2>/dev/null || echo "No test data found" Latest metrics: !ls -t reports/*.json experiments/*.json 2>/dev/null | head -3 || echo "No previous metrics found" Config files: !ls configs/*.yaml configs/*.toml 2>/dev/null || echo "No configs found"

Checkpoint Selection

If the user provided a checkpoint path as an argument, use it: $ARGUMENTS

Otherwise, find the latest checkpoint:

  1. Look for checkpoints/best_model.pt or checkpoints/best_model.pth
  2. If not found, find the most recently modified .pt or .pth file in checkpoints/
  3. If no checkpoints exist, report the error and stop

Evaluation Process

Step 1: Load and Verify Checkpoint

Verify the checkpoint file exists and can be loaded:

python3 -c "
import torch
ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False)
print('Checkpoint keys:', list(ckpt.keys()))
print('Epoch:', ckpt.get('epoch', 'unknown'))
print('Best metric:', ckpt.get('best_metric', 'unknown'))
print('Config:', ckpt.get('config', 'not stored'))
"

Report the checkpoint metadata: epoch, stored metric, config used.

Step 2: Run Evaluation Script

Execute the evaluation:

python3 -m src.models.evaluation.evaluate \
    --checkpoint $CHECKPOINT_PATH \
    --data-dir data/features/ \
    --output-dir reports/ \
    --config configs/experiment.yaml

Alternative patterns to try if the above fails:

  • python3 src/evaluation/evaluate.py --checkpoint $CHECKPOINT_PATH
  • python3 evaluate.py --checkpoint $CHECKPOINT_PATH --test-data data/features/test.parquet

Step 3: Collect Metrics

After evaluation completes, read the metrics output. Look for the metrics JSON file:

cat reports/metrics.json 2>/dev/null || cat reports/evaluation_metrics.json 2>/dev/null

If no JSON file was generated, parse metrics from the script's stdout.

Step 4: Generate Confusion Matrix

If the evaluation script did not generate a confusion matrix plot, create one:

python3 -c "
import json
import numpy as np
from pathlib import Path

# Load metrics that include confusion matrix data
metrics_path = Path('reports/metrics.json')
if metrics_path.exists():
    metrics = json.loads(metrics_path.read_text())
    if 'confusion_matrix' in metrics:
        cm = np.array(metrics['confusion_matrix'])
        print('Confusion Matrix:')
        print(cm)
        print()
        # Print per-class metrics
        for i, row in enumerate(cm):
            precision = row[i] / max(row.sum(), 1)
            recall = row[i] / max(cm[:, i].sum(), 1)
            print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}')
"

Step 5: Compare with Baseline

If previous metrics exist, load and compare:

  1. Find the most recent previous metrics file (excluding the one just generated)
  2. Compute deltas for each metric
  3. Flag any metric regressions (where current is worse than previous)
  4. Highlight improvements

Step 6: Generate Summary Report

Produce a structured evaluation report:

## Model Evaluation Report

### Checkpoint
- Path: [checkpoint path]
- Epoch: [epoch number]
- Training config: [config file used]

### Test Set Metrics
| Metric | Value |
|--------|-------|
| Accuracy | X.XXXX |
| Precision (macro) | X.XXXX |
| Recall (macro) | X.XXXX |
| F1 (macro) | X.XXXX |
| AUC-ROC | X.XXXX |

### Confusion Matrix
[confusion matrix table or reference to plot]

### Comparison with Previous Run
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| ... | ... | ... | +/- ... |

### Observations
- [Key findings about model performance]
- [Any concerning patterns in errors]
- [Recommendations for improvement]

Write this report to reports/evaluation_report.md.

Error Handling

  • If checkpoint cannot be loaded: check for PyTorch version mismatch, report the error
  • If test data is missing: report which files are expected and where to find them
  • If CUDA is not available: run evaluation on CPU (will be slower but should work)
  • If metrics computation fails: report the specific error and which metric caused it

Source

git clone https://github.com/xvirobotics/metaskill/blob/main/examples/data-science/.claude/skills/evaluate-model/SKILL.mdView on GitHub

Overview

Loads a model checkpoint, runs evaluation on the test set, and outputs a structured metrics report with a confusion matrix. This is essential after training to quantify performance or to re-evaluate a specific checkpoint.

How This Skill Works

If a checkpoint path is provided, it loads that file; otherwise it selects the latest checkpoint following the priority: best_model.pt/pth, then the most recently modified pt/pth. It runs the evaluation script to produce reports/metrics.json (or reports/evaluation_metrics.json) and, if needed, creates a confusion matrix visualization.

When to Use It

  • After finishing model training to quantify performance on the test set
  • When re-evaluating a specific checkpoint during hyperparameter tuning
  • When you need a standardized metrics report for stakeholder reviews
  • When you want a confusion matrix to inspect class-wise errors
  • When preparing results for baseline or comparison against previous runs

Quick Start

  1. Step 1: Provide a checkpoint path or rely on the latest: e.g. ./evaluate-model/checkpoints/best_model.pt
  2. Step 2: Run evaluation with data and config: python3 -m src.models.evaluation.evaluate --checkpoint $CHECKPOINT_PATH --data-dir data/features/ --output-dir reports/ --config configs/experiment.yaml
  3. Step 3: Inspect results: cat reports/metrics.json or cat reports/evaluation_metrics.json; open confusion matrix if generated

Best Practices

  • Ensure test data paths (data/processed/test* or data/features/test*) exist and are consistent with the evaluation script
  • Provide a valid checkpoint or rely on the latest one following the priority rules
  • Verify the evaluation config (configs/experiment.yaml) matches the desired experiment setup
  • Check for reports/metrics.json or reports/evaluation_metrics.json to confirm output is generated
  • Store and filename metrics in reports/ to facilitate baselining and comparisons

Example Use Cases

  • Evaluate the latest checkpoint after training on dataset X and generate a metrics report in reports/
  • Re-evaluate checkpoints/best_model.pt to compare against the current run
  • Create a confusion matrix plot from reports/metrics.json to assess class balance
  • Compare current metrics against a saved baseline JSON in experiments/ and highlight regressions
  • Generate a concise evaluation summary for a project pitch or publication

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers