evaluate-model
npx machina-cli add skill xvirobotics/metaskill/evaluate-model --openclawYou are running model evaluation for this project. Your goal is to load a trained model checkpoint, evaluate it on the held-out test set, compute comprehensive metrics, and generate a structured report.
Dynamic Context
Current branch: !git branch --show-current
Available checkpoints: !ls checkpoints/*.pt checkpoints/*.pth 2>/dev/null || echo "No checkpoints found"
Test data: !ls data/processed/test* data/features/test* 2>/dev/null || echo "No test data found"
Latest metrics: !ls -t reports/*.json experiments/*.json 2>/dev/null | head -3 || echo "No previous metrics found"
Config files: !ls configs/*.yaml configs/*.toml 2>/dev/null || echo "No configs found"
Checkpoint Selection
If the user provided a checkpoint path as an argument, use it: $ARGUMENTS
Otherwise, find the latest checkpoint:
- Look for
checkpoints/best_model.ptorcheckpoints/best_model.pth - If not found, find the most recently modified
.ptor.pthfile incheckpoints/ - If no checkpoints exist, report the error and stop
Evaluation Process
Step 1: Load and Verify Checkpoint
Verify the checkpoint file exists and can be loaded:
python3 -c "
import torch
ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False)
print('Checkpoint keys:', list(ckpt.keys()))
print('Epoch:', ckpt.get('epoch', 'unknown'))
print('Best metric:', ckpt.get('best_metric', 'unknown'))
print('Config:', ckpt.get('config', 'not stored'))
"
Report the checkpoint metadata: epoch, stored metric, config used.
Step 2: Run Evaluation Script
Execute the evaluation:
python3 -m src.models.evaluation.evaluate \
--checkpoint $CHECKPOINT_PATH \
--data-dir data/features/ \
--output-dir reports/ \
--config configs/experiment.yaml
Alternative patterns to try if the above fails:
python3 src/evaluation/evaluate.py --checkpoint $CHECKPOINT_PATHpython3 evaluate.py --checkpoint $CHECKPOINT_PATH --test-data data/features/test.parquet
Step 3: Collect Metrics
After evaluation completes, read the metrics output. Look for the metrics JSON file:
cat reports/metrics.json 2>/dev/null || cat reports/evaluation_metrics.json 2>/dev/null
If no JSON file was generated, parse metrics from the script's stdout.
Step 4: Generate Confusion Matrix
If the evaluation script did not generate a confusion matrix plot, create one:
python3 -c "
import json
import numpy as np
from pathlib import Path
# Load metrics that include confusion matrix data
metrics_path = Path('reports/metrics.json')
if metrics_path.exists():
metrics = json.loads(metrics_path.read_text())
if 'confusion_matrix' in metrics:
cm = np.array(metrics['confusion_matrix'])
print('Confusion Matrix:')
print(cm)
print()
# Print per-class metrics
for i, row in enumerate(cm):
precision = row[i] / max(row.sum(), 1)
recall = row[i] / max(cm[:, i].sum(), 1)
print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}')
"
Step 5: Compare with Baseline
If previous metrics exist, load and compare:
- Find the most recent previous metrics file (excluding the one just generated)
- Compute deltas for each metric
- Flag any metric regressions (where current is worse than previous)
- Highlight improvements
Step 6: Generate Summary Report
Produce a structured evaluation report:
## Model Evaluation Report
### Checkpoint
- Path: [checkpoint path]
- Epoch: [epoch number]
- Training config: [config file used]
### Test Set Metrics
| Metric | Value |
|--------|-------|
| Accuracy | X.XXXX |
| Precision (macro) | X.XXXX |
| Recall (macro) | X.XXXX |
| F1 (macro) | X.XXXX |
| AUC-ROC | X.XXXX |
### Confusion Matrix
[confusion matrix table or reference to plot]
### Comparison with Previous Run
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| ... | ... | ... | +/- ... |
### Observations
- [Key findings about model performance]
- [Any concerning patterns in errors]
- [Recommendations for improvement]
Write this report to reports/evaluation_report.md.
Error Handling
- If checkpoint cannot be loaded: check for PyTorch version mismatch, report the error
- If test data is missing: report which files are expected and where to find them
- If CUDA is not available: run evaluation on CPU (will be slower but should work)
- If metrics computation fails: report the specific error and which metric caused it
Source
git clone https://github.com/xvirobotics/metaskill/blob/main/examples/data-science/.claude/skills/evaluate-model/SKILL.mdView on GitHub Overview
Loads a model checkpoint, runs evaluation on the test set, and outputs a structured metrics report with a confusion matrix. This is essential after training to quantify performance or to re-evaluate a specific checkpoint.
How This Skill Works
If a checkpoint path is provided, it loads that file; otherwise it selects the latest checkpoint following the priority: best_model.pt/pth, then the most recently modified pt/pth. It runs the evaluation script to produce reports/metrics.json (or reports/evaluation_metrics.json) and, if needed, creates a confusion matrix visualization.
When to Use It
- After finishing model training to quantify performance on the test set
- When re-evaluating a specific checkpoint during hyperparameter tuning
- When you need a standardized metrics report for stakeholder reviews
- When you want a confusion matrix to inspect class-wise errors
- When preparing results for baseline or comparison against previous runs
Quick Start
- Step 1: Provide a checkpoint path or rely on the latest: e.g. ./evaluate-model/checkpoints/best_model.pt
- Step 2: Run evaluation with data and config: python3 -m src.models.evaluation.evaluate --checkpoint $CHECKPOINT_PATH --data-dir data/features/ --output-dir reports/ --config configs/experiment.yaml
- Step 3: Inspect results: cat reports/metrics.json or cat reports/evaluation_metrics.json; open confusion matrix if generated
Best Practices
- Ensure test data paths (data/processed/test* or data/features/test*) exist and are consistent with the evaluation script
- Provide a valid checkpoint or rely on the latest one following the priority rules
- Verify the evaluation config (configs/experiment.yaml) matches the desired experiment setup
- Check for reports/metrics.json or reports/evaluation_metrics.json to confirm output is generated
- Store and filename metrics in reports/ to facilitate baselining and comparisons
Example Use Cases
- Evaluate the latest checkpoint after training on dataset X and generate a metrics report in reports/
- Re-evaluate checkpoints/best_model.pt to compare against the current run
- Create a confusion matrix plot from reports/metrics.json to assess class balance
- Compare current metrics against a saved baseline JSON in experiments/ and highlight regressions
- Generate a concise evaluation summary for a project pitch or publication