What if there are no plots or metrics?

The report will include whatever is available and note any missing items.

How does it handle missing baseline?

It will note the absence of a baseline and still present current metrics for comparison.

generate-report

npx machina-cli add skill xvirobotics/metaskill/generate-report --openclaw

Files (1)

SKILL.md

6.2 KB

You are generating a comprehensive experiment report for this data science project. Your goal is to gather all available metrics, plots, and configuration details from the latest experiment and produce a clear, well-structured report that can be shared with the team.

Dynamic Context

Current branch: !git branch --show-current Git commit: !git rev-parse --short HEAD 2>/dev/null || echo "unknown" Recent experiment logs: !ls -lt reports/*.json experiments/*.json 2>/dev/null | head -5 || echo "No experiment logs found" Available plots: !ls reports/figures/*.png reports/figures/*.svg 2>/dev/null | head -10 || echo "No plots found" Checkpoints: !ls -lt checkpoints/*.pt checkpoints/*.pth 2>/dev/null | head -3 || echo "No checkpoints" Config used: !ls configs/*.yaml configs/*.toml 2>/dev/null | head -3 || echo "No configs"

Experiment Name

If the user provided an experiment name: $ARGUMENTS Otherwise, derive one from the branch name, latest config file, or use the current date.

Report Generation Process

Step 1: Gather Experiment Data

Collect all available information about the latest experiment:

Metrics: Read the latest metrics JSON from reports/ or experiments/
Training logs: Look for training output logs, MLflow run data, or W&B run summaries
Configuration: Read the experiment config file (YAML/TOML)
Checkpoint metadata: Load the best checkpoint and extract epoch, metric, config
Dataset statistics: Look for data profiling outputs or read from data validation logs

# Find and read latest metrics
METRICS_FILE=$(ls -t reports/*.json experiments/*.json 2>/dev/null | head -1)
if [ -n "$METRICS_FILE" ]; then
    echo "=== Latest Metrics ==="
    cat "$METRICS_FILE"
fi

# Find config used
CONFIG_FILE=$(ls -t configs/*.yaml configs/*.toml 2>/dev/null | head -1)
if [ -n "$CONFIG_FILE" ]; then
    echo "=== Configuration ==="
    cat "$CONFIG_FILE"
fi

Step 2: Gather Baseline Data

Look for baseline metrics to compare against:

Check for a reports/baseline_metrics.json or experiments/baseline.json
Check git history for previous metrics files: git log --oneline --all -- reports/*.json
If MLflow is configured, query for the baseline run
If no baseline exists, note this in the report

Step 3: Generate Visualizations

If plots do not already exist, generate them:

python3 -c "
import json
from pathlib import Path

# Check if visualization script exists
viz_script = Path('src/evaluation/visualize.py')
if viz_script.exists():
    print('Visualization script found')
else:
    print('No visualization script found -- will generate basic plots')
"

Key visualizations to include:

Training curves: loss and metric over epochs (train vs. validation)
Confusion matrix: if classification task
Metric comparison bar chart: current vs. baseline
Feature importance: if available from the model or analysis

Step 4: Write the Report

Generate the report as a Markdown file at reports/experiment_report.md:

# Experiment Report: [Experiment Name]

**Date:** [current date]
**Branch:** [git branch]
**Commit:** [git commit hash]
**Author:** [generated by /generate-report skill]

---

## Executive Summary

[2-3 sentences: what was the experiment, what was the key result, and is it better than baseline?]

## Experiment Configuration

| Parameter | Value |
|-----------|-------|
| Model architecture | [from config] |
| Learning rate | [from config] |
| Batch size | [from config] |
| Epochs | [from config] |
| Optimizer | [from config] |
| Scheduler | [from config] |
| Random seed | [from config] |
| Dataset version | [from config or DVC] |

## Dataset Summary

| Split | Samples | Features | Classes |
|-------|---------|----------|---------|
| Train | [count] | [count] | [count or N/A] |
| Validation | [count] | [count] | [count or N/A] |
| Test | [count] | [count] | [count or N/A] |

## Results

### Final Metrics

| Metric | Value |
|--------|-------|
| [metric 1] | [value] |
| [metric 2] | [value] |
| ... | ... |

### Comparison with Baseline

| Metric | Baseline | Current | Delta | Improvement? |
|--------|----------|---------|-------|-------------|
| [metric 1] | [value] | [value] | [+/- value] | [Yes/No] |
| ... | ... | ... | ... | ... |

### Training Curves

![Training Loss](figures/training_loss.png)
![Validation Metric](figures/validation_metric.png)

### Confusion Matrix

![Confusion Matrix](figures/confusion_matrix.png)

## Analysis

### Key Findings
- [Finding 1: most important result]
- [Finding 2: notable pattern or observation]
- [Finding 3: any concerning behavior]

### Error Analysis
- [What types of errors does the model make?]
- [Are errors concentrated in specific classes or data subsets?]

### Comparison with Previous Experiments
- [How does this compare to previous runs?]
- [What changed and what impact did it have?]

## Recommendations

### Next Steps
1. [Actionable recommendation 1]
2. [Actionable recommendation 2]
3. [Actionable recommendation 3]

### Potential Improvements
- [Idea for model improvement]
- [Idea for data improvement]
- [Idea for training procedure improvement]

## Artifacts

| Artifact | Path |
|----------|------|
| Best checkpoint | checkpoints/best_model.pt |
| Metrics JSON | reports/metrics.json |
| Config file | configs/experiment.yaml |
| Training logs | experiments/[run-id]/ |
| Figures | reports/figures/ |

---

*Report generated automatically by the /generate-report skill.*

Step 5: Verify Report Quality

After writing the report:

Read it back and verify all placeholders are filled with actual data
Verify all referenced figure paths exist
Verify metrics values are reasonable (not NaN, not obviously wrong)
Ensure the executive summary accurately reflects the detailed results
Check that recommendations are specific and actionable, not generic

Report the path to the generated report file when complete.

Source

git clone https://github.com/xvirobotics/metaskill/blob/main/examples/data-science/.claude/skills/generate-report/SKILL.md

View on GitHub

Overview

This skill compiles a comprehensive summary of the latest experiment, collecting metrics, plots, and configuration details. It compares results against the baseline and outputs a Markdown report that can be shared with the team. Use it after training and evaluation to communicate results clearly.

How This Skill Works

It scans reports and experiments folders for the most recent run, extracts metrics, training logs, and config files, and loads the best checkpoint metadata. It then generates a Markdown report at reports/experiment_report.md, embedding key plots and a baseline comparison, ready for sharing.

When to Use It

After finishing a training and evaluation run to summarize results for the team.
When you need to compare current performance to the baseline.
Before sharing results with stakeholders or reviewers.
When multiple experiments exist and you want a consolidated report.
To produce a repeatable report format for CI or ML ops.

Quick Start

Step 1: Provide the experiment name as an argument, e.g. 'generate-report transformer-v2-lr-sweep'.
Step 2: The tool collects metrics, config, and plots from reports/experiments/ and builds reports/experiment_report.md.
Step 3: Open reports/experiment_report.md and share it with your team.

Best Practices

Ensure latest experiment logs exist in reports/ or experiments/.
Verify there's a corresponding baseline to compare against.
Standardize metric names across experiments for consistent comparisons.
Include at least one relevant plot (training curves, confusion matrix) in the report.
Review the generated Markdown for accuracy before sharing.

Example Use Cases

Transformer-v2-lr-sweep: generate a report to share with the team.
CNN-augmentation-warmup: produce a report after evaluation to compare to baseline.
bert-finetune-epoch10: create a summary report for leadership.
lstm-hyperparam-search: capture metrics, plots, and baseline comparison.
random-forest-baselineCheck: generate report highlighting baseline alignment.

Frequently Asked Questions

Add this skill to your agents