What is ml-experiment?

A framework for managing ML experiments with config driven setups, a central registry, and tools for comparison and analysis

How does automatic tracking work?

A callback logs metrics, hyperparameters and tags to the experiment registry so you can review results later

Which tools are supported?

WandB, TensorBoard, and MLflow are supported via the logger configuration in the experiment YAML

ml-experiment

Scanned

npx machina-cli add skill nishide-dev/claude-code-ml-research/ml-experiment --openclaw

Files (1)

SKILL.md

11.4 KB

ML Experiment Management

Systematic experiment tracking, comparison, and analysis for machine learning research.

Quick Start

Directory Structure:

logs/
├── 2026-02-22/
│   ├── 14-30-22/              # Timestamp of run
│   │   ├── .hydra/
│   │   │   ├── config.yaml    # Full resolved config
│   │   │   ├── overrides.yaml # CLI overrides
│   │   │   └── hydra.yaml
│   │   ├── checkpoints/
│   │   ├── metrics.csv
│   │   └── train.log
│   └── 15-45-10/
└── experiment_registry.json   # Central registry

1. Create Experiment Config

Interactive Setup - Ask User:

Experiment name and description
Base configuration to extend
Key parameters to modify
Tags (baseline, ablation, optimization, etc.)
Expected runtime/compute requirements

Generate: configs/experiment/<name>.yaml

# @package _global_

# Metadata
name: "vit_imagenet_finetuning"
description: "Fine-tune Vision Transformer on ImageNet subset"
tags: ["vision-transformer", "transfer-learning", "imagenet"]

# Compose from existing configs
defaults:
  - override /model: vit_base
  - override /data: imagenet
  - override /trainer: gpu_multi
  - override /logger: wandb

# Seed
seed: 42

# Model overrides
model:
  pretrained: true
  freeze_backbone: false
  num_classes: 1000
  optimizer:
    lr: 0.001

# Data overrides
data:
  batch_size: 256
  num_workers: 8
  image_size: 224

# Trainer overrides
trainer:
  max_epochs: 50
  precision: "16-mixed"
  devices: 4
  strategy: "ddp"

# Callbacks
callbacks:
  model_checkpoint:
    monitor: "val/acc"
    mode: "max"
    save_top_k: 3

  early_stopping:
    monitor: "val/loss"
    patience: 10
    mode: "min"

# Logger
logger:
  wandb:
    project: "imagenet-classification"
    tags: ${tags}
    notes: ${description}

Run experiment:

python src/train.py experiment=vit_imagenet_finetuning

See templates/experiment-templates.yaml for common experiment types.

2. Track Experiment Results

Automatic Tracking with Callbacks

# In LightningModule
def on_train_end(self):
    # Log experiment to registry
    from scripts.experiment_registry import log_experiment

    log_experiment(
        name=self.hparams.experiment_name,
        config_path=self.hparams.config_path,
        metrics={
            "best_val_acc": self.trainer.callback_metrics["val/acc"].item(),
            "best_val_loss": self.trainer.checkpoint_callback.best_model_score.item(),
            "epochs_trained": self.trainer.current_epoch,
        },
        hyperparameters={
            "lr": self.hparams.optimizer.lr,
            "batch_size": self.hparams.data.batch_size,
            "optimizer": self.hparams.optimizer._target_,
        },
        tags=self.hparams.tags,
    )

Experiment Registry Format

logs/experiment_registry.json:

{
  "experiments": [
    {
      "id": "exp_001",
      "name": "baseline_resnet50",
      "timestamp": "2026-02-22T14:30:22",
      "config": "configs/experiment/baseline.yaml",
      "status": "completed",
      "metrics": {
        "best_val_acc": 0.876,
        "best_val_loss": 0.324,
        "final_train_loss": 0.145,
        "epochs_trained": 45
      },
      "hyperparameters": {
        "lr": 0.001,
        "batch_size": 128,
        "optimizer": "AdamW"
      },
      "runtime": "2h 34m",
      "gpu_count": 2,
      "tags": ["baseline", "resnet"]
    }
  ]
}

See scripts/experiment_registry.py for implementation.

3. Compare Experiments

Compare specific experiments:

python scripts/compare_experiments.py exp_001 exp_002 exp_003

Output:

ID      Name              Val Acc  Val Loss  LR      Batch  Runtime
exp_001 baseline_resnet50 0.876    0.324     0.001   128    2h 34m
exp_002 resnet50_tuned    0.892    0.298     0.005   256    3h 12m
exp_003 resnet50_dropout  0.884    0.312     0.001   128    2h 45m

Comparison plot:

# Generates logs/experiment_comparison.png
# - Bar charts for accuracy and loss
# - Side-by-side comparison

See scripts/compare_experiments.py for full implementation.

4. Experiment Templates

Baseline Experiment

# configs/experiment/baseline.yaml
name: "baseline"
description: "Baseline with default hyperparameters"
tags: ["baseline"]

# Use defaults from model/data/trainer
model: {}
data: {}
trainer:
  max_epochs: 100

Ablation Study

# configs/experiment/ablation_dropout.yaml
name: "ablation_dropout"
description: "Effect of dropout rate"
tags: ["ablation", "regularization"]

# Run with: --multirun model.dropout=0.0,0.1,0.2,0.3,0.4,0.5
model:
  dropout: 0.3

Hyperparameter Optimization

# configs/experiment/hp_optimization.yaml
name: "hp_optimization"
description: "Hyperparameter optimization with Optuna"
tags: ["optimization", "tuning"]

defaults:
  - override hydra/sweeper: optuna

hydra:
  sweeper:
    n_trials: 100
    direction: maximize
    study_name: "model_optimization"
    params:
      model.hidden_dims:
        type: categorical
        choices: [[512,256], [1024,512,256]]
      model.optimizer.lr:
        type: float
        low: 0.0001
        high: 0.01
        log: true

optimized_metric: "val/acc"

See templates/ for more experiment types.

5. Experiment Reproduction

Save Full Environment

# Save package versions
pixi list > logs/exp_001/environment.txt
# or
uv pip freeze > logs/exp_001/requirements.txt

# Save git commit
git rev-parse HEAD > logs/exp_001/commit_hash.txt

# Save system info
python -c "import torch; print(f'PyTorch: {torch.__version__}\nCUDA: {torch.version.cuda}')" > logs/exp_001/system_info.txt

Reproduce Experiment

# Checkout exact code
git checkout $(cat logs/exp_001/commit_hash.txt)

# Restore environment
pixi install
# or
uv pip install -r logs/exp_001/requirements.txt

# Run with exact config
python src/train.py \
  --config-path ../logs/exp_001/.hydra \
  --config-name config

Reproducibility Checklist:

Set random seeds (Python, NumPy, PyTorch, Lightning)
Save exact package versions
Document hardware (GPU type, CUDA version)
Save data splits/preprocessing
Commit code before experiment
Don't modify code during run

6. Experiment Analysis

Analyze Single Experiment

python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/

Generates:

analysis.png - Training curves (loss, accuracy, LR)
Summary statistics (best metrics, epochs, final LR)

Example:

Experiment Summary:
Best Val Acc:    0.8921
Best Val Loss:   0.2984
Epochs Trained:  45
Final LR:        0.000123

Multi-Experiment Analysis

# List all experiments
python scripts/list_experiments.py

# Filter by tags
python scripts/list_experiments.py --tags baseline ablation

# Export to CSV
python scripts/export_results.py --output results.csv

# Generate markdown report
python scripts/generate_report.py --format markdown --output report.md

See examples/experiment-analysis.md for detailed analysis workflows.

7. W&B Integration

Query W&B Runs

import wandb

api = wandb.Api()
runs = api.runs("my-project")

# Filter runs
runs = api.runs("my-project", filters={"tags": "baseline"})

# Get metrics
for run in runs:
    print(f"{run.name}: val_acc={run.summary['val/acc']:.4f}")

# Download artifacts
best_run = runs[0]
best_run.file("model.pt").download()

Compare Runs in W&B

# Open workspace
wandb workspace

# Generate report
wandb reports create --title "Experiment Comparison"

W&B Sweeps

# Initialize sweep
wandb sweep configs/sweep/bayesian_optimization.yaml

# Run sweep agent
wandb agent <sweep-id>

See examples/wandb-integration.md for complete guide.

8. Experiment Best Practices

Naming Conventions

Descriptive names: vit_large_imagenet_pretrained
Include date for long runs: exp_2026_02_baseline
Use prefixes: ablation_, optimization_, baseline_

Documentation

Always add description and notes
Tag for easy filtering
Document unexpected results
Track compute resources

Version Control

Commit code before long experiments
Save git hash with experiment
Don't modify code during run
Use branches for experimental features

Organization

configs/experiment/
├── baselines/
│   ├── resnet_baseline.yaml
│   └── vit_baseline.yaml
├── ablations/
│   ├── ablation_dropout.yaml
│   └── ablation_lr.yaml
└── optimizations/
    └── hp_optimization.yaml

9. Common Experiment Types

A. Baseline Experiment

Purpose: Establish reference performance.

name: "baseline"
tags: ["baseline"]
model: {}  # Use defaults

B. Ablation Study

Purpose: Isolate effect of single component.

name: "ablation_batch_norm"
tags: ["ablation"]
model:
  use_batch_norm: false  # Remove batch norm

C. Hyperparameter Tuning

Purpose: Find optimal hyperparameters.

name: "hp_tuning"
tags: ["optimization"]
# Use with --multirun or Optuna sweeper

D. Transfer Learning

Purpose: Fine-tune pretrained model.

name: "transfer_learning"
tags: ["transfer-learning"]
model:
  pretrained: true
  freeze_backbone: true  # Freeze early layers

E. Architecture Search

Purpose: Compare different architectures.

# Run multiple architectures
python src/train.py --multirun \
  experiment=architecture_search \
  model=resnet18,resnet50,vit_base

10. Experiment Commands

# Create new experiment
python src/train.py experiment=<name>

# List experiments
python scripts/list_experiments.py

# Compare experiments
python scripts/compare_experiments.py exp_001 exp_002 exp_003

# Analyze experiment
python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/

# Clean old experiments (keep best 5)
python scripts/clean_experiments.py --keep-best 5

# Export results
python scripts/export_results.py --output results.csv

# Generate report
python scripts/generate_report.py --format markdown --output report.md

Troubleshooting

Experiment registry not updating:

Check permissions on logs/experiment_registry.json
Verify on_train_end callback is called
Check for JSON syntax errors

Can't reproduce results:

Verify exact package versions match
Check random seeds are set
Confirm same hardware (GPU model affects results)
Validate data preprocessing matches

W&B runs not logging:

Check WANDB_API_KEY is set
Verify project name matches
Check network connectivity
Try wandb login again

Metrics not saving:

Verify log_every_n_steps is set
Check disk space
Confirm metrics are logged in training_step

Success Criteria

Experiment registry tracks all runs
Configs saved with each experiment
Metrics logged consistently
Easy comparison between experiments
Reproducible results
Clear documentation
Training curves visualized
Summary statistics computed

Experiments are well-organized and easily comparable!

Source

git clone https://github.com/nishide-dev/claude-code-ml-research/blob/main/skills/ml-experiment/SKILL.mdView on GitHub

Overview

ML Experiment Management provides systematic tracking, comparison, and analysis of machine learning runs. It supports creating experiment configs, maintaining a central registry, and analyzing results with W&B, TensorBoard, and MLflow. It organizes outputs under a logs directory and a central experiment_registry.json for reproducibility.

How This Skill Works

Start by interactively creating an experiment config that is saved under configs/experiment/<name>.yaml, optionally composing base configs with defaults. During training, a callback logs metrics, hyperparameters, and tags to the central registry at logs/experiment_registry.json. External loggers like W&B, TensorBoard, or MLflow integrate via the logger section of the config, enabling cross tool analysis and easy comparisons with scripts like compare_experiments.py.

When to Use It

When starting a new ML experiment and you need a structured, extendable config
When you want automatic tracking of metrics and hyperparameters to a central registry
When you need to compare multiple experiments or runs side by side
When analyzing results with W&B, TensorBoard, or MLflow
When organizing experiments with metadata like tags and runtimes for reproducibility

Quick Start

Step 1: Use interactive setup to define the experiment name, description, base config and overrides; this generates configs/experiment/<name>.yaml
Step 2: Run the experiment with your training script, for example python src/train.py experiment=your_experiment_name
Step 3: Check the central registry at logs/experiment_registry.json and use compare_experiments.py to compare runs with other experiments

Best Practices

Use descriptive experiment names and consistent tags to enable filtering
Rely on a stable defaults based config to improve reproducibility
Log key metrics and hyperparameters consistently across runs
Leverage the central registry to compare and reproduce results
Regularly back up the registry and associated logs to avoid data loss

Example Use Cases

vit_imagenet_finetuning on ImageNet subset
baseline_resnet50 experiment with registry entry
ablation study changing freezing backbone
ddp multiGPU training scaling study
lr sweep comparing learning rate schedules

Frequently Asked Questions

Add this skill to your agents