ml-experiment
Scannednpx machina-cli add skill nishide-dev/claude-code-ml-research/ml-experiment --openclawML Experiment Management
Systematic experiment tracking, comparison, and analysis for machine learning research.
Quick Start
Directory Structure:
logs/
├── 2026-02-22/
│ ├── 14-30-22/ # Timestamp of run
│ │ ├── .hydra/
│ │ │ ├── config.yaml # Full resolved config
│ │ │ ├── overrides.yaml # CLI overrides
│ │ │ └── hydra.yaml
│ │ ├── checkpoints/
│ │ ├── metrics.csv
│ │ └── train.log
│ └── 15-45-10/
└── experiment_registry.json # Central registry
1. Create Experiment Config
Interactive Setup - Ask User:
- Experiment name and description
- Base configuration to extend
- Key parameters to modify
- Tags (baseline, ablation, optimization, etc.)
- Expected runtime/compute requirements
Generate: configs/experiment/<name>.yaml
# @package _global_
# Metadata
name: "vit_imagenet_finetuning"
description: "Fine-tune Vision Transformer on ImageNet subset"
tags: ["vision-transformer", "transfer-learning", "imagenet"]
# Compose from existing configs
defaults:
- override /model: vit_base
- override /data: imagenet
- override /trainer: gpu_multi
- override /logger: wandb
# Seed
seed: 42
# Model overrides
model:
pretrained: true
freeze_backbone: false
num_classes: 1000
optimizer:
lr: 0.001
# Data overrides
data:
batch_size: 256
num_workers: 8
image_size: 224
# Trainer overrides
trainer:
max_epochs: 50
precision: "16-mixed"
devices: 4
strategy: "ddp"
# Callbacks
callbacks:
model_checkpoint:
monitor: "val/acc"
mode: "max"
save_top_k: 3
early_stopping:
monitor: "val/loss"
patience: 10
mode: "min"
# Logger
logger:
wandb:
project: "imagenet-classification"
tags: ${tags}
notes: ${description}
Run experiment:
python src/train.py experiment=vit_imagenet_finetuning
See templates/experiment-templates.yaml for common experiment types.
2. Track Experiment Results
Automatic Tracking with Callbacks
# In LightningModule
def on_train_end(self):
# Log experiment to registry
from scripts.experiment_registry import log_experiment
log_experiment(
name=self.hparams.experiment_name,
config_path=self.hparams.config_path,
metrics={
"best_val_acc": self.trainer.callback_metrics["val/acc"].item(),
"best_val_loss": self.trainer.checkpoint_callback.best_model_score.item(),
"epochs_trained": self.trainer.current_epoch,
},
hyperparameters={
"lr": self.hparams.optimizer.lr,
"batch_size": self.hparams.data.batch_size,
"optimizer": self.hparams.optimizer._target_,
},
tags=self.hparams.tags,
)
Experiment Registry Format
logs/experiment_registry.json:
{
"experiments": [
{
"id": "exp_001",
"name": "baseline_resnet50",
"timestamp": "2026-02-22T14:30:22",
"config": "configs/experiment/baseline.yaml",
"status": "completed",
"metrics": {
"best_val_acc": 0.876,
"best_val_loss": 0.324,
"final_train_loss": 0.145,
"epochs_trained": 45
},
"hyperparameters": {
"lr": 0.001,
"batch_size": 128,
"optimizer": "AdamW"
},
"runtime": "2h 34m",
"gpu_count": 2,
"tags": ["baseline", "resnet"]
}
]
}
See scripts/experiment_registry.py for implementation.
3. Compare Experiments
Compare specific experiments:
python scripts/compare_experiments.py exp_001 exp_002 exp_003
Output:
ID Name Val Acc Val Loss LR Batch Runtime
exp_001 baseline_resnet50 0.876 0.324 0.001 128 2h 34m
exp_002 resnet50_tuned 0.892 0.298 0.005 256 3h 12m
exp_003 resnet50_dropout 0.884 0.312 0.001 128 2h 45m
Comparison plot:
# Generates logs/experiment_comparison.png
# - Bar charts for accuracy and loss
# - Side-by-side comparison
See scripts/compare_experiments.py for full implementation.
4. Experiment Templates
Baseline Experiment
# configs/experiment/baseline.yaml
name: "baseline"
description: "Baseline with default hyperparameters"
tags: ["baseline"]
# Use defaults from model/data/trainer
model: {}
data: {}
trainer:
max_epochs: 100
Ablation Study
# configs/experiment/ablation_dropout.yaml
name: "ablation_dropout"
description: "Effect of dropout rate"
tags: ["ablation", "regularization"]
# Run with: --multirun model.dropout=0.0,0.1,0.2,0.3,0.4,0.5
model:
dropout: 0.3
Hyperparameter Optimization
# configs/experiment/hp_optimization.yaml
name: "hp_optimization"
description: "Hyperparameter optimization with Optuna"
tags: ["optimization", "tuning"]
defaults:
- override hydra/sweeper: optuna
hydra:
sweeper:
n_trials: 100
direction: maximize
study_name: "model_optimization"
params:
model.hidden_dims:
type: categorical
choices: [[512,256], [1024,512,256]]
model.optimizer.lr:
type: float
low: 0.0001
high: 0.01
log: true
optimized_metric: "val/acc"
See templates/ for more experiment types.
5. Experiment Reproduction
Save Full Environment
# Save package versions
pixi list > logs/exp_001/environment.txt
# or
uv pip freeze > logs/exp_001/requirements.txt
# Save git commit
git rev-parse HEAD > logs/exp_001/commit_hash.txt
# Save system info
python -c "import torch; print(f'PyTorch: {torch.__version__}\nCUDA: {torch.version.cuda}')" > logs/exp_001/system_info.txt
Reproduce Experiment
# Checkout exact code
git checkout $(cat logs/exp_001/commit_hash.txt)
# Restore environment
pixi install
# or
uv pip install -r logs/exp_001/requirements.txt
# Run with exact config
python src/train.py \
--config-path ../logs/exp_001/.hydra \
--config-name config
Reproducibility Checklist:
- Set random seeds (Python, NumPy, PyTorch, Lightning)
- Save exact package versions
- Document hardware (GPU type, CUDA version)
- Save data splits/preprocessing
- Commit code before experiment
- Don't modify code during run
6. Experiment Analysis
Analyze Single Experiment
python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/
Generates:
analysis.png- Training curves (loss, accuracy, LR)- Summary statistics (best metrics, epochs, final LR)
Example:
Experiment Summary:
Best Val Acc: 0.8921
Best Val Loss: 0.2984
Epochs Trained: 45
Final LR: 0.000123
Multi-Experiment Analysis
# List all experiments
python scripts/list_experiments.py
# Filter by tags
python scripts/list_experiments.py --tags baseline ablation
# Export to CSV
python scripts/export_results.py --output results.csv
# Generate markdown report
python scripts/generate_report.py --format markdown --output report.md
See examples/experiment-analysis.md for detailed analysis workflows.
7. W&B Integration
Query W&B Runs
import wandb
api = wandb.Api()
runs = api.runs("my-project")
# Filter runs
runs = api.runs("my-project", filters={"tags": "baseline"})
# Get metrics
for run in runs:
print(f"{run.name}: val_acc={run.summary['val/acc']:.4f}")
# Download artifacts
best_run = runs[0]
best_run.file("model.pt").download()
Compare Runs in W&B
# Open workspace
wandb workspace
# Generate report
wandb reports create --title "Experiment Comparison"
W&B Sweeps
# Initialize sweep
wandb sweep configs/sweep/bayesian_optimization.yaml
# Run sweep agent
wandb agent <sweep-id>
See examples/wandb-integration.md for complete guide.
8. Experiment Best Practices
Naming Conventions
- Descriptive names:
vit_large_imagenet_pretrained - Include date for long runs:
exp_2026_02_baseline - Use prefixes:
ablation_,optimization_,baseline_
Documentation
- Always add description and notes
- Tag for easy filtering
- Document unexpected results
- Track compute resources
Version Control
- Commit code before long experiments
- Save git hash with experiment
- Don't modify code during run
- Use branches for experimental features
Organization
configs/experiment/
├── baselines/
│ ├── resnet_baseline.yaml
│ └── vit_baseline.yaml
├── ablations/
│ ├── ablation_dropout.yaml
│ └── ablation_lr.yaml
└── optimizations/
└── hp_optimization.yaml
9. Common Experiment Types
A. Baseline Experiment
Purpose: Establish reference performance.
name: "baseline"
tags: ["baseline"]
model: {} # Use defaults
B. Ablation Study
Purpose: Isolate effect of single component.
name: "ablation_batch_norm"
tags: ["ablation"]
model:
use_batch_norm: false # Remove batch norm
C. Hyperparameter Tuning
Purpose: Find optimal hyperparameters.
name: "hp_tuning"
tags: ["optimization"]
# Use with --multirun or Optuna sweeper
D. Transfer Learning
Purpose: Fine-tune pretrained model.
name: "transfer_learning"
tags: ["transfer-learning"]
model:
pretrained: true
freeze_backbone: true # Freeze early layers
E. Architecture Search
Purpose: Compare different architectures.
# Run multiple architectures
python src/train.py --multirun \
experiment=architecture_search \
model=resnet18,resnet50,vit_base
10. Experiment Commands
# Create new experiment
python src/train.py experiment=<name>
# List experiments
python scripts/list_experiments.py
# Compare experiments
python scripts/compare_experiments.py exp_001 exp_002 exp_003
# Analyze experiment
python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/
# Clean old experiments (keep best 5)
python scripts/clean_experiments.py --keep-best 5
# Export results
python scripts/export_results.py --output results.csv
# Generate report
python scripts/generate_report.py --format markdown --output report.md
Troubleshooting
Experiment registry not updating:
- Check permissions on
logs/experiment_registry.json - Verify
on_train_endcallback is called - Check for JSON syntax errors
Can't reproduce results:
- Verify exact package versions match
- Check random seeds are set
- Confirm same hardware (GPU model affects results)
- Validate data preprocessing matches
W&B runs not logging:
- Check
WANDB_API_KEYis set - Verify project name matches
- Check network connectivity
- Try
wandb loginagain
Metrics not saving:
- Verify
log_every_n_stepsis set - Check disk space
- Confirm metrics are logged in training_step
Success Criteria
- Experiment registry tracks all runs
- Configs saved with each experiment
- Metrics logged consistently
- Easy comparison between experiments
- Reproducible results
- Clear documentation
- Training curves visualized
- Summary statistics computed
Experiments are well-organized and easily comparable!
Source
git clone https://github.com/nishide-dev/claude-code-ml-research/blob/main/skills/ml-experiment/SKILL.mdView on GitHub Overview
ML Experiment Management provides systematic tracking, comparison, and analysis of machine learning runs. It supports creating experiment configs, maintaining a central registry, and analyzing results with W&B, TensorBoard, and MLflow. It organizes outputs under a logs directory and a central experiment_registry.json for reproducibility.
How This Skill Works
Start by interactively creating an experiment config that is saved under configs/experiment/<name>.yaml, optionally composing base configs with defaults. During training, a callback logs metrics, hyperparameters, and tags to the central registry at logs/experiment_registry.json. External loggers like W&B, TensorBoard, or MLflow integrate via the logger section of the config, enabling cross tool analysis and easy comparisons with scripts like compare_experiments.py.
When to Use It
- When starting a new ML experiment and you need a structured, extendable config
- When you want automatic tracking of metrics and hyperparameters to a central registry
- When you need to compare multiple experiments or runs side by side
- When analyzing results with W&B, TensorBoard, or MLflow
- When organizing experiments with metadata like tags and runtimes for reproducibility
Quick Start
- Step 1: Use interactive setup to define the experiment name, description, base config and overrides; this generates configs/experiment/<name>.yaml
- Step 2: Run the experiment with your training script, for example python src/train.py experiment=your_experiment_name
- Step 3: Check the central registry at logs/experiment_registry.json and use compare_experiments.py to compare runs with other experiments
Best Practices
- Use descriptive experiment names and consistent tags to enable filtering
- Rely on a stable defaults based config to improve reproducibility
- Log key metrics and hyperparameters consistently across runs
- Leverage the central registry to compare and reproduce results
- Regularly back up the registry and associated logs to avoid data loss
Example Use Cases
- vit_imagenet_finetuning on ImageNet subset
- baseline_resnet50 experiment with registry entry
- ablation study changing freezing backbone
- ddp multiGPU training scaling study
- lr sweep comparing learning rate schedules