What is ml-config-manager?

A tool to generate and manage Hydra YAML configs for ML experiments, including Optuna sweeps.

How do I run a sweep?

Create a sweep-config.yaml and specify hyperparameters to optimize with Optuna; the manager prepares config variants for each trial.

ml-config-manager

Scanned

npx machina-cli add skill nishide-dev/claude-code-ml-research/ml-config-manager --openclaw

Files (1)

SKILL.md

12.3 KB

ML Configuration Management

Generate and manage Hydra configuration files for machine learning experiments with PyTorch Lightning.

Quick Reference

Template Files Available:

templates/model-config.yaml - Model architecture configuration
templates/data-config.yaml - Dataset and DataLoader configuration
templates/trainer-config.yaml - PyTorch Lightning Trainer configuration
templates/experiment-config.yaml - Complete experiment composition
templates/gnn-config.yaml - PyTorch Geometric GNN configuration
templates/sweep-config.yaml - Hyperparameter sweep with Optuna

Configuration Types

1. Model Configuration

Location: configs/model/<name>.yaml

What to ask:

Model architecture (CNN, Transformer, GNN, MLP, etc.)
Input/output dimensions
Hidden dimensions and layers
Activation functions
Normalization layers (batch norm, layer norm)
Dropout rates
Optimizer type and parameters
Learning rate scheduler

Generate from template:

_target_: src.models.<name>.Model

# Architecture
input_dim: 784
hidden_dims: [512, 256, 128]
output_dim: 10
activation: relu
dropout: 0.2
batch_norm: true

# Optimizer
optimizer:
  _target_: torch.optim.AdamW
  lr: 0.001
  weight_decay: 0.0001
  betas: [0.9, 0.999]

# Scheduler
scheduler:
  _target_: torch.optim.lr_scheduler.CosineAnnealingLR
  T_max: 100
  eta_min: 1e-6

Common optimizers:

torch.optim.AdamW - Default choice, good for most tasks
torch.optim.Adam - Classic optimizer
torch.optim.SGD - With momentum for vision tasks
torch.optim.RMSprop - Good for recurrent networks

Common schedulers:

CosineAnnealingLR - Smooth cosine decay
ReduceLROnPlateau - Reduce on metric plateau
OneCycleLR - Super-convergence with 1cycle policy
StepLR - Step decay at fixed intervals
ExponentialLR - Exponential decay

See templates/model-config.yaml for complete template.

2. Data Configuration

Location: configs/data/<name>.yaml

What to ask:

Dataset name/type
Batch size
Number of workers
Data augmentation strategy
Train/val/test split ratios
Preprocessing requirements

Generate from template:

_target_: src.data.<name>.DataModule

# Dataset
dataset_name: "mnist"
data_dir: "data/"
download: true

# DataLoader
batch_size: 128
num_workers: 4
pin_memory: true
persistent_workers: true
prefetch_factor: 2

# Splits
train_val_test_split: [0.8, 0.1, 0.1]
shuffle_train: true

# Augmentation
augmentation:
  random_crop: true
  horizontal_flip: true
  normalize: true
  mean: [0.485, 0.456, 0.406]
  std: [0.229, 0.224, 0.225]

See templates/data-config.yaml for complete template with all augmentation options.

3. Trainer Configuration

Location: configs/trainer/<name>.yaml

What to ask:

Max epochs
Precision (32, 16-mixed, bf16-mixed)
Accelerator (auto, gpu, cpu, mps)
Number of devices (GPUs)
Strategy (auto, ddp, fsdp, deepspeed)
Gradient clipping value
Accumulation steps
Validation frequency

Generate from template:

_target_: pytorch_lightning.Trainer

# Training duration
max_epochs: 100
min_epochs: 1

# Hardware
accelerator: gpu
devices: 1
precision: 16-mixed

# Distributed
strategy: auto

# Optimization
gradient_clip_val: 1.0
accumulate_grad_batches: 1

# Validation
check_val_every_n_epoch: 1
num_sanity_val_steps: 2

# Callbacks
callbacks:
  - _target_: pytorch_lightning.callbacks.ModelCheckpoint
    monitor: "val/loss"
    mode: "min"
    save_top_k: 3
    save_last: true

  - _target_: pytorch_lightning.callbacks.EarlyStopping
    monitor: "val/loss"
    patience: 10
    mode: "min"

  - _target_: pytorch_lightning.callbacks.LearningRateMonitor
    logging_interval: "step"

  - _target_: pytorch_lightning.callbacks.RichProgressBar

See templates/trainer-config.yaml for complete template with all callbacks and options.

4. Logger Configuration

Location: configs/logger/<name>.yaml

W&B (Recommended):

_target_: pytorch_lightning.loggers.WandbLogger

project: "my-ml-project"
name: null  # Auto-generated from config
save_dir: "logs/"
log_model: true
save_code: true

TensorBoard:

_target_: pytorch_lightning.loggers.TensorBoardLogger

save_dir: "logs/"
name: null
version: null
log_graph: true
default_hp_metric: false

MLflow:

_target_: pytorch_lightning.loggers.MLFlowLogger

experiment_name: "my-ml-project"
tracking_uri: "file:./mlruns"
log_model: true

5. Experiment Configuration

Location: configs/experiment/<name>.yaml

Compose complete experiments from existing configs:

# @package _global_

# Experiment metadata
name: "resnet50_imagenet_baseline"
description: "Baseline ResNet-50 on ImageNet with standard augmentation"
tags: ["baseline", "resnet", "imagenet"]

# Compose from existing configs
defaults:
  - override /model: resnet50
  - override /data: imagenet
  - override /trainer: gpu_ddp
  - override /logger: wandb

# Seed for reproducibility
seed: 42

# Model overrides (specific to this experiment)
model:
  hidden_dims: [2048, 1024, 512]
  dropout: 0.3
  optimizer:
    lr: 0.001

# Data overrides
data:
  batch_size: 256
  num_workers: 8

# Trainer overrides
trainer:
  max_epochs: 200
  devices: 4

# Logger configuration
logger:
  wandb:
    project: "imagenet-experiments"
    tags: ${tags}
    notes: ${description}

See templates/experiment-config.yaml for complete template.

6. Hyperparameter Sweep Configuration

Location: configs/sweep/<name>.yaml

What to ask:

Parameters to sweep (model, data, training params)
Search strategy (grid, random, bayesian/Optuna)
Search space (ranges, choices)
Optimization metric and direction
Number of trials

Generate from template (Optuna Bayesian Optimization):

defaults:
  - override hydra/sweeper: optuna

hydra:
  sweeper:
    _target_: hydra_plugins.hydra_optuna_sweeper.optuna_sweeper.OptunaSweeper

    direction: minimize  # minimize or maximize
    n_trials: 50
    n_jobs: 1

    study_name: "mlp_optimization"
    storage: null  # null for in-memory, or "sqlite:///optuna.db"

    # Sampler
    sampler:
      _target_: optuna.samplers.TPESampler
      seed: 42
      n_startup_trials: 10

    # Search space
    params:
      model.hidden_dims:
        type: categorical
        choices:
          - [512, 256]
          - [1024, 512, 256]
          - [2048, 1024, 512]

      model.dropout:
        type: float
        low: 0.0
        high: 0.5
        step: 0.05

      model.optimizer.lr:
        type: float
        low: 0.0001
        high: 0.01
        log: true

      data.batch_size:
        type: categorical
        choices: [64, 128, 256]

# Metric to optimize
optimized_metric: "val/loss"

Run sweep:

python src/train.py --multirun --config-name sweep/<name>

Alternative: Grid search (no Optuna needed):

python src/train.py --multirun \
  model.hidden_dims="[512,256],[1024,512,256]" \
  model.dropout=0.1,0.2,0.3,0.4 \
  model.optimizer.lr=0.001,0.01,0.1

See templates/sweep-config.yaml for complete template with all options.

7. PyTorch Geometric Specific Configs

Location: configs/model/gnn/<name>.yaml

For Graph Neural Networks with PyTorch Geometric:

_target_: src.models.gnn.GNNModel

# GNN architecture
conv_type: GCNConv  # GCNConv, GATConv, SAGEConv, GINConv, TransformerConv
num_layers: 3
hidden_channels: 128
out_channels: 64

# GNN-specific
aggr: "add"  # add, mean, max
normalize: true
dropout: 0.2
jk_mode: null  # null, cat, max, lstm

# Attention (for GAT/TransformerConv)
heads: 8
concat_heads: true

# Global pooling (for graph-level tasks)
global_pool: "mean"  # mean, max, add, attention, set2set

# Task configuration
task: "node_classification"  # node_classification, graph_classification, link_prediction
num_classes: 7

# Optimizer
optimizer:
  _target_: torch.optim.AdamW
  lr: 0.01
  weight_decay: 5e-4  # Higher weight decay for GNNs

Corresponding graph data config:

_target_: src.data.graph_datamodule.GraphDataModule

dataset_name: "Cora"
data_dir: "data/graphs/"
batch_size: 32

# Sampling (for large graphs)
use_sampling: false
num_neighbors: [15, 10, 5]

See templates/gnn-config.yaml for complete template.

Configuration Best Practices

1. Naming Conventions

Use descriptive names: resnet50_pretrained.yaml, not model1.yaml
Hierarchical naming: trainer/gpu_single.yaml, trainer/gpu_multi.yaml
Domain prefixes: data/vision/, data/nlp/, data/graph/

2. Modularity

Create reusable components
Use defaults composition for experiments
Override only what's necessary
Keep configs DRY (Don't Repeat Yourself)

3. Documentation

Add comments explaining non-obvious parameters:

# Use higher weight decay for GNNs to prevent overfitting
weight_decay: 5e-4

# TPESampler needs startup trials for warm-up
n_startup_trials: 10

4. Type Safety

Use Pydantic models for validation:

from pydantic import BaseModel

class ModelConfig(BaseModel):
    input_dim: int
    hidden_dims: list[int]
    output_dim: int
    dropout: float

5. Versioning

Track all config changes in git
Tag configs with experiment versions
Use meaningful commit messages for config changes

6. Sensible Defaults

Provide defaults for all optional parameters
Use industry-standard hyperparameters as defaults
Document why defaults were chosen

Validation Checklist

After generating configs, validate:

YAML syntax is valid (no tabs, correct indentation)
All _target_ paths exist and are importable
No circular dependencies in defaults
Config loads without errors: python src/train.py --cfg job
Print resolved config: python src/train.py --cfg job --print-config
All file paths are correct (data_dir, save_dir, etc.)
Hyperparameter ranges are reasonable
Batch size fits in GPU memory

Common CLI Overrides

# Override single parameter
python src/train.py model.lr=0.01

# Override multiple parameters
python src/train.py model.lr=0.01 data.batch_size=256

# Use specific experiment config
python src/train.py experiment=resnet50_imagenet

# Override nested parameters
python src/train.py model.optimizer.weight_decay=1e-4

# Use different config group
python src/train.py model=transformer data=wikitext

# Multirun (grid search)
python src/train.py --multirun model.dropout=0.1,0.2,0.3

# Print resolved config
python src/train.py --cfg job

# Print config with overrides
python src/train.py model.lr=0.01 --cfg job

Debugging Config Issues

Config doesn't load:

# Check YAML syntax
python -c "import yaml; yaml.safe_load(open('configs/model/mymodel.yaml'))"

# Validate with Hydra
python src/train.py --cfg job --config-name myconfig

Import errors:

Verify _target_ paths are correct
Check all modules are importable
Use absolute imports: src.models.mlp.MLP not models.mlp.MLP

Defaults not resolving:

Check defaults order (later overrides earlier)
Use override keyword for conflicting groups
Verify group names match directory structure

Variable interpolation not working:

Use ${var} syntax for interpolation
Check variable exists in config
Use ${oc.env:VAR} for environment variables

Integration with Training

Once configs are created, use them in training:

import hydra
from omegaconf import DictConfig

@hydra.main(version_base=None, config_path="configs", config_name="config")
def train(cfg: DictConfig):
    # Instantiate components from config
    model = hydra.utils.instantiate(cfg.model)
    datamodule = hydra.utils.instantiate(cfg.data)
    trainer = hydra.utils.instantiate(cfg.trainer)

    # Train
    trainer.fit(model, datamodule=datamodule)

Success Criteria

Configuration files created in correct locations
YAML syntax is valid
All _target_ paths resolve correctly
Config loads without errors
Overrides work as expected
Documentation updated (if needed)
Tested with actual training run

Configuration is ready for experimentation!

Source

git clone https://github.com/nishide-dev/claude-code-ml-research/blob/main/skills/ml-config-manager/SKILL.mdView on GitHub

Overview

ml-config-manager generates and manages Hydra YAML configs for machine learning experiments. It supports creating new configs for model, data, trainer, logger, and sweep setups, organizes hierarchical config structures, and enables Optuna-driven hyperparameter sweeps.

How This Skill Works

The tool uses predefined templates (model-config.yaml, data-config.yaml, trainer-config.yaml, experiment-config.yaml, gnn-config.yaml, sweep-config.yaml) to generate YAML configs with Hydra _target_ references. You fill in fields like architecture, dataset, and training parameters, and you can configure Optuna sweeps for hyperparameters within the sweep templates.

When to Use It

Setting up a new experiment with a fresh config hierarchy and templates
Defining model architecture, data processing, and data loaders in a structured way
Configuring PyTorch Lightning trainer settings (precision, devices, max epochs)
Organizing experiment configs and logger/experiment composition for reuse
Setting up Optuna hyperparameter sweeps to tune learning rate, dropout, and more

Quick Start

Step 1: Pick a template (model, data, trainer, etc.) and create a YAML under configs/
Step 2: Fill required fields (architecture, dataset, training parameters) to match your task
Step 3: Run your training script or the config generator to produce Hydra configs and optional sweep configs

Best Practices

Start from a base template and extend it for specialized experiments
Keep configs under configs/<type>/<name>.yaml to maintain organization
Reuse templates to ensure consistency across models, data, and trainers
Validate _target_ references and file paths before running experiments
Document field dependencies (e.g., batch size vs. workers) and how changes affect other components

Example Use Cases

Create a CNN model config and a corresponding data config for CIFAR-10, then link them in an experiment config
Define a PyTorch Lightning trainer config with mixed precision and 1 GPU for ImageNet fine-tuning
Set up a sweep-config using Optuna to tune learning rate and dropout for an MLP on MNIST
Compose an end-to-end experiment config that combines model, data, trainer, and logger modules
Configure a GNN with a dedicated gnn-config.yaml and integrate it with a graph dataset

Frequently Asked Questions

Add this skill to your agents

ml-config-manager

ML Configuration Management

Quick Reference

Configuration Types

1. Model Configuration

2. Data Configuration

3. Trainer Configuration

4. Logger Configuration

5. Experiment Configuration

6. Hyperparameter Sweep Configuration

7. PyTorch Geometric Specific Configs

Configuration Best Practices

1. Naming Conventions

2. Modularity

3. Documentation

4. Type Safety

5. Versioning

6. Sensible Defaults

Validation Checklist

Common CLI Overrides

Debugging Config Issues

Integration with Training

Success Criteria

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is ml-config-manager?

Which templates are available?

How do I run a sweep?