Get the FREE Ultimate OpenClaw Setup Guide →

ml-debug

Scanned
npx machina-cli add skill nishide-dev/claude-code-ml-research/ml-debug --openclaw
Files (1)
SKILL.md
10.3 KB

ML Training Debugging

Systematic debugging guide for machine learning training issues with PyTorch Lightning.

Quick Diagnosis

Identify your problem category:

SymptomCategoryQuick Check
NaN or Inf in lossLoss IssuesCheck learning rate, gradient clipping
Training loss >> validation lossOverfittingAdd regularization, data augmentation
Both losses highUnderfittingIncrease model capacity, train longer
GPU utilization <80%Data LoadingIncrease num_workers, use faster storage
CUDA out of memoryMemory IssuesReduce batch_size, use gradient checkpointing
Loss plateauConvergence IssuesAdjust learning rate, try different optimizer
Wrong predictionsData IssuesCheck labels, verify preprocessing

1. Loss Issues

NaN or Inf Loss

Diagnosis:

# Check training logs for NaN
grep -i "nan\|inf" logs/train.log

# Test with lower learning rate
python src/train.py model.optimizer.lr=0.0001

Solutions (try in order):

  1. Reduce learning rate
model:
  optimizer:
    lr: 0.0001  # Start small
  1. Enable gradient clipping
trainer:
  gradient_clip_val: 1.0
  gradient_clip_algorithm: "norm"
  1. Fix numerical stability
# In model forward pass
# Bad: Division without epsilon
output = x / y

# Good: Add epsilon
output = x / (y + 1e-8)

# Bad: Manual log(softmax)
output = torch.log(F.softmax(x, dim=-1))

# Good: Use stable version
output = F.log_softmax(x, dim=-1)
  1. Try full precision
python src/train.py trainer.precision=32
  1. Check for inf/nan in data
# In DataModule.setup()
sample = self.train_dataset[0]
assert not torch.isnan(sample[0]).any(), "NaN in input data"
assert not torch.isinf(sample[0]).any(), "Inf in input data"

See examples/nan-loss-debugging.md for detailed guide.

Exploding Gradients

Diagnosis:

# Enable gradient tracking
trainer = Trainer(
    gradient_clip_val=1.0,
    track_grad_norm=2,  # Log L2 norm
    log_every_n_steps=10
)

Solutions:

  • If gradient norm >100: reduce learning rate or increase clipping
  • If gradients explode in specific layer: check weight initialization
  • Use gradient accumulation to reduce per-step updates

Vanishing Gradients

Symptoms: Gradients close to zero, no learning in early layers.

Solutions:

# 1. Use ReLU instead of Sigmoid/Tanh
self.activation = nn.ReLU()

# 2. Add skip connections (ResNet-style)
def forward(self, x):
    identity = x
    out = self.layer(x)
    return out + identity  # Skip connection

# 3. Use batch normalization
self.bn = nn.BatchNorm1d(hidden_dim)

# 4. Better initialization
for layer in self.layers:
    nn.init.kaiming_normal_(layer.weight, mode='fan_out', nonlinearity='relu')

2. Performance Issues

Overfitting (val_loss > train_loss)

Diagnosis:

# Plot losses
import matplotlib.pyplot as plt
import pandas as pd

metrics = pd.read_csv("logs/metrics.csv")
plt.plot(metrics["train_loss"], label="Train")
plt.plot(metrics["val_loss"], label="Val")
plt.legend()
plt.savefig("overfit_analysis.png")

# Overfit ratio
ratio = metrics["val_loss"].iloc[-1] / metrics["train_loss"].iloc[-1]
if ratio > 1.5:
    print("⚠️  SEVERE OVERFITTING")

Solutions:

# 1. Add regularization
model:
  dropout: 0.3  # Increase dropout
  optimizer:
    weight_decay: 0.0001  # L2 regularization

# 2. Data augmentation
data:
  augmentation:
    random_crop: true
    horizontal_flip: true
    mixup_alpha: 0.2

# 3. Early stopping
callbacks:
  early_stopping:
    monitor: "val/loss"
    patience: 10
    mode: "min"

# 4. Reduce model size
model:
  hidden_dims: [256, 128]  # Smaller model

Underfitting (both losses high)

Solutions:

# 1. Increase model capacity
model:
  hidden_dims: [2048, 1024, 512, 256]
  num_layers: 6

# 2. Train longer
trainer:
  max_epochs: 200

# 3. Increase learning rate
model:
  optimizer:
    lr: 0.01

# 4. Remove excessive regularization
model:
  dropout: 0.1
  optimizer:
    weight_decay: 0.00001

Slow Convergence

Try different optimizers:

# AdamW (default, good for most)
optimizer = torch.optim.AdamW(params, lr=0.001, weight_decay=0.01)

# SGD with momentum (good for vision)
optimizer = torch.optim.SGD(params, lr=0.1, momentum=0.9, nesterov=True)

# RMSprop (good for RNNs)
optimizer = torch.optim.RMSprop(params, lr=0.001)

Add learning rate scheduler:

model:
  scheduler:
    # Cosine annealing
    _target_: torch.optim.lr_scheduler.CosineAnnealingLR
    T_max: 100
    eta_min: 1e-6

    # Or OneCycle (often fastest convergence)
    # _target_: torch.optim.lr_scheduler.OneCycleLR
    # max_lr: 0.01
    # total_steps: ${trainer.max_steps}

3. Speed Issues

Profile Training

# Use Lightning profiler
python src/train.py \
  trainer.profiler="advanced" \
  trainer.max_epochs=1

# Check GPU utilization
watch -n 1 nvidia-smi

Data Loading Bottleneck

Symptoms: GPU utilization <80%, slow batch iteration.

Solutions:

data:
  num_workers: 8  # Increase workers
  pin_memory: true
  persistent_workers: true
  prefetch_factor: 2

# Use faster data formats
# HDF5, LMDB, or preprocessed tensors

Slow Forward/Backward

Solutions:

# 1. Use torch.compile (PyTorch 2.0+)
def configure_model(self):
    self.model = torch.compile(self.model, mode="reduce-overhead")

# 2. Avoid Python loops - use vectorization
# Bad
for i in range(batch_size):
    result[i] = self.layer(x[i])

# Good
result = self.layer(x)

# 3. Mixed precision
# trainer.precision = "16-mixed"

# 4. Gradient accumulation
# trainer.accumulate_grad_batches = 4

See scripts/profile_training.py for profiling script.

4. Memory Issues

Out of Memory (OOM)

Diagnosis:

# Check GPU memory
nvidia-smi

# Profile memory usage
python src/train.py trainer.profiler="pytorch" trainer.max_epochs=1

Solutions (in order):

# 1. Reduce batch size
data:
  batch_size: 32  # Was 128

# 2. Mixed precision
trainer:
  precision: "16-mixed"

# 3. Gradient accumulation (maintains effective batch size)
trainer:
  accumulate_grad_batches: 4  # Effective: 32 * 4 = 128

# 4. Gradient checkpointing (in model)
# self.model.gradient_checkpointing_enable()

# 5. Reduce model size
model:
  hidden_dims: [512, 256]

Memory Leak

Common causes:

# Bad: Keeps computation graph
loss_history = []
for batch in dataloader:
    loss = model(batch)
    loss_history.append(loss)  # ❌

# Good: Only store scalar
loss_history = []
for batch in dataloader:
    loss = model(batch)
    loss_history.append(loss.item())  # ✅

# Clear cache periodically
if self.global_step % 100 == 0:
    torch.cuda.empty_cache()

Checkpoint management:

callbacks:
  model_checkpoint:
    save_top_k: 3  # Only keep 3 best

5. Data Issues

Check Data Shapes

# In DataModule.setup()
def setup(self, stage=None):
    self.train_dataset = ...

    # Validate
    sample = self.train_dataset[0]
    print(f"Input shape: {sample[0].shape}")
    print(f"Label shape: {sample[1].shape}")
    print(f"Label range: [{sample[1].min()}, {sample[1].max()}]")

    # Check for NaN
    assert not torch.isnan(sample[0]).any(), "NaN in input"

    # Label distribution
    labels = [self.train_dataset[i][1] for i in range(min(1000, len(self.train_dataset)))]
    print(f"Distribution: {pd.Series(labels).value_counts()}")

Visualize Data

# Show a batch
import matplotlib.pyplot as plt

dm = MyDataModule()
dm.setup()
loader = dm.train_dataloader()
batch = next(iter(loader))
inputs, labels = batch

fig, axes = plt.subplots(4, 4, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
    ax.imshow(inputs[i].permute(1, 2, 0))  # CHW -> HWC
    ax.set_title(f"Label: {labels[i].item()}")
    ax.axis('off')
plt.savefig("data_samples.png")

6. PyTorch Geometric Specific Issues

Over-smoothing in GNNs

Symptoms: All node representations become similar after many layers.

Solutions:

# 1. Reduce layers
model:
  num_layers: 2  # Instead of 5+

# 2. Add skip connections
# In model forward:
# x = conv(x, edge_index) + x_orig

# 3. Use jumping knowledge
model:
  jk_mode: "cat"  # or "max", "lstm"

Large Graph OOM

# Use mini-batch training with sampling
data:
  use_sampling: true
  num_neighbors: [15, 10, 5]  # Per-layer sampling
  batch_size: 1024  # Mini-batches of nodes

See examples/gnn-debugging.md for GNN-specific guide.

Debugging Checklist

Data:

  • Loads without errors
  • Shapes are correct
  • No NaN or Inf values
  • Labels in correct range (0 to num_classes-1)
  • Augmentation works
  • Splits are correct

Model:

  • Forward pass works with dummy data
  • Output shape matches expected
  • Reasonable number of parameters
  • Gradients flow through all layers

Training:

  • Loss decreases in first epoch
  • Validation runs correctly
  • Checkpoints save
  • Metrics logged
  • GPU utilization high (>80%)

Config:

  • Learning rate appropriate (0.0001-0.01)
  • Batch size fits in memory
  • Enough epochs

Common Error Messages

ErrorSolution
CUDA out of memoryReduce batch_size, enable gradient checkpointing, use fp16
Expected all tensors on same devicex = x.to(self.device) in forward
Target size must match input sizeCheck loss function, verify output dims
Sizes of tensors must matchCheck batch dimensions

Generate Debug Report

Use the debug report script:

python scripts/debug_report.py --log-dir logs/

See scripts/debug_report.py for implementation.

Success Criteria

  • Problem identified and categorized
  • Root cause determined
  • Solution applied
  • Training resumes successfully
  • Metrics improve
  • No errors in logs

Debugging complete - training is back on track!

Source

git clone https://github.com/nishide-dev/claude-code-ml-research/blob/main/skills/ml-debug/SKILL.mdView on GitHub

Overview

ML Training Debugging is a systematic guide for diagnosing and solving common ML training issues in PyTorch Lightning. It covers NaN/Inf losses, CUDA OOM, data loading bottlenecks, convergence problems, and overfitting, with practical checks and concrete fixes.

How This Skill Works

Symptoms are categorized intoLoss Issues, Memory Issues, Convergence, and Data problems, then paired with prioritized tests and remedies. The guide provides concrete code snippets, config tweaks, and sanity checks to apply in a logical order from quick wins to deeper fixes in real training runs.

When to Use It

  • NaN or Inf loss during training
  • CUDA out of memory (OOM) error
  • Loss plateau or very slow convergence
  • Overfitting or large generalization gap (val_loss >> train_loss)
  • Data loading bottlenecks or underutilized GPU (low GPU utilization)

Quick Start

  1. Step 1: Run quick diagnosis by checking logs for NaN/Inf and CUDA OOM signals (grep logs/train.log)
  2. Step 2: Apply fixes in order: reduce learning rate, enable gradient clipping, and ensure numerically stable ops
  3. Step 3: Re-run training with sanity checks: verify data cleanliness, enable gradient norms, and test lower precision if needed

Best Practices

  • Start by reducing the learning rate and enabling gradient clipping to stabilize updates
  • Use numerically stable operations (epsilon in divisions, log_softmax instead of manual log)
  • Verify data quality with NaN/Inf checks before training begins
  • Apply regularization and data augmentation to improve generalization
  • For memory issues, reduce batch size and consider gradient checkpointing to fit larger models

Example Use Cases

  • NaN loss in a PyTorch Lightning run: lowered learning rate and added gradient clipping (gradient_clip_val) to stabilize training
  • CUDA OOM resolved by reducing batch_size and enabling gradient checkpointing for long sequences
  • Exploding gradients diagnosed via gradient norm tracking; reduced LR and enabled clipping to fix
  • Vanishing gradients addressed by switching to ReLU, adding skip connections, and using batch normalization
  • Overfitting mitigated with dropout, data augmentation, and weight decay adjustments

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers