A systematic debugging guide for ML training issues using PyTorch Lightning with practical checks and fixes.

How do I fix NaN loss?

Lower the learning rate, enable gradient clipping, ensure numerical stability (epsilon, stable softmax), and verify input data for NaN/Inf before training.

How can I address CUDA OOM?

Reduce batch size, enable gradient checkpointing, and consider precision tweaks or distributed strategies to fit memory constraints.

ml-debug

Scanned

npx machina-cli add skill nishide-dev/claude-code-ml-research/ml-debug --openclaw

Files (1)

SKILL.md

10.3 KB

ML Training Debugging

Systematic debugging guide for machine learning training issues with PyTorch Lightning.

Quick Diagnosis

Identify your problem category:

Symptom	Category	Quick Check
`NaN` or `Inf` in loss	Loss Issues	Check learning rate, gradient clipping
Training loss >> validation loss	Overfitting	Add regularization, data augmentation
Both losses high	Underfitting	Increase model capacity, train longer
GPU utilization <80%	Data Loading	Increase `num_workers`, use faster storage
CUDA out of memory	Memory Issues	Reduce `batch_size`, use gradient checkpointing
Loss plateau	Convergence Issues	Adjust learning rate, try different optimizer
Wrong predictions	Data Issues	Check labels, verify preprocessing

1. Loss Issues

NaN or Inf Loss

Diagnosis:

# Check training logs for NaN
grep -i "nan\|inf" logs/train.log

# Test with lower learning rate
python src/train.py model.optimizer.lr=0.0001

Solutions (try in order):

Reduce learning rate

model:
  optimizer:
    lr: 0.0001  # Start small

Enable gradient clipping

trainer:
  gradient_clip_val: 1.0
  gradient_clip_algorithm: "norm"

Fix numerical stability

# In model forward pass
# Bad: Division without epsilon
output = x / y

# Good: Add epsilon
output = x / (y + 1e-8)

# Bad: Manual log(softmax)
output = torch.log(F.softmax(x, dim=-1))

# Good: Use stable version
output = F.log_softmax(x, dim=-1)

Try full precision

python src/train.py trainer.precision=32

Check for inf/nan in data

# In DataModule.setup()
sample = self.train_dataset[0]
assert not torch.isnan(sample[0]).any(), "NaN in input data"
assert not torch.isinf(sample[0]).any(), "Inf in input data"

See examples/nan-loss-debugging.md for detailed guide.

Exploding Gradients

Diagnosis:

# Enable gradient tracking
trainer = Trainer(
    gradient_clip_val=1.0,
    track_grad_norm=2,  # Log L2 norm
    log_every_n_steps=10
)

Solutions:

If gradient norm >100: reduce learning rate or increase clipping
If gradients explode in specific layer: check weight initialization
Use gradient accumulation to reduce per-step updates

Vanishing Gradients

Symptoms: Gradients close to zero, no learning in early layers.

Solutions:

# 1. Use ReLU instead of Sigmoid/Tanh
self.activation = nn.ReLU()

# 2. Add skip connections (ResNet-style)
def forward(self, x):
    identity = x
    out = self.layer(x)
    return out + identity  # Skip connection

# 3. Use batch normalization
self.bn = nn.BatchNorm1d(hidden_dim)

# 4. Better initialization
for layer in self.layers:
    nn.init.kaiming_normal_(layer.weight, mode='fan_out', nonlinearity='relu')

2. Performance Issues

Overfitting (val_loss > train_loss)

Diagnosis:

# Plot losses
import matplotlib.pyplot as plt
import pandas as pd

metrics = pd.read_csv("logs/metrics.csv")
plt.plot(metrics["train_loss"], label="Train")
plt.plot(metrics["val_loss"], label="Val")
plt.legend()
plt.savefig("overfit_analysis.png")

# Overfit ratio
ratio = metrics["val_loss"].iloc[-1] / metrics["train_loss"].iloc[-1]
if ratio > 1.5:
    print("⚠️  SEVERE OVERFITTING")

Solutions:

# 1. Add regularization
model:
  dropout: 0.3  # Increase dropout
  optimizer:
    weight_decay: 0.0001  # L2 regularization

# 2. Data augmentation
data:
  augmentation:
    random_crop: true
    horizontal_flip: true
    mixup_alpha: 0.2

# 3. Early stopping
callbacks:
  early_stopping:
    monitor: "val/loss"
    patience: 10
    mode: "min"

# 4. Reduce model size
model:
  hidden_dims: [256, 128]  # Smaller model

Underfitting (both losses high)

Solutions:

# 1. Increase model capacity
model:
  hidden_dims: [2048, 1024, 512, 256]
  num_layers: 6

# 2. Train longer
trainer:
  max_epochs: 200

# 3. Increase learning rate
model:
  optimizer:
    lr: 0.01

# 4. Remove excessive regularization
model:
  dropout: 0.1
  optimizer:
    weight_decay: 0.00001

Slow Convergence

Try different optimizers:

# AdamW (default, good for most)
optimizer = torch.optim.AdamW(params, lr=0.001, weight_decay=0.01)

# SGD with momentum (good for vision)
optimizer = torch.optim.SGD(params, lr=0.1, momentum=0.9, nesterov=True)

# RMSprop (good for RNNs)
optimizer = torch.optim.RMSprop(params, lr=0.001)

Add learning rate scheduler:

model:
  scheduler:
    # Cosine annealing
    _target_: torch.optim.lr_scheduler.CosineAnnealingLR
    T_max: 100
    eta_min: 1e-6

    # Or OneCycle (often fastest convergence)
    # _target_: torch.optim.lr_scheduler.OneCycleLR
    # max_lr: 0.01
    # total_steps: ${trainer.max_steps}

3. Speed Issues

Profile Training

# Use Lightning profiler
python src/train.py \
  trainer.profiler="advanced" \
  trainer.max_epochs=1

# Check GPU utilization
watch -n 1 nvidia-smi

Data Loading Bottleneck

Symptoms: GPU utilization <80%, slow batch iteration.

Solutions:

data:
  num_workers: 8  # Increase workers
  pin_memory: true
  persistent_workers: true
  prefetch_factor: 2

# Use faster data formats
# HDF5, LMDB, or preprocessed tensors

Slow Forward/Backward

Solutions:

# 1. Use torch.compile (PyTorch 2.0+)
def configure_model(self):
    self.model = torch.compile(self.model, mode="reduce-overhead")

# 2. Avoid Python loops - use vectorization
# Bad
for i in range(batch_size):
    result[i] = self.layer(x[i])

# Good
result = self.layer(x)

# 3. Mixed precision
# trainer.precision = "16-mixed"

# 4. Gradient accumulation
# trainer.accumulate_grad_batches = 4

See scripts/profile_training.py for profiling script.

4. Memory Issues

Out of Memory (OOM)

Diagnosis:

# Check GPU memory
nvidia-smi

# Profile memory usage
python src/train.py trainer.profiler="pytorch" trainer.max_epochs=1

Solutions (in order):

# 1. Reduce batch size
data:
  batch_size: 32  # Was 128

# 2. Mixed precision
trainer:
  precision: "16-mixed"

# 3. Gradient accumulation (maintains effective batch size)
trainer:
  accumulate_grad_batches: 4  # Effective: 32 * 4 = 128

# 4. Gradient checkpointing (in model)
# self.model.gradient_checkpointing_enable()

# 5. Reduce model size
model:
  hidden_dims: [512, 256]

Memory Leak

Common causes:

# Bad: Keeps computation graph
loss_history = []
for batch in dataloader:
    loss = model(batch)
    loss_history.append(loss)  # ❌

# Good: Only store scalar
loss_history = []
for batch in dataloader:
    loss = model(batch)
    loss_history.append(loss.item())  # ✅

# Clear cache periodically
if self.global_step % 100 == 0:
    torch.cuda.empty_cache()

Checkpoint management:

callbacks:
  model_checkpoint:
    save_top_k: 3  # Only keep 3 best

5. Data Issues

Check Data Shapes

# In DataModule.setup()
def setup(self, stage=None):
    self.train_dataset = ...

    # Validate
    sample = self.train_dataset[0]
    print(f"Input shape: {sample[0].shape}")
    print(f"Label shape: {sample[1].shape}")
    print(f"Label range: [{sample[1].min()}, {sample[1].max()}]")

    # Check for NaN
    assert not torch.isnan(sample[0]).any(), "NaN in input"

    # Label distribution
    labels = [self.train_dataset[i][1] for i in range(min(1000, len(self.train_dataset)))]
    print(f"Distribution: {pd.Series(labels).value_counts()}")

Visualize Data

# Show a batch
import matplotlib.pyplot as plt

dm = MyDataModule()
dm.setup()
loader = dm.train_dataloader()
batch = next(iter(loader))
inputs, labels = batch

fig, axes = plt.subplots(4, 4, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
    ax.imshow(inputs[i].permute(1, 2, 0))  # CHW -> HWC
    ax.set_title(f"Label: {labels[i].item()}")
    ax.axis('off')
plt.savefig("data_samples.png")

6. PyTorch Geometric Specific Issues

Over-smoothing in GNNs

Symptoms: All node representations become similar after many layers.

Solutions:

# 1. Reduce layers
model:
  num_layers: 2  # Instead of 5+

# 2. Add skip connections
# In model forward:
# x = conv(x, edge_index) + x_orig

# 3. Use jumping knowledge
model:
  jk_mode: "cat"  # or "max", "lstm"

Large Graph OOM

# Use mini-batch training with sampling
data:
  use_sampling: true
  num_neighbors: [15, 10, 5]  # Per-layer sampling
  batch_size: 1024  # Mini-batches of nodes

See examples/gnn-debugging.md for GNN-specific guide.

Debugging Checklist

Data:

Model:

Forward pass works with dummy data
Output shape matches expected
Reasonable number of parameters
Gradients flow through all layers

Training:

Config:

Learning rate appropriate (0.0001-0.01)
Batch size fits in memory
Enough epochs

Common Error Messages

Error	Solution
`CUDA out of memory`	Reduce batch_size, enable gradient checkpointing, use fp16
`Expected all tensors on same device`	`x = x.to(self.device)` in forward
`Target size must match input size`	Check loss function, verify output dims
`Sizes of tensors must match`	Check batch dimensions

Generate Debug Report

Use the debug report script:

python scripts/debug_report.py --log-dir logs/

See scripts/debug_report.py for implementation.

Success Criteria

Debugging complete - training is back on track!

Source

git clone https://github.com/nishide-dev/claude-code-ml-research/blob/main/skills/ml-debug/SKILL.mdView on GitHub

Overview

ML Training Debugging is a systematic guide for diagnosing and solving common ML training issues in PyTorch Lightning. It covers NaN/Inf losses, CUDA OOM, data loading bottlenecks, convergence problems, and overfitting, with practical checks and concrete fixes.

How This Skill Works

Symptoms are categorized intoLoss Issues, Memory Issues, Convergence, and Data problems, then paired with prioritized tests and remedies. The guide provides concrete code snippets, config tweaks, and sanity checks to apply in a logical order from quick wins to deeper fixes in real training runs.

When to Use It

NaN or Inf loss during training
CUDA out of memory (OOM) error
Loss plateau or very slow convergence
Overfitting or large generalization gap (val_loss >> train_loss)
Data loading bottlenecks or underutilized GPU (low GPU utilization)

Quick Start

Step 1: Run quick diagnosis by checking logs for NaN/Inf and CUDA OOM signals (grep logs/train.log)
Step 2: Apply fixes in order: reduce learning rate, enable gradient clipping, and ensure numerically stable ops
Step 3: Re-run training with sanity checks: verify data cleanliness, enable gradient norms, and test lower precision if needed

Best Practices

Start by reducing the learning rate and enabling gradient clipping to stabilize updates
Use numerically stable operations (epsilon in divisions, log_softmax instead of manual log)
Verify data quality with NaN/Inf checks before training begins
Apply regularization and data augmentation to improve generalization
For memory issues, reduce batch size and consider gradient checkpointing to fit larger models

Example Use Cases

NaN loss in a PyTorch Lightning run: lowered learning rate and added gradient clipping (gradient_clip_val) to stabilize training
CUDA OOM resolved by reducing batch_size and enabling gradient checkpointing for long sequences
Exploding gradients diagnosed via gradient norm tracking; reduced LR and enabled clipping to fix
Vanishing gradients addressed by switching to ReLU, adding skip connections, and using batch normalization
Overfitting mitigated with dropout, data augmentation, and weight decay adjustments

Frequently Asked Questions

Add this skill to your agents

ml-debug

ML Training Debugging

Quick Diagnosis

1. Loss Issues

NaN or Inf Loss

Exploding Gradients

Vanishing Gradients

2. Performance Issues

Overfitting (val_loss > train_loss)

Underfitting (both losses high)

Slow Convergence

3. Speed Issues

Profile Training

Data Loading Bottleneck

Slow Forward/Backward

4. Memory Issues

Out of Memory (OOM)

Memory Leak

5. Data Issues

Check Data Shapes

Visualize Data

6. PyTorch Geometric Specific Issues

Over-smoothing in GNNs

Large Graph OOM

Debugging Checklist

Common Error Messages

Generate Debug Report

Success Criteria

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is ml-debug?

How do I fix NaN loss?

How can I address CUDA OOM?