Pytorch
Scannednpx machina-cli add skill muhammederem/chief/pytorch --openclawPyTorch Deep Learning Framework
Overview
PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides tensor computation with GPU acceleration and deep neural networks built on a tape-based automatic differentiation system.
Key Features
Dynamic Computation Graphs
PyTorch uses dynamic computational graphs that are built on-the-fly, making debugging easier and enabling more flexible model architectures.
GPU Acceleration
Seamless CUDA integration for GPU-accelerated computing:
import torch
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tensor = torch.randn(1000, 1000).to(device)
Automatic Differentiation
Autograd system for automatic computation of gradients:
x = torch.randn(3, requires_grad=True)
y = x * 2
while y.data.norm() < 1000:
y = y * 2
gradients = torch.autograd.grad(y, x)
Model Design Patterns
Basic Model Structure
import torch.nn as nn
import torch.nn.functional as F
class NeuralNetwork(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(NeuralNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
Convolutional Neural Networks
class CNN(nn.Module):
def __init__(self, num_classes=10):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 8 * 8, 512)
self.fc2 = nn.Linear(512, num_classes)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 64 * 8 * 8)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
Transfer Learning
import torchvision.models as models
# Load pretrained model
model = models.resnet50(pretrained=True)
# Freeze early layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes)
Training Best Practices
Training Loop Template
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, device):
model = model.to(device)
best_val_loss = float('inf')
for epoch in range(num_epochs):
# Training phase
model.train()
train_loss = 0.0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation phase
model.eval()
val_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')
print(f'Epoch {epoch+1}/{num_epochs}')
print(f'Train Loss: {train_loss/len(train_loader):.4f}')
print(f'Val Loss: {val_loss/len(val_loader):.4f}')
print(f'Val Acc: {100.*correct/total:.2f}%')
return model
Optimizer Choice
- Adam: Default choice for most tasks (lr=0.001)
- AdamW: Better for transformers (lr=1e-4)
- SGD with Momentum: Better generalization (lr=0.1, momentum=0.9)
Learning Rate Scheduling
# Reduce on plateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.1, patience=5
)
# Cosine annealing
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=num_epochs
)
# One cycle learning
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=0.01, epochs=num_epochs, steps_per_epoch=len(train_loader)
)
Data Loading
Custom Dataset
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, data, targets, transform=None):
self.data = data
self.targets = targets
self.transform = transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
label = self.targets[idx]
if self.transform:
sample = self.transform(sample)
return sample, label
# Create data loaders
train_dataset = CustomDataset(train_data, train_labels, transform=train_transform)
train_loader = DataLoader(
train_dataset,
batch_size=32,
shuffle=True,
num_workers=4,
pin_memory=True
)
Data Augmentation
from torchvision import transforms
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.RandomRotation(15),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
Performance Optimization
Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Gradient Accumulation
accumulation_steps = 4
optimizer.zero_grad()
for i, (inputs, labels) in enumerate(train_loader):
inputs, labels = inputs.to(device), labels.to(device)
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Gradient Clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Checkpointing
Save Checkpoint
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')
Load Checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
start_epoch = checkpoint['epoch'] + 1
Common Issues and Solutions
Out of Memory
- Reduce batch size
- Use gradient accumulation
- Enable gradient checkpointing:
model.gradient_checkpointing_enable() - Clear cache:
torch.cuda.empty_cache()
Slow Training
- Use
pin_memory=Truein DataLoader - Increase
num_workersin DataLoader - Enable mixed precision training
- Use multiple GPUs with
DataParallelorDistributedDataParallel
Overfitting
- Add data augmentation
- Use dropout:
nn.Dropout(0.5) - Add L2 regularization via weight decay in optimizer
- Early stopping based on validation loss
Best Practices Summary
- Always use
model.eval()for inference andmodel.train()for training - Use
torch.no_grad()context manager during inference - Pin memory (
pin_memory=True) for faster GPU transfer - Use mixed precision training for modern GPUs
- Save checkpoints regularly with validation metrics
- Use learning rate schedulers instead of manual decay
- Normalize data using dataset statistics
- Set random seeds for reproducibility:
torch.manual_seed(42) torch.cuda.manual_seed_all(42)
Integration Points
- Vector Databases: Store trained embeddings
- Hugging Face: Load pretrained transformers
- MLflow: Track experiments and metrics
- SageMaker: Distributed training
- FastAPI: Model serving endpoints
Source
git clone https://github.com/muhammederem/chief/blob/main/.claude/skills/ml-ai/pytorch/SKILL.mdView on GitHub Overview
PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides tensor computation with GPU acceleration and deep neural networks built on a tape-based automatic differentiation system.
How This Skill Works
PyTorch uses dynamic computation graphs that are built on-the-fly, enabling flexible model designs and easier debugging. It integrates seamless CUDA support for GPU acceleration and includes the Autograd system for automatic gradient computation.
When to Use It
- Prototype ideas quickly with dynamic graphs when model architectures are experimental.
- Train models on GPUs to speed up computation with CUDA integration.
- Fine-tune pretrained models via transfer learning to adapt to new tasks.
- Build common architectures like feedforward nets and CNNs using nn modules.
- Iterate with a clear training loop using a reusable template.
Quick Start
- Step 1: Install PyTorch and set up CUDA if available.
- Step 2: Define a model using nn.Module (for example a simple NeuralNetwork).
- Step 3: Write a training loop with a loss function, optimizer, and device management.
Best Practices
- Leverage dynamic graphs to debug and iterate model designs efficiently.
- Move data and models to the correct device (CPU/GPU) for optimal performance.
- Use the Autograd system to compute gradients and call backward() or autograd.grad.
- Define models with nn.Module and explicit forward methods for readability.
- Adopt a structured training loop and save the best model during training.
Example Use Cases
- Implement a simple feedforward neural network using nn.Module and a forward pass.
- Create a CNN with Conv2d, ReLU, pooling, and fully connected layers for image tasks.
- Fine-tune a pretrained ResNet50 by freezing early layers and replacing the final layer.
- Use a training loop template to train and validate a model with proper device handling.
- Switch between CPU and CUDA contexts by moving tensors and models to the target device.