An open-source deep learning library with dynamic computation graphs, seamless CUDA GPU acceleration, and Autograd for automatic differentiation.

How does Autograd work?

Autograd tracks operations on tensors with requires_grad enabled and automatically computes gradients via backward or autograd.grad.

How do I apply transfer learning in PyTorch?

Load a pretrained model (eg. torchvision.models), freeze early layers, and replace the final layer to match your number of classes.

Pytorch

Scanned

npx machina-cli add skill muhammederem/chief/pytorch --openclaw

Files (1)

SKILL.md

8.6 KB

PyTorch Deep Learning Framework

Overview

PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides tensor computation with GPU acceleration and deep neural networks built on a tape-based automatic differentiation system.

Key Features

Dynamic Computation Graphs

PyTorch uses dynamic computational graphs that are built on-the-fly, making debugging easier and enabling more flexible model architectures.

GPU Acceleration

Seamless CUDA integration for GPU-accelerated computing:

import torch

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tensor = torch.randn(1000, 1000).to(device)

Automatic Differentiation

Autograd system for automatic computation of gradients:

x = torch.randn(3, requires_grad=True)
y = x * 2
while y.data.norm() < 1000:
    y = y * 2
gradients = torch.autograd.grad(y, x)

Model Design Patterns

Basic Model Structure

import torch.nn as nn
import torch.nn.functional as F

class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

Convolutional Neural Networks

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 8 * 8, 512)
        self.fc2 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 8 * 8)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Transfer Learning

import torchvision.models as models

# Load pretrained model
model = models.resnet50(pretrained=True)

# Freeze early layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes)

Training Best Practices

Training Loop Template

def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, device):
    model = model.to(device)
    best_val_loss = float('inf')

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0.0

        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        # Validation phase
        model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                val_loss += loss.item()

                _, predicted = outputs.max(1)
                total += labels.size(0)
                correct += predicted.eq(labels).sum().item()

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best_model.pth')

        print(f'Epoch {epoch+1}/{num_epochs}')
        print(f'Train Loss: {train_loss/len(train_loader):.4f}')
        print(f'Val Loss: {val_loss/len(val_loader):.4f}')
        print(f'Val Acc: {100.*correct/total:.2f}%')

    return model

Optimizer Choice

Adam: Default choice for most tasks (lr=0.001)
AdamW: Better for transformers (lr=1e-4)
SGD with Momentum: Better generalization (lr=0.1, momentum=0.9)

Learning Rate Scheduling

# Reduce on plateau
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.1, patience=5
)

# Cosine annealing
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=num_epochs
)

# One cycle learning
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01, epochs=num_epochs, steps_per_epoch=len(train_loader)
)

Data Loading

Custom Dataset

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data, targets, transform=None):
        self.data = data
        self.targets = targets
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        label = self.targets[idx]

        if self.transform:
            sample = self.transform(sample)

        return sample, label

# Create data loaders
train_dataset = CustomDataset(train_data, train_labels, transform=train_transform)
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)

Data Augmentation

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

Performance Optimization

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for inputs, labels in train_loader:
    inputs, labels = inputs.to(device), labels.to(device)

    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Gradient Accumulation

accumulation_steps = 4
optimizer.zero_grad()

for i, (inputs, labels) in enumerate(train_loader):
    inputs, labels = inputs.to(device), labels.to(device)

    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels) / accumulation_steps

    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Checkpointing

Save Checkpoint

checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict(),
    'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')

Load Checkpoint

checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
start_epoch = checkpoint['epoch'] + 1

Common Issues and Solutions

Out of Memory

Reduce batch size
Use gradient accumulation
Enable gradient checkpointing: model.gradient_checkpointing_enable()
Clear cache: torch.cuda.empty_cache()

Slow Training

Use pin_memory=True in DataLoader
Increase num_workers in DataLoader
Enable mixed precision training
Use multiple GPUs with DataParallel or DistributedDataParallel

Overfitting

Add data augmentation
Use dropout: nn.Dropout(0.5)
Add L2 regularization via weight decay in optimizer
Early stopping based on validation loss

Best Practices Summary

Always use model.eval() for inference and model.train() for training
Use torch.no_grad() context manager during inference
Pin memory (pin_memory=True) for faster GPU transfer
Use mixed precision training for modern GPUs
Save checkpoints regularly with validation metrics
Use learning rate schedulers instead of manual decay
Normalize data using dataset statistics

Set random seeds for reproducibility:

torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

Integration Points

Vector Databases: Store trained embeddings
Hugging Face: Load pretrained transformers
MLflow: Track experiments and metrics
SageMaker: Distributed training
FastAPI: Model serving endpoints

Source

git clone https://github.com/muhammederem/chief/blob/main/.claude/skills/ml-ai/pytorch/SKILL.mdView on GitHub

Overview

How This Skill Works

PyTorch uses dynamic computation graphs that are built on-the-fly, enabling flexible model designs and easier debugging. It integrates seamless CUDA support for GPU acceleration and includes the Autograd system for automatic gradient computation.

When to Use It

Prototype ideas quickly with dynamic graphs when model architectures are experimental.
Train models on GPUs to speed up computation with CUDA integration.
Fine-tune pretrained models via transfer learning to adapt to new tasks.
Build common architectures like feedforward nets and CNNs using nn modules.
Iterate with a clear training loop using a reusable template.

Quick Start

Step 1: Install PyTorch and set up CUDA if available.
Step 2: Define a model using nn.Module (for example a simple NeuralNetwork).
Step 3: Write a training loop with a loss function, optimizer, and device management.

Best Practices

Leverage dynamic graphs to debug and iterate model designs efficiently.
Move data and models to the correct device (CPU/GPU) for optimal performance.
Use the Autograd system to compute gradients and call backward() or autograd.grad.
Define models with nn.Module and explicit forward methods for readability.
Adopt a structured training loop and save the best model during training.

Example Use Cases

Implement a simple feedforward neural network using nn.Module and a forward pass.
Create a CNN with Conv2d, ReLU, pooling, and fully connected layers for image tasks.
Fine-tune a pretrained ResNet50 by freezing early layers and replacing the final layer.
Use a training loop template to train and validate a model with proper device handling.
Switch between CPU and CUDA contexts by moving tensors and models to the target device.

Frequently Asked Questions

Add this skill to your agents

Pytorch

PyTorch Deep Learning Framework

Overview

Key Features

Dynamic Computation Graphs

GPU Acceleration

Automatic Differentiation

Model Design Patterns

Basic Model Structure

Convolutional Neural Networks

Transfer Learning

Training Best Practices

Training Loop Template

Optimizer Choice

Learning Rate Scheduling

Data Loading

Custom Dataset

Data Augmentation

Performance Optimization

Mixed Precision Training

Gradient Accumulation

Gradient Clipping

Checkpointing

Save Checkpoint

Load Checkpoint

Common Issues and Solutions

Out of Memory

Slow Training

Overfitting

Best Practices Summary

Integration Points

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is PyTorch?

How does Autograd work?

How do I apply transfer learning in PyTorch?