FSDP2 (fully_shard) is PyTorch's advanced sharding approach exposed via fully_shard() and FSDPModule that shards parameters, gradients, and optimizer state across devices for memory efficiency.

When should I use FSDP2 instead of DDP or FSDP1?

Use FSDP2 when the model cannot fit on a single GPU and you want DTensor-based per-parameter sharding with easier state dict handling, plus the possibility of DeviceMesh-based tensor parallelism.

How do I checkpoint with distributed training?

Use Distributed Checkpoint (DCP) or distributed-state-dict helpers to save and load shards; avoid naive state_dict saves that gather tensors across devices.

pytorch-fsdp2

Scanned

PyTorch FSDP2 Fully Sharded Data Parallel Distributed Training DTensor Device Mesh Sharded Checkpointing Mixed Precision Offload Torch Distributed

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/pytorch-fsdp2 --openclaw

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/08-distributed-training/pytorch-fsdp2/SKILL.md

View on GitHub

Overview

This skill explains how to integrate PyTorch FSDP2 (fully_shard) into a training script, including correct initialization, per-parameter sharding, mixed precision/offload configuration, and distributed checkpointing. It targets models that exceed a single-GPU memory and enables DTensor-based sharding with DeviceMesh for scalable training.

How This Skill Works

FSDP2 uses fully_shard() to shard submodules bottom-up before the root module and relies on DTensor-capable parameter handling. Training is launched with torchrun, the optimizer is created after sharding on DTensor parameters, and checkpoints are saved with Distributed Checkpoint (DCP) rather than raw state_dict saves.

When to Use It

Your model doesn’t fit on a single GPU (parameters, gradients, and optimizer state exceed memory).
You want DTensor-based per-parameter sharding for clearer, sharded state dicts compared to FSDP1.
You plan to compose Data Parallel with Tensor Parallel using DeviceMesh.
You need distributed checkpointing (DCP) to resume training across nodes or after failures.
You require mixed precision with offload to manage memory footprint.

Quick Start

Step 1: Launch the job with torchrun --nproc_per_node <gpus_per_node> and ensure RANK/WORLD_SIZE/LOCAL_RANK are in the environment.
Step 2: Initialize distributed and set the CUDA device; optionally create a DeviceMesh to describe the data-parallel groups.
Step 3: Build the model on meta device, apply fully_shard() to submodules (then fully_shard(model)), materialize on CUDA, create the optimizer after sharding, and enable DCP-based checkpointing.

Best Practices

Launch with torchrun and ensure LOCAL_RANK, RANK, and WORLD_SIZE are visible in the environment.
Apply fully_shard() bottom-up to submodules before the root module.
Call model(input) (not model.forward(input)) so FSDP2 hooks run unless unsharded or forward overridden.
Create the optimizer after sharding so it’s built on DTensor parameters post-shard.
Checkpoint with Distributed Checkpoint (DCP) or distributed-state-dict helpers instead of torch.save(state_dict) unless you need full tensors.

Example Use Cases

Train a 2B-parameter Transformer model across 16 GPUs with DTensor per-parameter sharding and a DeviceMesh.
Fine-tune a large Vision Transformer across 8 GPUs using FSDP2 with mixed precision and model offload.
Scale a language model across 32 GPUs, leveraging DTensor sharding, Block-wise sharding, and DCP for robust checkpoints.
Combine DP with Tensor Parallel on a multi-branch model using DeviceMesh to distribute workload across 24 GPUs.
Resume training from distributed checkpoints after a node outage using DCP across multiple nodes.