What is HuggingFace Accelerate?

A minimal API to add distributed support to PyTorch scripts with a unified interface for DeepSpeed, FSDP, Megatron and DDP, plus automatic device placement and mixed-precision options.

How do I enable mixed precision?

Create an Accelerator with mixed_precision='fp16', 'bf16', or 'fp8' (and optionally use accelerator.autocast()). Then wrap your components with accelerator.prepare and use accelerator.backward for gradients.

How do I run on multiple GPUs or machines?

Launch your training with accelerate launch train.py and use the appropriate flags (e.g., --multi_gpu, --num_processes, --num_machines, etc.). For larger setups, supply a deepspeed_plugin for ZeRO-2 as needed.

huggingface-accelerate

Scanned

Distributed Training HuggingFace Accelerate DeepSpeed FSDP Mixed Precision PyTorch DDP Unified API Simple

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/accelerate --openclaw

Files (1)

SKILL.md

8.1 KB

HuggingFace Accelerate - Unified Distributed Training

Quick start

Accelerate simplifies distributed training to 4 lines of code.

Installation:

pip install accelerate

Convert PyTorch script (4 lines):

import torch
+ from accelerate import Accelerator

+ accelerator = Accelerator()

  model = torch.nn.Transformer()
  optimizer = torch.optim.Adam(model.parameters())
  dataloader = torch.utils.data.DataLoader(dataset)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  for batch in dataloader:
      optimizer.zero_grad()
      loss = model(batch)
-     loss.backward()
+     accelerator.backward(loss)
      optimizer.step()

Run (single command):

accelerate launch train.py

Common workflows

Workflow 1: From single GPU to multi-GPU

Original script:

# train.py
import torch

model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

for epoch in range(10):
    for batch in dataloader:
        batch = batch.to('cuda')
        optimizer.zero_grad()
        loss = model(batch).mean()
        loss.backward()
        optimizer.step()

With Accelerate (4 lines added):

# train.py
import torch
from accelerate import Accelerator  # +1

accelerator = Accelerator()  # +2

model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)  # +3

for epoch in range(10):
    for batch in dataloader:
        # No .to('cuda') needed - automatic!
        optimizer.zero_grad()
        loss = model(batch).mean()
        accelerator.backward(loss)  # +4
        optimizer.step()

Configure (interactive):

accelerate config

Questions:

Which machine? (single/multi GPU/TPU/CPU)
How many machines? (1)
Mixed precision? (no/fp16/bf16/fp8)
DeepSpeed? (no/yes)

Launch (works on any setup):

# Single GPU
accelerate launch train.py

# Multi-GPU (8 GPUs)
accelerate launch --multi_gpu --num_processes 8 train.py

# Multi-node
accelerate launch --multi_gpu --num_processes 16 \
  --num_machines 2 --machine_rank 0 \
  --main_process_ip $MASTER_ADDR \
  train.py

Workflow 2: Mixed precision training

Enable FP16/BF16:

from accelerate import Accelerator

# FP16 (with gradient scaling)
accelerator = Accelerator(mixed_precision='fp16')

# BF16 (no scaling, more stable)
accelerator = Accelerator(mixed_precision='bf16')

# FP8 (H100+)
accelerator = Accelerator(mixed_precision='fp8')

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

# Everything else is automatic!
for batch in dataloader:
    with accelerator.autocast():  # Optional, done automatically
        loss = model(batch)
    accelerator.backward(loss)

Workflow 3: DeepSpeed ZeRO integration

Enable DeepSpeed ZeRO-2:

from accelerate import Accelerator

accelerator = Accelerator(
    mixed_precision='bf16',
    deepspeed_plugin={
        "zero_stage": 2,  # ZeRO-2
        "offload_optimizer": False,
        "gradient_accumulation_steps": 4
    }
)

# Same code as before!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config
# Select: DeepSpeed → ZeRO-2

deepspeed_config.json:

{
    "fp16": {"enabled": false},
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"},
        "allgather_bucket_size": 5e8,
        "reduce_bucket_size": 5e8
    }
}

Launch:

accelerate launch --config_file deepspeed_config.json train.py

Workflow 4: FSDP (Fully Sharded Data Parallel)

Enable FSDP:

from accelerate import Accelerator, FullyShardedDataParallelPlugin

fsdp_plugin = FullyShardedDataParallelPlugin(
    sharding_strategy="FULL_SHARD",  # ZeRO-3 equivalent
    auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
    cpu_offload=False
)

accelerator = Accelerator(
    mixed_precision='bf16',
    fsdp_plugin=fsdp_plugin
)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config
# Select: FSDP → Full Shard → No CPU Offload

Workflow 5: Gradient accumulation

Accumulate gradients:

from accelerate import Accelerator

accelerator = Accelerator(gradient_accumulation_steps=4)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for batch in dataloader:
    with accelerator.accumulate(model):  # Handles accumulation
        optimizer.zero_grad()
        loss = model(batch)
        accelerator.backward(loss)
        optimizer.step()

Effective batch size: batch_size * num_gpus * gradient_accumulation_steps

When to use vs alternatives

Use Accelerate when:

Want simplest distributed training
Need single script for any hardware
Use HuggingFace ecosystem
Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
Need quick prototyping

Key advantages:

4 lines: Minimal code changes
Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron
Automatic: Device placement, mixed precision, sharding
Interactive config: No manual launcher setup
Single launch: Works everywhere

Use alternatives instead:

PyTorch Lightning: Need callbacks, high-level abstractions
Ray Train: Multi-node orchestration, hyperparameter tuning
DeepSpeed: Direct API control, advanced features
Raw DDP: Maximum control, minimal abstraction

Common issues

Issue: Wrong device placement

Don't manually move to device:

# WRONG
batch = batch.to('cuda')

# CORRECT
# Accelerate handles it automatically after prepare()

Issue: Gradient accumulation not working

Use context manager:

# CORRECT
with accelerator.accumulate(model):
    optimizer.zero_grad()
    accelerator.backward(loss)
    optimizer.step()

Issue: Checkpointing in distributed

Use accelerator methods:

# Save only on main process
if accelerator.is_main_process:
    accelerator.save_state('checkpoint/')

# Load on all processes
accelerator.load_state('checkpoint/')

Issue: Different results with FSDP

Ensure same random seed:

from accelerate.utils import set_seed
set_seed(42)

Advanced topics

Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.

Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.

Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.

Hardware requirements

CPU: Works (slow)
Single GPU: Works
Multi-GPU: DDP (default), DeepSpeed, or FSDP
Multi-node: DDP, DeepSpeed, FSDP, Megatron
TPU: Supported
Apple MPS: Supported

Launcher requirements:

DDP: torch.distributed.run (built-in)
DeepSpeed: deepspeed (pip install deepspeed)
FSDP: PyTorch 1.12+ (built-in)
Megatron: Custom setup

Resources

Docs: https://huggingface.co/docs/accelerate
GitHub: https://github.com/huggingface/accelerate
Version: 1.11.0+
Tutorial: "Accelerate your scripts"
Examples: https://github.com/huggingface/accelerate/tree/main/examples
Used by: HuggingFace Transformers, TRL, PEFT, all HF libraries

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/08-distributed-training/accelerate/SKILL.mdView on GitHub

Overview

HuggingFace Accelerate provides a minimal 4-line integration to add distributed support to any PyTorch script. It offers a unified API across DeepSpeed, FSDP, Megatron, and DDP, with automatic device placement and support for mixed precision (FP16, BF16, FP8). It also includes an interactive config and a single launch command to simplify distributed training within the HuggingFace ecosystem.

How This Skill Works

Instantiate an Accelerator, then wrap your components with accelerator.prepare(model, optimizer, dataloader) to enable distributed execution. Use accelerator.backward(loss) for gradients and leverage mixed-precision features via accelerator (including optional autocast). Training is launched with accelerate launch train.py, which works across single/multi-GPU and multi-node setups.

When to Use It

You’re starting with a single-GPU script and want to scale to multi-GPU with minimal changes.
You need mixed precision training (fp16, bf16, or fp8) to improve performance or fit memory constraints.
You want DeepSpeed ZeRO-2 integration via a deepspeed_plugin configuration.
You want automatic device placement so you don’t manually move tensors to CUDA.
You require a quick, single-launch solution to run across CPU, single GPU, or multi-node clusters in the HuggingFace ecosystem.

Quick Start

Step 1: Step 1: pip install accelerate
Step 2: Wrap your script with Accelerator and call accelerator.prepare on model, optimizer, and dataloader
Step 3: Run with accelerate launch train.py

Best Practices

Add the 4-line integration exactly as shown: import Accelerator, create it, then accelerator.prepare.
Always call accelerator.prepare on model, optimizer, and dataloader to ensure proper sharding and device placement.
Use accelerator.backward(loss) for gradients and enable gradient handling with mixed precision.
Configure mixed precision (fp16, bf16, or fp8) via Accelerate to match your hardware capabilities.
Prefer accelerate launch for consistent behavior across CPU, single-GPU, and multi-node environments.

Example Use Cases

Upgrade a single-GPU PyTorch script to multi-GPU with four-line integration and accelerator.prepare.
Enable FP16 or BF16 mixed precision to reduce memory usage and increase throughput.
Integrate DeepSpeed ZeRO-2 using a deepspeed_plugin configuration for larger models.
Launch distributed training across multiple machines using accelerate launch with multi-machine flags.
Use accelerator.autocast and accelerator.backward for robust precision handling in HuggingFace training flows.