What GPUs are available on Modal?

Available GPUs include T4, L4, A10, A100-40GB, A100-80GB, L40S, H100, H200, and B200, each with specified VRAM and use cases like inference, training, and large-model support.

How do I request multiple GPUs or a specific variant?

Use gpu=" " for a single GPU (e.g., A100) or gpu=" : " for multiple GPUs (e.g., H100:8). You can also specify fallbacks with a list like gpu=["H100", "A100-80GB", "A100-40GB:2"], and use gpu="H100!" to prevent auto-upgrade.

How do I set up CUDA drivers or toolkit?

For pre-installed drivers, use a basic image and install libraries with pip (e.g., image = modal.Image.debian_slim().pip_install("torch")); then @app.function(gpu="A100", image=image) will run with CUDA available. For full CUDA Toolkit, build a custom image with nvcc and headers (e.g., using nvidia/cuda:12.8.1-devel-ubuntu24.04) and install required libraries, then run advanced CUDA workflows.

Gpus

Scanned

npx machina-cli add skill samarth777/modal-skills/gpus --openclaw

Files (1)

SKILL.md

4.3 KB

Modal GPU Reference

Detailed reference for GPU acceleration on Modal.

Available GPUs

GPU	VRAM	Max Count	Best For
`T4`	16 GB	8	Budget inference, light training
`L4`	24 GB	8	Inference
`A10`	24 GB	4	Inference, light training
`A100-40GB`	40 GB	8	Training, large model inference
`A100-80GB`	80 GB	8	Large models, distributed training
`L40S`	48 GB	8	Best cost/performance ratio
`H100`	80 GB	8	High-performance training
`H200`	141 GB	8	Very large models
`B200`	192 GB	8	Largest models, cutting-edge

GPU Selection

Single GPU

@app.function(gpu="A100")
def train():
    import torch
    assert torch.cuda.is_available()

Multiple GPUs

# 8 H100s for large model training
@app.function(gpu="H100:8")
def train_large_model():
    import torch
    device_count = torch.cuda.device_count()  # 8

GPU Fallbacks

Request multiple GPU types in priority order:

@app.function(gpu=["H100", "A100-80GB", "A100-40GB:2"])
def flexible_function():
    # Will try H100 first, then A100-80GB, then 2x A100-40GB
    ...

Specific GPU Variants

# Specific A100 variant
@app.function(gpu="A100-40GB")
def smaller_a100():
    ...

@app.function(gpu="A100-80GB")
def larger_a100():
    ...

# Prevent H100 → H200 auto-upgrade
@app.function(gpu="H100!")
def strict_h100():
    ...

GPU Selection Guidelines

For Inference

Model Size	Recommended GPU
< 7B params	`L4`, `T4`
7B-13B params	`L40S`, `A100-40GB`
13B-70B params	`A100-80GB`, `H100`
> 70B params	`H100:2+`, `H200`, `B200`

For Training

Training Type	Recommended GPU
Fine-tuning (LoRA)	`A100-40GB`, `L40S`
Full fine-tuning	`A100-80GB`, `H100`
Pre-training	`H100:8`, `H200:8`

Multi-GPU Training

PyTorch DDP

@app.function(gpu="A100:4")
def train_ddp():
    import subprocess
    import sys
    subprocess.run(
        ["torchrun", "--nproc_per_node=4", "train.py"],
        check=True
    )

Multi-Node Clusters (Beta)

import modal.experimental

@app.function(gpu="H100:8", timeout=86400)
@modal.experimental.clustered(size=4, rdma=True)  # 4 nodes × 8 GPUs = 32 GPUs
def train_distributed():
    cluster_info = modal.experimental.get_cluster_info()
    
    from torch.distributed.run import run, parse_args
    run(parse_args([
        f"--nnodes={4}",
        f"--node-rank={cluster_info.rank}",
        f"--master-addr={cluster_info.container_ips[0]}",
        "--nproc-per-node=8",
        "--master-port=1234",
        "train.py",
    ]))

CUDA Setup

Using Pre-installed Drivers

Modal provides CUDA drivers by default. Many libraries work with just pip:

image = modal.Image.debian_slim().pip_install("torch")

@app.function(gpu="A100", image=image)
def cuda_function():
    import torch
    print(torch.cuda.is_available())  # True

Full CUDA Toolkit

For libraries requiring nvcc or CUDA headers:

image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.8.1-devel-ubuntu24.04",
        add_python="3.12"
    )
    .entrypoint([])
    .pip_install("flash-attn", "triton")
)

@app.function(gpu="H100", image=image)
def advanced_cuda():
    ...

GPU Metrics

Access GPU metrics from within your function:

@app.function(gpu="A100")
def check_gpu():
    import subprocess
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total", "--format=csv"],
        capture_output=True, text=True
    )
    print(result.stdout)

Cost Optimization

Right-size your GPU: Start with smaller GPUs and scale up
Use container idle timeout: Keep containers warm for repeated requests
Batch requests: Use @modal.batched for throughput
Consider GPU fallbacks: Accept any available GPU type

@app.cls(
    gpu="L40S",
    container_idle_timeout=300,  # Keep warm for 5 min
)
class InferenceService:
    @modal.enter()
    def load_model(self):
        self.model = load_model()
    
    @modal.batched(max_batch_size=32, wait_ms=100)
    async def predict(self, inputs: list[str]) -> list[str]:
        return self.model.batch_predict(inputs)

Source

git clone https://github.com/samarth777/modal-skills/blob/main/skills/gpus/SKILL.mdView on GitHub

Overview

This skill is a detailed reference for using GPUs on Modal. It covers available GPUs, how to select single or multiple GPUs, multi-node setups, CUDA driver options, and cost optimization strategies.

How This Skill Works

You request GPUs by decorating functions with gpu= or gpu=[...], and the framework provisions the requested hardware. You can request multiple GPUs (e.g., H100:8) for data-parallel training, or use GPU fallbacks to specify priority. CUDA can use pre-installed drivers or user-supplied images with CUDA toolkit; examples show PyTorch usage and simple GPU checks.

When to Use It

Budget-friendly inference or light training on T4 or L4 GPUs.
Inference for mid-to-large models using L4, L40S, A100-40GB, or A100-80GB.
Large-model training or high-performance training with H100, H200, or B200.
Distributed training across multiple GPUs or nodes (e.g., A100:4, H100:8) using PyTorch DDP.
CUDA environment setup: choose pre-installed drivers or full CUDA toolkit depending on libraries.

Quick Start

Step 1: Choose a GPU type (e.g., @app.function(gpu="A100")) or multi-GPU (e.g., @app.function(gpu="H100:8")).
Step 2: If doing multi-GPU or multi-node training, follow the PyTorch DDP or beta multi-node cluster examples to configure your setup.
Step 3: Validate CUDA availability inside your function (torch.cuda.is_available()) and optionally query GPU metrics with nvidia-smi.

Best Practices

Reference model size to the GPU guidance table and start with the smallest GPU that meets throughput needs.
Use explicit GPU variants (e.g., A100-80GB, H100) when you know the requirement, and consider GPU fallbacks for resilience.
Use multi-GPU syntax (e.g., H100:8) for scalable training, and leverage strict GPU flags (H100!) to prevent auto-upgrade.
Explore multi-node training with PyTorch DDP and the beta multi-node features when scaling beyond a single node.
Test CUDA setup early: verify drivers and CUDA toolkit availability with simple checks like torch.cuda.is_available() and nvidia-smi.

Example Use Cases

Single-GPU training on an A100 to validate model training locally on Modal.
8 H100s configured for large model training using @app.function(gpu="H100:8").
Distributed training across 4 nodes with H100:8 and modal.experimental.clustered gated by @modal.experimental.clustered.
PyTorch DDP example using @app.function(gpu="A100:4") to run parallel training.
Check GPU health and utilization inside a function using nvidia-smi and a simple CUDA test.

Frequently Asked Questions

Add this skill to your agents