Gpus
Scannednpx machina-cli add skill samarth777/modal-skills/gpus --openclawModal GPU Reference
Detailed reference for GPU acceleration on Modal.
Available GPUs
| GPU | VRAM | Max Count | Best For |
|---|---|---|---|
T4 | 16 GB | 8 | Budget inference, light training |
L4 | 24 GB | 8 | Inference |
A10 | 24 GB | 4 | Inference, light training |
A100-40GB | 40 GB | 8 | Training, large model inference |
A100-80GB | 80 GB | 8 | Large models, distributed training |
L40S | 48 GB | 8 | Best cost/performance ratio |
H100 | 80 GB | 8 | High-performance training |
H200 | 141 GB | 8 | Very large models |
B200 | 192 GB | 8 | Largest models, cutting-edge |
GPU Selection
Single GPU
@app.function(gpu="A100")
def train():
import torch
assert torch.cuda.is_available()
Multiple GPUs
# 8 H100s for large model training
@app.function(gpu="H100:8")
def train_large_model():
import torch
device_count = torch.cuda.device_count() # 8
GPU Fallbacks
Request multiple GPU types in priority order:
@app.function(gpu=["H100", "A100-80GB", "A100-40GB:2"])
def flexible_function():
# Will try H100 first, then A100-80GB, then 2x A100-40GB
...
Specific GPU Variants
# Specific A100 variant
@app.function(gpu="A100-40GB")
def smaller_a100():
...
@app.function(gpu="A100-80GB")
def larger_a100():
...
# Prevent H100 → H200 auto-upgrade
@app.function(gpu="H100!")
def strict_h100():
...
GPU Selection Guidelines
For Inference
| Model Size | Recommended GPU |
|---|---|
| < 7B params | L4, T4 |
| 7B-13B params | L40S, A100-40GB |
| 13B-70B params | A100-80GB, H100 |
| > 70B params | H100:2+, H200, B200 |
For Training
| Training Type | Recommended GPU |
|---|---|
| Fine-tuning (LoRA) | A100-40GB, L40S |
| Full fine-tuning | A100-80GB, H100 |
| Pre-training | H100:8, H200:8 |
Multi-GPU Training
PyTorch DDP
@app.function(gpu="A100:4")
def train_ddp():
import subprocess
import sys
subprocess.run(
["torchrun", "--nproc_per_node=4", "train.py"],
check=True
)
Multi-Node Clusters (Beta)
import modal.experimental
@app.function(gpu="H100:8", timeout=86400)
@modal.experimental.clustered(size=4, rdma=True) # 4 nodes × 8 GPUs = 32 GPUs
def train_distributed():
cluster_info = modal.experimental.get_cluster_info()
from torch.distributed.run import run, parse_args
run(parse_args([
f"--nnodes={4}",
f"--node-rank={cluster_info.rank}",
f"--master-addr={cluster_info.container_ips[0]}",
"--nproc-per-node=8",
"--master-port=1234",
"train.py",
]))
CUDA Setup
Using Pre-installed Drivers
Modal provides CUDA drivers by default. Many libraries work with just pip:
image = modal.Image.debian_slim().pip_install("torch")
@app.function(gpu="A100", image=image)
def cuda_function():
import torch
print(torch.cuda.is_available()) # True
Full CUDA Toolkit
For libraries requiring nvcc or CUDA headers:
image = (
modal.Image.from_registry(
"nvidia/cuda:12.8.1-devel-ubuntu24.04",
add_python="3.12"
)
.entrypoint([])
.pip_install("flash-attn", "triton")
)
@app.function(gpu="H100", image=image)
def advanced_cuda():
...
GPU Metrics
Access GPU metrics from within your function:
@app.function(gpu="A100")
def check_gpu():
import subprocess
result = subprocess.run(
["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total", "--format=csv"],
capture_output=True, text=True
)
print(result.stdout)
Cost Optimization
- Right-size your GPU: Start with smaller GPUs and scale up
- Use container idle timeout: Keep containers warm for repeated requests
- Batch requests: Use
@modal.batchedfor throughput - Consider GPU fallbacks: Accept any available GPU type
@app.cls(
gpu="L40S",
container_idle_timeout=300, # Keep warm for 5 min
)
class InferenceService:
@modal.enter()
def load_model(self):
self.model = load_model()
@modal.batched(max_batch_size=32, wait_ms=100)
async def predict(self, inputs: list[str]) -> list[str]:
return self.model.batch_predict(inputs)
Source
git clone https://github.com/samarth777/modal-skills/blob/main/skills/gpus/SKILL.mdView on GitHub Overview
This skill is a detailed reference for using GPUs on Modal. It covers available GPUs, how to select single or multiple GPUs, multi-node setups, CUDA driver options, and cost optimization strategies.
How This Skill Works
You request GPUs by decorating functions with gpu= or gpu=[...], and the framework provisions the requested hardware. You can request multiple GPUs (e.g., H100:8) for data-parallel training, or use GPU fallbacks to specify priority. CUDA can use pre-installed drivers or user-supplied images with CUDA toolkit; examples show PyTorch usage and simple GPU checks.
When to Use It
- Budget-friendly inference or light training on T4 or L4 GPUs.
- Inference for mid-to-large models using L4, L40S, A100-40GB, or A100-80GB.
- Large-model training or high-performance training with H100, H200, or B200.
- Distributed training across multiple GPUs or nodes (e.g., A100:4, H100:8) using PyTorch DDP.
- CUDA environment setup: choose pre-installed drivers or full CUDA toolkit depending on libraries.
Quick Start
- Step 1: Choose a GPU type (e.g., @app.function(gpu="A100")) or multi-GPU (e.g., @app.function(gpu="H100:8")).
- Step 2: If doing multi-GPU or multi-node training, follow the PyTorch DDP or beta multi-node cluster examples to configure your setup.
- Step 3: Validate CUDA availability inside your function (torch.cuda.is_available()) and optionally query GPU metrics with nvidia-smi.
Best Practices
- Reference model size to the GPU guidance table and start with the smallest GPU that meets throughput needs.
- Use explicit GPU variants (e.g., A100-80GB, H100) when you know the requirement, and consider GPU fallbacks for resilience.
- Use multi-GPU syntax (e.g., H100:8) for scalable training, and leverage strict GPU flags (H100!) to prevent auto-upgrade.
- Explore multi-node training with PyTorch DDP and the beta multi-node features when scaling beyond a single node.
- Test CUDA setup early: verify drivers and CUDA toolkit availability with simple checks like torch.cuda.is_available() and nvidia-smi.
Example Use Cases
- Single-GPU training on an A100 to validate model training locally on Modal.
- 8 H100s configured for large model training using @app.function(gpu="H100:8").
- Distributed training across 4 nodes with H100:8 and modal.experimental.clustered gated by @modal.experimental.clustered.
- PyTorch DDP example using @app.function(gpu="A100:4") to run parallel training.
- Check GPU health and utilization inside a function using nvidia-smi and a simple CUDA test.