Get the FREE Ultimate OpenClaw Setup Guide →

Gpus

Scanned
npx machina-cli add skill samarth777/modal-skills/gpus --openclaw
Files (1)
SKILL.md
4.3 KB

Modal GPU Reference

Detailed reference for GPU acceleration on Modal.

Available GPUs

GPUVRAMMax CountBest For
T416 GB8Budget inference, light training
L424 GB8Inference
A1024 GB4Inference, light training
A100-40GB40 GB8Training, large model inference
A100-80GB80 GB8Large models, distributed training
L40S48 GB8Best cost/performance ratio
H10080 GB8High-performance training
H200141 GB8Very large models
B200192 GB8Largest models, cutting-edge

GPU Selection

Single GPU

@app.function(gpu="A100")
def train():
    import torch
    assert torch.cuda.is_available()

Multiple GPUs

# 8 H100s for large model training
@app.function(gpu="H100:8")
def train_large_model():
    import torch
    device_count = torch.cuda.device_count()  # 8

GPU Fallbacks

Request multiple GPU types in priority order:

@app.function(gpu=["H100", "A100-80GB", "A100-40GB:2"])
def flexible_function():
    # Will try H100 first, then A100-80GB, then 2x A100-40GB
    ...

Specific GPU Variants

# Specific A100 variant
@app.function(gpu="A100-40GB")
def smaller_a100():
    ...

@app.function(gpu="A100-80GB")
def larger_a100():
    ...

# Prevent H100 → H200 auto-upgrade
@app.function(gpu="H100!")
def strict_h100():
    ...

GPU Selection Guidelines

For Inference

Model SizeRecommended GPU
< 7B paramsL4, T4
7B-13B paramsL40S, A100-40GB
13B-70B paramsA100-80GB, H100
> 70B paramsH100:2+, H200, B200

For Training

Training TypeRecommended GPU
Fine-tuning (LoRA)A100-40GB, L40S
Full fine-tuningA100-80GB, H100
Pre-trainingH100:8, H200:8

Multi-GPU Training

PyTorch DDP

@app.function(gpu="A100:4")
def train_ddp():
    import subprocess
    import sys
    subprocess.run(
        ["torchrun", "--nproc_per_node=4", "train.py"],
        check=True
    )

Multi-Node Clusters (Beta)

import modal.experimental

@app.function(gpu="H100:8", timeout=86400)
@modal.experimental.clustered(size=4, rdma=True)  # 4 nodes × 8 GPUs = 32 GPUs
def train_distributed():
    cluster_info = modal.experimental.get_cluster_info()
    
    from torch.distributed.run import run, parse_args
    run(parse_args([
        f"--nnodes={4}",
        f"--node-rank={cluster_info.rank}",
        f"--master-addr={cluster_info.container_ips[0]}",
        "--nproc-per-node=8",
        "--master-port=1234",
        "train.py",
    ]))

CUDA Setup

Using Pre-installed Drivers

Modal provides CUDA drivers by default. Many libraries work with just pip:

image = modal.Image.debian_slim().pip_install("torch")

@app.function(gpu="A100", image=image)
def cuda_function():
    import torch
    print(torch.cuda.is_available())  # True

Full CUDA Toolkit

For libraries requiring nvcc or CUDA headers:

image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.8.1-devel-ubuntu24.04",
        add_python="3.12"
    )
    .entrypoint([])
    .pip_install("flash-attn", "triton")
)

@app.function(gpu="H100", image=image)
def advanced_cuda():
    ...

GPU Metrics

Access GPU metrics from within your function:

@app.function(gpu="A100")
def check_gpu():
    import subprocess
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total", "--format=csv"],
        capture_output=True, text=True
    )
    print(result.stdout)

Cost Optimization

  1. Right-size your GPU: Start with smaller GPUs and scale up
  2. Use container idle timeout: Keep containers warm for repeated requests
  3. Batch requests: Use @modal.batched for throughput
  4. Consider GPU fallbacks: Accept any available GPU type
@app.cls(
    gpu="L40S",
    container_idle_timeout=300,  # Keep warm for 5 min
)
class InferenceService:
    @modal.enter()
    def load_model(self):
        self.model = load_model()
    
    @modal.batched(max_batch_size=32, wait_ms=100)
    async def predict(self, inputs: list[str]) -> list[str]:
        return self.model.batch_predict(inputs)

Source

git clone https://github.com/samarth777/modal-skills/blob/main/skills/gpus/SKILL.mdView on GitHub

Overview

This skill is a detailed reference for using GPUs on Modal. It covers available GPUs, how to select single or multiple GPUs, multi-node setups, CUDA driver options, and cost optimization strategies.

How This Skill Works

You request GPUs by decorating functions with gpu= or gpu=[...], and the framework provisions the requested hardware. You can request multiple GPUs (e.g., H100:8) for data-parallel training, or use GPU fallbacks to specify priority. CUDA can use pre-installed drivers or user-supplied images with CUDA toolkit; examples show PyTorch usage and simple GPU checks.

When to Use It

  • Budget-friendly inference or light training on T4 or L4 GPUs.
  • Inference for mid-to-large models using L4, L40S, A100-40GB, or A100-80GB.
  • Large-model training or high-performance training with H100, H200, or B200.
  • Distributed training across multiple GPUs or nodes (e.g., A100:4, H100:8) using PyTorch DDP.
  • CUDA environment setup: choose pre-installed drivers or full CUDA toolkit depending on libraries.

Quick Start

  1. Step 1: Choose a GPU type (e.g., @app.function(gpu="A100")) or multi-GPU (e.g., @app.function(gpu="H100:8")).
  2. Step 2: If doing multi-GPU or multi-node training, follow the PyTorch DDP or beta multi-node cluster examples to configure your setup.
  3. Step 3: Validate CUDA availability inside your function (torch.cuda.is_available()) and optionally query GPU metrics with nvidia-smi.

Best Practices

  • Reference model size to the GPU guidance table and start with the smallest GPU that meets throughput needs.
  • Use explicit GPU variants (e.g., A100-80GB, H100) when you know the requirement, and consider GPU fallbacks for resilience.
  • Use multi-GPU syntax (e.g., H100:8) for scalable training, and leverage strict GPU flags (H100!) to prevent auto-upgrade.
  • Explore multi-node training with PyTorch DDP and the beta multi-node features when scaling beyond a single node.
  • Test CUDA setup early: verify drivers and CUDA toolkit availability with simple checks like torch.cuda.is_available() and nvidia-smi.

Example Use Cases

  • Single-GPU training on an A100 to validate model training locally on Modal.
  • 8 H100s configured for large model training using @app.function(gpu="H100:8").
  • Distributed training across 4 nodes with H100:8 and modal.experimental.clustered gated by @modal.experimental.clustered.
  • PyTorch DDP example using @app.function(gpu="A100:4") to run parallel training.
  • Check GPU health and utilization inside a function using nvidia-smi and a simple CUDA test.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers