What is Modal used for?

A serverless cloud platform for running Python code with GPU support, web endpoints, cron jobs, and batch processing with minimal configuration.

Containers are billed per second and autoscale from zero to thousands of containers; you pay only for resources actually used.

How do I run GPU workloads?

Specify a GPU type in the function decorator, e.g., @app.function(gpu='A100'), to access GPU-accelerated hardware for inference or training.

modal

Scanned

npx machina-cli add skill samarth777/modal-skills/main --openclaw

Files (1)

SKILL.md

16.7 KB

Modal Development Guide

Overview

Modal is a serverless cloud platform for running Python code with minimal configuration. It excels at:

GPU-accelerated AI/ML inference (supports T4, L4, A10, A100, L40S, H100, H200, B200)
Serverless web APIs and endpoints
Scheduled jobs (cron)
High-performance batch processing
Sandboxed code execution

Key benefits:

Pay only for resources used (billed per second)
Containers spin up in seconds
No infrastructure management required
Built-in autoscaling from zero to thousands of containers

Quick Start

Installation & Setup

pip install modal
modal setup  # Authenticate with Modal

Basic App Structure

import modal

# Create an App (groups Functions for deployment)
app = modal.App("my-app")

# Define a container image with dependencies
image = modal.Image.debian_slim(python_version="3.12").pip_install("numpy", "pandas")

# Create a serverless function
@app.function(image=image)
def process_data(x: int) -> int:
    import numpy as np
    return int(np.square(x))

# Local entrypoint for running the app
@app.local_entrypoint()
def main():
    result = process_data.remote(5)
    print(f"Result: {result}")

Running Your App

# Run ephemeral (for development/testing)
modal run my_app.py

# Deploy persistently
modal deploy my_app.py

# Serve with hot-reload (for web endpoints)
modal serve my_app.py

Core Concepts

Apps and Functions

An App groups related Functions for atomic deployment. A Function is an independent unit that scales up/down automatically.

import modal

app = modal.App("my-app")

@app.function()
def hello(name: str) -> str:
    return f"Hello, {name}!"

@app.function()
def goodbye(name: str) -> str:
    return f"Goodbye, {name}!"

Container Images

Define custom environments using method chaining:

image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("git", "ffmpeg")              # System packages
    .pip_install("torch", "transformers")       # Python packages (pip)
    .uv_pip_install("numpy", "pandas")          # Python packages (uv - faster)
    .env({"MY_VAR": "value"})                   # Environment variables
    .run_commands("echo 'setup complete'")      # Shell commands
    .add_local_python_source("my_module")       # Local Python code
)

@app.function(image=image)
def my_function():
    ...

Using Existing Images

# From Docker registry
image = modal.Image.from_registry("pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime")

# From Dockerfile
image = modal.Image.from_dockerfile("./Dockerfile")

# From NVIDIA CUDA
image = modal.Image.from_registry("nvidia/cuda:12.8.1-devel-ubuntu24.04", add_python="3.12")

Handling Remote-Only Imports

When packages exist only in the container:

image = modal.Image.debian_slim().pip_install("pandas")

# Option 1: Import inside function body
@app.function(image=image)
def my_function():
    import pandas as pd  # Import inside function
    return pd.DataFrame()

# Option 2: Use Image.imports() context manager
with image.imports():
    import pandas as pd
    import numpy as np

@app.function(image=image)
def my_function():
    return pd.DataFrame()

GPU Acceleration

Requesting GPUs

# Single GPU
@app.function(gpu="A100")
def train_model():
    import torch
    assert torch.cuda.is_available()
    ...

# Multiple GPUs (same machine)
@app.function(gpu="H100:8")
def train_large_model():
    ...

# GPU fallbacks (tries in order)
@app.function(gpu=["H100", "A100-80GB", "A100-40GB"])
def flexible_inference():
    ...

Available GPU Types

GPU	Memory	Best For
`T4`	16 GB	Budget inference
`L4`	24 GB	Inference
`A10`	24 GB	Inference (up to 4x)
`A100-40GB`	40 GB	Training/Inference
`A100-80GB`	80 GB	Large models
`L40S`	48 GB	Best cost/performance
`H100`	80 GB	High-performance training
`H200`	141 GB	Large models
`B200`	192 GB	Largest models

GPU Image Setup

For libraries requiring CUDA toolkit:

image = (
    modal.Image.from_registry("nvidia/cuda:12.8.1-devel-ubuntu24.04", add_python="3.12")
    .entrypoint([])
    .pip_install("torch", "transformers")
)

@app.function(gpu="A100", image=image)
def gpu_function():
    ...

Web Endpoints

FastAPI Endpoints

from fastapi import FastAPI
import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App("web-app", image=image)

@app.function()
@modal.fastapi_endpoint()
def hello(name: str = "World") -> dict:
    return {"message": f"Hello, {name}!"}

# With custom configuration
@app.function()
@modal.fastapi_endpoint(method="POST", docs=True)
def process(data: dict) -> dict:
    return {"processed": data}

Full ASGI/WSGI Apps

# ASGI (FastAPI, Starlette)
@app.function()
@modal.asgi_app()
def fastapi_app():
    from fastapi import FastAPI
    web_app = FastAPI()
    
    @web_app.get("/")
    def root():
        return {"status": "ok"}
    
    return web_app

# WSGI (Flask, Django)
@app.function()
@modal.wsgi_app()
def flask_app():
    from flask import Flask
    web_app = Flask(__name__)
    
    @web_app.route("/")
    def root():
        return {"status": "ok"}
    
    return web_app

Web Server (Custom Ports)

@app.function()
@modal.web_server(port=8080)
def custom_server():
    import subprocess
    subprocess.run(["python", "-m", "http.server", "8080"])

Custom Domains

@app.function()
@modal.fastapi_endpoint(custom_domains=["api.example.com"])
def my_api():
    return {"message": "Hello from custom domain!"}

Persistent Storage

Volumes (High-Performance Storage)

# Create or reference a Volume
volume = modal.Volume.from_name("my-volume", create_if_missing=True)

@app.function(volumes={"/data": volume})
def save_data():
    with open("/data/output.txt", "w") as f:
        f.write("Hello, Volume!")

@app.function(volumes={"/data": volume})
def read_data():
    with open("/data/output.txt", "r") as f:
        return f.read()

Volume Operations

# Upload files
with volume.batch_upload() as batch:
    batch.put_file("local.txt", "/remote.txt")
    batch.put_directory("./local_dir", "/remote_dir")

# Read files
for chunk in volume.read_file("output.txt"):
    print(chunk)

# List contents
for entry in volume.listdir("/"):
    print(entry)

Cloud Bucket Mounts

bucket_mount = modal.CloudBucketMount(
    bucket_name="my-s3-bucket",
    secret=modal.Secret.from_name("aws-secret"),
)

@app.function(volumes={"/s3": bucket_mount})
def process_s3_data():
    import os
    files = os.listdir("/s3")
    ...

Secrets Management

Creating Secrets

# From Modal dashboard (recommended for production)
secret = modal.Secret.from_name("my-secret")

# From dictionary (inline)
secret = modal.Secret.from_dict({"API_KEY": "xxx", "DB_PASSWORD": "yyy"})

# From .env file
secret = modal.Secret.from_dotenv()

Using Secrets

@app.function(secrets=[modal.Secret.from_name("openai-secret")])
def call_openai():
    import os
    api_key = os.environ["OPENAI_API_KEY"]
    ...

Scheduling (Cron Jobs)

# Run every hour
@app.function(schedule=modal.Period(hours=1))
def hourly_task():
    print("Running hourly task")

# Run daily at specific time
@app.function(schedule=modal.Cron("0 9 * * *"))  # 9 AM UTC daily
def daily_report():
    print("Generating daily report")

# Run every 5 minutes
@app.function(schedule=modal.Period(minutes=5))
def frequent_check():
    print("Checking...")

Deploy with modal deploy to activate schedules.

Parallel Processing

Using `.map()` for Batch Processing

@app.function()
def process_item(item: int) -> int:
    return item * 2

@app.local_entrypoint()
def main():
    items = list(range(1000))
    
    # Process all items in parallel
    results = list(process_item.map(items))
    print(f"Processed {len(results)} items")

Using `.starmap()` for Multiple Arguments

@app.function()
def add(x: int, y: int) -> int:
    return x + y

@app.local_entrypoint()
def main():
    pairs = [(1, 2), (3, 4), (5, 6)]
    results = list(add.starmap(pairs))

Fire-and-Forget with `.spawn()`

@app.function()
def background_task(task_id: int):
    # Long-running task
    ...

@app.local_entrypoint()
def main():
    # Spawn without waiting for results
    for i in range(100):
        background_task.spawn(i)
    print("All tasks spawned")

Sandboxes (Dynamic Code Execution)

Execute arbitrary code in isolated containers:

import modal

app = modal.App.lookup("sandbox-app", create_if_missing=True)

# Create a sandbox
sb = modal.Sandbox.create(
    image=modal.Image.debian_slim().pip_install("numpy"),
    app=app,
)

# Execute commands
p = sb.exec("python", "-c", "import numpy; print(numpy.__version__)")
print(p.stdout.read())

# Clean up
sb.terminate()

LLM Code Execution

@app.function(
    restrict_modal_access=True,  # Security: restrict Modal API access
    single_use_containers=True,   # Fresh container per request
    timeout=30,
    block_network=True,           # No network access
)
def run_untrusted_code(code: str):
    try:
        exec_globals = {}
        exec(code, exec_globals)
        return exec_globals.get("result")
    except Exception as e:
        return f"Error: {e}"

Classes with Lifecycle

Use @app.cls() for stateful services with initialization:

@app.cls(gpu="A100", image=image)
class ModelService:
    @modal.enter()
    def load_model(self):
        # Runs once when container starts
        from transformers import pipeline
        self.model = pipeline("text-generation", model="gpt2", device="cuda")
    
    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.model(prompt, max_length=100)[0]["generated_text"]
    
    @modal.exit()
    def cleanup(self):
        # Runs when container shuts down
        del self.model

# Usage
@app.local_entrypoint()
def main():
    service = ModelService()
    result = service.generate.remote("Hello, world!")
    print(result)

Resource Configuration

CPU and Memory

@app.function(
    cpu=4.0,           # 4 CPU cores
    memory=8192,       # 8 GB RAM
)
def heavy_computation():
    ...

# With limits
@app.function(
    cpu=(1.0, 4.0),           # Request 1 core, limit 4
    memory=(2048, 8192),      # Request 2GB, limit 8GB
)
def flexible_function():
    ...

Timeouts and Retries

@app.function(
    timeout=600,              # 10 minute timeout
    retries=modal.Retries(
        max_retries=3,
        initial_delay=1.0,
        backoff_coefficient=2.0,
    ),
)
def reliable_function():
    ...

Container Configuration

@app.function(
    concurrency_limit=10,         # Max containers
    allow_concurrent_inputs=5,    # Inputs per container
    container_idle_timeout=300,   # Keep warm for 5 min
)
def optimized_function():
    ...

Common Patterns

AI Model Inference Service

import modal

app = modal.App("llm-service")

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("vllm", "torch")
)

@app.cls(gpu="A100", image=image, container_idle_timeout=300)
class LLMService:
    @modal.enter()
    def load(self):
        from vllm import LLM
        self.llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
    
    @modal.method()
    def generate(self, prompt: str) -> str:
        from vllm import SamplingParams
        params = SamplingParams(temperature=0.7, max_tokens=256)
        outputs = self.llm.generate([prompt], params)
        return outputs[0].outputs[0].text

@app.function()
@modal.fastapi_endpoint(method="POST")
def inference(body: dict) -> dict:
    service = LLMService()
    result = service.generate.remote(body["prompt"])
    return {"response": result}

Data Processing Pipeline

import modal

app = modal.App("data-pipeline")
volume = modal.Volume.from_name("pipeline-data", create_if_missing=True)

image = modal.Image.debian_slim().pip_install("pandas", "pyarrow")

@app.function(image=image, volumes={"/data": volume})
def extract(source: str) -> str:
    import pandas as pd
    df = pd.read_csv(source)
    output_path = f"/data/extracted_{source.split('/')[-1]}"
    df.to_parquet(output_path)
    return output_path

@app.function(image=image, volumes={"/data": volume})
def transform(input_path: str) -> str:
    import pandas as pd
    df = pd.read_parquet(input_path)
    # Transform logic
    df["processed"] = True
    output_path = input_path.replace("extracted", "transformed")
    df.to_parquet(output_path)
    return output_path

@app.function(image=image, volumes={"/data": volume})
def load(input_path: str):
    import pandas as pd
    df = pd.read_parquet(input_path)
    # Load to destination
    print(f"Loaded {len(df)} rows")

@app.local_entrypoint()
def run_pipeline():
    sources = ["data1.csv", "data2.csv", "data3.csv"]
    
    # Extract in parallel
    extracted = list(extract.map(sources))
    
    # Transform in parallel
    transformed = list(transform.map(extracted))
    
    # Load in parallel
    list(load.map(transformed))

Job Queue Pattern

import modal

app = modal.App("job-queue")

@app.function()
def process_job(data: dict) -> dict:
    # Long-running processing
    import time
    time.sleep(10)
    return {"status": "complete", "data": data}

# Submit job and get call ID
def submit_job(data: dict) -> str:
    process_job_fn = modal.Function.from_name("job-queue", "process_job")
    call = process_job_fn.spawn(data)
    return call.object_id

# Poll for result
def get_result(call_id: str):
    call = modal.FunctionCall.from_id(call_id)
    try:
        return call.get(timeout=0)  # Non-blocking
    except TimeoutError:
        return {"status": "pending"}

Deployment Best Practices

Project Structure

my-modal-app/
├── app.py           # Main Modal app
├── models/          # ML models
│   └── inference.py
├── utils/           # Shared utilities
│   └── helpers.py
├── requirements.txt
└── Dockerfile       # Optional

Image Optimization

# Order layers by change frequency (least changing first)
image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("ffmpeg")                    # Rarely changes
    .pip_install("torch==2.1.0")              # Pin versions
    .pip_install("transformers==4.35.0")      # Pin versions
    .add_local_python_source("utils")         # Changes more often
)

CI/CD Integration

# .github/workflows/deploy.yml
name: Deploy to Modal
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
      MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install modal
      - run: modal deploy app.py

Debugging & Monitoring

View Logs

# Stream logs from deployed app
modal app logs my-app

# View specific function logs
modal container logs <container-id>

Interactive Debugging

# Add breakpoint for debugging
@app.function()
def debug_function():
    import pdb; pdb.set_trace()
    ...

Force Image Rebuild

# Rebuild all images
MODAL_FORCE_BUILD=1 modal run app.py

# Ignore cache (doesn't break cache for others)
MODAL_IGNORE_CACHE=1 modal run app.py

Reference

CLI Commands

Command	Description
`modal run app.py`	Run ephemeral app
`modal deploy app.py`	Deploy persistently
`modal serve app.py`	Serve with hot-reload
`modal app list`	List deployed apps
`modal app stop <name>`	Stop deployed app
`modal secret list`	List secrets
`modal volume list`	List volumes

Pricing

CPU: ~$0.192/core/hour
Memory: ~$0.024/GB/hour
GPUs: Varies by type (see modal.com/pricing)
Billed per second, only for resources used

Overview

Modal provides a serverless cloud platform to run Python code with minimal configuration. It excels at GPU-accelerated AI/ML inference, serverless web APIs, cron jobs, and high-performance batch processing, all without infrastructure management.

How This Skill Works

Create an App to group Functions, define a container image with dependencies, and deploy or run them on Modal’s serverless infra. Functions scale automatically and can request GPUs; code runs in sandboxed containers and can be exposed as web endpoints or scheduled jobs.

When to Use It

Deploy Python code to Modal's infrastructure
Run GPU-accelerated AI/ML inference
Create serverless web APIs and endpoints
Process data at scale with batch processing
Build AI-driven applications

Quick Start

Step 1: Install Modal and authenticate: pip install modal; modal setup
Step 2: Create an App, define a container image, and add a function
Step 3: Run, deploy, or serve your app: modal run/deploy/serve my_app.py

Best Practices

Structure apps with Apps and Functions, keeping functions small and cohesive.
Choose a dedicated Image per function and pin dependencies with pip_install/uv_pip_install.
Request GPUs explicitly for heavy workloads using @app.function(gpu=...)
Use modal run for development and modal deploy/serve for production.
Leverage environment variables, run_commands, and imports() to manage setup and code locality.

Example Use Cases

GPU-accelerated model inference in a function using @app.function(gpu='A100') with PyTorch.
A serverless web API endpoint served via modal serve my_app.py.
Scheduled cron jobs running data pipelines on a defined schedule.
High-throughput batch processing of large datasets with autoscaling containers.
Sandboxed code execution for user-submitted AI tasks without managing infrastructure.

Frequently Asked Questions

Add this skill to your agents

modal

Modal Development Guide

Overview

Quick Start

Installation & Setup

Basic App Structure

Running Your App

Core Concepts

Apps and Functions

Container Images

Using Existing Images

Handling Remote-Only Imports

GPU Acceleration

Requesting GPUs

Available GPU Types

GPU Image Setup

Web Endpoints

FastAPI Endpoints

Full ASGI/WSGI Apps

Web Server (Custom Ports)

Custom Domains

Persistent Storage

Volumes (High-Performance Storage)

Volume Operations

Cloud Bucket Mounts

Secrets Management

Creating Secrets

Using Secrets

Scheduling (Cron Jobs)

Parallel Processing

Using .map() for Batch Processing

Using .starmap() for Multiple Arguments

Fire-and-Forget with .spawn()

Sandboxes (Dynamic Code Execution)

LLM Code Execution

Classes with Lifecycle

Resource Configuration

CPU and Memory

Timeouts and Retries

Container Configuration

Common Patterns

AI Model Inference Service

Data Processing Pipeline

Job Queue Pattern

Deployment Best Practices

Project Structure

Image Optimization

CI/CD Integration

Debugging & Monitoring

View Logs

Interactive Debugging

Force Image Rebuild

Reference

CLI Commands

Pricing

Links

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is Modal used for?

How is usage billed?

How do I run GPU workloads?

Using `.map()` for Batch Processing

Using `.starmap()` for Multiple Arguments

Fire-and-Forget with `.spawn()`