The example uses meta-llama/Llama-3.1-8B-Instruct loaded into a persistent model volume

How is the model downloaded securely?

The download_model build step uses huggingface snapshot_download with HF_TOKEN from a Modal secret to fetch and store the model

How do I call the API and test the service?

Deploy the app and POST prompts to the FastAPI endpoint or test via the CLI as shown in the usage section

Llm Inference

Scanned

npx machina-cli add skill samarth777/modal-skills/llm-inference --openclaw

Files (1)

SKILL.md

4.0 KB

LLM Inference Service Example

A complete example of deploying an LLM inference service on Modal using vLLM.

import modal

# --- Configuration ---
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
GPU_TYPE = "A100"

# --- Image Definition ---
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.6.0",
        "torch==2.4.0",
        "transformers",
        "huggingface_hub[hf_transfer]",
    )
    .env({
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "VLLM_ATTENTION_BACKEND": "FLASH_ATTN",
    })
)

app = modal.App("llm-inference", image=image)

# --- Model Cache Volume ---
model_volume = modal.Volume.from_name("llm-model-cache", create_if_missing=True)
MODEL_PATH = "/models"

# --- Download Model (Build Step) ---
@app.function(
    volumes={MODEL_PATH: model_volume},
    secrets=[modal.Secret.from_name("huggingface-secret")],
    timeout=3600,
)
def download_model():
    from huggingface_hub import snapshot_download
    
    snapshot_download(
        MODEL_NAME,
        local_dir=f"{MODEL_PATH}/{MODEL_NAME}",
        token=os.environ["HF_TOKEN"],
    )
    model_volume.commit()

# --- Inference Service ---
@app.cls(
    gpu=GPU_TYPE,
    volumes={MODEL_PATH: model_volume},
    container_idle_timeout=300,  # Keep warm for 5 minutes
    allow_concurrent_inputs=10,
)
class LLMService:
    @modal.enter()
    def load_model(self):
        from vllm import LLM
        
        self.llm = LLM(
            model=f"{MODEL_PATH}/{MODEL_NAME}",
            tensor_parallel_size=1,
            gpu_memory_utilization=0.9,
        )
    
    @modal.method()
    def generate(
        self,
        prompt: str,
        max_tokens: int = 256,
        temperature: float = 0.7,
    ) -> str:
        from vllm import SamplingParams
        
        params = SamplingParams(
            temperature=temperature,
            max_tokens=max_tokens,
        )
        
        outputs = self.llm.generate([prompt], params)
        return outputs[0].outputs[0].text
    
    @modal.method()
    def generate_batch(
        self,
        prompts: list[str],
        max_tokens: int = 256,
        temperature: float = 0.7,
    ) -> list[str]:
        from vllm import SamplingParams
        
        params = SamplingParams(
            temperature=temperature,
            max_tokens=max_tokens,
        )
        
        outputs = self.llm.generate(prompts, params)
        return [o.outputs[0].text for o in outputs]

# --- Web API ---
@app.function()
@modal.fastapi_endpoint(method="POST", docs=True)
def generate(body: dict) -> dict:
    service = LLMService()
    
    result = service.generate.remote(
        prompt=body["prompt"],
        max_tokens=body.get("max_tokens", 256),
        temperature=body.get("temperature", 0.7),
    )
    
    return {"response": result}

# --- Streaming Endpoint ---
@app.function()
@modal.fastapi_endpoint(method="POST")
async def generate_stream(body: dict):
    from fastapi.responses import StreamingResponse
    
    # For streaming, you'd use vLLM's async engine
    # This is a simplified example
    service = LLMService()
    result = service.generate.remote(
        prompt=body["prompt"],
        max_tokens=body.get("max_tokens", 256),
    )
    
    async def stream():
        # In production, use vLLM's streaming
        for token in result.split():
            yield f"data: {token}\n\n"
    
    return StreamingResponse(stream(), media_type="text/event-stream")

# --- CLI ---
@app.local_entrypoint()
def main(prompt: str = "Explain quantum computing in simple terms."):
    print(f"Prompt: {prompt}\n")
    
    service = LLMService()
    response = service.generate.remote(prompt)
    
    print(f"Response:\n{response}")

Usage

# Download model first
modal run llm_service.py::download_model

# Test locally
modal run llm_service.py --prompt "What is the meaning of life?"

# Deploy
modal deploy llm_service.py

# Call API
curl -X POST https://your-workspace--llm-inference-generate.modal.run \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "max_tokens": 100}'

Source

git clone https://github.com/samarth777/modal-skills/blob/main/skills/llm-inference/SKILL.mdView on GitHub

Overview

This skill demonstrates a complete end-to-end LLM inference service on Modal using vLLM. It covers image setup, a persistent model cache, a loadable LLM service, and a web API with generation endpoints for single and batch prompts.

How This Skill Works

The solution builds a Modal app image that installs vLLM, Torch and transformers, downloads the model into a shared volume, and then loads the model using vLLM in a service class. It exposes generate and generate_batch methods for prompt-based generation and a FastAPI endpoint for REST access, with an optional streaming path.

When to Use It

You want a hosted LLM inference service on Modal using a pre downloaded model like Llama 3.1 8B Instruct
You need fast startup by persistent model storage on a shared volume
You require a REST API for single and batch prompt generation
You want a testable CLI and local testing harness before production
You need an end-to-end example including model download, load, and inference

Quick Start

Step 1: Build an image that installs vLLM, torch, transformers and enables HF transfer
Step 2: Create the model cache volume and run the download_model build step to populate /models
Step 3: Deploy the app and call the API or test locally with the CLI prompt

Best Practices

Pin exact versions for vLLM, torch, and transformers to avoid compatibility issues
Use a persistent model volume and commit after model download to speed cold starts
Tune tensor_parallel_size and gpu_memory_utilization for your GPU constraints
Store HF token in a Modal secret and enable HF transfer in the environment
Test both single and batch generation and consider streaming for latency-sensitive apps

Example Use Cases

LLMService.load_model loads the model from the shared volume with vLLM
generate handles a single prompt with configurable max_tokens and temperature
generate_batch processes multiple prompts and returns a list of texts
Web API endpoint generate exposes generation via a FastAPI function
Streaming endpoint generate_stream demonstrates streaming-style responses

Frequently Asked Questions

Add this skill to your agents