Get the FREE Ultimate OpenClaw Setup Guide →

Llm Inference

Scanned
npx machina-cli add skill samarth777/modal-skills/llm-inference --openclaw
Files (1)
SKILL.md
4.0 KB

LLM Inference Service Example

A complete example of deploying an LLM inference service on Modal using vLLM.

import modal

# --- Configuration ---
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
GPU_TYPE = "A100"

# --- Image Definition ---
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.6.0",
        "torch==2.4.0",
        "transformers",
        "huggingface_hub[hf_transfer]",
    )
    .env({
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "VLLM_ATTENTION_BACKEND": "FLASH_ATTN",
    })
)

app = modal.App("llm-inference", image=image)

# --- Model Cache Volume ---
model_volume = modal.Volume.from_name("llm-model-cache", create_if_missing=True)
MODEL_PATH = "/models"

# --- Download Model (Build Step) ---
@app.function(
    volumes={MODEL_PATH: model_volume},
    secrets=[modal.Secret.from_name("huggingface-secret")],
    timeout=3600,
)
def download_model():
    from huggingface_hub import snapshot_download
    
    snapshot_download(
        MODEL_NAME,
        local_dir=f"{MODEL_PATH}/{MODEL_NAME}",
        token=os.environ["HF_TOKEN"],
    )
    model_volume.commit()

# --- Inference Service ---
@app.cls(
    gpu=GPU_TYPE,
    volumes={MODEL_PATH: model_volume},
    container_idle_timeout=300,  # Keep warm for 5 minutes
    allow_concurrent_inputs=10,
)
class LLMService:
    @modal.enter()
    def load_model(self):
        from vllm import LLM
        
        self.llm = LLM(
            model=f"{MODEL_PATH}/{MODEL_NAME}",
            tensor_parallel_size=1,
            gpu_memory_utilization=0.9,
        )
    
    @modal.method()
    def generate(
        self,
        prompt: str,
        max_tokens: int = 256,
        temperature: float = 0.7,
    ) -> str:
        from vllm import SamplingParams
        
        params = SamplingParams(
            temperature=temperature,
            max_tokens=max_tokens,
        )
        
        outputs = self.llm.generate([prompt], params)
        return outputs[0].outputs[0].text
    
    @modal.method()
    def generate_batch(
        self,
        prompts: list[str],
        max_tokens: int = 256,
        temperature: float = 0.7,
    ) -> list[str]:
        from vllm import SamplingParams
        
        params = SamplingParams(
            temperature=temperature,
            max_tokens=max_tokens,
        )
        
        outputs = self.llm.generate(prompts, params)
        return [o.outputs[0].text for o in outputs]

# --- Web API ---
@app.function()
@modal.fastapi_endpoint(method="POST", docs=True)
def generate(body: dict) -> dict:
    service = LLMService()
    
    result = service.generate.remote(
        prompt=body["prompt"],
        max_tokens=body.get("max_tokens", 256),
        temperature=body.get("temperature", 0.7),
    )
    
    return {"response": result}

# --- Streaming Endpoint ---
@app.function()
@modal.fastapi_endpoint(method="POST")
async def generate_stream(body: dict):
    from fastapi.responses import StreamingResponse
    
    # For streaming, you'd use vLLM's async engine
    # This is a simplified example
    service = LLMService()
    result = service.generate.remote(
        prompt=body["prompt"],
        max_tokens=body.get("max_tokens", 256),
    )
    
    async def stream():
        # In production, use vLLM's streaming
        for token in result.split():
            yield f"data: {token}\n\n"
    
    return StreamingResponse(stream(), media_type="text/event-stream")

# --- CLI ---
@app.local_entrypoint()
def main(prompt: str = "Explain quantum computing in simple terms."):
    print(f"Prompt: {prompt}\n")
    
    service = LLMService()
    response = service.generate.remote(prompt)
    
    print(f"Response:\n{response}")

Usage

# Download model first
modal run llm_service.py::download_model

# Test locally
modal run llm_service.py --prompt "What is the meaning of life?"

# Deploy
modal deploy llm_service.py

# Call API
curl -X POST https://your-workspace--llm-inference-generate.modal.run \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "max_tokens": 100}'

Source

git clone https://github.com/samarth777/modal-skills/blob/main/skills/llm-inference/SKILL.mdView on GitHub

Overview

This skill demonstrates a complete end-to-end LLM inference service on Modal using vLLM. It covers image setup, a persistent model cache, a loadable LLM service, and a web API with generation endpoints for single and batch prompts.

How This Skill Works

The solution builds a Modal app image that installs vLLM, Torch and transformers, downloads the model into a shared volume, and then loads the model using vLLM in a service class. It exposes generate and generate_batch methods for prompt-based generation and a FastAPI endpoint for REST access, with an optional streaming path.

When to Use It

  • You want a hosted LLM inference service on Modal using a pre downloaded model like Llama 3.1 8B Instruct
  • You need fast startup by persistent model storage on a shared volume
  • You require a REST API for single and batch prompt generation
  • You want a testable CLI and local testing harness before production
  • You need an end-to-end example including model download, load, and inference

Quick Start

  1. Step 1: Build an image that installs vLLM, torch, transformers and enables HF transfer
  2. Step 2: Create the model cache volume and run the download_model build step to populate /models
  3. Step 3: Deploy the app and call the API or test locally with the CLI prompt

Best Practices

  • Pin exact versions for vLLM, torch, and transformers to avoid compatibility issues
  • Use a persistent model volume and commit after model download to speed cold starts
  • Tune tensor_parallel_size and gpu_memory_utilization for your GPU constraints
  • Store HF token in a Modal secret and enable HF transfer in the environment
  • Test both single and batch generation and consider streaming for latency-sensitive apps

Example Use Cases

  • LLMService.load_model loads the model from the shared volume with vLLM
  • generate handles a single prompt with configurable max_tokens and temperature
  • generate_batch processes multiple prompts and returns a list of texts
  • Web API endpoint generate exposes generation via a FastAPI function
  • Streaming endpoint generate_stream demonstrates streaming-style responses

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers