Llm Inference
Scannednpx machina-cli add skill samarth777/modal-skills/llm-inference --openclawFiles (1)
SKILL.md
4.0 KB
LLM Inference Service Example
A complete example of deploying an LLM inference service on Modal using vLLM.
import modal
# --- Configuration ---
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
GPU_TYPE = "A100"
# --- Image Definition ---
image = (
modal.Image.debian_slim(python_version="3.12")
.pip_install(
"vllm==0.6.0",
"torch==2.4.0",
"transformers",
"huggingface_hub[hf_transfer]",
)
.env({
"HF_HUB_ENABLE_HF_TRANSFER": "1",
"VLLM_ATTENTION_BACKEND": "FLASH_ATTN",
})
)
app = modal.App("llm-inference", image=image)
# --- Model Cache Volume ---
model_volume = modal.Volume.from_name("llm-model-cache", create_if_missing=True)
MODEL_PATH = "/models"
# --- Download Model (Build Step) ---
@app.function(
volumes={MODEL_PATH: model_volume},
secrets=[modal.Secret.from_name("huggingface-secret")],
timeout=3600,
)
def download_model():
from huggingface_hub import snapshot_download
snapshot_download(
MODEL_NAME,
local_dir=f"{MODEL_PATH}/{MODEL_NAME}",
token=os.environ["HF_TOKEN"],
)
model_volume.commit()
# --- Inference Service ---
@app.cls(
gpu=GPU_TYPE,
volumes={MODEL_PATH: model_volume},
container_idle_timeout=300, # Keep warm for 5 minutes
allow_concurrent_inputs=10,
)
class LLMService:
@modal.enter()
def load_model(self):
from vllm import LLM
self.llm = LLM(
model=f"{MODEL_PATH}/{MODEL_NAME}",
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
)
@modal.method()
def generate(
self,
prompt: str,
max_tokens: int = 256,
temperature: float = 0.7,
) -> str:
from vllm import SamplingParams
params = SamplingParams(
temperature=temperature,
max_tokens=max_tokens,
)
outputs = self.llm.generate([prompt], params)
return outputs[0].outputs[0].text
@modal.method()
def generate_batch(
self,
prompts: list[str],
max_tokens: int = 256,
temperature: float = 0.7,
) -> list[str]:
from vllm import SamplingParams
params = SamplingParams(
temperature=temperature,
max_tokens=max_tokens,
)
outputs = self.llm.generate(prompts, params)
return [o.outputs[0].text for o in outputs]
# --- Web API ---
@app.function()
@modal.fastapi_endpoint(method="POST", docs=True)
def generate(body: dict) -> dict:
service = LLMService()
result = service.generate.remote(
prompt=body["prompt"],
max_tokens=body.get("max_tokens", 256),
temperature=body.get("temperature", 0.7),
)
return {"response": result}
# --- Streaming Endpoint ---
@app.function()
@modal.fastapi_endpoint(method="POST")
async def generate_stream(body: dict):
from fastapi.responses import StreamingResponse
# For streaming, you'd use vLLM's async engine
# This is a simplified example
service = LLMService()
result = service.generate.remote(
prompt=body["prompt"],
max_tokens=body.get("max_tokens", 256),
)
async def stream():
# In production, use vLLM's streaming
for token in result.split():
yield f"data: {token}\n\n"
return StreamingResponse(stream(), media_type="text/event-stream")
# --- CLI ---
@app.local_entrypoint()
def main(prompt: str = "Explain quantum computing in simple terms."):
print(f"Prompt: {prompt}\n")
service = LLMService()
response = service.generate.remote(prompt)
print(f"Response:\n{response}")
Usage
# Download model first
modal run llm_service.py::download_model
# Test locally
modal run llm_service.py --prompt "What is the meaning of life?"
# Deploy
modal deploy llm_service.py
# Call API
curl -X POST https://your-workspace--llm-inference-generate.modal.run \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, how are you?", "max_tokens": 100}'
Source
git clone https://github.com/samarth777/modal-skills/blob/main/skills/llm-inference/SKILL.mdView on GitHub Overview
This skill demonstrates a complete end-to-end LLM inference service on Modal using vLLM. It covers image setup, a persistent model cache, a loadable LLM service, and a web API with generation endpoints for single and batch prompts.
How This Skill Works
The solution builds a Modal app image that installs vLLM, Torch and transformers, downloads the model into a shared volume, and then loads the model using vLLM in a service class. It exposes generate and generate_batch methods for prompt-based generation and a FastAPI endpoint for REST access, with an optional streaming path.
When to Use It
- You want a hosted LLM inference service on Modal using a pre downloaded model like Llama 3.1 8B Instruct
- You need fast startup by persistent model storage on a shared volume
- You require a REST API for single and batch prompt generation
- You want a testable CLI and local testing harness before production
- You need an end-to-end example including model download, load, and inference
Quick Start
- Step 1: Build an image that installs vLLM, torch, transformers and enables HF transfer
- Step 2: Create the model cache volume and run the download_model build step to populate /models
- Step 3: Deploy the app and call the API or test locally with the CLI prompt
Best Practices
- Pin exact versions for vLLM, torch, and transformers to avoid compatibility issues
- Use a persistent model volume and commit after model download to speed cold starts
- Tune tensor_parallel_size and gpu_memory_utilization for your GPU constraints
- Store HF token in a Modal secret and enable HF transfer in the environment
- Test both single and batch generation and consider streaming for latency-sensitive apps
Example Use Cases
- LLMService.load_model loads the model from the shared volume with vLLM
- generate handles a single prompt with configurable max_tokens and temperature
- generate_batch processes multiple prompts and returns a list of texts
- Web API endpoint generate exposes generation via a FastAPI function
- Streaming endpoint generate_stream demonstrates streaming-style responses
Frequently Asked Questions
Add this skill to your agents