model-serving
Scannednpx machina-cli add skill ancoleman/ai-design-components/model-serving --openclawModel Serving
Purpose
Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications.
When to Use
- Deploying LLMs for production (self-hosted Llama, Mistral, Qwen)
- Building AI APIs with streaming responses
- Serving traditional ML models (scikit-learn, XGBoost, PyTorch)
- Implementing RAG pipelines with vector databases
- Optimizing inference throughput and latency
- Integrating LLM serving with frontend chat interfaces
Model Serving Selection
LLM Serving Engines
vLLM (Recommended Primary)
- PagedAttention memory management (20-30x throughput improvement)
- Continuous batching for dynamic request handling
- OpenAI-compatible API endpoints
- Use for: Most self-hosted LLM deployments
TensorRT-LLM
- Maximum GPU efficiency (2-8x faster than vLLM)
- Requires model conversion and optimization
- Use for: Production workloads needing absolute maximum throughput
Ollama
- Local development without GPUs
- Simple CLI interface
- Use for: Prototyping, laptop development, educational purposes
Decision Framework:
Self-hosted LLM deployment needed?
├─ Yes, need maximum throughput → vLLM
├─ Yes, need absolute max GPU efficiency → TensorRT-LLM
├─ Yes, local development only → Ollama
└─ No, use managed API (OpenAI, Anthropic) → No serving layer needed
ML Model Serving (Non-LLM)
BentoML (Recommended)
- Python-native, easy deployment
- Adaptive batching for throughput
- Multi-framework support (scikit-learn, PyTorch, XGBoost)
- Use for: Most traditional ML model deployments
Triton Inference Server
- Multi-model serving on same GPU
- Model ensembles (chain multiple models)
- Use for: NVIDIA GPU optimization, serving 10+ models
LLM Orchestration
LangChain
- General-purpose workflows, agents, RAG
- 100+ integrations (LLMs, vector DBs, tools)
- Use for: Most RAG and agent applications
LlamaIndex
- RAG-focused with advanced retrieval strategies
- 100+ data connectors (PDF, Notion, web)
- Use for: RAG is primary use case
Quick Start Examples
vLLM Server Setup
# Install
pip install vllm
# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--dtype auto \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--port 8000
Key Parameters:
--dtype: Model precision (auto, float16, bfloat16)--max-model-len: Context window size--gpu-memory-utilization: GPU memory fraction (0.8-0.95)--tensor-parallel-size: Number of GPUs for model parallelism
Streaming Responses (SSE Pattern)
Backend (FastAPI):
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json
app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
@app.post("/chat/stream")
async def chat_stream(message: str):
async def generate():
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": message}],
stream=True,
max_tokens=512
)
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
yield f"data: {json.dumps({'token': token})}\n\n"
yield f"data: {json.dumps({'done': True})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache"}
)
Frontend (React):
// Integration with ai-chat skill
const sendMessage = async (message: string) => {
const response = await fetch('/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message })
})
const reader = response.body!.getReader()
const decoder = new TextDecoder()
while (true) {
const { done, value } = await reader.read()
if (done) break
const chunk = decoder.decode(value)
const lines = chunk.split('\n\n')
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6))
if (data.token) {
setResponse(prev => prev + data.token)
}
}
}
}
}
BentoML Service
import bentoml
from bentoml.io import JSON
import numpy as np
@bentoml.service(
resources={"cpu": "2", "memory": "4Gi"},
traffic={"timeout": 10}
)
class IrisClassifier:
model_ref = bentoml.models.get("iris_classifier:latest")
def __init__(self):
self.model = bentoml.sklearn.load_model(self.model_ref)
@bentoml.api(batchable=True, max_batch_size=32)
def classify(self, features: list[dict]) -> list[str]:
X = np.array([[f['sepal_length'], f['sepal_width'],
f['petal_length'], f['petal_width']] for f in features])
predictions = self.model.predict(X)
return ['setosa', 'versicolor', 'virginica'][predictions]
LangChain RAG Pipeline
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load and chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Qdrant.from_documents(
chunks,
embeddings,
url="http://localhost:6333",
collection_name="docs"
)
# Create retrieval chain
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
# Query
result = qa_chain({"query": "What is PagedAttention?"})
Performance Optimization
GPU Memory Estimation
Rule of thumb for LLMs:
GPU Memory (GB) = Model Parameters (B) × Precision (bytes) × 1.2
Examples:
- Llama-3.1-8B (FP16): 8B × 2 bytes × 1.2 = 19.2 GB
- Llama-3.1-70B (FP16): 70B × 2 bytes × 1.2 = 168 GB (requires 2-4 A100s)
Quantization reduces memory:
- FP16: 2 bytes per parameter
- INT8: 1 byte per parameter (2x memory reduction)
- INT4: 0.5 bytes per parameter (4x memory reduction)
vLLM Optimization
# Enable quantization (AWQ for 4-bit)
vllm serve TheBloke/Llama-3.1-8B-AWQ \
--quantization awq \
--gpu-memory-utilization 0.9
# Multi-GPU deployment (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9
Batching Strategies
Continuous batching (vLLM default):
- Dynamically adds/removes requests from batch
- Higher throughput than static batching
- No configuration needed
Adaptive batching (BentoML):
@bentoml.api(
batchable=True,
max_batch_size=32,
max_latency_ms=1000 # Wait max 1s to fill batch
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
# BentoML automatically batches requests
return self.model.predict(np.array(inputs))
Production Deployment
Kubernetes Deployment
See examples/k8s-vllm-deployment/ for complete YAML manifests.
Key considerations:
- GPU resource requests:
nvidia.com/gpu: 1 - Health checks:
/healthendpoint - Horizontal Pod Autoscaling based on queue depth
- Persistent volume for model caching
API Gateway Pattern
For production, add rate limiting, authentication, and monitoring:
Kong Configuration:
services:
- name: vllm-service
url: http://vllm-llama-8b:8000
plugins:
- name: rate-limiting
config:
minute: 60 # 60 requests per minute per API key
- name: key-auth
- name: prometheus
Monitoring Metrics
Essential LLM metrics:
- Tokens per second (throughput)
- Time to first token (TTFT)
- Inter-token latency
- GPU utilization and memory
- Queue depth
Prometheus instrumentation:
from prometheus_client import Counter, Histogram
requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')
@app.post("/chat")
async def chat(request):
requests_total.inc()
start = time.time()
response = await generate(request)
tokens_generated.inc(len(response.tokens))
request_duration.observe(time.time() - start)
return response
Integration Patterns
Frontend (ai-chat) Integration
This skill provides the backend serving layer for the ai-chat skill.
Flow:
Frontend (React) → API Gateway → vLLM Server → GPU Inference
↑ ↓
└─────────── SSE Stream (tokens) ─────────────────┘
See references/streaming-sse.md for complete implementation patterns.
RAG with Vector Databases
Architecture:
User Query → LangChain
├─> Vector DB (Qdrant) for retrieval
├─> Combine context + query
└─> LLM (vLLM) for generation
See references/langchain-orchestration.md and examples/langchain-rag-qdrant/ for complete patterns.
Async Inference Queue
For batch processing or non-real-time inference:
Client → API → Message Queue (Celery) → Workers (vLLM) → Results DB
Useful for:
- Batch document processing
- Background summarization
- Non-interactive workflows
Benchmarking
Use scripts/benchmark_inference.py to measure the deployment:
python scripts/benchmark_inference.py \
--endpoint http://localhost:8000/v1/chat/completions \
--model meta-llama/Llama-3.1-8B-Instruct \
--concurrency 32 \
--requests 1000
Outputs:
- Requests per second
- P50/P95/P99 latency
- Tokens per second
- GPU memory usage
Bundled Resources
Detailed Guides:
references/vllm.md- vLLM setup, PagedAttention, optimizationreferences/tgi.md- Text Generation Inference patternsreferences/bentoml.md- BentoML deployment patternsreferences/langchain-orchestration.md- LangChain RAG and agentsreferences/inference-optimization.md- Quantization, batching, GPU tuning
Working Examples:
examples/vllm-serving/- Complete vLLM + FastAPI streaming setupexamples/ollama-local/- Local development with Ollamaexamples/langchain-agents/- LangChain agent patterns
Utility Scripts:
scripts/benchmark_inference.py- Throughput and latency benchmarkingscripts/validate_model_config.py- Validate deployment configurations
Common Patterns
Migration from OpenAI API
vLLM provides OpenAI-compatible endpoints for easy migration:
# Before (OpenAI)
from openai import OpenAI
client = OpenAI(api_key="sk-...")
# After (vLLM)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# Same API calls work!
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}]
)
Multi-Model Serving
Route requests to different models based on task:
MODEL_ROUTING = {
"small": "meta-llama/Llama-3.1-8B-Instruct", # Fast, cheap
"large": "meta-llama/Llama-3.1-70B-Instruct", # Accurate, expensive
"code": "codellama/CodeLlama-34b-Instruct" # Code-specific
}
@app.post("/chat")
async def chat(message: str, task: str = "small"):
model = MODEL_ROUTING[task]
# Route to appropriate vLLM instance
Cost Optimization
Track token usage:
import tiktoken
def estimate_cost(text: str, model: str, price_per_1k: float):
encoding = tiktoken.encoding_for_model(model)
tokens = len(encoding.encode(text))
return (tokens / 1000) * price_per_1k
# Compare costs
openai_cost = estimate_cost(text, "gpt-4o", 0.005) # $5 per 1M tokens
self_hosted_cost = 0 # Fixed GPU cost, unlimited tokens
Troubleshooting
Out of GPU memory:
- Reduce
--max-model-len - Lower
--gpu-memory-utilization(try 0.8) - Enable quantization (
--quantization awq) - Use smaller model variant
Low throughput:
- Increase
--gpu-memory-utilization(try 0.95) - Enable continuous batching (vLLM default)
- Check GPU utilization (should be >80%)
- Consider tensor parallelism for multi-GPU
High latency:
- Reduce batch size if using static batching
- Check network latency to GPU server
- Profile with
scripts/benchmark_inference.py
Next Steps
- Local Development: Start with
examples/ollama-local/for GPU-free testing - Production Setup: Deploy vLLM with
examples/vllm-serving/ - RAG Integration: Add vector DB with
examples/langchain-rag-qdrant/ - Kubernetes: Scale with
examples/k8s-vllm-deployment/ - Monitoring: Add metrics with Prometheus and Grafana
Source
git clone https://github.com/ancoleman/ai-design-components/blob/main/skills/model-serving/SKILL.mdView on GitHub Overview
Model-serving enables self-hosted deployment of LLMs and traditional ML models for production inference with optimized engines, streaming responses, and orchestration. It covers vLLM, TensorRT-LLM, Ollama, BentoML, and Triton, plus RAG tooling with LangChain and LlamaIndex to build AI APIs and scalable pipelines.
How This Skill Works
Choose a serving engine based on workload: vLLM for high-throughput LLM serving, TensorRT-LLM for maximum GPU efficiency, or Ollama for local development. Deploy models behind APIs, enable streaming responses, and connect with orchestration and RAG tools like LangChain and LlamaIndex to compose multi-model pipelines.
When to Use It
- Deploying LLMs in production with self-hosted infrastructure
- Building AI APIs or chat interfaces with streaming responses
- Serving traditional ML models (scikit-learn, PyTorch, XGBoost)
- Implementing RAG pipelines with vector databases and retrieval
- Optimizing inference throughput and latency for multi-model workloads
Quick Start
- Step 1: Install and pick a serving engine (e.g., vLLM for LLMs, BentoML for traditional models)
- Step 2: Start the server with a sample model, e.g., vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --max-model-len 4096 --gpu-memory-utilization 0.9 --port 8000
- Step 3: Connect your frontend/API and test streaming responses or RAG workflows with LangChain or LlamaIndex
Best Practices
- Match the engine to the workload: vLLM for throughput, TensorRT-LLM for peak GPU efficiency, Ollama for local prototyping
- Leverage BentoML for Python-native deployment across frameworks
- Use streaming responses to reduce perceived latency in AI APIs
- Combine LangChain and LlamaIndex for RAG and agent workflows
- Monitor latency, throughput, and resource utilization; test with realistic workloads
Example Use Cases
- Deploy a self-hosted LLM with vLLM serving an OpenAI-compatible endpoint
- Achieve maximum throughput on GPU using TensorRT-LLM for production workloads
- Prototype locally with Ollama on laptops or without GPUs
- Deploy traditional ML models (scikit-learn, XGBoost, PyTorch) via BentoML
- Build RAG pipelines using LangChain/LlamaIndex with vector databases and multi-model ensembles