serving-llms-vllm
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/vllm --openclawvLLM - High-Performance LLM Serving
Quick start
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
pip install vllm
Basic offline inference:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
OpenAI-compatible server:
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
Common workflows
Workflow 1: Production API deployment
Copy this checklist and track progress:
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
Step 1: Configure server settings
Choose configuration based on your model size:
# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--port 8000
# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-metrics \
--metrics-port 9090 \
--port 8000 \
--host 0.0.0.0
Step 2: Test with limited traffic
Run load test before production:
# Install load testing tool
pip install locust
# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
Step 3: Enable monitoring
vLLM exposes Prometheus metrics on port 9090:
curl http://localhost:9090/metrics | grep vllm
Key metrics to monitor:
vllm:time_to_first_token_seconds- Latencyvllm:num_requests_running- Active requestsvllm:gpu_cache_usage_perc- KV cache utilization
Step 4: Deploy to production
Use Docker for consistent deployment:
# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching
Step 5: Verify performance metrics
Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs
Workflow 2: Offline batch inference
For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
Step 1: Prepare input data
# Load prompts from file
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
Step 2: Configure LLM engine
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)
Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
# Process all prompts in one call
outputs = llm.generate(prompts, sampling)
# vLLM handles batching internally
# No need to manually chunk prompts
Step 4: Process results
# Extract generated text
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
# Save to file
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
Workflow 3: Quantized model serving
Fit large models in limited GPU memory.
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
Step 1: Choose quantization method
- AWQ: Best for 70B models, minimal accuracy loss
- GPTQ: Wide model support, good compression
- FP8: Fastest on H100 GPUs
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ
Step 3: Launch with quantization flag
# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAM
Step 4: Verify accuracy
Test outputs match expected quality:
# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged
When to use vs alternatives
Use vLLM when:
- Deploying production LLM APIs (100+ req/sec)
- Serving OpenAI-compatible endpoints
- Limited GPU memory but need large models
- Multi-user applications (chatbots, assistants)
- Need low latency with high throughput
Use alternatives instead:
- llama.cpp: CPU/edge inference, single-user
- HuggingFace transformers: Research, prototyping, one-off generation
- TensorRT-LLM: NVIDIA-only, need absolute maximum performance
- Text-Generation-Inference: Already in HuggingFace ecosystem
Common issues
Issue: Out of memory during model loading
Reduce memory usage:
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096
Or use quantization:
vllm serve MODEL --quantization awq
Issue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
vllm serve MODEL --enable-prefix-caching
For long prompts, enable chunked prefill:
vllm serve MODEL --enable-chunked-prefill
Issue: Model not found error
Use --trust-remote-code for custom models:
vllm serve MODEL --trust-remote-code
Issue: Low throughput (<50 req/sec)
Increase concurrent sequences:
vllm serve MODEL --max-num-seqs 512
Check GPU utilization with nvidia-smi - should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
vllm serve MODEL --tensor-parallel-size 4 # Not 3
Enable speculative decoding for faster generation:
vllm serve MODEL --speculative-model DRAFT_MODEL
Advanced topics
Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
Hardware requirements
- Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
- Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
- Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
Resources
- Official docs: https://docs.vllm.ai
- GitHub: https://github.com/vllm-project/vllm
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
- Community: https://discuss.vllm.ai
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/12-inference-serving/vllm/SKILL.mdView on GitHub Overview
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Ideal for production LLM APIs, optimizing inference latency and throughput, or running memory-constrained models on limited GPUs. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
How This Skill Works
Utilizes vLLM's block-based KV cache (PagedAttention) and continuous batching to accelerate inference by mixing prefill and decode requests. Exposes OpenAI-compatible servers, supports quantization strategies (GPTQ, AWQ, FP8), and distributes models across tensor-parallel workers to fit larger architectures on available GPUs.
When to Use It
- Deploy production LLM APIs with high throughput and low latency
- Optimize latency/throughput for streaming or batch prompts
- Serve large models on GPUs with limited memory using quantization and tensor parallelism
- Provide OpenAI-compatible endpoints for existing clients and SDKs
- Monitor performance with built-in caching and Prometheus metrics in production
Quick Start
- Step 1: pip install vllm
- Step 2: Offline inference: from vllm import LLM, SamplingParams; llm = LLM(model="meta-llama/Llama-3-8B-Instruct"); sampling = SamplingParams(temperature=0.7, max_tokens=256); outputs = llm.generate(["Explain quantum computing"], sampling); print(outputs[0].outputs[0].text)
- Step 3: OpenAI-compatible server: vllm serve meta-llama/Llama-3-8B-Instruct; then query with OpenAI SDK to verify responses
Best Practices
- Tune gpu-memory-utilization and max-model-len to match model size and memory constraints
- Leverage quantization (awq, GPTQ, FP8) to reduce memory footprint when needed
- Enable tensor-parallel-size for very large models and distribute across GPUs
- Monitor TTFT, throughput, and metrics (Prometheus) to ensure targets are met
- Dockerize deployment for consistent production environments and easy scaling
Example Use Cases
- 7B-13B models on a single GPU with --gpu-memory-utilization 0.9 and max-model-len 8192
- 30B-70B models with tensor-parallel-size 4 and quantization awq
- OpenAI-compatible server using meta-llama/Llama-3-8B-Instruct and OpenAI SDK
- Production deployment with caching, metrics, and --enable-prefix-caching
- Docker-based deployment using vllm-openai:latest for consistent infra
Frequently Asked Questions
Related Skills
awq-quantization
Orchestra-Research/AI-Research-SKILLs
Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
gptq
Orchestra-Research/AI-Research-SKILLs
Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
hqq-quantization
Orchestra-Research/AI-Research-SKILLs
Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.
langchain
Orchestra-Research/AI-Research-SKILLs
Framework for building LLM-powered applications with agents, chains, and RAG. Supports multiple providers (OpenAI, Anthropic, Google), 500+ integrations, ReAct agents, tool calling, memory management, and vector store retrieval. Use for building chatbots, question-answering systems, autonomous agents, or RAG applications. Best for rapid prototyping and production deployments.
qdrant-vector-search
Orchestra-Research/AI-Research-SKILLs
High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.
sentence-transformers
Orchestra-Research/AI-Research-SKILLs
Framework for state-of-the-art sentence, text, and image embeddings. Provides 5000+ pre-trained models for semantic similarity, clustering, and retrieval. Supports multilingual, domain-specific, and multimodal models. Use for generating embeddings for RAG, semantic search, or similarity tasks. Best for production embedding generation.