awq-quantization
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/awq --openclawAWQ (Activation-aware Weight Quantization)
4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.
When to use AWQ
Use AWQ when:
- Need 4-bit quantization with <5% accuracy loss
- Deploying instruction-tuned or chat models (AWQ generalizes better)
- Want ~2.5-3x inference speedup over FP16
- Using vLLM for production serving
- Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support
Use GPTQ instead when:
- Need maximum ecosystem compatibility (more tools support GPTQ)
- Working with ExLlamaV2 backend specifically
- Have older GPUs without Marlin support
Use bitsandbytes instead when:
- Need zero calibration overhead (quantize on-the-fly)
- Want to fine-tune with QLoRA
- Prefer simpler integration
Quick start
Installation
# Default (Triton kernels)
pip install autoawq
# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]
# Intel CPU/XPU optimization
pip install autoawq[cpu]
Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+
Load pre-quantized model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantize your own model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantization config
quant_config = {
"zero_point": True, # Use zero-point quantization
"q_group_size": 128, # Group size (128 recommended)
"w_bit": 4, # 4-bit weights
"version": "GEMM" # GEMM for batch, GEMV for single-token
}
# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")
Timing: ~10-15 min for 7B, ~1 hour for 70B models.
AWQ vs GPTQ vs bitsandbytes
| Feature | AWQ | GPTQ | bitsandbytes |
|---|---|---|---|
| Speedup (4-bit) | ~2.5-3x | ~2x | ~1.5x |
| Accuracy loss | <5% | ~5-10% | ~5-15% |
| Calibration | Minimal (128-1K tokens) | More extensive | None |
| Overfitting risk | Low | Higher | N/A |
| Best for | Production inference | GPU inference | Easy integration |
| vLLM support | Native | Yes | Limited |
Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.
Kernel backends
GEMM (default, batch inference)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # Best for batch sizes > 1
}
GEMV (single-token generation)
quant_config = {
"version": "GEMV" # 20% faster for batch_size=1
}
Limitation: Only batch size 1, not good for large context.
Marlin (Ampere+ GPUs)
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
version="marlin" # 2x faster on A100/H100
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-AWQ",
quantization_config=config
)
Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)
ExLlamaV2 (AMD compatible)
config = AwqConfig(
bits=4,
version="exllama" # Faster prefill, AMD GPU support
)
HuggingFace Transformers integration
Direct loading
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-alpha-AWQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")
Fused modules (recommended)
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # Max sequence length for fusing
do_fuse=True # Enable fused attention/MLP
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-OpenOrca-AWQ",
quantization_config=config
)
Note: Fused modules cannot combine with FlashAttention2.
vLLM integration
from vllm import LLM, SamplingParams
# vLLM auto-detects AWQ models
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half"
)
sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)
Performance benchmarks
Memory reduction
| Model | FP16 | AWQ 4-bit | Reduction |
|---|---|---|---|
| Mistral 7B | 14 GB | 5.5 GB | 2.5x |
| Llama 2-13B | 26 GB | 10 GB | 2.6x |
| Llama 2-70B | 140 GB | 35 GB | 4x |
Inference speed (RTX 4090)
| Model | Prefill (tok/s) | Decode (tok/s) | Memory |
|---|---|---|---|
| Mistral 7B GEMM | 3,897 | 114 | 5.55 GB |
| TinyLlama 1B GEMV | 5,179 | 431 | 2.10 GB |
| Llama 2-13B GEMM | 2,279 | 74 | 10.28 GB |
Accuracy (perplexity)
| Model | FP16 | AWQ 4-bit | Degradation |
|---|---|---|---|
| Llama 3 8B | 8.20 | 8.48 | +3.4% |
| Mistral 7B | 5.25 | 5.42 | +3.2% |
| Qwen2 72B | 4.85 | 4.95 | +2.1% |
Custom calibration data
# Use custom dataset for domain-specific models
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wikitext", # Or custom list of strings
max_calib_samples=256, # More samples = better accuracy
max_calib_seq_len=512 # Sequence length
)
# Or provide your own samples
calib_samples = [
"Your domain-specific text here...",
"More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
Multi-GPU deployment
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-AWQ",
device_map="auto", # Auto-split across GPUs
max_memory={0: "40GB", 1: "40GB"}
)
Supported models
35+ architectures including:
- Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
- Qwen: Qwen, Qwen2, Qwen2.5-VL
- Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
- Multimodal: LLaVA, LLaVA-Next, Qwen2-VL
Common issues
CUDA OOM during quantization:
# Reduce batch size
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)
Slow inference:
# Enable fused layers
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
AMD GPU support:
# Use ExLlama backend
config = AwqConfig(bits=4, version="exllama")
Deprecation notice
AutoAWQ is officially deprecated. For new projects, consider:
- vLLM llm-compressor: https://github.com/vllm-project/llm-compressor
- MLX-LM: For Mac devices with Apple Silicon
Existing quantized models remain usable.
References
- Paper: AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper
- GitHub: https://github.com/casper-hansen/AutoAWQ
- MIT Han Lab: https://github.com/mit-han-lab/llm-awq
- Models: https://huggingface.co/models?library=awq
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/10-optimization/awq/SKILL.mdView on GitHub Overview
AWQ quantizes large language models to 4-bit weights using activation-pattern-based saliency to minimize accuracy loss. It delivers about a 3x inference speedup on memory-constrained GPUs, enabling 7B–70B models to run efficiently. This approach is particularly effective for instruction-tuned and multimodal models and is MLSys 2024 Best Paper Award winner.
How This Skill Works
AWQ identifies salient weights via activation patterns and quantizes the rest to 4 bits, preserving crucial information while reducing compute. It exposes a configurable quant_config (zero_point, q_group_size, w_bit, version) and supports GEMM (default for batch inference) and GEMV (single-token generation). The workflow works with pre-trained models or loading pre-quantized models and integrates with vLLM and standard transformers pipelines.
When to Use It
- Deploying large models (7B–70B) on GPUs with limited memory
- Need faster inference than GPTQ while preserving accuracy
- Working with instruction-tuned or multimodal/chat models (AWQ generalizes well)
- Production serving using vLLM for scalable, low-latency inference
- Have Ampere+ GPUs (A100, H100, RTX 40xx) to leverage Marlin kernel support
Quick Start
- Step 1: Install autoawq (default Triton kernels) with pip install autoawq
- Step 2: Load a pre-quantized model (e.g., TheBloke/Mistral-7B-Instruct-v0.2-AWQ) via AutoAWQForCausalLM and run a generation example
- Step 3: Quantize your own model (e.g., mistralai/Mistral-7B-Instruct-v0.2) with a quant_config and save the quantized model; timings: ~10–15 min for 7B, ~1 hour for 70B
Best Practices
- Verify that quantized models maintain <5% accuracy loss compared with FP16/float32 baselines
- Start with w_bit=4 and q_group_size=128; tune as needed based on calibration data
- Use GEMM for batch inference (>1) and GEMV for single-token generation
- Enable zero_point quantization and calibrate with a small dataset (128–1K tokens)
- Leverage Ampere+ GPUs and Marlin kernels; ensure dependencies (vLLM, transformers>=4.45.0) are met
Example Use Cases
- Deploy a pre-quantized TheBloke/Mistral-7B-Instruct-v0.2-AWQ model with AutoAWQForCausalLM and run generation via vLLM
- Quantize mistralai/Mistral-7B-Instruct-v0.2 yourself using a 4-bit quant_config and save as mistral-7b-awq for fast deployment
- Benchmark AWQ on Ampere+ GPUs showing ~2.5–3x speedups over FP16 with Marlin kernels
- Apply AWQ to instruction-tuned chat or multimodal models to preserve accuracy better than alternatives
- Calibrate quantization with 128–1K tokens to minimize overhead and avoid overfitting
Frequently Asked Questions
Related Skills
quantizing-models-bitsandbytes
Orchestra-Research/AI-Research-SKILLs
Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
sglang
Orchestra-Research/AI-Research-SKILLs
Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.
deepspeed
Orchestra-Research/AI-Research-SKILLs
Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention
unsloth
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization
optimizing-attention-flash
Orchestra-Research/AI-Research-SKILLs
Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.
gptq
Orchestra-Research/AI-Research-SKILLs
Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.