Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/awq --openclaw
Files (1)
SKILL.md
8.2 KB

AWQ (Activation-aware Weight Quantization)

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

When to use AWQ

Use AWQ when:

  • Need 4-bit quantization with <5% accuracy loss
  • Deploying instruction-tuned or chat models (AWQ generalizes better)
  • Want ~2.5-3x inference speedup over FP16
  • Using vLLM for production serving
  • Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

Use GPTQ instead when:

  • Need maximum ecosystem compatibility (more tools support GPTQ)
  • Working with ExLlamaV2 backend specifically
  • Have older GPUs without Marlin support

Use bitsandbytes instead when:

  • Need zero calibration overhead (quantize on-the-fly)
  • Want to fine-tune with QLoRA
  • Prefer simpler integration

Quick start

Installation

# Default (Triton kernels)
pip install autoawq

# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]

# Intel CPU/XPU optimization
pip install autoawq[cpu]

Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

Load pre-quantized model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize your own model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,      # Use zero-point quantization
    "q_group_size": 128,     # Group size (128 recommended)
    "w_bit": 4,              # 4-bit weights
    "version": "GEMM"        # GEMM for batch, GEMV for single-token
}

# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")

Timing: ~10-15 min for 7B, ~1 hour for 70B models.

AWQ vs GPTQ vs bitsandbytes

FeatureAWQGPTQbitsandbytes
Speedup (4-bit)~2.5-3x~2x~1.5x
Accuracy loss<5%~5-10%~5-15%
CalibrationMinimal (128-1K tokens)More extensiveNone
Overfitting riskLowHigherN/A
Best forProduction inferenceGPU inferenceEasy integration
vLLM supportNativeYesLimited

Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

Kernel backends

GEMM (default, batch inference)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # Best for batch sizes > 1
}

GEMV (single-token generation)

quant_config = {
    "version": "GEMV"  # 20% faster for batch_size=1
}

Limitation: Only batch size 1, not good for large context.

Marlin (Ampere+ GPUs)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 2x faster on A100/H100
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)

ExLlamaV2 (AMD compatible)

config = AwqConfig(
    bits=4,
    version="exllama"  # Faster prefill, AMD GPU support
)

HuggingFace Transformers integration

Direct loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Max sequence length for fusing
    do_fuse=True           # Enable fused attention/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

Note: Fused modules cannot combine with FlashAttention2.

vLLM integration

from vllm import LLM, SamplingParams

# vLLM auto-detects AWQ models
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)

Performance benchmarks

Memory reduction

ModelFP16AWQ 4-bitReduction
Mistral 7B14 GB5.5 GB2.5x
Llama 2-13B26 GB10 GB2.6x
Llama 2-70B140 GB35 GB4x

Inference speed (RTX 4090)

ModelPrefill (tok/s)Decode (tok/s)Memory
Mistral 7B GEMM3,8971145.55 GB
TinyLlama 1B GEMV5,1794312.10 GB
Llama 2-13B GEMM2,2797410.28 GB

Accuracy (perplexity)

ModelFP16AWQ 4-bitDegradation
Llama 3 8B8.208.48+3.4%
Mistral 7B5.255.42+3.2%
Qwen2 72B4.854.95+2.1%

Custom calibration data

# Use custom dataset for domain-specific models
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wikitext",       # Or custom list of strings
    max_calib_samples=256,       # More samples = better accuracy
    max_calib_seq_len=512        # Sequence length
)

# Or provide your own samples
calib_samples = [
    "Your domain-specific text here...",
    "More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

Multi-GPU deployment

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # Auto-split across GPUs
    max_memory={0: "40GB", 1: "40GB"}
)

Supported models

35+ architectures including:

  • Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
  • Qwen: Qwen, Qwen2, Qwen2.5-VL
  • Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
  • Multimodal: LLaVA, LLaVA-Next, Qwen2-VL

Common issues

CUDA OOM during quantization:

# Reduce batch size
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

Slow inference:

# Enable fused layers
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

AMD GPU support:

# Use ExLlama backend
config = AwqConfig(bits=4, version="exllama")

Deprecation notice

AutoAWQ is officially deprecated. For new projects, consider:

Existing quantized models remain usable.

References

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/10-optimization/awq/SKILL.mdView on GitHub

Overview

AWQ quantizes large language models to 4-bit weights using activation-pattern-based saliency to minimize accuracy loss. It delivers about a 3x inference speedup on memory-constrained GPUs, enabling 7B–70B models to run efficiently. This approach is particularly effective for instruction-tuned and multimodal models and is MLSys 2024 Best Paper Award winner.

How This Skill Works

AWQ identifies salient weights via activation patterns and quantizes the rest to 4 bits, preserving crucial information while reducing compute. It exposes a configurable quant_config (zero_point, q_group_size, w_bit, version) and supports GEMM (default for batch inference) and GEMV (single-token generation). The workflow works with pre-trained models or loading pre-quantized models and integrates with vLLM and standard transformers pipelines.

When to Use It

  • Deploying large models (7B–70B) on GPUs with limited memory
  • Need faster inference than GPTQ while preserving accuracy
  • Working with instruction-tuned or multimodal/chat models (AWQ generalizes well)
  • Production serving using vLLM for scalable, low-latency inference
  • Have Ampere+ GPUs (A100, H100, RTX 40xx) to leverage Marlin kernel support

Quick Start

  1. Step 1: Install autoawq (default Triton kernels) with pip install autoawq
  2. Step 2: Load a pre-quantized model (e.g., TheBloke/Mistral-7B-Instruct-v0.2-AWQ) via AutoAWQForCausalLM and run a generation example
  3. Step 3: Quantize your own model (e.g., mistralai/Mistral-7B-Instruct-v0.2) with a quant_config and save the quantized model; timings: ~10–15 min for 7B, ~1 hour for 70B

Best Practices

  • Verify that quantized models maintain <5% accuracy loss compared with FP16/float32 baselines
  • Start with w_bit=4 and q_group_size=128; tune as needed based on calibration data
  • Use GEMM for batch inference (>1) and GEMV for single-token generation
  • Enable zero_point quantization and calibrate with a small dataset (128–1K tokens)
  • Leverage Ampere+ GPUs and Marlin kernels; ensure dependencies (vLLM, transformers>=4.45.0) are met

Example Use Cases

  • Deploy a pre-quantized TheBloke/Mistral-7B-Instruct-v0.2-AWQ model with AutoAWQForCausalLM and run generation via vLLM
  • Quantize mistralai/Mistral-7B-Instruct-v0.2 yourself using a 4-bit quant_config and save as mistral-7b-awq for fast deployment
  • Benchmark AWQ on Ampere+ GPUs showing ~2.5–3x speedups over FP16 with Marlin kernels
  • Apply AWQ to instruction-tuned chat or multimodal models to preserve accuracy better than alternatives
  • Calibrate quantization with 128–1K tokens to minimize overhead and avoid overfitting

Frequently Asked Questions

Add this skill to your agents

Related Skills

quantizing-models-bitsandbytes

Orchestra-Research/AI-Research-SKILLs

Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.

sglang

Orchestra-Research/AI-Research-SKILLs

Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.

deepspeed

Orchestra-Research/AI-Research-SKILLs

Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention

unsloth

Orchestra-Research/AI-Research-SKILLs

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

optimizing-attention-flash

Orchestra-Research/AI-Research-SKILLs

Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.

gptq

Orchestra-Research/AI-Research-SKILLs

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

Sponsor this space

Reach thousands of developers