What is GPTQ and why use it?

GPTQ is a post-training 4-bit quantization method that compresses LLMs with minimal accuracy loss, enabling ~4× memory reduction and faster inference on consumer GPUs while supporting QLoRA/PEFT workflows.

What environments and dependencies are supported?

GPTQ relies on AutoGPTQ plus transformers, with optional Triton acceleration and CUDA extensions. It integrates with PEFT for QLoRA fine-tuning.

How does GPTQ compare to AWQ and bitsandbytes?

GPTQ targets 4-bit quantization with near-FP16 accuracy and Hessian-based error minimization; AWQ can offer slightly higher accuracy on some GPUs with Marlin kernels, while bitsandbytes provides 8-bit quantization for simpler integration and smaller memory reductions.

gptq

Scanned

Optimization GPTQ Quantization 4-Bit Post-Training Memory Optimization Consumer GPUs Fast Inference QLoRA Group-Wise Quantization

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/gptq --openclaw

Files (1)

SKILL.md

11.3 KB

GPTQ (Generative Pre-trained Transformer Quantization)

Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.

When to use GPTQ

Use GPTQ when:

Need to fit large models (70B+) on limited GPU memory
Want 4× memory reduction with <2% accuracy loss
Deploying on consumer GPUs (RTX 4090, 3090)
Need faster inference (3-4× speedup vs FP16)

Use AWQ instead when:

Need slightly better accuracy (<1% loss)
Have newer GPUs (Ampere, Ada)
Want Marlin kernel support (2× faster on some GPUs)

Use bitsandbytes instead when:

Need simple integration with transformers
Want 8-bit quantization (less compression, better quality)
Don't need pre-quantized model files

Quick start

Installation

# Install AutoGPTQ
pip install auto-gptq

# With Triton (Linux only, faster)
pip install auto-gptq[triton]

# With CUDA extensions (faster)
pip install auto-gptq --no-build-isolation

# Full installation
pip install auto-gptq transformers accelerate

Load pre-quantized model

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

# Load quantized model from HuggingFace
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=False  # Set True on Linux for speed
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
prompt = "Explain quantum computing"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Quantize your own model

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

# Load model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Quantization config
quantize_config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Group size (recommended: 128)
    desc_act=False,      # Activation order (False for CUDA kernel)
    damp_percent=0.01    # Dampening factor
)

# Load model for quantization
model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config
)

# Prepare calibration data
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
    tokenizer(example["text"])["input_ids"][:512]
    for example in dataset.take(128)
]

# Quantize
model.quantize(calibration_data)

# Save quantized model
model.save_quantized("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")

# Push to HuggingFace
model.push_to_hub("username/llama-2-7b-gptq")

Group-wise quantization

How GPTQ works:

Group weights: Divide each weight matrix into groups (typically 128 elements)
Quantize per-group: Each group has its own scale/zero-point
Minimize error: Uses Hessian information to minimize quantization error
Result: 4-bit weights with near-FP16 accuracy

Group size trade-off:

Group Size	Model Size	Accuracy	Speed	Recommendation
-1 (per-column)	Smallest	Best	Slowest	Research only
32	Smaller	Better	Slower	High accuracy needed
128	Medium	Good	Fast	Recommended default
256	Larger	Lower	Faster	Speed critical
1024	Largest	Lowest	Fastest	Not recommended

Example:

Weight matrix: [1024, 4096] = 4.2M elements

Group size = 128:
- Groups: 4.2M / 128 = 32,768 groups
- Each group: own 4-bit scale + zero-point
- Result: Better granularity → better accuracy

Quantization configurations

Standard 4-bit (recommended)

from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Standard group size
    desc_act=False,      # Faster CUDA kernel
    damp_percent=0.01    # Dampening factor
)

Performance:

Memory: 4× reduction (70B model: 140GB → 35GB)
Accuracy: ~1.5% perplexity increase
Speed: 3-4× faster than FP16

High accuracy (3-bit with larger groups)

config = BaseQuantizeConfig(
    bits=3,              # 3-bit (more compression)
    group_size=128,      # Keep standard group size
    desc_act=True,       # Better accuracy (slower)
    damp_percent=0.01
)

Trade-off:

Memory: 5× reduction
Accuracy: ~3% perplexity increase
Speed: 5× faster (but less accurate)

Maximum accuracy (4-bit with small groups)

config = BaseQuantizeConfig(
    bits=4,
    group_size=32,       # Smaller groups (better accuracy)
    desc_act=True,       # Activation reordering
    damp_percent=0.005   # Lower dampening
)

Trade-off:

Memory: 3.5× reduction (slightly larger)
Accuracy: ~0.8% perplexity increase (best)
Speed: 2-3× faster (kernel overhead)

Kernel backends

ExLlamaV2 (default, fastest)

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_exllama=True,      # Use ExLlamaV2
    exllama_config={"version": 2}
)

Performance: 1.5-2× faster than Triton

Marlin (Ampere+ GPUs)

# Quantize with Marlin format
config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False  # Required for Marlin
)

model.quantize(calibration_data, use_marlin=True)

# Load with Marlin
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_marlin=True  # 2× faster on A100/H100
)

Requirements:

NVIDIA Ampere or newer (A100, H100, RTX 40xx)
Compute capability ≥ 8.0

Triton (Linux only)

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=True  # Linux only
)

Performance: 1.2-1.5× faster than CUDA backend

Integration with transformers

Direct transformers usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load quantized model (transformers auto-detects GPTQ)
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-13B-Chat-GPTQ",
    device_map="auto",
    trust_remote_code=False
)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

# Use like any transformers model
inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

QLoRA fine-tuning (GPTQ + LoRA)

from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)

# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters
model = get_peft_model(model, lora_config)

# Fine-tune (memory efficient!)
# 70B model trainable on single A100 80GB

Performance benchmarks

Memory reduction

Model	FP16	GPTQ 4-bit	Reduction
Llama 2-7B	14 GB	3.5 GB	4×
Llama 2-13B	26 GB	6.5 GB	4×
Llama 2-70B	140 GB	35 GB	4×
Llama 3-405B	810 GB	203 GB	4×

Enables:

70B on single A100 80GB (vs 2× A100 needed for FP16)
405B on 3× A100 80GB (vs 11× A100 needed for FP16)
13B on RTX 4090 24GB (vs OOM with FP16)

Inference speed (Llama 2-7B, A100)

Precision	Tokens/sec	vs FP16
FP16	25 tok/s	1×
GPTQ 4-bit (CUDA)	85 tok/s	3.4×
GPTQ 4-bit (ExLlama)	105 tok/s	4.2×
GPTQ 4-bit (Marlin)	120 tok/s	4.8×

Accuracy (perplexity on WikiText-2)

Model	FP16	GPTQ 4-bit (g=128)	Degradation
Llama 2-7B	5.47	5.55	+1.5%
Llama 2-13B	4.88	4.95	+1.4%
Llama 2-70B	3.32	3.38	+1.8%

Excellent quality preservation - less than 2% degradation!

Common patterns

Multi-GPU deployment

# Automatic device mapping
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-GPTQ",
    device_map="auto",  # Automatically split across GPUs
    max_memory={0: "40GB", 1: "40GB"}  # Limit per GPU
)

# Manual device mapping
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0-39": 0,  # First 40 layers on GPU 0
    "model.layers.40-79": 1,  # Last 40 layers on GPU 1
    "model.norm": 1,
    "lm_head": 1
}

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device_map=device_map
)

CPU offloading

# Offload some layers to CPU (for very large models)
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-405B-GPTQ",
    device_map="auto",
    max_memory={
        0: "80GB",  # GPU 0
        1: "80GB",  # GPU 1
        2: "80GB",  # GPU 2
        "cpu": "200GB"  # Offload overflow to CPU
    }
)

Batch inference

# Process multiple prompts efficiently
prompts = [
    "Explain AI",
    "Explain ML",
    "Explain DL"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id
)

for i, output in enumerate(outputs):
    print(f"Prompt {i}: {tokenizer.decode(output)}")

Finding pre-quantized models

TheBloke on HuggingFace:

https://huggingface.co/TheBloke
1000+ models in GPTQ format
Multiple group sizes (32, 128)
Both CUDA and Marlin formats

Search:

# Find GPTQ models on HuggingFace
https://huggingface.co/models?library=gptq

Download:

from auto_gptq import AutoGPTQForCausalLM

# Automatically downloads from HuggingFace
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-Chat-GPTQ",
    device="cuda:0"
)

Supported models

LLaMA family: Llama 2, Llama 3, Code Llama
Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
Qwen: Qwen, Qwen2, QwQ
DeepSeek: V2, V3
Phi: Phi-2, Phi-3
Yi, Falcon, BLOOM, OPT
100+ models on HuggingFace

References

Calibration Guide - Dataset selection, quantization process, quality optimization
Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM
Troubleshooting - Common issues, performance optimization

Resources

GitHub: https://github.com/AutoGPTQ/AutoGPTQ
Paper: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
Models: https://huggingface.co/models?library=gptq
Discord: https://discord.gg/autogptq

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/10-optimization/gptq/SKILL.mdView on GitHub

Overview

GPTQ is a post-training method that compresses large LLMs to 4-bit with minimal accuracy loss. It enables roughly 4× memory reduction and 3–4× faster inference on consumer GPUs, and integrates with Transformers and PEFT for QLoRA fine-tuning.

How This Skill Works

GPTQ uses group-wise quantization: it splits each weight matrix into groups (commonly 128 elements), quantizes each group with its own scale and zero-point, and minimizes quantization error using Hessian information to achieve near-FP16 accuracy.

When to Use It

Deploy very large models (70B+) on consumer GPUs with limited VRAM.
Achieve ~4× memory reduction with <2% perplexity degradation.
Need 3–4× faster inference vs FP16 on GPUs like RTX 4090/3090.
Quantize models to enable QLoRA/PEFT fine-tuning with Transformers.
Prefer group-wise 4-bit quantization for strong compression with near-FP16 quality.

Quick Start

Step 1: Install required packages (auto-gptq, transformers, accelerate; optionally Triton and CUDA extensions).
Step 2: Load a pre-quantized model with AutoGPTQForCausalLM.from_quantized(model_name, device='cuda:0', use_triton=False).
Step 3: Quantize your own model by configuring BaseQuantizeConfig (bits=4, group_size=128, etc.), then call model.quantize(calibration_data) and save_quantized.

Best Practices

Use a group_size of 128 (default) for a good balance of accuracy and speed.
Quantize with representative calibration data to minimize domain drift.
Validate quantized model accuracy against the FP16 baseline after quantization.
Leverage AutoGPTQ with CUDA extensions or Triton for maximum speed.
Follow the QLoRA/PEFT workflow when preparing models for fine-tuning.

Example Use Cases

Quantize Llama-2-7B-Chat with GPTQ on an RTX 4090 to fit memory constraints.
Quantize a 70B+ model to run on consumer GPUs with 4× memory reduction.
Prepare a quantized model for QLoRA fine-tuning using PEFT utilities.
Deploy a quantized LLM via HuggingFace hub for quick sharing and reuse.
Use a 128-group quantization config to balance accuracy and speed on GPUs.