What memory reduction can I expect?

About 50% with 8-bit and up to 75% with 4-bit, with minimal accuracy loss.

Which quantization formats are supported?

INT8, NF4, and FP4 formats are supported.

Can I train or fine-tune with QLoRA or 8-bit optimizers?

Yes. QLoRA training and 8-bit optimizers are supported when used with this workflow and HuggingFace Transformers.

quantizing-models-bitsandbytes

Scanned

Optimization Bitsandbytes Quantization 8-Bit 4-Bit Memory Optimization QLoRA NF4 INT8 HuggingFace

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/bitsandbytes --openclaw

Files (1)

SKILL.md

9.9 KB

bitsandbytes - LLM Quantization

Quick start

bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.

Installation:

pip install bitsandbytes transformers accelerate

8-bit quantization (50% memory reduction):

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

# Memory: 14GB → 7GB

4-bit quantization (75% memory reduction):

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

# Memory: 14GB → 3.5GB

Common workflows

Workflow 1: Load large model in limited GPU memory

Copy this checklist:

Quantization Loading:
- [ ] Step 1: Calculate memory requirements
- [ ] Step 2: Choose quantization level (4-bit or 8-bit)
- [ ] Step 3: Configure quantization
- [ ] Step 4: Load and verify model

Step 1: Calculate memory requirements

Estimate model memory:

FP16 memory (GB) = Parameters × 2 bytes / 1e9
INT8 memory (GB) = Parameters × 1 byte / 1e9
INT4 memory (GB) = Parameters × 0.5 bytes / 1e9

Example (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB

Step 2: Choose quantization level

GPU VRAM	Model Size	Recommended
8 GB	3B	4-bit
12 GB	7B	4-bit
16 GB	7B	8-bit or 4-bit
24 GB	13B	8-bit or 70B 4-bit
40+ GB	70B	8-bit

Step 3: Configure quantization

For 8-bit (better accuracy):

from transformers import BitsAndBytesConfig
import torch

config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # Outlier threshold
    llm_int8_has_fp16_weight=False
)

For 4-bit (maximum memory savings):

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
    bnb_4bit_quant_type="nf4",  # NormalFloat4 (recommended)
    bnb_4bit_use_double_quant=True  # Nested quantization
)

Step 4: Load and verify model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=config,
    device_map="auto",  # Automatic device placement
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

# Test inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

# Check memory
import torch
print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")

Workflow 2: Fine-tune with QLoRA (4-bit training)

QLoRA enables fine-tuning large models on consumer GPUs.

Copy this checklist:

QLoRA Fine-tuning:
- [ ] Step 1: Install dependencies
- [ ] Step 2: Configure 4-bit base model
- [ ] Step 3: Add LoRA adapters
- [ ] Step 4: Train with standard Trainer

Step 1: Install dependencies

pip install bitsandbytes transformers peft accelerate datasets

Step 2: Configure 4-bit base model

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Step 3: Add LoRA adapters

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model for training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,  # LoRA alpha
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%

Step 4: Train with standard Trainer

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

trainer.train()

# Save LoRA adapters (only ~20MB)
model.save_pretrained("./qlora-adapters")

Workflow 3: 8-bit optimizer for memory-efficient training

Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.

8-bit Optimizer Setup:
- [ ] Step 1: Replace standard optimizer
- [ ] Step 2: Configure training
- [ ] Step 3: Monitor memory savings

Step 1: Replace standard optimizer

import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments

# Instead of torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")

training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=8,
    optim="paged_adamw_8bit",  # 8-bit optimizer
    learning_rate=5e-5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

Manual optimizer usage:

import bitsandbytes as bnb

optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    eps=1e-8
)

# Training loop
for batch in dataloader:
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Step 2: Configure training

Compare memory:

Standard AdamW optimizer memory = model_params × 8 bytes (states)
8-bit AdamW memory = model_params × 2 bytes
Savings = 75% optimizer memory

Example (Llama 2 7B):
Standard: 7B × 8 = 56 GB
8-bit: 7B × 2 = 14 GB
Savings: 42 GB

Step 3: Monitor memory savings

import torch

before = torch.cuda.memory_allocated()

# Training step
optimizer.step()

after = torch.cuda.memory_allocated()
print(f"Memory used: {(after-before)/1e9:.2f}GB")

When to use vs alternatives

Use bitsandbytes when:

GPU memory limited (need to fit larger model)
Training with QLoRA (fine-tune 70B on single GPU)
Inference only (50-75% memory reduction)
Using HuggingFace Transformers
Acceptable 0-2% accuracy degradation

Use alternatives instead:

GPTQ/AWQ: Production serving (faster inference than bitsandbytes)
GGUF: CPU inference (llama.cpp)
FP8: H100 GPUs (hardware FP8 faster)
Full precision: Accuracy critical, memory not constrained

Common issues

Issue: CUDA error during loading

Install matching CUDA version:

# Check CUDA version
nvcc --version

# Install matching bitsandbytes
pip install bitsandbytes --no-cache-dir

Issue: Model loading slow

Use CPU offload for large models:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    max_memory={0: "20GB", "cpu": "30GB"}  # Offload to CPU
)

Issue: Lower accuracy than expected

Try 8-bit instead of 4-bit:

config = BitsAndBytesConfig(load_in_8bit=True)
# 8-bit has <0.5% accuracy loss vs 1-2% for 4-bit

Or use NF4 with double quantization:

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Better than fp4
    bnb_4bit_use_double_quant=True  # Extra accuracy
)

Issue: OOM even with 4-bit

Enable CPU offload:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    offload_folder="offload",  # Disk offload
    offload_state_dict=True
)

Advanced topics

QLoRA training guide: See references/qlora-training.md for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.

Quantization formats: See references/quantization-formats.md for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.

Memory optimization: See references/memory-optimization.md for CPU offloading strategies, gradient checkpointing, and memory profiling.

Hardware requirements

GPU: NVIDIA with compute capability 7.0+ (Turing, Ampere, Hopper)
VRAM: Depends on model and quantization
- 4-bit Llama 2 7B: 4GB
- 4-bit Llama 2 13B: 8GB
- 4-bit Llama 2 70B: 24GB
CUDA: 11.1+ (12.0+ recommended)
PyTorch: 2.0+

Supported platforms: NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)

Resources

GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
HuggingFace docs: https://huggingface.co/docs/transformers/quantization/bitsandbytes
QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
LLM.int8() paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/10-optimization/bitsandbytes/SKILL.mdView on GitHub

Overview

Quantizes LLMs to 8-bit or 4-bit, delivering 50-75% memory reduction with minimal accuracy loss. Use this when GPU memory is limited, you need to fit larger models, or want faster inference. It supports INT8, NF4, FP4, QLoRA training, 8-bit optimizers, and works with HuggingFace Transformers.

How This Skill Works

bitsandbytes provides quantization of model weights to 8-bit or 4-bit representations, lowering memory usage during loading and inference. You configure a BitsAndBytesConfig (load_in_8bit or load_in_4bit) and optional compute dtype and quantization type, then load the pretrained model with quantization_config via transformers. This enables reduced memory footprint while preserving performance for large models.

When to Use It

GPU memory is limited and you need to run large LLMs
You want faster inference by reducing memory bandwidth and compute
You need to fit larger models on consumer GPUs
You plan to fine-tune using QLoRA or 8-bit optimizers
You’re using HuggingFace Transformers and require quantization support

Quick Start

Step 1: Install bitsandbytes, transformers, accelerate
Step 2: Load an 8-bit model with BitsAndBytesConfig(load_in_8bit=True) and from_pretrained(..., quantization_config=config)
Step 3: For 4-bit, configure 4-bit settings (compute dtype, NF4, double quant) and load the model with quantization_config

Best Practices

Estimate memory requirements using the guide’s memory equations before choosing 8-bit vs 4-bit
Match the quantization level to your GPU VRAM and model size (4-bit for maximum saving, 8-bit for better accuracy)
Configure BitsAndBytesConfig carefully (load_in_8bit or load_in_4bit, compute dtype, quant_type, and double quant as applicable)
Load and verify the quantized model on your device (device_map, torch_dtype) and test inference
Monitor memory usage and adjust quantization and precision if needed to balance speed and accuracy

Example Use Cases

Run an 7B LLM on an 8GB GPU using 8-bit quantization to halve memory usage
Quantize a 13B model to 4-bit to fit consumer hardware with NF4 compute
Use NF4 4-bit quantization to balance memory savings and precision for inference workloads
Fine-tune with QLoRA on a 4-bit base model on consumer GPUs
Deploy quantized models in HuggingFace Transformers for scalable inference services