gguf-quantization
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/gguf --openclawGGUF - Quantization Format for llama.cpp
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
When to use GGUF
Use GGUF when:
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
Key advantages:
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
- No Python runtime: Pure C/C++ inference
- Flexible quantization: 2-8 bit with various methods (K-quants)
- Ecosystem support: LM Studio, Ollama, koboldcpp, and more
- imatrix: Importance matrix for better low-bit quality
Use alternatives instead:
- AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
- HQQ: Fast calibration-free quantization for HuggingFace
- bitsandbytes: Simple integration with transformers library
- TensorRT-LLM: Production NVIDIA deployment with maximum speed
Quick start
Installation
# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build (CPU)
make
# Build with CUDA (NVIDIA)
make GGML_CUDA=1
# Build with Metal (Apple Silicon)
make GGML_METAL=1
# Install Python bindings (optional)
pip install llama-cpp-python
Convert model to GGUF
# Install requirements
pip install -r requirements.txt
# Convert HuggingFace model to GGUF (FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
# Or specify output type
python convert_hf_to_gguf.py ./path/to/model \
--outfile model-f16.gguf \
--outtype f16
Quantize model
# Basic quantization to Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Quantize with importance matrix (better quality)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
Run inference
# CLI inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
# Interactive mode
./llama-cli -m model-q4_k_m.gguf --interactive
# With GPU offload
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
Quantization types
K-quant methods (recommended)
| Type | Bits | Size (7B) | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
| Q4_K_M | 4.5 | ~4.1 GB | High | Recommended default |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
Legacy methods
| Type | Description |
|---|---|
| Q4_0 | 4-bit, basic |
| Q4_1 | 4-bit with delta |
| Q5_0 | 5-bit, basic |
| Q5_1 | 5-bit with delta |
Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
Conversion workflows
Workflow 1: HuggingFace to GGUF
# 1. Download model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
# 2. Convert to GGUF (FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 3. Quantize
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
# 4. Test
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
Workflow 2: With importance matrix (better quality)
# 1. Convert to GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
# 2. Create calibration text (diverse samples)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# Add more diverse text samples...
EOF
# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35 # GPU layers if available
# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf \
model-q4_k_m.gguf \
Q4_K_M
Workflow 3: Multiple quantizations
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
# Generate imatrix once
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
# Create multiple quantizations
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
Python usage
llama-cpp-python
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096, # Context window
n_gpu_layers=35, # GPU offload (0 for CPU only)
n_threads=8 # CPU threads
)
# Generate
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
Chat completion
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3" # Or "chatml", "mistral", etc.
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
Streaming
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
# Stream tokens
for chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)
Server mode
Start OpenAI-compatible server
# Start server
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096
# Or with Python bindings
python -m llama_cpp.server \
--model model-q4_k_m.gguf \
--n_gpu_layers 35 \
--host 0.0.0.0 \
--port 8080
Use with OpenAI client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)
Hardware optimization
Apple Silicon (Metal)
# Build with Metal
make clean && make GGML_METAL=1
# Run with Metal acceleration
./llama-cli -m model.gguf -ngl 99 -p "Hello"
# Python with Metal
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload all layers
n_threads=1 # Metal handles parallelism
)
NVIDIA CUDA
# Build with CUDA
make clean && make GGML_CUDA=1
# Run with CUDA
./llama-cli -m model.gguf -ngl 35 -p "Hello"
# Specify GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
CPU optimization
# Build with AVX2/AVX512
make clean && make
# Run with optimal threads
./llama-cli -m model.gguf -t 8 -p "Hello"
# Python CPU config
llm = Llama(
model_path="model.gguf",
n_gpu_layers=0, # CPU only
n_threads=8, # Match physical cores
n_batch=512 # Batch size for prompt processing
)
Integration with tools
Ollama
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
# Create Ollama model
ollama create mymodel -f Modelfile
# Run
ollama run mymodel "Hello!"
LM Studio
- Place GGUF file in
~/.cache/lm-studio/models/ - Open LM Studio and select the model
- Configure context length and GPU offload
- Start inference
text-generation-webui
# Place in models folder
cp model-q4_k_m.gguf text-generation-webui/models/
# Start with llama.cpp loader
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
Best practices
- Use K-quants: Q4_K_M offers best quality/size balance
- Use imatrix: Always use importance matrix for Q4 and below
- GPU offload: Offload as many layers as VRAM allows
- Context length: Start with 4096, increase if needed
- Thread count: Match physical CPU cores, not logical
- Batch size: Increase n_batch for faster prompt processing
Common issues
Model loads slowly:
# Use mmap for faster loading
./llama-cli -m model.gguf --mmap
Out of memory:
# Reduce GPU layers
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
# Or use smaller quantization
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
Poor quality at low bits:
# Always use imatrix for Q4 and below
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
References
- Advanced Usage - Batching, speculative decoding, custom builds
- Troubleshooting - Common issues, debugging, benchmarks
Resources
- Repository: https://github.com/ggml-org/llama.cpp
- Python Bindings: https://github.com/abetlen/llama-cpp-python
- Pre-quantized Models: https://huggingface.co/TheBloke
- GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- License: MIT
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/10-optimization/gguf/SKILL.mdView on GitHub Overview
GGUF is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization. This skill covers when to use GGUF, how to convert models, and how to apply 2-8 bit quantization (Q2_K through Q8_0) without requiring a GPU.
How This Skill Works
GGUF stores quantized weights and an imatrix to improve low-bit quality, across 2-8 bit quantization methods (K-quants). You convert a HuggingFace model to GGUF, then quantize with llama-quantize (optionally using an imatrix) and run inference with llama.cpp on CPU, Apple Silicon, or CUDA backends depending on your build.
When to Use It
- Deploying on consumer hardware like laptops/desktops
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- CPU inference without requiring a GPU
- Need flexible 2-8 bit quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
Quick Start
- Step 1: Install and build llama.cpp for your target hardware (CPU, CUDA, or Metal).
- Step 2: Convert a HuggingFace model to GGUF (FP16) using convert_hf_to_gguf.py to produce a .gguf file.
- Step 3: Quantize the GGUF model (for example to Q4_K_M) with llama-quantize, optionally creating an imatrix for better quality.
Best Practices
- Start with Q4_K_M (or Q5_K_M) for the best balance between quality and model size.
- Enable an imatrix when you need improved quality at lower bitrates (Q4_K_M/Q5_K_M).
- Test performance on your target hardware (latency and memory) before deciding on a lower or higher bit precision.
- Match your build flags to your hardware (GGML_METAL for Apple Silicon, GGML_CUDA for NVIDIA, or CPU otherwise) before quantization.
- Leverage local tools and ecosystems (LM Studio, Ollama, koboldcpp, text-generation-webui) to simplify deployment.
Example Use Cases
- Offline chatbot on a consumer laptop using CPU-only inference with a Q4_K_M GGUF.
- MacBook with Apple Silicon running a quantized model via Metal acceleration in LM Studio or Ollama.
- Local inference for a small-scale assistant inside text-generation-webui.
- Ollama deployment of a 7B GGUF model on a desktop or laptop for offline use.
- koboldcpp-based experiment with a GGUF-quantized model for rapid prototyping.
Frequently Asked Questions
Related Skills
quantizing-models-bitsandbytes
Orchestra-Research/AI-Research-SKILLs
Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
deepspeed
Orchestra-Research/AI-Research-SKILLs
Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention
awq-quantization
Orchestra-Research/AI-Research-SKILLs
Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
unsloth
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization
optimizing-attention-flash
Orchestra-Research/AI-Research-SKILLs
Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.
gptq
Orchestra-Research/AI-Research-SKILLs
Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.