Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/peft --openclaw
Files (1)
SKILL.md
11.9 KB

PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

When to use PEFT

Use PEFT/LoRA when:

  • Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
  • Need to train <1% parameters (6MB adapters vs 14GB full model)
  • Want fast iteration with multiple task-specific adapters
  • Deploying multiple fine-tuned variants from one base model

Use QLoRA (PEFT + quantization) when:

  • Fine-tuning 70B models on single 24GB GPU
  • Memory is the primary constraint
  • Can accept ~5% quality trade-off vs full fine-tuning

Use full fine-tuning instead when:

  • Training small models (<1B parameters)
  • Need maximum quality and have compute budget
  • Significant domain shift requires updating all weights

Quick start

Installation

# Basic installation
pip install peft

# With quantization support (recommended)
pip install peft bitsandbytes

# Full stack
pip install peft transformers accelerate bitsandbytes datasets

LoRA fine-tuning (standard)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

# Load base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank (8-64, higher = more capacity)
    lora_alpha=32,                 # Scaling factor (typically 2*r)
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    bias="none"                    # Don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

# Prepare dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example):
    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
                                 "labels": torch.stack([f["input_ids"] for f in data])}
)

trainer.train()

# Save adapter only (6MB vs 16GB)
model.save_pretrained("./lora-llama-adapter")

QLoRA fine-tuning (memory-efficient)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
    bnb_4bit_use_double_quant=True       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA config for QLoRA
lora_config = LoraConfig(
    r=64,                              # Higher rank for 70B
    lora_alpha=128,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# 70B model now fits on single 24GB GPU!

LoRA parameter selection

Rank (r) - capacity vs efficiency

RankTrainable ParamsMemoryQualityUse Case
4~3MMinimalLowerSimple tasks, prototyping
8~7MLowGoodRecommended starting point
16~14MMediumBetterGeneral fine-tuning
32~27MHigherHighComplex tasks
64~54MHighHighestDomain adaptation, 70B models

Alpha (lora_alpha) - scaling factor

# Rule of thumb: alpha = 2 * rank
LoraConfig(r=16, lora_alpha=32)  # Standard
LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)

Target modules by architecture

# Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Auto-detect all linear layers
target_modules = "all-linear"  # PEFT 0.6.0+

Loading and merging adapters

Load trained adapter

from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# Option 1: Load with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

# Option 2: Load directly (recommended)
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-llama-adapter",
    device_map="auto"
)

Merge adapter into base model

# Merge for deployment (no adapter overhead)
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")

# Push to Hub
merged_model.push_to_hub("username/llama-finetuned")

Multi-adapter serving

from peft import PeftModel

# Load base with first adapter
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

# Load additional adapters
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")

# Switch between adapters at runtime
model.set_adapter("task1")  # Use task1 adapter
output1 = model.generate(**inputs)

model.set_adapter("task2")  # Switch to task2
output2 = model.generate(**inputs)

# Disable adapters (use base model)
with model.disable_adapter():
    base_output = model.generate(**inputs)

PEFT methods comparison

MethodTrainable %MemorySpeedBest For
LoRA0.1-1%LowFastGeneral fine-tuning
QLoRA0.1-1%Very LowMediumMemory-constrained
AdaLoRA0.1-1%LowMediumAutomatic rank selection
IA30.01%MinimalFastestFew-shot adaptation
Prefix Tuning0.1%LowMediumGeneration control
Prompt Tuning0.001%MinimalFastSimple task adaptation
P-Tuning v20.1%LowMediumNLU tasks

IA3 (minimal parameters)

from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
# Trains only 0.01% of parameters!

Prefix Tuning

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)

Integration patterns

With TRL (SFTTrainer)

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()

With Axolotl (YAML config)

# axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_target_linear: true  # Target all linear layers

With vLLM (inference)

from vllm import LLM
from vllm.lora.request import LoRARequest

# Load base model with LoRA support
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

# Serve with adapter
outputs = llm.generate(
    prompts,
    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)

Performance benchmarks

Memory usage (Llama 3.1 8B)

MethodGPU MemoryTrainable Params
Full fine-tuning60+ GB8B (100%)
LoRA r=1618 GB14M (0.17%)
QLoRA r=166 GB14M (0.17%)
IA316 GB800K (0.01%)

Training speed (A100 80GB)

MethodTokens/secvs Full FT
Full FT2,5001x
LoRA3,2001.3x
QLoRA2,1000.84x

Quality (MMLU benchmark)

ModelFull FTLoRAQLoRA
Llama 2-7B45.344.844.1
Llama 2-13B54.854.253.5

Common issues

CUDA OOM during training

# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size + increase accumulation
TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16
)

# Solution 3: Use QLoRA
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

Adapter not applying

# Verify adapter is active
print(model.active_adapters)  # Should show adapter name

# Check trainable parameters
model.print_trainable_parameters()

# Ensure model in training mode
model.train()

Quality degradation

# Increase rank
LoraConfig(r=32, lora_alpha=64)

# Target more modules
target_modules = "all-linear"

# Use more training data and epochs
TrainingArguments(num_train_epochs=5)

# Lower learning rate
TrainingArguments(learning_rate=1e-4)

Best practices

  1. Start with r=8-16, increase if quality insufficient
  2. Use alpha = 2 * rank as starting point
  3. Target attention + MLP layers for best quality/efficiency
  4. Enable gradient checkpointing for memory savings
  5. Save adapters frequently (small files, easy rollback)
  6. Evaluate on held-out data before merging
  7. Use QLoRA for 70B+ models on consumer hardware

References

Resources

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/03-fine-tuning/peft/SKILL.mdView on GitHub

Overview

PEFT enables fine-tuning large language models by training less than 1% of parameters using LoRA, QLoRA and 25+ adapter methods. It targets big models (7B–70B) on memory-constrained hardware, supports multi-adapter workflows, and integrates with HuggingFace transformers.

How This Skill Works

Adapters are added to a base model and the original weights remain frozen. LoRA trains low-rank updates to attention projections (e.g., q_proj, v_proj, k_proj, o_proj), while QLoRA adds quantization to fit larger models on smaller GPUs. The peft library manages applying, training, and saving these adapters within the transformers ecosystem.

When to Use It

  • Fine-tuning 7B–70B models on consumer GPUs (e.g., RTX 4090, A100)
  • Need to train <1% of parameters (e.g., 6MB adapters vs 14GB full model)
  • Want fast iteration with multiple task-specific adapters
  • Deploying multiple fine-tuned variants from one base model
  • Memory-constrained tuning of very large models (use QLoRA when needed)

Quick Start

  1. Step 1: Install the library and dependencies pip install peft # With quantization support (recommended) pip install peft bitsandbytes # Full stack pip install peft transformers accelerate bitsandbytes datasets
  2. Step 2: LoRA fine-tuning (standard) # Load base model, configure LoRA, apply adapter, and start training (conceptual summary; see SKILL for details) # Load model, set LoraConfig with r, lora_alpha, lora_dropout, target_modules, and apply via get_peft_model; train with Trainer; monitor trainable params
  3. Step 3: Save adapter only # After training, save only the adapter to a lightweight artifact model.save_pretrained("./lora-llama-adapter")

Best Practices

  • Prefer PEFT methods like LoRA for very large models on limited GPU memory by freezing base weights and training adapters
  • Tune LoRA hyperparameters (r, lora_alpha, lora_dropout) and specify targeted_modules (e.g., attention projections)
  • Train multiple task-specific adapters to enable quick switching between tasks
  • Save and load only adapters when possible to minimize storage (adapter as small artifacts)
  • Ensure dependencies align with the ecosystem: peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0

Example Use Cases

  • LoRA fine-tuning of Llama-3.1-8B on a RTX 4090; adapters trained to ~0.17% of total parameters (~13.6M out of ~8B), then saved as a small adapter file
  • Creating multiple task-specific adapters from one base model to support different environments without retraining the full model
  • QLoRA setup for extremely large models (70B) on a single 24GB GPU when memory is the primary constraint
  • Saving only the adapter after training to a dedicated path (e.g., ./lora-llama-adapter) to keep storage footprint small
  • Full stack setup via HuggingFace/PEFT ecosystem: installing peft, transformers, accelerate, bitsandbytes and datasets for end-to-end fine-tuning workflows

Frequently Asked Questions

Add this skill to your agents

Related Skills

quantizing-models-bitsandbytes

Orchestra-Research/AI-Research-SKILLs

Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.

axolotl

Orchestra-Research/AI-Research-SKILLs

Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support

awq-quantization

Orchestra-Research/AI-Research-SKILLs

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

unsloth

Orchestra-Research/AI-Research-SKILLs

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

gptq

Orchestra-Research/AI-Research-SKILLs

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

implementing-llms-litgpt

Orchestra-Research/AI-Research-SKILLs

Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.

Sponsor this space

Reach thousands of developers