implementing-llms-litgpt
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/litgpt --openclawLitGPT - Clean LLM Implementations
Quick start
LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.
Installation:
pip install 'litgpt[extra]'
Load and use any model:
from litgpt import LLM
# Load pretrained model
llm = LLM.load("microsoft/phi-2")
# Generate text
result = llm.generate(
"What is the capital of France?",
max_new_tokens=50,
temperature=0.7
)
print(result)
List available models:
litgpt download list
Common workflows
Workflow 1: Fine-tune on custom dataset
Copy this checklist:
Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning
Step 1: Download pretrained model
# Download Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B
# Download Phi-2 (smaller, faster)
litgpt download microsoft/phi-2
# Download Gemma 2B
litgpt download google/gemma-2b
Models are saved to checkpoints/ directory.
Step 2: Prepare dataset
LitGPT supports multiple formats:
Alpaca format (instruction-response):
[
{
"instruction": "What is the capital of France?",
"input": "",
"output": "The capital of France is Paris."
},
{
"instruction": "Translate to Spanish: Hello, how are you?",
"input": "",
"output": "Hola, ¿cómo estás?"
}
]
Save as data/my_dataset.json.
Step 3: Configure training
# Full fine-tuning (requires 40GB+ GPU for 7B models)
litgpt finetune \
meta-llama/Meta-Llama-3-8B \
--data JSON \
--data.json_path data/my_dataset.json \
--train.max_steps 1000 \
--train.learning_rate 2e-5 \
--train.micro_batch_size 1 \
--train.global_batch_size 16
# LoRA fine-tuning (efficient, 16GB GPU)
litgpt finetune_lora \
microsoft/phi-2 \
--data JSON \
--data.json_path data/my_dataset.json \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--train.max_steps 1000 \
--train.learning_rate 1e-4
Step 4: Run fine-tuning
Training saves checkpoints to out/finetune/ automatically.
Monitor training:
# View logs
tail -f out/finetune/logs.txt
# TensorBoard (if using --train.logger_name tensorboard)
tensorboard --logdir out/finetune/lightning_logs
Workflow 2: LoRA fine-tuning on single GPU
Most memory-efficient option.
LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)
Step 1: Choose base model
For limited GPU memory (12-16GB):
- Phi-2 (2.7B) - Best quality/size tradeoff
- Llama 3 1B - Smallest, fastest
- Gemma 2B - Good reasoning
Step 2: Configure LoRA parameters
litgpt finetune_lora \
microsoft/phi-2 \
--data JSON \
--data.json_path data/my_dataset.json \
--lora_r 16 \ # LoRA rank (8-64, higher=more capacity)
--lora_alpha 32 \ # LoRA scaling (typically 2×r)
--lora_dropout 0.05 \ # Prevent overfitting
--lora_query true \ # Apply LoRA to query projection
--lora_key false \ # Usually not needed
--lora_value true \ # Apply LoRA to value projection
--lora_projection true \ # Apply LoRA to output projection
--lora_mlp false \ # Usually not needed
--lora_head false # Usually not needed
LoRA rank guide:
r=8: Lightweight, 2-4MB adaptersr=16: Standard, good qualityr=32: High capacity, use for complex tasksr=64: Maximum quality, 4× larger adapters
Step 3: Train with LoRA
litgpt finetune_lora \
microsoft/phi-2 \
--data JSON \
--data.json_path data/my_dataset.json \
--lora_r 16 \
--train.epochs 3 \
--train.learning_rate 1e-4 \
--train.micro_batch_size 4 \
--train.global_batch_size 32 \
--out_dir out/phi2-lora
# Memory usage: ~8-12GB for Phi-2 with LoRA
Step 4: Merge LoRA weights (optional)
Merge LoRA adapters into base model for deployment:
litgpt merge_lora \
out/phi2-lora/final \
--out_dir out/phi2-merged
Now use merged model:
from litgpt import LLM
llm = LLM.load("out/phi2-merged")
Workflow 3: Pretrain from scratch
Train new model on your domain data.
Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining
Step 1: Prepare pretraining dataset
LitGPT expects tokenized data. Use prepare_dataset.py:
python scripts/prepare_dataset.py \
--source_path data/my_corpus.txt \
--checkpoint_dir checkpoints/tokenizer \
--destination_path data/pretrain \
--split train,val
Step 2: Configure model architecture
Edit config file or use existing:
# config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true
Step 3: Set up multi-GPU training
# Single GPU
litgpt pretrain \
--config config/pythia-160m.yaml \
--data.data_dir data/pretrain \
--train.max_tokens 10_000_000_000
# Multi-GPU with FSDP
litgpt pretrain \
--config config/pythia-1b.yaml \
--data.data_dir data/pretrain \
--devices 8 \
--train.max_tokens 100_000_000_000
Step 4: Launch pretraining
For large-scale pretraining on cluster:
# Using SLURM
sbatch --nodes=8 --gpus-per-node=8 \
pretrain_script.sh
# pretrain_script.sh content:
litgpt pretrain \
--config config/pythia-1b.yaml \
--data.data_dir /shared/data/pretrain \
--devices 8 \
--num_nodes 8 \
--train.global_batch_size 512 \
--train.max_tokens 300_000_000_000
Workflow 4: Convert and deploy model
Export LitGPT models for production.
Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API
Step 1: Test inference locally
from litgpt import LLM
llm = LLM.load("out/phi2-lora/final")
# Single generation
print(llm.generate("What is machine learning?"))
# Streaming
for token in llm.generate("Explain quantum computing", stream=True):
print(token, end="", flush=True)
# Batch inference
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]
Step 2: Quantize model (optional)
Reduce model size with minimal quality loss:
# 8-bit quantization (50% size reduction)
litgpt convert_lit_checkpoint \
out/phi2-lora/final \
--dtype bfloat16 \
--quantize bnb.nf4
# 4-bit quantization (75% size reduction)
litgpt convert_lit_checkpoint \
out/phi2-lora/final \
--quantize bnb.nf4-dq # Double quantization
Step 3: Convert to GGUF (for llama.cpp)
python scripts/convert_lit_checkpoint.py \
--checkpoint_path out/phi2-lora/final \
--output_path models/phi2.gguf \
--model_name microsoft/phi-2
Step 4: Deploy with API
from fastapi import FastAPI
from litgpt import LLM
app = FastAPI()
llm = LLM.load("out/phi2-lora/final")
@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
result = llm.generate(
prompt,
max_new_tokens=max_tokens,
temperature=0.7
)
return {"response": result}
# Run: uvicorn api:app --host 0.0.0.0 --port 8000
When to use vs alternatives
Use LitGPT when:
- Want to understand LLM architectures (clean, readable code)
- Need production-ready training recipes
- Educational purposes or research
- Prototyping new model ideas
- Lightning ecosystem user
Use alternatives instead:
- Axolotl/TRL: More fine-tuning features, YAML configs
- Megatron-Core: Maximum performance for >70B models
- HuggingFace Transformers: Broadest model support
- vLLM: Inference-only (no training)
Common issues
Issue: Out of memory during fine-tuning
Use LoRA instead of full fine-tuning:
# Instead of litgpt finetune (requires 40GB+)
litgpt finetune_lora # Only needs 12-16GB
Or enable gradient checkpointing:
litgpt finetune_lora \
... \
--train.gradient_accumulation_iters 4 # Accumulate gradients
Issue: Training too slow
Enable Flash Attention (built-in, automatic on compatible hardware):
# Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
# No configuration needed
Use smaller micro-batch and accumulate:
--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32 # Effective batch=32
Issue: Model not loading
Check model name:
# List all available models
litgpt download list
# Download if not exists
litgpt download meta-llama/Meta-Llama-3-8B
Verify checkpoints directory:
ls checkpoints/
# Should see: meta-llama/Meta-Llama-3-8B/
Issue: LoRA adapters too large
Reduce LoRA rank:
--lora_r 8 # Instead of 16 or 32
Apply LoRA to fewer layers:
--lora_query true \
--lora_value true \
--lora_projection false \ # Disable this
--lora_mlp false # And this
Advanced topics
Supported architectures: See references/supported-models.md for complete list of 20+ model families with sizes and capabilities.
Training recipes: See references/training-recipes.md for proven hyperparameter configurations for pretraining and fine-tuning.
FSDP configuration: See references/distributed-training.md for multi-GPU training with Fully Sharded Data Parallel.
Custom architectures: See references/custom-models.md for implementing new model architectures in LitGPT style.
Hardware requirements
- GPU: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
- Memory:
- Inference (Phi-2): 6GB
- LoRA fine-tuning (7B): 16GB
- Full fine-tuning (7B): 40GB+
- Pretraining (1B): 24GB
- Storage: 5-50GB per model (depending on size)
Resources
- GitHub: https://github.com/Lightning-AI/litgpt
- Docs: https://lightning.ai/docs/litgpt
- Tutorials: https://lightning.ai/docs/litgpt/tutorials
- Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/01-model-architecture/litgpt/SKILL.mdView on GitHub Overview
Implements and trains LLMs using Lightning AI's LitGPT across 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Ideal for clean, educational architecture exploration or production fine-tuning with LoRA/QLoRA, all via single-file implementations without abstraction layers.
How This Skill Works
LitGPT offers readable, production-ready training workflows for multiple pretrained LLMs. You load a model with LLM.load, then perform full fine-tuning or LoRA/QLoRA-based fine-tuning, saving checkpoints automatically; you can download models, prepare data in Alpaca JSON format, and monitor training with logs or TensorBoard.
When to Use It
- You want clean, single-file LLM implementations for study or quick prototyping.
- You need educational understanding of architectures like Llama, Gemma, Phi, Qwen, or Mistral.
- You plan production fine-tuning with LoRA/QLoRA on a selected model.
- You want to experiment across 20+ pretrained architectures with consistent workflows.
- You require end-to-end guidance from loading a model to training and evaluation with minimal abstraction.
Quick Start
- Step 1: Install: pip install 'litgpt[extra]'
- Step 2: Load a model and generate: from litgpt import LLM; llm = LLM.load("microsoft/phi-2"); llm.generate("What is the capital of France?", max_new_tokens=50, temperature=0.7)
- Step 3: List models or run fine-tuning: litgpt download list or litgpt finetune_lora microsoft/phi-2 --data JSON --data.json_path data/my_dataset.json ...
Best Practices
- Start with LoRA fine-tuning for memory efficiency on limited GPUs (e.g., 12-16GB).
- Use Alpaca-format datasets and JSON as the standard data format for fine-tuning.
- Select base models (Phi-2, Llama-3, Gemma) based on your GPU capacity and quality/size needs.
- Leverage existing checkpoints in checkpoints/ and review out/finetune/ for results.
- Monitor training with logs and TensorBoard to catch overfitting and convergence issues.
Example Use Cases
- Load a Phi-2 model and perform LoRA fine-tuning on a custom JSON dataset.
- Fine-tune Llama-3-8B with full fine-tuning on a sizable dataset, monitoring out/finetune/ outputs.
- Download and compare different models using litgpt download list to select best trade-off.
- Prepare Alpaca-format data and run litgpt finetune to evaluate instruction-following performance.
- Execute LoRA training steps with explicit lora_r and lora_alpha values to tune capacity and scaling.
Frequently Asked Questions
Related Skills
quantizing-models-bitsandbytes
Orchestra-Research/AI-Research-SKILLs
Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
unsloth
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization
llama-factory
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fine-tuning LLMs with LLaMA-Factory - WebUI no-code, 100+ models, 2/3/4/5/6/8-bit QLoRA, multimodal support
axolotl
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support
gptq
Orchestra-Research/AI-Research-SKILLs
Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.
mamba-architecture
Orchestra-Research/AI-Research-SKILLs
State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.