sparse-autoencoder-training
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/saelens --openclawSAELens: Sparse Autoencoders for Mechanistic Interpretability
SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity.
GitHub: jbloomAus/SAELens (1,100+ stars)
The Problem: Polysemanticity & Superposition
Individual neurons in neural networks are polysemantic - they activate in multiple, semantically distinct contexts. This happens because models use superposition to represent more features than they have neurons, making interpretability difficult.
SAEs solve this by decomposing dense activations into sparse, monosemantic features - typically only a small number of features activate for any given input, and each feature corresponds to an interpretable concept.
When to Use SAELens
Use SAELens when you need to:
- Discover interpretable features in model activations
- Understand what concepts a model has learned
- Study superposition and feature geometry
- Perform feature-based steering or ablation
- Analyze safety-relevant features (deception, bias, harmful content)
Consider alternatives when:
- You need basic activation analysis → Use TransformerLens directly
- You want causal intervention experiments → Use pyvene or TransformerLens
- You need production steering → Consider direct activation engineering
Installation
pip install sae-lens
Requirements: Python 3.10+, transformer-lens>=2.0.0
Core Concepts
What SAEs Learn
SAEs are trained to reconstruct model activations through a sparse bottleneck:
Input Activation → Encoder → Sparse Features → Decoder → Reconstructed Activation
(d_model) ↓ (d_sae >> d_model) ↓ (d_model)
sparsity reconstruction
penalty loss
Loss Function: MSE(original, reconstructed) + L1_coefficient × L1(features)
Key Validation (Anthropic Research)
In "Towards Monosemanticity", human evaluators found 70% of SAE features genuinely interpretable. Features discovered include:
- DNA sequences, legal language, HTTP requests
- Hebrew text, nutrition statements, code syntax
- Sentiment, named entities, grammatical structures
Workflow 1: Loading and Analyzing Pre-trained SAEs
Step-by-Step
from transformer_lens import HookedTransformer
from sae_lens import SAE
# 1. Load model and pre-trained SAE
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, cfg_dict, sparsity = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# 2. Get model activations
tokens = model.to_tokens("The capital of France is Paris")
_, cache = model.run_with_cache(tokens)
activations = cache["resid_pre", 8] # [batch, pos, d_model]
# 3. Encode to SAE features
sae_features = sae.encode(activations) # [batch, pos, d_sae]
print(f"Active features: {(sae_features > 0).sum()}")
# 4. Find top features for each position
for pos in range(tokens.shape[1]):
top_features = sae_features[0, pos].topk(5)
token = model.to_str_tokens(tokens[0, pos:pos+1])[0]
print(f"Token '{token}': features {top_features.indices.tolist()}")
# 5. Reconstruct activations
reconstructed = sae.decode(sae_features)
reconstruction_error = (activations - reconstructed).norm()
Available Pre-trained SAEs
| Release | Model | Layers |
|---|---|---|
gpt2-small-res-jb | GPT-2 Small | Multiple residual streams |
gemma-2b-res | Gemma 2B | Residual streams |
| Various on HuggingFace | Search tag saelens | Various |
Checklist
- Load model with TransformerLens
- Load matching SAE for target layer
- Encode activations to sparse features
- Identify top-activating features per token
- Validate reconstruction quality
Workflow 2: Training a Custom SAE
Step-by-Step
from sae_lens import SAE, LanguageModelSAERunnerConfig, SAETrainingRunner
# 1. Configure training
cfg = LanguageModelSAERunnerConfig(
# Model
model_name="gpt2-small",
hook_name="blocks.8.hook_resid_pre",
hook_layer=8,
d_in=768, # Model dimension
# SAE architecture
architecture="standard", # or "gated", "topk"
d_sae=768 * 8, # Expansion factor of 8
activation_fn="relu",
# Training
lr=4e-4,
l1_coefficient=8e-5, # Sparsity penalty
l1_warm_up_steps=1000,
train_batch_size_tokens=4096,
training_tokens=100_000_000,
# Data
dataset_path="monology/pile-uncopyrighted",
context_size=128,
# Logging
log_to_wandb=True,
wandb_project="sae-training",
# Checkpointing
checkpoint_path="checkpoints",
n_checkpoints=5,
)
# 2. Train
trainer = SAETrainingRunner(cfg)
sae = trainer.run()
# 3. Evaluate
print(f"L0 (avg active features): {trainer.metrics['l0']}")
print(f"CE Loss Recovered: {trainer.metrics['ce_loss_score']}")
Key Hyperparameters
| Parameter | Typical Value | Effect |
|---|---|---|
d_sae | 4-16× d_model | More features, higher capacity |
l1_coefficient | 5e-5 to 1e-4 | Higher = sparser, less accurate |
lr | 1e-4 to 1e-3 | Standard optimizer LR |
l1_warm_up_steps | 500-2000 | Prevents early feature death |
Evaluation Metrics
| Metric | Target | Meaning |
|---|---|---|
| L0 | 50-200 | Average active features per token |
| CE Loss Score | 80-95% | Cross-entropy recovered vs original |
| Dead Features | <5% | Features that never activate |
| Explained Variance | >90% | Reconstruction quality |
Checklist
- Choose target layer and hook point
- Set expansion factor (d_sae = 4-16× d_model)
- Tune L1 coefficient for desired sparsity
- Enable L1 warm-up to prevent dead features
- Monitor metrics during training (W&B)
- Validate L0 and CE loss recovery
- Check dead feature ratio
Workflow 3: Feature Analysis and Steering
Analyzing Individual Features
from transformer_lens import HookedTransformer
from sae_lens import SAE
import torch
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, _, _ = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# Find what activates a specific feature
feature_idx = 1234
test_texts = [
"The scientist conducted an experiment",
"I love chocolate cake",
"The code compiles successfully",
"Paris is beautiful in spring",
]
for text in test_texts:
tokens = model.to_tokens(text)
_, cache = model.run_with_cache(tokens)
features = sae.encode(cache["resid_pre", 8])
activation = features[0, :, feature_idx].max().item()
print(f"{activation:.3f}: {text}")
Feature Steering
def steer_with_feature(model, sae, prompt, feature_idx, strength=5.0):
"""Add SAE feature direction to residual stream."""
tokens = model.to_tokens(prompt)
# Get feature direction from decoder
feature_direction = sae.W_dec[feature_idx] # [d_model]
def steering_hook(activation, hook):
# Add scaled feature direction at all positions
activation += strength * feature_direction
return activation
# Generate with steering
output = model.generate(
tokens,
max_new_tokens=50,
fwd_hooks=[("blocks.8.hook_resid_pre", steering_hook)]
)
return model.to_string(output[0])
Feature Attribution
# Which features most affect a specific output?
tokens = model.to_tokens("The capital of France is")
_, cache = model.run_with_cache(tokens)
# Get features at final position
features = sae.encode(cache["resid_pre", 8])[0, -1] # [d_sae]
# Get logit attribution per feature
# Feature contribution = feature_activation × decoder_weight × unembedding
W_dec = sae.W_dec # [d_sae, d_model]
W_U = model.W_U # [d_model, vocab]
# Contribution to "Paris" logit
paris_token = model.to_single_token(" Paris")
feature_contributions = features * (W_dec @ W_U[:, paris_token])
top_features = feature_contributions.topk(10)
print("Top features for 'Paris' prediction:")
for idx, val in zip(top_features.indices, top_features.values):
print(f" Feature {idx.item()}: {val.item():.3f}")
Common Issues & Solutions
Issue: High dead feature ratio
# WRONG: No warm-up, features die early
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=1e-4,
l1_warm_up_steps=0, # Bad!
)
# RIGHT: Warm-up L1 penalty
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=8e-5,
l1_warm_up_steps=1000, # Gradually increase
use_ghost_grads=True, # Revive dead features
)
Issue: Poor reconstruction (low CE recovery)
# Reduce sparsity penalty
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=5e-5, # Lower = better reconstruction
d_sae=768 * 16, # More capacity
)
Issue: Features not interpretable
# Increase sparsity (higher L1)
cfg = LanguageModelSAERunnerConfig(
l1_coefficient=1e-4, # Higher = sparser, more interpretable
)
# Or use TopK architecture
cfg = LanguageModelSAERunnerConfig(
architecture="topk",
activation_fn_kwargs={"k": 50}, # Exactly 50 active features
)
Issue: Memory errors during training
cfg = LanguageModelSAERunnerConfig(
train_batch_size_tokens=2048, # Reduce batch size
store_batch_size_prompts=4, # Fewer prompts in buffer
n_batches_in_buffer=8, # Smaller activation buffer
)
Integration with Neuronpedia
Browse pre-trained SAE features at neuronpedia.org:
# Features are indexed by SAE ID
# Example: gpt2-small layer 8 feature 1234
# → neuronpedia.org/gpt2-small/8-res-jb/1234
Key Classes Reference
| Class | Purpose |
|---|---|
SAE | Sparse Autoencoder model |
LanguageModelSAERunnerConfig | Training configuration |
SAETrainingRunner | Training loop manager |
ActivationsStore | Activation collection and batching |
HookedSAETransformer | TransformerLens + SAE integration |
Reference Documentation
For detailed API documentation, tutorials, and advanced usage, see the references/ folder:
| File | Contents |
|---|---|
| references/README.md | Overview and quick start guide |
| references/api.md | Complete API reference for SAE, TrainingSAE, configurations |
| references/tutorials.md | Step-by-step tutorials for training, analysis, steering |
External Resources
Tutorials
Papers
- Towards Monosemanticity - Anthropic (2023)
- Scaling Monosemanticity - Anthropic (2024)
- Sparse Autoencoders Find Highly Interpretable Features - Cunningham et al. (ICLR 2024)
Official Documentation
- SAELens Docs
- Neuronpedia - Feature browser
SAE Architectures
| Architecture | Description | Use Case |
|---|---|---|
| Standard | ReLU + L1 penalty | General purpose |
| Gated | Learned gating mechanism | Better sparsity control |
| TopK | Exactly K active features | Consistent sparsity |
# TopK SAE (exactly 50 features active)
cfg = LanguageModelSAERunnerConfig(
architecture="topk",
activation_fn="topk",
activation_fn_kwargs={"k": 50},
)
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/04-mechanistic-interpretability/saelens/SKILL.mdView on GitHub Overview
SAELens trains Sparse Autoencoders (SAEs) to decompose dense neural activations into sparse, interpretable features. This helps reveal monosemantic representations and analyze how models use superposition to encode multiple concepts.
How This Skill Works
An SAE encodes model activations into a small, sparse feature vector and decodes back to reconstruct the original activation. Training uses a loss = MSE(original, reconstructed) plus an L1 penalty on features to enforce sparsity. You typically load a pre-trained SAE, pass activations through the encoder to obtain SAE features, and inspect the top features per position to interpret the concepts the model has learned.
When to Use It
- Discover interpretable features in model activations
- Understand what concepts a model has learned
- Study superposition and feature geometry
- Perform feature-based steering or ablation
- Analyze safety-relevant features (deception, bias, harmful content)
Quick Start
- Step 1: Load a model and a pre-trained SAE from SAELens (e.g., HookedTransformer.from_pretrained and SAE.from_pretrained)
- Step 2: Run the model on input text to obtain activations (e.g., model.run_with_cache on tokens) and extract the target layer
- Step 3: Encode activations with sae.encode(activations) to get SAE features and inspect top features per position; optionally reconstruct to validate
Best Practices
- Use SAELens with compatible pre-trained SAEs and model checkpoints (e.g., gpt2-small with gpt2-small-res-jb)
- Monitor sparsity by checking active features per position and tune the L1 coefficient or SAE size accordingly
- Validate interpretability by inspecting top features for representative tokens (e.g., DNA sequences, legal language, HTTP requests)
- Compare SAE features across layers and inputs to study monosemanticity and feature geometry
- Be cautious with safety-related features; corroborate findings with additional analyses
Example Use Cases
- Identify features corresponding to DNA sequences, legal language, or HTTP requests activated in a model
- Decode Hebrew text, nutrition statements, or code syntax as sparse, human-interpretable features
- Capture sentiment signals and named entities as monosemantic SAE features
- Analyze grammatical structures or syntactic patterns through SAE activations
- Investigate bias or deception-related activations by tracing sparse features to concepts
Frequently Asked Questions
Related Skills
transformer-lens-interpretability
Orchestra-Research/AI-Research-SKILLs
Provides guidance for mechanistic interpretability research using TransformerLens to inspect and manipulate transformer internals via HookPoints and activation caching. Use when reverse-engineering model algorithms, studying attention patterns, or performing activation patching experiments.
nnsight-remote-interpretability
Orchestra-Research/AI-Research-SKILLs
Provides guidance for interpreting and manipulating neural network internals using nnsight with optional NDIF remote execution. Use when needing to run interpretability experiments on massive models (70B+) without local GPU resources, or when working with any PyTorch architecture.