Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/mamba --openclaw
Files (1)
SKILL.md
7.2 KB

Mamba - Selective State Space Models

Quick start

Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.

Installation:

# Install causal-conv1d (optional, for efficiency)
pip install causal-conv1d>=1.4.0

# Install Mamba
pip install mamba-ssm
# Or both together
pip install mamba-ssm[causal-conv1d]

Prerequisites: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+

Basic usage (Mamba block):

import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

model = Mamba(
    d_model=dim,      # Model dimension
    d_state=16,       # SSM state dimension
    d_conv=4,         # Conv1d kernel size
    expand=2          # Expansion factor
).to("cuda")

y = model(x)  # O(n) complexity!
assert y.shape == x.shape

Common workflows

Workflow 1: Language model with Mamba-2

Complete LM with generation:

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch

# Configure Mamba-2 LM
config = MambaConfig(
    d_model=1024,           # Hidden dimension
    n_layer=24,             # Number of layers
    vocab_size=50277,       # Vocabulary size
    ssm_cfg=dict(
        layer="Mamba2",     # Use Mamba-2
        d_state=128,        # Larger state for Mamba-2
        headdim=64,         # Head dimension
        ngroups=1           # Number of groups
    )
)

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)

# Generate text
input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
output = model.generate(
    input_ids=input_ids,
    max_length=100,
    temperature=0.7,
    top_p=0.9
)

Workflow 2: Use pretrained Mamba models

Load from HuggingFace:

from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

# Load pretrained model
model_name = "state-spaces/mamba-2.8b"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")  # Use compatible tokenizer
model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)

# Generate
prompt = "The future of AI is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
output_ids = model.generate(
    input_ids=input_ids,
    max_length=200,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2
)
generated_text = tokenizer.decode(output_ids[0])
print(generated_text)

Available models:

  • state-spaces/mamba-130m
  • state-spaces/mamba-370m
  • state-spaces/mamba-790m
  • state-spaces/mamba-1.4b
  • state-spaces/mamba-2.8b

Workflow 3: Mamba-1 vs Mamba-2

Mamba-1 (smaller state):

from mamba_ssm import Mamba

model = Mamba(
    d_model=256,
    d_state=16,      # Smaller state dimension
    d_conv=4,
    expand=2
).to("cuda")

Mamba-2 (multi-head, larger state):

from mamba_ssm import Mamba2

model = Mamba2(
    d_model=256,
    d_state=128,     # Larger state dimension
    d_conv=4,
    expand=2,
    headdim=64,      # Head dimension for multi-head
    ngroups=1        # Parallel groups
).to("cuda")

Key differences:

  • State size: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
  • Architecture: Mamba-2 has multi-head structure
  • Normalization: Mamba-2 uses RMSNorm
  • Distributed: Mamba-2 supports tensor parallelism

Workflow 4: Benchmark vs Transformers

Generation speed comparison:

# Benchmark Mamba
python benchmarks/benchmark_generation_mamba_simple.py \
  --model-name "state-spaces/mamba-2.8b" \
  --prompt "The future of machine learning is" \
  --topp 0.9 --temperature 0.7 --repetition-penalty 1.2

# Benchmark Transformer
python benchmarks/benchmark_generation_mamba_simple.py \
  --model-name "EleutherAI/pythia-2.8b" \
  --prompt "The future of machine learning is" \
  --topp 0.9 --temperature 0.7 --repetition-penalty 1.2

Expected results:

  • Mamba: 5× faster inference
  • Memory: No KV cache needed
  • Scaling: Linear with sequence length

When to use vs alternatives

Use Mamba when:

  • Need long sequences (100K+ tokens)
  • Want faster inference than Transformers
  • Memory-constrained (no KV cache)
  • Building streaming applications
  • Linear scaling important

Advantages:

  • O(n) complexity: Linear vs quadratic
  • 5× faster inference: No attention overhead
  • No KV cache: Lower memory usage
  • Million-token sequences: Hardware-efficient
  • Streaming: Constant memory per token

Use alternatives instead:

  • Transformers: Need best-in-class performance, have compute
  • RWKV: Want RNN+Transformer hybrid
  • RetNet: Need retention-based architecture
  • Hyena: Want convolution-based approach

Common issues

Issue: CUDA out of memory

Reduce batch size or use gradient checkpointing:

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable()  # Enable checkpointing

Issue: Slow installation

Install binary wheels (not source):

pip install mamba-ssm --no-build-isolation

Issue: Missing causal-conv1d

Install separately:

pip install causal-conv1d>=1.4.0

Issue: Model not loading from HuggingFace

Use MambaLMHeadModel.from_pretrained (not AutoModel):

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")

Advanced topics

Selective SSM: See references/selective-ssm.md for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.

Mamba-2 architecture: See references/mamba2-details.md for multi-head structure, tensor parallelism, and distributed training setup.

Performance optimization: See references/performance.md for hardware-aware design, CUDA kernels, and memory efficiency techniques.

Hardware requirements

  • GPU: NVIDIA with CUDA 11.6+
  • VRAM:
    • 130M model: 2GB
    • 370M model: 4GB
    • 790M model: 8GB
    • 1.4B model: 14GB
    • 2.8B model: 28GB (FP16)
  • Inference: 5× faster than Transformers
  • Memory: No KV cache (lower than Transformers)

Performance (vs Transformers):

  • Speed: 5× faster inference
  • Memory: 50% less (no KV cache)
  • Scaling: Linear vs quadratic

Resources

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/01-model-architecture/mamba/SKILL.mdView on GitHub

Overview

Mamba is a selective state-space model architecture that achieves O(n) linear complexity for sequence modeling. It enables faster, long-context inference and can handle million-token sequences without KV caches. With Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head), it scales from 130M up to 2.8B parameters on HuggingFace.

How This Skill Works

Technically, Mamba uses a state-space formulation to compute sequence representations with O(n) time, avoiding the need for KV caches. It is hardware-aware and selective, pairing efficient kernels (optionally via causal-conv1d) with two variants: Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). The implementation is exposed by the mamba-ssm library.

When to Use It

  • When modeling long-context sequences or documents where Transformer quadratics are a bottleneck (million-token scenarios).
  • When deploying on hardware where you need faster inference and efficient memory usage (linear complexity).
  • When you want to avoid KV caches during generation to reduce latency and memory overhead.
  • When deciding between Mamba-1 and Mamba-2 to balance state size, speed, and multi-head capabilities.
  • When leveraging pretrained models on HuggingFace (130M–2.8B variants) for scalable deployment.

Quick Start

  1. Step 1: Install dependencies: pip install causal-conv1d>=1.4.0; pip install mamba-ssm
  2. Step 2: Create a Mamba model: instantiate Mamba or Mamba2 with d_model, d_state, d_conv, expand (and headdim for Mamba-2).
  3. Step 3: Run inference: y = model(x) or use generation utilities to produce long-context outputs.

Best Practices

  • Install optional acceleration with causal-conv1d>=1.4.0 to boost throughput.
  • Choose Mamba-1 (d_state=16) for smaller tasks and Mamba-2 (d_state=128) for longer contexts or multi-head needs.
  • Leverage RMSNorm in Mamba-2 setups to stabilize training and inference.
  • Use the HuggingFace variants (130m–2.8b) as starting points for your domain.
  • Benchmark against Transformer baselines to validate linear-time gains and memory usage on your hardware.

Example Use Cases

  • Long-form document generation and summarization with million-token contexts.
  • Code generation across large files where context spans thousands of lines.
  • Extended-dialogue agents with long memory across many turns.
  • Scientific literature review across multi-document corpora.
  • Enterprise chatbots handling extended customer histories.

Frequently Asked Questions

Add this skill to your agents

Related Skills

long-context

Orchestra-Research/AI-Research-SKILLs

Extend context windows of transformer models using RoPE, YaRN, ALiBi, and position interpolation techniques. Use when processing long documents (32k-128k+ tokens), extending pre-trained models beyond original context limits, or implementing efficient positional encodings. Covers rotary embeddings, attention biases, interpolation methods, and extrapolation strategies for LLMs.

optimizing-attention-flash

Orchestra-Research/AI-Research-SKILLs

Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.

implementing-llms-litgpt

Orchestra-Research/AI-Research-SKILLs

Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.

nanogpt

Orchestra-Research/AI-Research-SKILLs

Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).

rwkv-architecture

Orchestra-Research/AI-Research-SKILLs

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

distributed-llm-pretraining-torchtitan

Orchestra-Research/AI-Research-SKILLs

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

Sponsor this space

Reach thousands of developers