huggingface-tokenizers
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/huggingface-tokenizers --openclawHuggingFace Tokenizers - Fast Tokenization for NLP
Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
When to use HuggingFace Tokenizers
Use HuggingFace Tokenizers when:
- Need extremely fast tokenization (<20s per GB of text)
- Training custom tokenizers from scratch
- Want alignment tracking (token → original text position)
- Building production NLP pipelines
- Need to tokenize large corpora efficiently
Performance:
- Speed: <20 seconds to tokenize 1GB on CPU
- Implementation: Rust core with Python/Node.js bindings
- Efficiency: 10-100× faster than pure Python implementations
Use alternatives instead:
- SentencePiece: Language-independent, used by T5/ALBERT
- tiktoken: OpenAI's BPE tokenizer for GPT models
- transformers AutoTokenizer: Loading pretrained only (uses this library internally)
Quick start
Installation
# Install tokenizers
pip install tokenizers
# With transformers integration
pip install tokenizers transformers
Load pretrained tokenizer
from tokenizers import Tokenizer
# Load from HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
# Encode text
output = tokenizer.encode("Hello, how are you?")
print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
# Decode back
text = tokenizer.decode(output.ids)
print(text) # "hello, how are you?"
Train custom BPE tokenizer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# Initialize tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
# Configure trainer
trainer = BpeTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
min_frequency=2
)
# Train on files
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)
# Save
tokenizer.save("my-tokenizer.json")
Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
Batch encoding with padding
# Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
# Encode batch
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)
for encoding in encodings:
print(encoding.ids)
# [101, 7592, 2088, 102, 3, 3, 3]
# [101, 2023, 2003, 1037, 2936, 6251, 102]
Tokenization algorithms
BPE (Byte-Pair Encoding)
How it works:
- Start with character-level vocabulary
- Find most frequent character pair
- Merge into new token, add to vocabulary
- Repeat until vocabulary size reached
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()
trainer = BpeTrainer(
vocab_size=50257,
special_tokens=["<|endoftext|>"],
min_frequency=2
)
tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
- Handles OOV words well (breaks into subwords)
- Flexible vocabulary size
- Good for morphologically rich languages
Trade-offs:
- Tokenization depends on merge order
- May split common words unexpectedly
WordPiece
How it works:
- Start with character vocabulary
- Score merge pairs:
frequency(pair) / (frequency(first) × frequency(second)) - Merge highest scoring pair
- Repeat until vocabulary size reached
Used by: BERT, DistilBERT, MobileBERT
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30522,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##"
)
tokenizer.train(files=["corpus.txt"], trainer=trainer)
Advantages:
- Prioritizes meaningful merges (high score = semantically related)
- Used successfully in BERT (state-of-the-art results)
Trade-offs:
- Unknown words become
[UNK]if no subword match - Saves vocabulary, not merge rules (larger files)
Unigram
How it works:
- Start with large vocabulary (all substrings)
- Compute loss for corpus with current vocabulary
- Remove tokens with minimal impact on loss
- Repeat until vocabulary size reached
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["<unk>", "<s>", "</s>"],
unk_token="<unk>"
)
tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
- Probabilistic (finds most likely tokenization)
- Works well for languages without word boundaries
- Handles diverse linguistic contexts
Trade-offs:
- Computationally expensive to train
- More hyperparameters to tune
Tokenization pipeline
Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
Normalization
Clean and standardize text:
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
tokenizer.normalizer = Sequence([
NFD(), # Unicode normalization (decompose)
Lowercase(), # Convert to lowercase
StripAccents() # Remove accents
])
# Input: "Héllo WORLD"
# After normalization: "hello world"
Common normalizers:
NFD,NFC,NFKD,NFKC- Unicode normalization formsLowercase()- Convert to lowercaseStripAccents()- Remove accents (é → e)Strip()- Remove whitespaceReplace(pattern, content)- Regex replacement
Pre-tokenization
Split text into word-like units:
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
# Split on whitespace and punctuation
tokenizer.pre_tokenizer = Sequence([
Whitespace(),
Punctuation()
])
# Input: "Hello, world!"
# After pre-tokenization: ["Hello", ",", "world", "!"]
Common pre-tokenizers:
Whitespace()- Split on spaces, tabs, newlinesByteLevel()- GPT-2 style byte-level splittingPunctuation()- Isolate punctuationDigits(individual_digits=True)- Split digits individuallyMetaspace()- Replace spaces with ▁ (SentencePiece style)
Post-processing
Add special tokens for model input:
from tokenizers.processors import TemplateProcessing
# BERT-style: [CLS] sentence [SEP]
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B [SEP]",
special_tokens=[
("[CLS]", 1),
("[SEP]", 2),
],
)
Common patterns:
# GPT-2: sentence <|endoftext|>
TemplateProcessing(
single="$A <|endoftext|>",
special_tokens=[("<|endoftext|>", 50256)]
)
# RoBERTa: <s> sentence </s>
TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> </s> $B </s>",
special_tokens=[("<s>", 0), ("</s>", 2)]
)
Alignment tracking
Track token positions in original text:
output = tokenizer.encode("Hello, world!")
# Get token offsets
for token, offset in zip(output.tokens, output.offsets):
start, end = offset
print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
# Output:
# hello → [ 0, 5): 'Hello'
# , → [ 5, 6): ','
# world → [ 7, 12): 'world'
# ! → [12, 13): '!'
Use cases:
- Named entity recognition (map predictions back to text)
- Question answering (extract answer spans)
- Token classification (align labels to original positions)
Integration with transformers
Load with AutoTokenizer
from transformers import AutoTokenizer
# AutoTokenizer automatically uses fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Check if using fast tokenizer
print(tokenizer.is_fast) # True
# Access underlying tokenizers.Tokenizer
fast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
Convert custom tokenizer to transformers
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
# Train custom tokenizer
tokenizer = Tokenizer(BPE())
# ... train tokenizer ...
tokenizer.save("my-tokenizer.json")
# Wrap for transformers
transformers_tokenizer = PreTrainedTokenizerFast(
tokenizer_file="my-tokenizer.json",
unk_token="[UNK]",
pad_token="[PAD]",
cls_token="[CLS]",
sep_token="[SEP]",
mask_token="[MASK]"
)
# Use like any transformers tokenizer
outputs = transformers_tokenizer(
"Hello world",
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
Common patterns
Train from iterator (large datasets)
from datasets import load_dataset
# Load dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
# Create batch iterator
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i:i + batch_size]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
batch_iterator(),
trainer=trainer,
length=len(dataset) # For progress bar
)
Performance: Processes 1GB in ~10-20 minutes
Enable truncation and padding
# Enable truncation
tokenizer.enable_truncation(max_length=512)
# Enable padding
tokenizer.enable_padding(
pad_id=tokenizer.token_to_id("[PAD]"),
pad_token="[PAD]",
length=512 # Fixed length, or None for batch max
)
# Encode with both
output = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids)) # 512
Multi-processing
from tokenizers import Tokenizer
from multiprocessing import Pool
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts):
return tokenizer.encode_batch(texts)
# Process large corpus in parallel
with Pool(8) as pool:
# Split corpus into chunks
chunk_size = 1000
chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
# Encode in parallel
results = pool.map(encode_batch, chunks)
Speedup: 5-8× with 8 cores
Performance benchmarks
Training speed
| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|---|---|---|---|
| 10 MB | 15 sec | 18 sec | 25 sec |
| 100 MB | 1.5 min | 2 min | 4 min |
| 1 GB | 15 min | 20 min | 40 min |
Hardware: 16-core CPU, tested on English Wikipedia
Tokenization speed
| Implementation | 1 GB corpus | Throughput |
|---|---|---|
| Pure Python | ~20 minutes | ~50 MB/min |
| HF Tokenizers | ~15 seconds | ~4 GB/min |
| Speedup | 80× | 80× |
Test: English text, average sentence length 20 words
Memory usage
| Task | Memory |
|---|---|
| Load tokenizer | ~10 MB |
| Train BPE (30k vocab) | ~200 MB |
| Encode 1M sentences | ~500 MB |
Supported models
Pre-trained tokenizers available via from_pretrained():
BERT family:
bert-base-uncased,bert-large-caseddistilbert-base-uncasedroberta-base,roberta-large
GPT family:
gpt2,gpt2-medium,gpt2-largedistilgpt2
T5 family:
t5-small,t5-base,t5-largegoogle/flan-t5-xxl
Other:
facebook/bart-base,facebook/mbart-large-cc25albert-base-v2,albert-xlarge-v2xlm-roberta-base,xlm-roberta-large
Browse all: https://huggingface.co/models?library=tokenizers
References
- Training Guide - Train custom tokenizers, configure trainers, handle large datasets
- Algorithms Deep Dive - BPE, WordPiece, Unigram explained in detail
- Pipeline Components - Normalizers, pre-tokenizers, post-processors, decoders
- Transformers Integration - AutoTokenizer, PreTrainedTokenizerFast, special tokens
Resources
- Docs: https://huggingface.co/docs/tokenizers
- GitHub: https://github.com/huggingface/tokenizers ⭐ 9,000+
- Version: 0.20.0+
- Course: https://huggingface.co/learn/nlp-course/chapter6/1
- Paper: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/02-tokenization/huggingface-tokenizers/SKILL.mdView on GitHub Overview
HuggingFace Tokenizers offers fast, production-ready tokenization with a Rust core and Python/Node.js bindings. It supports the BPE, WordPiece, and Unigram algorithms and lets you train custom vocabularies, track alignments, and manage padding and truncation. It integrates with transformers, making it ideal for high-performance NLP pipelines.
How This Skill Works
The library is built around a Rust core for speed, exposed to Python/Node.js for easy use. You can train new vocabularies with a BPETrainer, apply pre-tokenizers, and enable alignment tracking to map tokens to original text positions, then load or save tokenizers for production workflows.
When to Use It
- When you need ultra-fast tokenization (<20 seconds per GB) on CPU for large corpora.
- When you want to train a tokenizer from scratch using BPE, WordPiece, or Unigram.
- When maintaining token-to-text alignment is important for downstream tasks.
- When building production NLP pipelines that require robust padding and truncation handling.
- When integrating with transformers and other HuggingFace tooling for scalable workflows.
Quick Start
- Step 1: Install tokenizers with pip install tokenizers.
- Step 2: Load a pretrained tokenizer from the HuggingFace Hub and encode sample text.
- Step 3: Train a custom BPE tokenizer on your corpus using a BPE trainer, then save the tokenizer.
Best Practices
- Benchmark speed on your hardware to ensure the <20s per GB target.
- Choose the right model: BPE for flexible subwords, WordPiece or Unigram as appropriate.
- Enable alignment tracking when you need token position mapping.
- Configure padding and truncation correctly and use batch encoding for throughput.
- Train vocabularies on representative data and validate with downstream models; leverage the Rust core for production.
Example Use Cases
- Tokenize 1GB of text in under 20 seconds on CPU.
- Train a custom BPE tokenizer with a 30k vocab and special tokens.
- Load a pretrained tokenizer from the HuggingFace Hub (eg bert-base-uncased) and encode text.
- Encode a batch of sentences with padding to fixed length sequences.
- Use alignment tracking to align token IDs back to the source text in a dataset.
Frequently Asked Questions
Related Skills
quantizing-models-bitsandbytes
Orchestra-Research/AI-Research-SKILLs
Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
huggingface-accelerate
Orchestra-Research/AI-Research-SKILLs
Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
crewai-multi-agent
Orchestra-Research/AI-Research-SKILLs
Multi-agent orchestration framework for autonomous AI collaboration. Use when building teams of specialized agents working together on complex tasks, when you need role-based agent collaboration with memory, or for production workflows requiring sequential/hierarchical execution. Built without LangChain dependencies for lean, fast execution.
langchain
Orchestra-Research/AI-Research-SKILLs
Framework for building LLM-powered applications with agents, chains, and RAG. Supports multiple providers (OpenAI, Anthropic, Google), 500+ integrations, ReAct agents, tool calling, memory management, and vector store retrieval. Use for building chatbots, question-answering systems, autonomous agents, or RAG applications. Best for rapid prototyping and production deployments.
llama-factory
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fine-tuning LLMs with LLaMA-Factory - WebUI no-code, 100+ models, 2/3/4/5/6/8-bit QLoRA, multimodal support
nemo-guardrails
Orchestra-Research/AI-Research-SKILLs
NVIDIA's runtime safety framework for LLM applications. Features jailbreak detection, input/output validation, fact-checking, hallucination detection, PII filtering, toxicity detection. Uses Colang 2.0 DSL for programmable rails. Production-ready, runs on T4 GPU.