How fast is tokenization on CPU?

Typically under 20 seconds per GB, thanks to the Rust core.

How do I train a custom tokenizer?

Use a BPETrainer (and similar trainers for other models) to train on your files, specify vocab size and special tokens, and enable alignment tracking if needed.

huggingface-tokenizers

Scanned

Tokenization HuggingFace BPE WordPiece Unigram Fast Tokenization Rust Custom Tokenizer Alignment Tracking Production

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/huggingface-tokenizers --openclaw

Files (1)

SKILL.md

13.4 KB

HuggingFace Tokenizers - Fast Tokenization for NLP

Fast, production-ready tokenizers with Rust performance and Python ease-of-use.

When to use HuggingFace Tokenizers

Use HuggingFace Tokenizers when:

Need extremely fast tokenization (<20s per GB of text)
Training custom tokenizers from scratch
Want alignment tracking (token → original text position)
Building production NLP pipelines
Need to tokenize large corpora efficiently

Performance:

Speed: <20 seconds to tokenize 1GB on CPU
Implementation: Rust core with Python/Node.js bindings
Efficiency: 10-100× faster than pure Python implementations

Use alternatives instead:

SentencePiece: Language-independent, used by T5/ALBERT
tiktoken: OpenAI's BPE tokenizer for GPT models
transformers AutoTokenizer: Loading pretrained only (uses this library internally)

Quick start

Installation

# Install tokenizers
pip install tokenizers

# With transformers integration
pip install tokenizers transformers

Load pretrained tokenizer

from tokenizers import Tokenizer

# Load from HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode text
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)  # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids)     # [7592, 1010, 2129, 2024, 2017, 1029]

# Decode back
text = tokenizer.decode(output.ids)
print(text)  # "hello, how are you?"

Train custom BPE tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Configure trainer
trainer = BpeTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2
)

# Train on files
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)

# Save
tokenizer.save("my-tokenizer.json")

Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB

Batch encoding with padding

# Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

# Encode batch
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)

for encoding in encodings:
    print(encoding.ids)
# [101, 7592, 2088, 102, 3, 3, 3]
# [101, 2023, 2003, 1037, 2936, 6251, 102]

Tokenization algorithms

BPE (Byte-Pair Encoding)

How it works:

Start with character-level vocabulary
Find most frequent character pair
Merge into new token, add to vocabulary
Repeat until vocabulary size reached

Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()

trainer = BpeTrainer(
    vocab_size=50257,
    special_tokens=["<|endoftext|>"],
    min_frequency=2
)

tokenizer.train(files=["data.txt"], trainer=trainer)

Advantages:

Handles OOV words well (breaks into subwords)
Flexible vocabulary size
Good for morphologically rich languages

Trade-offs:

Tokenization depends on merge order
May split common words unexpectedly

WordPiece

How it works:

Start with character vocabulary
Score merge pairs: frequency(pair) / (frequency(first) × frequency(second))
Merge highest scoring pair
Repeat until vocabulary size reached

Used by: BERT, DistilBERT, MobileBERT

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)

Advantages:

Prioritizes meaningful merges (high score = semantically related)
Used successfully in BERT (state-of-the-art results)

Trade-offs:

Unknown words become [UNK] if no subword match
Saves vocabulary, not merge rules (larger files)

Unigram

How it works:

Start with large vocabulary (all substrings)
Compute loss for corpus with current vocabulary
Remove tokens with minimal impact on loss
Repeat until vocabulary size reached

Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)

from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())

trainer = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<s>", "</s>"],
    unk_token="<unk>"
)

tokenizer.train(files=["data.txt"], trainer=trainer)

Advantages:

Probabilistic (finds most likely tokenization)
Works well for languages without word boundaries
Handles diverse linguistic contexts

Trade-offs:

Computationally expensive to train
More hyperparameters to tune

Tokenization pipeline

Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing

Normalization

Clean and standardize text:

from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence

tokenizer.normalizer = Sequence([
    NFD(),           # Unicode normalization (decompose)
    Lowercase(),     # Convert to lowercase
    StripAccents()   # Remove accents
])

# Input: "Héllo WORLD"
# After normalization: "hello world"

Common normalizers:

NFD, NFC, NFKD, NFKC - Unicode normalization forms
Lowercase() - Convert to lowercase
StripAccents() - Remove accents (é → e)
Strip() - Remove whitespace
Replace(pattern, content) - Regex replacement

Pre-tokenization

Split text into word-like units:

from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel

# Split on whitespace and punctuation
tokenizer.pre_tokenizer = Sequence([
    Whitespace(),
    Punctuation()
])

# Input: "Hello, world!"
# After pre-tokenization: ["Hello", ",", "world", "!"]

Common pre-tokenizers:

Whitespace() - Split on spaces, tabs, newlines
ByteLevel() - GPT-2 style byte-level splitting
Punctuation() - Isolate punctuation
Digits(individual_digits=True) - Split digits individually
Metaspace() - Replace spaces with ▁ (SentencePiece style)

Post-processing

Add special tokens for model input:

from tokenizers.processors import TemplateProcessing

# BERT-style: [CLS] sentence [SEP]
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B [SEP]",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

Common patterns:

# GPT-2: sentence <|endoftext|>
TemplateProcessing(
    single="$A <|endoftext|>",
    special_tokens=[("<|endoftext|>", 50256)]
)

# RoBERTa: <s> sentence </s>
TemplateProcessing(
    single="<s> $A </s>",
    pair="<s> $A </s> </s> $B </s>",
    special_tokens=[("<s>", 0), ("</s>", 2)]
)

Alignment tracking

Track token positions in original text:

output = tokenizer.encode("Hello, world!")

# Get token offsets
for token, offset in zip(output.tokens, output.offsets):
    start, end = offset
    print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")

# Output:
# hello      → [ 0,  5): 'Hello'
# ,          → [ 5,  6): ','
# world      → [ 7, 12): 'world'
# !          → [12, 13): '!'

Use cases:

Named entity recognition (map predictions back to text)
Question answering (extract answer spans)
Token classification (align labels to original positions)

Integration with transformers

Load with AutoTokenizer

from transformers import AutoTokenizer

# AutoTokenizer automatically uses fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Check if using fast tokenizer
print(tokenizer.is_fast)  # True

# Access underlying tokenizers.Tokenizer
fast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer))  # <class 'tokenizers.Tokenizer'>

Convert custom tokenizer to transformers

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

# Train custom tokenizer
tokenizer = Tokenizer(BPE())
# ... train tokenizer ...
tokenizer.save("my-tokenizer.json")

# Wrap for transformers
transformers_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="my-tokenizer.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]"
)

# Use like any transformers tokenizer
outputs = transformers_tokenizer(
    "Hello world",
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

Common patterns

Train from iterator (large datasets)

from datasets import load_dataset

# Load dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

# Create batch iterator
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i:i + batch_size]["text"]

# Train tokenizer
tokenizer.train_from_iterator(
    batch_iterator(),
    trainer=trainer,
    length=len(dataset)  # For progress bar
)

Performance: Processes 1GB in ~10-20 minutes

Enable truncation and padding

# Enable truncation
tokenizer.enable_truncation(max_length=512)

# Enable padding
tokenizer.enable_padding(
    pad_id=tokenizer.token_to_id("[PAD]"),
    pad_token="[PAD]",
    length=512  # Fixed length, or None for batch max
)

# Encode with both
output = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids))  # 512

Multi-processing

from tokenizers import Tokenizer
from multiprocessing import Pool

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

def encode_batch(texts):
    return tokenizer.encode_batch(texts)

# Process large corpus in parallel
with Pool(8) as pool:
    # Split corpus into chunks
    chunk_size = 1000
    chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]

    # Encode in parallel
    results = pool.map(encode_batch, chunks)

Speedup: 5-8× with 8 cores

Performance benchmarks

Training speed

Corpus Size	BPE (30k vocab)	WordPiece (30k)	Unigram (8k)
10 MB	15 sec	18 sec	25 sec
100 MB	1.5 min	2 min	4 min
1 GB	15 min	20 min	40 min

Hardware: 16-core CPU, tested on English Wikipedia

Tokenization speed

Implementation	1 GB corpus	Throughput
Pure Python	~20 minutes	~50 MB/min
HF Tokenizers	~15 seconds	~4 GB/min
Speedup	80×	80×

Test: English text, average sentence length 20 words

Memory usage

Task	Memory
Load tokenizer	~10 MB
Train BPE (30k vocab)	~200 MB
Encode 1M sentences	~500 MB

Supported models

Pre-trained tokenizers available via from_pretrained():

BERT family:

bert-base-uncased, bert-large-cased
distilbert-base-uncased
roberta-base, roberta-large

GPT family:

gpt2, gpt2-medium, gpt2-large
distilgpt2

T5 family:

t5-small, t5-base, t5-large
google/flan-t5-xxl

Other:

facebook/bart-base, facebook/mbart-large-cc25
albert-base-v2, albert-xlarge-v2
xlm-roberta-base, xlm-roberta-large

Browse all: https://huggingface.co/models?library=tokenizers

References

Training Guide - Train custom tokenizers, configure trainers, handle large datasets
Algorithms Deep Dive - BPE, WordPiece, Unigram explained in detail
Pipeline Components - Normalizers, pre-tokenizers, post-processors, decoders
Transformers Integration - AutoTokenizer, PreTrainedTokenizerFast, special tokens

Resources

Docs: https://huggingface.co/docs/tokenizers
GitHub: https://github.com/huggingface/tokenizers ⭐ 9,000+
Version: 0.20.0+
Course: https://huggingface.co/learn/nlp-course/chapter6/1
Paper: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/02-tokenization/huggingface-tokenizers/SKILL.md

View on GitHub

Overview

HuggingFace Tokenizers offers fast, production-ready tokenization with a Rust core and Python/Node.js bindings. It supports the BPE, WordPiece, and Unigram algorithms and lets you train custom vocabularies, track alignments, and manage padding and truncation. It integrates with transformers, making it ideal for high-performance NLP pipelines.

How This Skill Works

The library is built around a Rust core for speed, exposed to Python/Node.js for easy use. You can train new vocabularies with a BPETrainer, apply pre-tokenizers, and enable alignment tracking to map tokens to original text positions, then load or save tokenizers for production workflows.

When to Use It

When you need ultra-fast tokenization (<20 seconds per GB) on CPU for large corpora.
When you want to train a tokenizer from scratch using BPE, WordPiece, or Unigram.
When maintaining token-to-text alignment is important for downstream tasks.
When building production NLP pipelines that require robust padding and truncation handling.
When integrating with transformers and other HuggingFace tooling for scalable workflows.

Quick Start

Step 1: Install tokenizers with pip install tokenizers.
Step 2: Load a pretrained tokenizer from the HuggingFace Hub and encode sample text.
Step 3: Train a custom BPE tokenizer on your corpus using a BPE trainer, then save the tokenizer.

Best Practices

Benchmark speed on your hardware to ensure the <20s per GB target.
Choose the right model: BPE for flexible subwords, WordPiece or Unigram as appropriate.
Enable alignment tracking when you need token position mapping.
Configure padding and truncation correctly and use batch encoding for throughput.
Train vocabularies on representative data and validate with downstream models; leverage the Rust core for production.

Example Use Cases

Tokenize 1GB of text in under 20 seconds on CPU.
Train a custom BPE tokenizer with a 30k vocab and special tokens.
Load a pretrained tokenizer from the HuggingFace Hub (eg bert-base-uncased) and encode text.
Encode a batch of sentences with padding to fixed length sequences.
Use alignment tracking to align token IDs back to the source text in a dataset.