Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/sentencepiece --openclaw
Files (1)
SKILL.md
5.5 KB

SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

When to use SentencePiece

Use SentencePiece when:

  • Building multilingual models (no language-specific rules)
  • Working with CJK languages (Chinese, Japanese, Korean)
  • Need reproducible tokenization (deterministic vocabulary)
  • Want to train on raw text (no pre-tokenization needed)
  • Require lightweight deployment (6MB memory, 50k sentences/sec)

Performance:

  • Speed: 50,000 sentences/sec
  • Memory: ~6MB for loaded model
  • Languages: All (language-independent)

Use alternatives instead:

  • HuggingFace Tokenizers: Faster training, more flexibility
  • tiktoken: OpenAI models (GPT-3.5/4)
  • BERT WordPiece: English-centric tasks

Quick start

Installation

# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install

Train model

# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)

Training time: ~1-2 minutes for 100MB corpus

Encode and decode

import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"

Language-independent design

Whitespace as symbol (▁)

text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"

Key principle: Treat text as raw Unicode, whitespace = ▁ (meta symbol)

Tokenization algorithms

BPE (Byte-Pair Encoding)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)

Used by: mBART

Unigram (default)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)

Used by: T5, ALBERT, XLNet

Training configuration

Essential parameters

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

Character coverage

Language TypeCoverageRationale
English0.9995Most common chars
CJK (Chinese)1.0All characters needed
Multilingual0.9995Balance

Encoding options

Subword regularization

# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']

Use case: Data augmentation for robustness.

Common patterns

T5-style training

spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

Integration with transformers

from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

Performance benchmarks

Training speed

CorpusBPE (16k)Unigram (8k)
100 MB1-2 min3-4 min
1 GB10-15 min30-40 min

Tokenization speed

  • SentencePiece: 50,000 sentences/sec
  • HF Tokenizers: 200,000 sentences/sec (4× faster)

Supported models

T5 family: t5-base, t5-large (32k vocab, Unigram) ALBERT: albert-base-v2 (30k vocab, Unigram) XLNet: xlnet-base-cased (32k vocab, Unigram) mBART: facebook/mbart-large-50 (250k vocab, BPE)

References

Resources

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/02-tokenization/sentencepiece/SKILL.mdView on GitHub

Overview

SentencePiece is an unsupervised tokenizer that treats text as raw Unicode, removing language-specific preprocessing. It supports both BPE and Unigram models, is fast and lightweight (about 6MB memory), and yields a deterministic vocabulary. Widely used in models like T5, ALBERT, XLNet, and mBART, it trains on raw text without pre-tokenization.

How This Skill Works

It learns subword units directly from raw text and produces a deterministic vocabulary. The design uses a special whitespace symbol (▁) to mark boundaries, enabling language-independent tokenization. After training, you load the model and encode/decode to or from pieces or IDs without language-specific rules.

When to Use It

  • Building multilingual models with no language-specific rules.
  • Working with CJK languages (Chinese, Japanese, Korean) without language-specific tooling.
  • Needing reproducible, deterministic tokenization across runs and deployments.
  • Training on raw text without any pre-tokenization step.
  • Seeking lightweight deployment (model ~6MB; ~50k sentences/sec) for fast inference.

Quick Start

  1. Step 1: Install: pip install sentencepiece
  2. Step 2: Train a model: spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
  3. Step 3: Encode/decode: load with SentencePieceProcessor(model_file='m.model'), then encode/decode as needed

Best Practices

  • Choose model_type based on your target model: BPE for mBART-style work, Unigram for T5/ALBERT/XLNet.
  • Set character_coverage appropriately: 1.0 for CJK, 0.9995 for English/multilingual.
  • Include user_defined_symbols for task-specific tokens like [SEP] or [CLS].
  • Train on a representative, diverse corpus to improve vocab coverage across languages.
  • Use subword regularization with care; it can introduce tokenization variability during training.

Example Use Cases

  • T5, ALBERT, and XLNet commonly use Unigram tokenization with SentencePiece.
  • mBART employs BPE-based SentencePiece models for multilingual translation tasks.
  • SentencePiece is widely used to train models directly on raw text without pre-tokenization.
  • Models can be deployed in lightweight environments with a small footprint (~6MB).
  • Training on raw corpora allows consistent, language-agnostic tokenization across languages.

Frequently Asked Questions

Add this skill to your agents

Related Skills

Sponsor this space

Reach thousands of developers