What is SentencePiece?

An unsupervised tokenizer that treats text as raw Unicode and learns subword units using BPE or Unigram, without language-specific preprocessing.

Is SentencePiece language-specific?

No. It is language-independent and works across languages by learning a deterministic subword vocabulary from raw text.

What models/algorithms does it support?

SentencePiece supports BPE and Unigram tokenization, and is used in models like T5, ALBERT, XLNet, and mBART for multilingual or all-language tokenization.

sentencepiece

Scanned

Tokenization SentencePiece Language-Independent BPE Unigram Multilingual CJK Languages Unicode Deterministic Google

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/sentencepiece --openclaw

Files (1)

SKILL.md

5.5 KB

SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

When to use SentencePiece

Use SentencePiece when:

Building multilingual models (no language-specific rules)
Working with CJK languages (Chinese, Japanese, Korean)
Need reproducible tokenization (deterministic vocabulary)
Want to train on raw text (no pre-tokenization needed)
Require lightweight deployment (6MB memory, 50k sentences/sec)

Performance:

Speed: 50,000 sentences/sec
Memory: ~6MB for loaded model
Languages: All (language-independent)

Use alternatives instead:

HuggingFace Tokenizers: Faster training, more flexibility
tiktoken: OpenAI models (GPT-3.5/4)
BERT WordPiece: English-centric tasks

Quick start

Installation

# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install

Train model

# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)

Training time: ~1-2 minutes for 100MB corpus

Encode and decode

import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"

Language-independent design

Whitespace as symbol (▁)

text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"

Key principle: Treat text as raw Unicode, whitespace = ▁ (meta symbol)

Tokenization algorithms

BPE (Byte-Pair Encoding)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)

Used by: mBART

Unigram (default)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)

Used by: T5, ALBERT, XLNet

Training configuration

Essential parameters

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

Character coverage

Language Type	Coverage	Rationale
English	0.9995	Most common chars
CJK (Chinese)	1.0	All characters needed
Multilingual	0.9995	Balance

Encoding options

Subword regularization

# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']

Use case: Data augmentation for robustness.

Common patterns

T5-style training

spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

Integration with transformers

from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

Performance benchmarks

Training speed

Corpus	BPE (16k)	Unigram (8k)
100 MB	1-2 min	3-4 min
1 GB	10-15 min	30-40 min

Tokenization speed

SentencePiece: 50,000 sentences/sec
HF Tokenizers: 200,000 sentences/sec (4× faster)

Supported models

T5 family: t5-base, t5-large (32k vocab, Unigram) ALBERT: albert-base-v2 (30k vocab, Unigram) XLNet: xlnet-base-cased (32k vocab, Unigram) mBART: facebook/mbart-large-50 (250k vocab, BPE)

References

Training Guide - Detailed options, corpus preparation
Algorithms - BPE vs Unigram, subword regularization

Resources

GitHub: https://github.com/google/sentencepiece ⭐ 10,000+
Paper: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
Version: 0.2.0+

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/02-tokenization/sentencepiece/SKILL.mdView on GitHub

Overview

SentencePiece is an unsupervised tokenizer that treats text as raw Unicode, removing language-specific preprocessing. It supports both BPE and Unigram models, is fast and lightweight (about 6MB memory), and yields a deterministic vocabulary. Widely used in models like T5, ALBERT, XLNet, and mBART, it trains on raw text without pre-tokenization.

How This Skill Works

It learns subword units directly from raw text and produces a deterministic vocabulary. The design uses a special whitespace symbol (▁) to mark boundaries, enabling language-independent tokenization. After training, you load the model and encode/decode to or from pieces or IDs without language-specific rules.

When to Use It

Building multilingual models with no language-specific rules.
Working with CJK languages (Chinese, Japanese, Korean) without language-specific tooling.
Needing reproducible, deterministic tokenization across runs and deployments.
Training on raw text without any pre-tokenization step.
Seeking lightweight deployment (model ~6MB; ~50k sentences/sec) for fast inference.

Quick Start

Step 1: Install: pip install sentencepiece
Step 2: Train a model: spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
Step 3: Encode/decode: load with SentencePieceProcessor(model_file='m.model'), then encode/decode as needed

Best Practices

Choose model_type based on your target model: BPE for mBART-style work, Unigram for T5/ALBERT/XLNet.
Set character_coverage appropriately: 1.0 for CJK, 0.9995 for English/multilingual.
Include user_defined_symbols for task-specific tokens like [SEP] or [CLS].
Train on a representative, diverse corpus to improve vocab coverage across languages.
Use subword regularization with care; it can introduce tokenization variability during training.

Example Use Cases

T5, ALBERT, and XLNet commonly use Unigram tokenization with SentencePiece.
mBART employs BPE-based SentencePiece models for multilingual translation tasks.
SentencePiece is widely used to train models directly on raw text without pre-tokenization.
Models can be deployed in lightweight environments with a small footprint (~6MB).
Training on raw corpora allows consistent, language-agnostic tokenization across languages.

Frequently Asked Questions

Add this skill to your agents

Related Skills

sentence-transformers

Orchestra-Research/AI-Research-SKILLs

Framework for state-of-the-art sentence, text, and image embeddings. Provides 5000+ pre-trained models for semantic similarity, clustering, and retrieval. Supports multilingual, domain-specific, and multimodal models. Use for generating embeddings for RAG, semantic search, or similarity tasks. Best for production embedding generation.

whisper

Orchestra-Research/AI-Research-SKILLs

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

prompt-guard

Orchestra-Research/AI-Research-SKILLs

Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.

huggingface-tokenizers

Orchestra-Research/AI-Research-SKILLs

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.