cjk-aware-text-metrics
npx machina-cli add skill shimo4228/claude-code-learned-skills/cjk-aware-text-metrics --openclawCJK-Aware Text Metrics
Extracted: 2026-02-11 Context: Multilingual LLM pipelines where token estimation affects cost, chunking, or rate limits
Problem
Fixed chars-per-token constants (e.g., CHARS_PER_TOKEN = 4) assume Latin text.
Japanese/Chinese/Korean text uses ~2.5 chars/token, causing ~60% underestimation
in token counts, cost previews, and chunk sizing for CJK-heavy documents.
Solution
- Detect CJK characters by Unicode range:
def _is_cjk(char: str) -> bool:
cp = ord(char)
return (
0x4E00 <= cp <= 0x9FFF # CJK Unified Ideographs
or 0x3040 <= cp <= 0x309F # Hiragana
or 0x30A0 <= cp <= 0x30FF # Katakana
or 0x3400 <= cp <= 0x4DBF # CJK Extension A
or 0xF900 <= cp <= 0xFAFF # CJK Compatibility
)
- Weighted token estimation:
CJK_CHARS_PER_TOKEN = 2.5
LATIN_CHARS_PER_TOKEN = 4.0
def estimate_tokens(text: str) -> int:
cjk_count = sum(1 for c in text if _is_cjk(c))
other_count = len(text) - cjk_count
return int(cjk_count / CJK_CHARS_PER_TOKEN + other_count / LATIN_CHARS_PER_TOKEN)
- Chunk splitting must use token-based accumulation (not char-based):
# BAD: char_limit = token_limit * FIXED_CONSTANT
# GOOD: accumulate estimated tokens per paragraph
current_tokens += estimate_tokens(para)
if current_tokens > token_limit:
flush_chunk()
When to Use
- Building LLM pipelines that process Japanese/Chinese/Korean text
- Implementing chunk splitting for multilingual documents
- Estimating API costs for non-English content
- Any text metric (token count, cost, rate limit) using a fixed chars-per-token constant
Source
git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/cjk-aware-text-metrics/SKILL.mdView on GitHub Overview
Detects CJK characters and weights token counts to fit fixed chars per token in multilingual LLM pipelines. This reduces underestimation for Japanese, Chinese, and Korean text and improves chunking and cost estimates.
How This Skill Works
Uses Unicode ranges to detect CJK characters, then estimates tokens with CJK_CHARS_PER_TOKEN = 2.5 and LATIN_CHARS_PER_TOKEN = 4.0. Token-based chunking accumulates estimated tokens per paragraph instead of counting characters, ensuring chunk limits align with actual token usage.
When to Use It
- Building LLM pipelines that process Japanese, Chinese, or Korean text
- Implementing chunk splitting for multilingual documents
- Estimating API costs for non English content
- Any metric using a fixed chars-per-token constant in multilingual data
- Cost previews and rate limit planning for mixed language inputs
Quick Start
- Step 1: Identify the text to process and set a per chunk token limit
- Step 2: Implement _is_cjk and estimate_tokens using the 2.5 and 4.0 constants
- Step 3: Replace char-based chunking with token-based accumulation in your pipeline
Best Practices
- Detect CJK ranges in input text using Unicode
- Use CJK_CHARS_PER_TOKEN = 2.5 and LATIN_CHARS_PER_TOKEN = 4.0
- Implement _is_cjk and estimate_tokens as shown
- Switch chunking to token-based accumulation rather than char-based limits
- Verify estimates with real multilingual samples and adjust constants if needed
Example Use Cases
- Estimate tokens for a Chinese news article and compare to provider counts
- Process a Japanese technical manual and calibrate chunk boundaries
- Analyze English-Chinese mixed documents for API cost previews
- Chunk a Korean web article using token-based limits
- Budget tokens for multilingual chat transcripts across scripts