What is CJK-aware token estimation?

A method that weights CJK characters at 1 token roughly every 2.5 chars and Latin text at 1 token every 4 chars to produce more accurate token counts.

Why are CJK and Latin constants different?

CJK characters generally convey more information per character in common scripts, so a smaller chars per token count is needed to avoid underestimating tokens.

How do I validate results in practice?

Compare the estimated token counts to actual token counts from your LLM provider on mixed language samples, adjust the constants if needed, and ensure chunking respects token boundaries.

cjk-aware-text-metrics

npx machina-cli add skill shimo4228/claude-code-learned-skills/cjk-aware-text-metrics --openclaw

Files (1)

SKILL.md

1.9 KB

CJK-Aware Text Metrics

Extracted: 2026-02-11 Context: Multilingual LLM pipelines where token estimation affects cost, chunking, or rate limits

Problem

Fixed chars-per-token constants (e.g., CHARS_PER_TOKEN = 4) assume Latin text. Japanese/Chinese/Korean text uses ~2.5 chars/token, causing ~60% underestimation in token counts, cost previews, and chunk sizing for CJK-heavy documents.

Solution

Detect CJK characters by Unicode range:

def _is_cjk(char: str) -> bool:
    cp = ord(char)
    return (
        0x4E00 <= cp <= 0x9FFF      # CJK Unified Ideographs
        or 0x3040 <= cp <= 0x309F   # Hiragana
        or 0x30A0 <= cp <= 0x30FF   # Katakana
        or 0x3400 <= cp <= 0x4DBF   # CJK Extension A
        or 0xF900 <= cp <= 0xFAFF   # CJK Compatibility
    )

Weighted token estimation:

CJK_CHARS_PER_TOKEN = 2.5
LATIN_CHARS_PER_TOKEN = 4.0

def estimate_tokens(text: str) -> int:
    cjk_count = sum(1 for c in text if _is_cjk(c))
    other_count = len(text) - cjk_count
    return int(cjk_count / CJK_CHARS_PER_TOKEN + other_count / LATIN_CHARS_PER_TOKEN)

Chunk splitting must use token-based accumulation (not char-based):

# BAD: char_limit = token_limit * FIXED_CONSTANT
# GOOD: accumulate estimated tokens per paragraph
current_tokens += estimate_tokens(para)
if current_tokens > token_limit:
    flush_chunk()

When to Use

Building LLM pipelines that process Japanese/Chinese/Korean text
Implementing chunk splitting for multilingual documents
Estimating API costs for non-English content
Any text metric (token count, cost, rate limit) using a fixed chars-per-token constant

Source

git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/cjk-aware-text-metrics/SKILL.mdView on GitHub

Overview

Detects CJK characters and weights token counts to fit fixed chars per token in multilingual LLM pipelines. This reduces underestimation for Japanese, Chinese, and Korean text and improves chunking and cost estimates.

How This Skill Works

Uses Unicode ranges to detect CJK characters, then estimates tokens with CJK_CHARS_PER_TOKEN = 2.5 and LATIN_CHARS_PER_TOKEN = 4.0. Token-based chunking accumulates estimated tokens per paragraph instead of counting characters, ensuring chunk limits align with actual token usage.

When to Use It

Building LLM pipelines that process Japanese, Chinese, or Korean text
Implementing chunk splitting for multilingual documents
Estimating API costs for non English content
Any metric using a fixed chars-per-token constant in multilingual data
Cost previews and rate limit planning for mixed language inputs

Quick Start

Step 1: Identify the text to process and set a per chunk token limit
Step 2: Implement _is_cjk and estimate_tokens using the 2.5 and 4.0 constants
Step 3: Replace char-based chunking with token-based accumulation in your pipeline

Best Practices

Detect CJK ranges in input text using Unicode
Use CJK_CHARS_PER_TOKEN = 2.5 and LATIN_CHARS_PER_TOKEN = 4.0
Implement _is_cjk and estimate_tokens as shown
Switch chunking to token-based accumulation rather than char-based limits
Verify estimates with real multilingual samples and adjust constants if needed

Example Use Cases

Estimate tokens for a Chinese news article and compare to provider counts
Process a Japanese technical manual and calibrate chunk boundaries
Analyze English-Chinese mixed documents for API cost previews
Chunk a Korean web article using token-based limits
Budget tokens for multilingual chat transcripts across scripts

Frequently Asked Questions

Add this skill to your agents