Get the FREE Ultimate OpenClaw Setup Guide →

cjk-aware-text-metrics

npx machina-cli add skill shimo4228/claude-code-learned-skills/cjk-aware-text-metrics --openclaw
Files (1)
SKILL.md
1.9 KB

CJK-Aware Text Metrics

Extracted: 2026-02-11 Context: Multilingual LLM pipelines where token estimation affects cost, chunking, or rate limits

Problem

Fixed chars-per-token constants (e.g., CHARS_PER_TOKEN = 4) assume Latin text. Japanese/Chinese/Korean text uses ~2.5 chars/token, causing ~60% underestimation in token counts, cost previews, and chunk sizing for CJK-heavy documents.

Solution

  1. Detect CJK characters by Unicode range:
def _is_cjk(char: str) -> bool:
    cp = ord(char)
    return (
        0x4E00 <= cp <= 0x9FFF      # CJK Unified Ideographs
        or 0x3040 <= cp <= 0x309F   # Hiragana
        or 0x30A0 <= cp <= 0x30FF   # Katakana
        or 0x3400 <= cp <= 0x4DBF   # CJK Extension A
        or 0xF900 <= cp <= 0xFAFF   # CJK Compatibility
    )
  1. Weighted token estimation:
CJK_CHARS_PER_TOKEN = 2.5
LATIN_CHARS_PER_TOKEN = 4.0

def estimate_tokens(text: str) -> int:
    cjk_count = sum(1 for c in text if _is_cjk(c))
    other_count = len(text) - cjk_count
    return int(cjk_count / CJK_CHARS_PER_TOKEN + other_count / LATIN_CHARS_PER_TOKEN)
  1. Chunk splitting must use token-based accumulation (not char-based):
# BAD: char_limit = token_limit * FIXED_CONSTANT
# GOOD: accumulate estimated tokens per paragraph
current_tokens += estimate_tokens(para)
if current_tokens > token_limit:
    flush_chunk()

When to Use

  • Building LLM pipelines that process Japanese/Chinese/Korean text
  • Implementing chunk splitting for multilingual documents
  • Estimating API costs for non-English content
  • Any text metric (token count, cost, rate limit) using a fixed chars-per-token constant

Source

git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/cjk-aware-text-metrics/SKILL.mdView on GitHub

Overview

Detects CJK characters and weights token counts to fit fixed chars per token in multilingual LLM pipelines. This reduces underestimation for Japanese, Chinese, and Korean text and improves chunking and cost estimates.

How This Skill Works

Uses Unicode ranges to detect CJK characters, then estimates tokens with CJK_CHARS_PER_TOKEN = 2.5 and LATIN_CHARS_PER_TOKEN = 4.0. Token-based chunking accumulates estimated tokens per paragraph instead of counting characters, ensuring chunk limits align with actual token usage.

When to Use It

  • Building LLM pipelines that process Japanese, Chinese, or Korean text
  • Implementing chunk splitting for multilingual documents
  • Estimating API costs for non English content
  • Any metric using a fixed chars-per-token constant in multilingual data
  • Cost previews and rate limit planning for mixed language inputs

Quick Start

  1. Step 1: Identify the text to process and set a per chunk token limit
  2. Step 2: Implement _is_cjk and estimate_tokens using the 2.5 and 4.0 constants
  3. Step 3: Replace char-based chunking with token-based accumulation in your pipeline

Best Practices

  • Detect CJK ranges in input text using Unicode
  • Use CJK_CHARS_PER_TOKEN = 2.5 and LATIN_CHARS_PER_TOKEN = 4.0
  • Implement _is_cjk and estimate_tokens as shown
  • Switch chunking to token-based accumulation rather than char-based limits
  • Verify estimates with real multilingual samples and adjust constants if needed

Example Use Cases

  • Estimate tokens for a Chinese news article and compare to provider counts
  • Process a Japanese technical manual and calibrate chunk boundaries
  • Analyze English-Chinese mixed documents for API cost previews
  • Chunk a Korean web article using token-based limits
  • Budget tokens for multilingual chat transcripts across scripts

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers