Chunking Advisor
Scannednpx machina-cli add skill davicqueiroz/claude-rag-skills/chunking-advisor --openclawChunking Advisor Skill
Analyze your documents and recommend optimal chunking strategies based on content type, use case, and embedding model.
When to Use
Use /chunking-advisor when:
- Setting up a new RAG pipeline and unsure about chunk configuration
- Experiencing poor retrieval quality (chunks too big or small)
- Working with diverse document types (PDFs, code, tables, etc.)
- Optimizing an existing chunking strategy
Chunking Strategy Decision Tree
What type of content?
│
├── Technical Documentation / Code
│ └── Use: Semantic chunking by function/class/section
│ Chunk size: 500-1000 tokens
│ Overlap: 50-100 tokens
│
├── Legal / Contracts
│ └── Use: Hierarchical chunking (clause → section → document)
│ Chunk size: 300-500 tokens
│ Overlap: 100-150 tokens (preserve clause boundaries)
│
├── Product Catalogs
│ └── Use: Fixed per-product chunks
│ Chunk size: One product = one chunk
│ Overlap: None (products are atomic)
│
├── FAQ / Q&A
│ └── Use: Question-answer pairs as chunks
│ Chunk size: Variable (complete Q&A)
│ Overlap: None
│
├── Long-form Articles / Blog Posts
│ └── Use: Semantic chunking by paragraph/section
│ Chunk size: 800-1200 tokens
│ Overlap: 100-200 tokens
│
├── Tables / Structured Data
│ └── Use: Row-based or section-based chunking
│ Preserve headers in each chunk
│ Chunk size: 10-50 rows per chunk
│
└── Mixed Content
└── Use: Document-aware chunking
Different strategies per section type
Maintain parent-child relationships
Chunking Strategies Explained
1. Fixed-Size Chunking
Best for: Homogeneous content, quick implementation Pros: Simple, predictable Cons: May split sentences/paragraphs awkwardly
# Implementation
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separator="\n\n" # Prefer paragraph breaks
)
chunks = splitter.split_text(document)
Recommended settings by embedding model:
| Model | Optimal Chunk Size | Max Chunk Size |
|---|---|---|
| text-embedding-3-small | 500-800 tokens | 8191 tokens |
| text-embedding-3-large | 800-1200 tokens | 8191 tokens |
| voyage-2 | 500-1000 tokens | 4096 tokens |
| Cohere embed-v3 | 300-500 tokens | 512 tokens |
2. Semantic Chunking
Best for: Technical docs, articles, varied content Pros: Respects content boundaries, better retrieval Cons: More complex, requires NLP
# Implementation using sentence boundaries
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""]
)
chunks = splitter.split_text(document)
Advanced: Embedding-based semantic chunking
# Split when semantic similarity drops
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunk(sentences, model, threshold=0.5):
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = np.dot(embeddings[i], embeddings[i-1])
if similarity < threshold:
chunks.append(" ".join(current_chunk))
current_chunk = []
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
3. Hierarchical Chunking
Best for: Legal docs, manuals, structured documents Pros: Preserves document structure, enables multi-level retrieval Cons: Complex implementation
# Parent-child chunk structure
class HierarchicalChunker:
def chunk(self, document):
# Level 1: Sections
sections = self.split_by_headers(document)
chunks = []
for section in sections:
# Level 2: Paragraphs within sections
paragraphs = self.split_paragraphs(section.content)
for para in paragraphs:
chunks.append({
"content": para,
"metadata": {
"parent_section": section.title,
"document": document.name,
"level": "paragraph"
}
})
# Also store section summary for high-level queries
chunks.append({
"content": section.summary,
"metadata": {
"document": document.name,
"level": "section"
}
})
return chunks
4. Document-Specific Chunking
Best for: Mixed document types Pros: Optimal for each content type Cons: Requires content detection
def smart_chunk(document):
doc_type = detect_document_type(document)
if doc_type == "code":
return chunk_by_function(document)
elif doc_type == "table":
return chunk_table_by_rows(document, rows_per_chunk=20)
elif doc_type == "legal":
return chunk_by_clause(document)
elif doc_type == "faq":
return chunk_qa_pairs(document)
else:
return semantic_chunk(document)
Analysis Process
When the user invokes /chunking-advisor, follow this process:
Step 1: Understand the Use Case
Ask clarifying questions:
- What types of documents will you be indexing? (PDFs, code, web pages, etc.)
- What kinds of questions will users ask? (Factual, analytical, comparative)
- What embedding model are you using?
- What's your latency budget? (Real-time vs batch)
- Do you have existing chunks that perform poorly?
Step 2: Analyze Sample Documents
If the user provides sample documents:
- Identify content types (prose, code, tables, lists)
- Measure average sentence/paragraph length
- Detect natural section boundaries
- Note any special formatting
Step 3: Provide Recommendations
## Chunking Recommendation for [Use Case]
### Primary Strategy: [Strategy Name]
Based on your [document types] and [query patterns], I recommend:
**Configuration:**
- Chunk size: X tokens
- Overlap: Y tokens (Z%)
- Separator hierarchy: ["\n\n", "\n", ". "]
**Rationale:**
- Your documents have [characteristic], which benefits from [approach]
- Your queries are [type], which need [chunk property]
### Implementation
```python
[Ready-to-use code snippet]
Metadata to Preserve
source: Document filenamepage: Page number (for PDFs)section: Section headerchunk_index: Position in document
Testing Your Chunking
- Sample 10 representative queries
- Check if relevant info appears in single chunks
- Verify chunks don't cut off mid-sentence
- Test retrieval precision
Warning Signs of Bad Chunking
- Same document appears multiple times in results (too small)
- Irrelevant content pollutes results (too large)
- Answers are truncated (boundary issues)
## Common Mistakes to Warn About
### 1. Chunks Too Large
**Symptoms**: Irrelevant content retrieved, high noise
**Fix**: Reduce chunk size, add reranking
### 2. Chunks Too Small
**Symptoms**: Missing context, same doc retrieved multiple times
**Fix**: Increase chunk size, add parent document retrieval
### 3. No Overlap
**Symptoms**: Information at boundaries is lost
**Fix**: Add 10-20% overlap
### 4. Ignoring Document Structure
**Symptoms**: Chunks split tables, code blocks, lists
**Fix**: Use structure-aware chunking
### 5. Same Strategy for All Content
**Symptoms**: Some content types perform worse
**Fix**: Content-specific chunking strategies
### 6. Not Preserving Metadata
**Symptoms**: Can't cite sources, no filtering possible
**Fix**: Always store source, page, section metadata
## Overlap Calculation
```python
def calculate_overlap(chunk_size: int, content_type: str) -> int:
"""
Calculate recommended overlap based on chunk size and content.
"""
overlap_ratios = {
"technical": 0.15, # 15% - preserve code context
"legal": 0.20, # 20% - preserve clause boundaries
"narrative": 0.10, # 10% - prose flows naturally
"tabular": 0.0, # 0% - tables are atomic
"default": 0.12 # 12% - balanced
}
ratio = overlap_ratios.get(content_type, overlap_ratios["default"])
return int(chunk_size * ratio)
Reference Resources
For detailed chunking guidance:
- Chunking strategies overview: https://app.ailog.fr/en/blog/guides/chunking-strategies
- Semantic chunking: https://app.ailog.fr/en/blog/guides/semantic-chunking
- Hierarchical chunking: https://app.ailog.fr/en/blog/guides/hierarchical-chunking
- Fixed-size chunking: https://app.ailog.fr/en/blog/guides/fixed-size-chunking
- Parent document retrieval: https://app.ailog.fr/en/blog/guides/parent-document-retrieval
Quick Reference Card
| Document Type | Strategy | Size | Overlap | Key Consideration |
|---|---|---|---|---|
| Code | By function/class | 500-1000 | 50-100 | Preserve syntax |
| Legal | Hierarchical | 300-500 | 100-150 | Clause boundaries |
| FAQ | Q&A pairs | Variable | 0 | Complete pairs |
| Articles | Semantic | 800-1200 | 100-200 | Paragraph integrity |
| Tables | Row-based | 10-50 rows | 0 + headers | Include headers |
| Manuals | Section-based | 600-1000 | 100 | Step integrity |
| Chat logs | By conversation | Variable | 0 | Timestamp groups |
Source
git clone https://github.com/davicqueiroz/claude-rag-skills/blob/main/chunking-advisor/SKILL.mdView on GitHub Overview
Analyzes your documents and recommends optimal chunking strategies based on content type, use case, and embedding model. This helps maximize retrieval quality across diverse document types and RAG pipelines.
How This Skill Works
It evaluates content type, use case, and the embedding model to propose chunking settings—size, overlap, and strategy—for each category (Technical, Legal, Product Catalogs, FAQ, Long-form). It includes a decision tree and practical code examples to implement the recommended approach.
When to Use It
- Setting up a new RAG pipeline and unsure about chunk configuration
- Experiencing poor retrieval quality (chunks too big or small)
- Working with diverse document types (PDFs, code, tables, etc.)
- Optimizing an existing chunking strategy
- Adapting chunking for a new embedding model and token budgets
Quick Start
- Step 1: Identify the document types and the embedding model you will use
- Step 2: Pick a strategy from the Chunking Strategy Decision Tree and configure chunk_size and overlap (e.g., 500-1000 tokens with 50-100 overlap for technical docs)
- Step 3: Implement with LangChain splitters (e.g., CharacterTextSplitter or RecursiveCharacterTextSplitter) and validate retrieval quality
Best Practices
- Identify content type and use case first using the decision tree
- Choose a chunking strategy per content type (Fixed, Semantic, Hierarchical)
- Preserve structure where it matters (headers for tables, clause boundaries for contracts)
- Tune chunk_size and chunk_overlap to fit embedding model limits
- Test retrieval quality and iterate until results improve
Example Use Cases
- Technical Documentation: semantic chunking by function with 1000 token chunks and 150 token overlap
- Legal contracts: hierarchical chunking (clause → section → document) with 300-500 token chunks and 100-150 overlap
- Product catalogs: fixed per-product chunks with no overlap
- FAQ/Q&A: chunks built from complete Q&A units
- Long-form articles: semantic chunks by paragraph/section with 800-1200 tokens and 100-200 overlap