hanx-knowledge-base
Scannednpx machina-cli add skill wrm3/ai_project_template/hanx-knowledge-base --openclawHanx Knowledge Base & RAG Skill
Build searchable knowledge bases from diverse document types with advanced RAG (Retrieval-Augmented Generation) capabilities.
Overview
This skill provides a complete document ingestion and search pipeline:
- Document Processing: Extract text from PDF, DOCX, MD, TXT, CSV, HTML, JSON
- Intelligent Chunking: Split documents with semantic awareness
- Vector Embeddings: Generate embeddings using local or OpenAI models
- Persistent Storage: Store in ChromaDB vector database
- Semantic Search: Find relevant content using natural language queries
- Category Management: Organize documents by topic
When to Use This Skill
Automatic Triggers
- User says "ingest this document" or "add to knowledge base"
- User requests "search my documents" or "find information about"
- User mentions "build knowledge base from" a directory
- User wants to "query the knowledge base"
Manual Invocation
# Add single document
python scripts/knowledge_base.py add document.pdf --category technical
# Add directory of documents
python scripts/knowledge_base.py add-batch ./documents --category research
# Search knowledge base
python scripts/search_kb.py "What is RAG?" --limit 5
# List all documents
python scripts/knowledge_base.py list
Core Capabilities
1. Document Ingestion
Supported Formats:
- PDF (.pdf): Full text extraction with page numbers
- Word (.docx, .doc): Text, tables, and document properties
- Markdown (.md): With YAML frontmatter support
- Text (.txt): Plain text files
- CSV (.csv): Structured data with headers
- HTML (.html, .htm): Web pages with metadata extraction
- JSON (.json): Structured JSON data
Features:
- Automatic format detection
- Metadata extraction (author, title, dates, etc.)
- Batch processing for directories
- Category-based organization
- Progress tracking
2. Intelligent Chunking
Chunking Strategy:
Document
↓
[Extract Text]
↓
[Split by Semantic Boundaries]
↓
[Chunks with Overlap]
↓
Vector Embeddings
Features:
- Recursive character splitting
- Semantic boundary detection (paragraphs, sentences)
- Configurable chunk size (default: 1000 characters)
- Overlap between chunks (default: 200 characters)
- Metadata preservation per chunk
3. Vector Embeddings
Local Embeddings (Default - FREE):
- Model: all-MiniLM-L6-v2 (SentenceTransformers)
- Dimensions: 384
- Speed: Fast (local inference)
- Cost: $0 (no API costs)
- Privacy: Data stays local
Alternative Models:
all-mpnet-base-v2: Better quality, 768 dimensionsmulti-qa-MiniLM-L6-cos-v1: Optimized for Q&A- OpenAI embeddings: Highest quality (requires API key)
4. Vector Storage (ChromaDB)
Features:
- Persistent Storage: Documents saved to disk
- Metadata Filtering: Filter by category, author, date
- Multiple Distance Metrics: Cosine, L2, Inner Product
- Efficient Indexing: Fast similarity search
- CRUD Operations: Add, search, update, delete
Storage Structure:
knowledge_base/
├── documents/ # Original documents (categorized)
│ ├── technical/
│ ├── business/
│ ├── reference/
│ └── ...
├── metadata/ # Document metadata (JSON)
├── vector_store/ # ChromaDB vector database
└── exports/ # Exported metadata
5. Semantic Search
Search Capabilities:
- Natural Language Queries: "How do I implement RAG?"
- Similarity Scoring: Relevance scores (0.0 to 1.0)
- Metadata Filtering: Filter by category, file type, date
- Top-K Retrieval: Get most relevant chunks
- Contextual Results: Full text + metadata + score
Example Search:
$ python search_kb.py "vector database implementation"
================================================================================
Result 1
================================================================================
Relevance: 0.8742 ████████████████████
Category: technical
Source: chromadb_guide.pdf
Content:
--------------------------------------------------------------------------------
Vector databases are specialized databases designed to store and search
vector embeddings efficiently. ChromaDB provides an open-source solution
with features like persistent storage, metadata filtering, and cosine
similarity search. Implementation involves three key steps: 1) Generate
embeddings using a model like SentenceTransformers, 2) Store embeddings
in the vector database, 3) Query using similarity search.
--------------------------------------------------------------------------------
Workflow Examples
Example 1: Build Knowledge Base from Directory
User: "Build a knowledge base from all PDF files in ./research"
Skill Actions:
- Scan directory for PDF files
- Process each PDF (extract text + metadata)
- Chunk documents intelligently
- Generate embeddings for all chunks
- Store in ChromaDB vector database
- Report statistics
Output:
================================================================================
Knowledge Base Batch Ingestion
================================================================================
Directory: ./research
Category: research
Recursive: True
File Filter: .pdf
[1/15] Processing: machine_learning_survey.pdf
[KB] Adding document: machine_learning_survey.pdf
[KB] Category: research
[KB] Processing document...
[KB] Metadata saved: machine_learning_survey.json
[KB] Adding to RAG system...
[KB] ✅ Document added successfully in 2.34s
[KB] Document ID: doc_1730123456789
[2/15] Processing: neural_networks_intro.pdf
...
[15/15] Processing: transformer_architecture.pdf
================================================================================
BATCH INGESTION COMPLETE
================================================================================
Total Files: 15
Successfully Added: 15
Failed: 0
Total Time: 45.2s
Average Time per Document: 3.0s
Knowledge base ready for search!
Example 2: Semantic Search with Filters
User: "Search my technical documents for information about embeddings with minimum relevance 0.7"
Command:
python search_kb.py "embeddings" --category technical --min-score 0.7 --limit 10
Output:
================================================================================
Knowledge Base Search
================================================================================
Knowledge Base: ./data/knowledge_base
Query: "embeddings"
Category Filter: technical
Max Results: 10
Min Score: 0.7
[SEARCH] Searching knowledge base...
[SEARCH] Found 8 results
================================================================================
Result 1
================================================================================
Relevance: 0.8921 ██████████████████
Category: technical
Source: rag_implementation_guide.md
Content:
--------------------------------------------------------------------------------
Embeddings are dense vector representations of text that capture semantic
meaning. Modern embedding models like OpenAI's text-embedding-3-small or
SentenceTransformers produce vectors of 384 to 3072 dimensions. These
embeddings enable semantic search by measuring similarity through cosine
distance or dot product calculations.
--------------------------------------------------------------------------------
[... more results ...]
Example 3: Document Management
List all documents:
python scripts/knowledge_base.py list --category technical
Output:
Knowledge Base Documents
Total: 23 documents in 'technical' category
1. rag_implementation_guide.md (45.2 KB)
Modified: 2025-11-01T10:30:45
Metadata: {"author": "John Doe", "topics": ["RAG", "embeddings"]}
2. chromadb_setup.pdf (1.2 MB)
Modified: 2025-10-28T14:22:10
Metadata: {"author": "Tech Team", "version": "2.0"}
[... 21 more documents ...]
Remove document:
python scripts/knowledge_base.py remove chromadb_setup.pdf --category technical
Export metadata:
python scripts/knowledge_base.py export --output kb_metadata.json
Technical Specifications
Embedding Models Comparison
| Model | Dimensions | Speed | Quality | Cost | Use Case |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | Free | Default, general purpose |
| all-mpnet-base-v2 | 768 | Medium | Better | Free | Higher quality needed |
| text-embedding-3-small | 1536 | API | Best | $0.020/1M tokens | Production, highest quality |
| text-embedding-3-large | 3072 | API | Superior | $0.130/1M tokens | Critical applications |
Performance Characteristics
Document Ingestion (1000-page PDF):
- Text extraction: ~5 seconds
- Chunking: <1 second
- Embedding generation: ~10 seconds (local), ~3 seconds (API)
- Database storage: ~2 seconds
- Total: ~18 seconds (local), ~11 seconds (API)
Search Performance:
- Query embedding: ~100ms (local), ~200ms (API)
- Vector search (10K chunks): ~50ms
- Result formatting: <10ms
- Total: ~160ms (local), ~260ms (API)
Storage Requirements (per 1000 documents):
- Original documents: ~500 MB (varies by type)
- ChromaDB vectors: ~100 MB
- Metadata: ~10 MB
- Total: ~610 MB
Chunking Configuration
Default Settings:
CHUNK_SIZE = 1000 # characters per chunk
CHUNK_OVERLAP = 200 # overlap between chunks
SEPARATORS = [
"\n\n", # Paragraphs
"\n", # Lines
". ", # Sentences
" ", # Words
"" # Characters
]
Customization:
from rag_utils import DocumentChunker
chunker = DocumentChunker(
chunk_size=1500, # Larger chunks for technical docs
chunk_overlap=300, # More overlap for context
separators=["\n\n", "\n", ". "] # Custom separators
)
Usage Instructions
Setup (One-time)
- Install Dependencies:
cd .claude/skills/hanx-knowledge-base
pip install -r scripts/requirements.txt
- Verify Installation:
python scripts/rag_utils.py # Runs test
python scripts/document_processor.py # Tests processors
- Optional: Configure OpenAI Embeddings:
# Create .env file
echo "OPENAI_API_KEY=sk-your-key-here" > .env
Basic Usage
Initialize Knowledge Base:
from knowledge_base import KnowledgeBase
kb = KnowledgeBase("./data/my_kb")
Add Documents:
# Single document
kb.add_document("document.pdf", category="technical")
# Batch add from directory
kb.add_documents_batch(
"./documents",
category="research",
recursive=True,
file_extensions=['.pdf', '.docx', '.md']
)
Search:
results = kb.search("What is RAG?", limit=5)
for result in results:
print(f"Score: {result.score:.4f}")
print(f"Text: {result.text[:200]}...")
print(f"Source: {result.metadata['source']}")
Advanced Usage
Custom RAG System:
from rag_utils import VectorStore, LocalEmbeddings, DocumentChunker, RAGSystem
# Configure components
embeddings = LocalEmbeddings(model_name="all-mpnet-base-v2")
vector_store = VectorStore(
persist_directory="./my_vector_db",
embedding_function=embeddings
)
chunker = DocumentChunker(chunk_size=1500, chunk_overlap=300)
# Create RAG system
rag = RAGSystem(vector_store=vector_store, chunker=chunker)
# Use it
rag.add_text("Document content here...", metadata={"source": "custom.txt"})
results = rag.query("search query", limit=10)
Metadata Filtering:
# Search only recent documents
results = kb.search(
"machine learning",
limit=5,
metadata_filter={"category": "research", "year": 2025}
)
Programmatic Document Processing:
from document_processor import process_document, process_directory
# Process single document
content, metadata = process_document("document.pdf")
# Process directory
results = process_directory(
"./documents",
recursive=True,
file_extensions=['.pdf', '.md']
)
for content, metadata in results:
print(f"Processed: {metadata['file_name']}")
print(f"Length: {len(content)} characters")
Best Practices
Document Organization
Categories:
- ✅ Use meaningful categories (technical, business, reference)
- ✅ Keep category names lowercase and simple
- ✅ Create custom categories as needed
- ❌ Don't over-categorize (use metadata for fine-grained classification)
Metadata:
- ✅ Add relevant metadata (author, date, topic, version)
- ✅ Use consistent metadata keys across documents
- ✅ Include source URLs for web content
- ❌ Don't duplicate information already in content
Chunking Strategy
When to use larger chunks (1500-2000 chars):
- Technical documentation with code examples
- Academic papers with complex concepts
- Documents requiring more context
When to use smaller chunks (500-800 chars):
- Q&A documents
- Short articles or blog posts
- Documents with discrete topics
Overlap considerations:
- More overlap (300-400 chars): Better context preservation
- Less overlap (100-200 chars): More efficient storage
- Default 200 chars works well for most content
Embedding Selection
Use Local Embeddings when:
- Privacy is critical (data cannot leave your system)
- Cost is a concern (embeddings are free)
- You have many documents to process
- Embedding quality is "good enough"
Use OpenAI Embeddings when:
- Highest quality is required
- Processing time is critical (API is faster than some local models)
- You need consistency with other OpenAI tools
- Cost is acceptable (~$0.02 per 1M tokens)
Search Optimization
Query Formulation:
- ✅ Use natural language questions
- ✅ Be specific rather than general
- ✅ Include key terms from your domain
- ❌ Don't use single keywords (use phrases)
Relevance Thresholds:
- 0.8+: Highly relevant, near-exact matches
- 0.7-0.8: Very relevant, strong semantic match
- 0.5-0.7: Relevant, moderate semantic match
- <0.5: Marginal relevance, consider filtering out
Result Limits:
- Start with 5 results for most queries
- Increase to 10-20 for broad exploratory queries
- Use 1-3 for highly specific questions
Limitations
Current Limitations
- Languages: Best results with English (models are English-optimized)
- Image Content: Text-only extraction (no OCR or image analysis)
- Tables: Basic table extraction (formatting may be lost)
- Code: Extracted as text (no syntax highlighting or execution)
- Updates: Documents must be re-ingested after changes
- Scale: ChromaDB optimized for <1M documents (beyond that, consider alternatives)
Performance Constraints
- Large PDFs (>1000 pages) may be slow to process
- First-time embedding model download (~100MB) required
- Local embeddings require CPU/GPU resources
- Memory usage scales with number of documents
Troubleshooting
Common Issues
Issue: "Module not found" errors
# Solution: Install dependencies
pip install -r scripts/requirements.txt
Issue: "No results found" for valid queries
# Solution 1: Check if documents are loaded
python scripts/knowledge_base.py list
# Solution 2: Lower minimum score threshold
python search_kb.py "query" --min-score 0.0
# Solution 3: Rebuild knowledge base
python scripts/knowledge_base.py rebuild
Issue: "Out of memory" during embedding generation
# Solution: Use smaller embedding model
from rag_utils import LocalEmbeddings
embeddings = LocalEmbeddings(model_name="all-MiniLM-L6-v2") # Smaller, 384 dim
Issue: "Slow search performance"
# Solution: Reduce chunk size or document count
# Or: Switch to more efficient embedding model
# Or: Consider upgrading hardware
Issue: "PDF text extraction failed"
# Solution: Check if PDF is text-based (not scanned image)
# For scanned PDFs, use OCR preprocessing:
# tesseract input.pdf output -l eng pdf
Integration Examples
With LangChain
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from knowledge_base import KnowledgeBase
kb = KnowledgeBase("./data/kb")
retriever = kb.rag_system.vector_store.vectorstore.as_retriever()
qa = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=retriever
)
answer = qa.run("What is RAG?")
With Claude API
import anthropic
from knowledge_base import KnowledgeBase
kb = KnowledgeBase("./data/kb")
results = kb.search("RAG implementation", limit=3)
# Build context from search results
context = "\n\n".join([r.text for r in results])
# Query Claude with context
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Based on this context:\n\n{context}\n\nQuestion: How do I implement RAG?"
}]
)
print(message.content[0].text)
Reference Documentation
- RAG Utils Reference: Complete API documentation
- Document Processors: Supported formats and extraction details
- Chunking Strategies: Best practices for document chunking
- Embedding Models: Comparison and selection guide
Examples
- Basic Workflow: Step-by-step basic usage
- Batch Ingestion: Process multiple documents
- Advanced Search: Complex queries and filtering
- Integration Examples: Use with LangChain, Claude, etc.
Version: 1.0.0 Created: 2025-11-01 Status: Production Ready Task: 022 Dependencies: sentence-transformers, chromadb, langchain
Source
git clone https://github.com/wrm3/ai_project_template/blob/main/.claude/skills/hanx-knowledge-base/SKILL.mdView on GitHub Overview
Hanx Knowledge Base builds searchable knowledge bases from diverse documents and enables Retrieval-Augmented Generation (RAG) for fast, relevant answers. It handles ingestion, intelligent chunking, embeddings, and persistent storage in ChromaDB, with category management to keep content organized.
How This Skill Works
Documents are ingested from formats like PDF, DOCX, MD, TXT, CSV, HTML, and JSON. Text is extracted, split into semantically bounded chunks with overlap, and converted into vector embeddings stored in ChromaDB. Semantic search retrieves relevant chunks using natural language queries to answer user questions.
When to Use It
- Ingest a single document or a batch directory into the knowledge base
- Build a searchable base from mixed formats (PDF, DOCX, MD, CSV, HTML, JSON)
- Answer questions using natural language against the knowledge base with RAG
- Organize documents by topic/category and filter results by metadata
- List, manage, or export your knowledge base for reuse
Quick Start
- Step 1: python scripts/knowledge_base.py add document.pdf --category technical
- Step 2: python scripts/search_kb.py "What is RAG?" --limit 5
- Step 3: python scripts/knowledge_base.py list
Best Practices
- Assign meaningful categories and metadata to improve filtering
- Tune chunk size (default ~1000 chars) and overlap (default ~200 chars) for retrieval quality
- Choose embeddings wisely: local models for cost and privacy; OpenAI for quality (requires API key)
- Leverage persistent ChromaDB storage to preserve embeddings and metadata between runs
- Batch ingest when possible to optimize processing time and track progress
Example Use Cases
- Build a customer support KB by ingesting product manuals (PDF) and guide docs (MD, HTML) for fast Q&A
- Ingest research papers and internal docs to support R&D decision making
- Create a compliance/legal document repository with metadata for quick retrieval
- Assemble developer docs from HTML/MD sources for a centralized API reference
- Populate HR policies from CSV/MD/HTML sources to enable policy Q&A