What formats does hanx-knowledge-base ingest?

PDF, Word (.docx/.doc), Markdown (.md), Text (.txt), CSV (.csv), HTML (.html/.htm), JSON (.json) with metadata extraction.

Where are embeddings stored and how is storage managed?

Embeddings are stored in a persistent ChromaDB vector store on disk, with metadata filtering by category, author, date, and other fields.

Can I use OpenAI embeddings with this skill?

Yes, OpenAI embeddings are supported for higher quality results, but you’ll need an API key; default local embeddings (all-MiniLM-L6-v2) are available without API costs.

hanx-knowledge-base

Scanned

npx machina-cli add skill wrm3/ai_project_template/hanx-knowledge-base --openclaw

Files (1)

SKILL.md

18.3 KB

Hanx Knowledge Base & RAG Skill

Build searchable knowledge bases from diverse document types with advanced RAG (Retrieval-Augmented Generation) capabilities.

Overview

This skill provides a complete document ingestion and search pipeline:

Document Processing: Extract text from PDF, DOCX, MD, TXT, CSV, HTML, JSON
Intelligent Chunking: Split documents with semantic awareness
Vector Embeddings: Generate embeddings using local or OpenAI models
Persistent Storage: Store in ChromaDB vector database
Semantic Search: Find relevant content using natural language queries
Category Management: Organize documents by topic

When to Use This Skill

Automatic Triggers

User says "ingest this document" or "add to knowledge base"
User requests "search my documents" or "find information about"
User mentions "build knowledge base from" a directory
User wants to "query the knowledge base"

Manual Invocation

# Add single document
python scripts/knowledge_base.py add document.pdf --category technical

# Add directory of documents
python scripts/knowledge_base.py add-batch ./documents --category research

# Search knowledge base
python scripts/search_kb.py "What is RAG?" --limit 5

# List all documents
python scripts/knowledge_base.py list

Core Capabilities

1. Document Ingestion

Supported Formats:

PDF (.pdf): Full text extraction with page numbers
Word (.docx, .doc): Text, tables, and document properties
Markdown (.md): With YAML frontmatter support
Text (.txt): Plain text files
CSV (.csv): Structured data with headers
HTML (.html, .htm): Web pages with metadata extraction
JSON (.json): Structured JSON data

Features:

Automatic format detection
Metadata extraction (author, title, dates, etc.)
Batch processing for directories
Category-based organization
Progress tracking

2. Intelligent Chunking

Chunking Strategy:

Document
    ↓
[Extract Text]
    ↓
[Split by Semantic Boundaries]
    ↓
[Chunks with Overlap]
    ↓
Vector Embeddings

Features:

Recursive character splitting
Semantic boundary detection (paragraphs, sentences)
Configurable chunk size (default: 1000 characters)
Overlap between chunks (default: 200 characters)
Metadata preservation per chunk

3. Vector Embeddings

Local Embeddings (Default - FREE):

Model: all-MiniLM-L6-v2 (SentenceTransformers)
Dimensions: 384
Speed: Fast (local inference)
Cost: $0 (no API costs)
Privacy: Data stays local

Alternative Models:

all-mpnet-base-v2: Better quality, 768 dimensions
multi-qa-MiniLM-L6-cos-v1: Optimized for Q&A
OpenAI embeddings: Highest quality (requires API key)

4. Vector Storage (ChromaDB)

Features:

Persistent Storage: Documents saved to disk
Metadata Filtering: Filter by category, author, date
Multiple Distance Metrics: Cosine, L2, Inner Product
Efficient Indexing: Fast similarity search
CRUD Operations: Add, search, update, delete

Storage Structure:

knowledge_base/
├── documents/           # Original documents (categorized)
│   ├── technical/
│   ├── business/
│   ├── reference/
│   └── ...
├── metadata/           # Document metadata (JSON)
├── vector_store/       # ChromaDB vector database
└── exports/           # Exported metadata

5. Semantic Search

Search Capabilities:

Natural Language Queries: "How do I implement RAG?"
Similarity Scoring: Relevance scores (0.0 to 1.0)
Metadata Filtering: Filter by category, file type, date
Top-K Retrieval: Get most relevant chunks
Contextual Results: Full text + metadata + score

Example Search:

$ python search_kb.py "vector database implementation"

================================================================================
Result 1
================================================================================
Relevance: 0.8742 ████████████████████
Category: technical
Source: chromadb_guide.pdf

Content:
--------------------------------------------------------------------------------
Vector databases are specialized databases designed to store and search
vector embeddings efficiently. ChromaDB provides an open-source solution
with features like persistent storage, metadata filtering, and cosine
similarity search. Implementation involves three key steps: 1) Generate
embeddings using a model like SentenceTransformers, 2) Store embeddings
in the vector database, 3) Query using similarity search.
--------------------------------------------------------------------------------

Workflow Examples

Example 1: Build Knowledge Base from Directory

User: "Build a knowledge base from all PDF files in ./research"

Skill Actions:

Scan directory for PDF files
Process each PDF (extract text + metadata)
Chunk documents intelligently
Generate embeddings for all chunks
Store in ChromaDB vector database
Report statistics

Output:

================================================================================
Knowledge Base Batch Ingestion
================================================================================

Directory: ./research
Category: research
Recursive: True
File Filter: .pdf

[1/15] Processing: machine_learning_survey.pdf
[KB] Adding document: machine_learning_survey.pdf
[KB] Category: research
[KB] Processing document...
[KB] Metadata saved: machine_learning_survey.json
[KB] Adding to RAG system...
[KB] ✅ Document added successfully in 2.34s
[KB] Document ID: doc_1730123456789

[2/15] Processing: neural_networks_intro.pdf
...

[15/15] Processing: transformer_architecture.pdf

================================================================================
BATCH INGESTION COMPLETE
================================================================================

Total Files: 15
Successfully Added: 15
Failed: 0
Total Time: 45.2s
Average Time per Document: 3.0s

Knowledge base ready for search!

Example 2: Semantic Search with Filters

User: "Search my technical documents for information about embeddings with minimum relevance 0.7"

Command:

python search_kb.py "embeddings" --category technical --min-score 0.7 --limit 10

Output:

================================================================================
Knowledge Base Search
================================================================================

Knowledge Base: ./data/knowledge_base
Query: "embeddings"
Category Filter: technical
Max Results: 10
Min Score: 0.7

[SEARCH] Searching knowledge base...
[SEARCH] Found 8 results

================================================================================
Result 1
================================================================================
Relevance: 0.8921 ██████████████████
Category: technical
Source: rag_implementation_guide.md

Content:
--------------------------------------------------------------------------------
Embeddings are dense vector representations of text that capture semantic
meaning. Modern embedding models like OpenAI's text-embedding-3-small or
SentenceTransformers produce vectors of 384 to 3072 dimensions. These
embeddings enable semantic search by measuring similarity through cosine
distance or dot product calculations.
--------------------------------------------------------------------------------

[... more results ...]

Example 3: Document Management

List all documents:

python scripts/knowledge_base.py list --category technical

Output:

Knowledge Base Documents
Total: 23 documents in 'technical' category

1. rag_implementation_guide.md (45.2 KB)
   Modified: 2025-11-01T10:30:45
   Metadata: {"author": "John Doe", "topics": ["RAG", "embeddings"]}

2. chromadb_setup.pdf (1.2 MB)
   Modified: 2025-10-28T14:22:10
   Metadata: {"author": "Tech Team", "version": "2.0"}

[... 21 more documents ...]

Remove document:

python scripts/knowledge_base.py remove chromadb_setup.pdf --category technical

Export metadata:

python scripts/knowledge_base.py export --output kb_metadata.json

Technical Specifications

Embedding Models Comparison

Model	Dimensions	Speed	Quality	Cost	Use Case
all-MiniLM-L6-v2	384	Fast	Good	Free	Default, general purpose
all-mpnet-base-v2	768	Medium	Better	Free	Higher quality needed
text-embedding-3-small	1536	API	Best	$0.020/1M tokens	Production, highest quality
text-embedding-3-large	3072	API	Superior	$0.130/1M tokens	Critical applications

Performance Characteristics

Document Ingestion (1000-page PDF):

Text extraction: ~5 seconds
Chunking: <1 second
Embedding generation: ~10 seconds (local), ~3 seconds (API)
Database storage: ~2 seconds
Total: ~18 seconds (local), ~11 seconds (API)

Search Performance:

Query embedding: ~100ms (local), ~200ms (API)
Vector search (10K chunks): ~50ms
Result formatting: <10ms
Total: ~160ms (local), ~260ms (API)

Storage Requirements (per 1000 documents):

Original documents: ~500 MB (varies by type)
ChromaDB vectors: ~100 MB
Metadata: ~10 MB
Total: ~610 MB

Chunking Configuration

Default Settings:

CHUNK_SIZE = 1000         # characters per chunk
CHUNK_OVERLAP = 200       # overlap between chunks
SEPARATORS = [
    "\n\n",              # Paragraphs
    "\n",                # Lines
    ". ",                # Sentences
    " ",                 # Words
    ""                   # Characters
]

Customization:

from rag_utils import DocumentChunker

chunker = DocumentChunker(
    chunk_size=1500,       # Larger chunks for technical docs
    chunk_overlap=300,     # More overlap for context
    separators=["\n\n", "\n", ". "]  # Custom separators
)

Usage Instructions

Setup (One-time)

Install Dependencies:

cd .claude/skills/hanx-knowledge-base
pip install -r scripts/requirements.txt

Verify Installation:

python scripts/rag_utils.py  # Runs test
python scripts/document_processor.py  # Tests processors

Optional: Configure OpenAI Embeddings:

# Create .env file
echo "OPENAI_API_KEY=sk-your-key-here" > .env

Basic Usage

Initialize Knowledge Base:

from knowledge_base import KnowledgeBase

kb = KnowledgeBase("./data/my_kb")

Add Documents:

# Single document
kb.add_document("document.pdf", category="technical")

# Batch add from directory
kb.add_documents_batch(
    "./documents",
    category="research",
    recursive=True,
    file_extensions=['.pdf', '.docx', '.md']
)

Search:

results = kb.search("What is RAG?", limit=5)

for result in results:
    print(f"Score: {result.score:.4f}")
    print(f"Text: {result.text[:200]}...")
    print(f"Source: {result.metadata['source']}")

Advanced Usage

Custom RAG System:

from rag_utils import VectorStore, LocalEmbeddings, DocumentChunker, RAGSystem

# Configure components
embeddings = LocalEmbeddings(model_name="all-mpnet-base-v2")
vector_store = VectorStore(
    persist_directory="./my_vector_db",
    embedding_function=embeddings
)
chunker = DocumentChunker(chunk_size=1500, chunk_overlap=300)

# Create RAG system
rag = RAGSystem(vector_store=vector_store, chunker=chunker)

# Use it
rag.add_text("Document content here...", metadata={"source": "custom.txt"})
results = rag.query("search query", limit=10)

Metadata Filtering:

# Search only recent documents
results = kb.search(
    "machine learning",
    limit=5,
    metadata_filter={"category": "research", "year": 2025}
)

Programmatic Document Processing:

from document_processor import process_document, process_directory

# Process single document
content, metadata = process_document("document.pdf")

# Process directory
results = process_directory(
    "./documents",
    recursive=True,
    file_extensions=['.pdf', '.md']
)

for content, metadata in results:
    print(f"Processed: {metadata['file_name']}")
    print(f"Length: {len(content)} characters")

Best Practices

Document Organization

Categories:

✅ Use meaningful categories (technical, business, reference)
✅ Keep category names lowercase and simple
✅ Create custom categories as needed
❌ Don't over-categorize (use metadata for fine-grained classification)

Metadata:

✅ Add relevant metadata (author, date, topic, version)
✅ Use consistent metadata keys across documents
✅ Include source URLs for web content
❌ Don't duplicate information already in content

Chunking Strategy

When to use larger chunks (1500-2000 chars):

Technical documentation with code examples
Academic papers with complex concepts
Documents requiring more context

When to use smaller chunks (500-800 chars):

Q&A documents
Short articles or blog posts
Documents with discrete topics

Overlap considerations:

More overlap (300-400 chars): Better context preservation
Less overlap (100-200 chars): More efficient storage
Default 200 chars works well for most content

Embedding Selection

Use Local Embeddings when:

Privacy is critical (data cannot leave your system)
Cost is a concern (embeddings are free)
You have many documents to process
Embedding quality is "good enough"

Use OpenAI Embeddings when:

Highest quality is required
Processing time is critical (API is faster than some local models)
You need consistency with other OpenAI tools
Cost is acceptable (~$0.02 per 1M tokens)

Search Optimization

Query Formulation:

✅ Use natural language questions
✅ Be specific rather than general
✅ Include key terms from your domain
❌ Don't use single keywords (use phrases)

Relevance Thresholds:

0.8+: Highly relevant, near-exact matches
0.7-0.8: Very relevant, strong semantic match
0.5-0.7: Relevant, moderate semantic match
<0.5: Marginal relevance, consider filtering out

Result Limits:

Start with 5 results for most queries
Increase to 10-20 for broad exploratory queries
Use 1-3 for highly specific questions

Limitations

Current Limitations

Languages: Best results with English (models are English-optimized)
Image Content: Text-only extraction (no OCR or image analysis)
Tables: Basic table extraction (formatting may be lost)
Code: Extracted as text (no syntax highlighting or execution)
Updates: Documents must be re-ingested after changes
Scale: ChromaDB optimized for <1M documents (beyond that, consider alternatives)

Performance Constraints

Large PDFs (>1000 pages) may be slow to process
First-time embedding model download (~100MB) required
Local embeddings require CPU/GPU resources
Memory usage scales with number of documents

Troubleshooting

Common Issues

Issue: "Module not found" errors

# Solution: Install dependencies
pip install -r scripts/requirements.txt

Issue: "No results found" for valid queries

# Solution 1: Check if documents are loaded
python scripts/knowledge_base.py list

# Solution 2: Lower minimum score threshold
python search_kb.py "query" --min-score 0.0

# Solution 3: Rebuild knowledge base
python scripts/knowledge_base.py rebuild

Issue: "Out of memory" during embedding generation

# Solution: Use smaller embedding model
from rag_utils import LocalEmbeddings

embeddings = LocalEmbeddings(model_name="all-MiniLM-L6-v2")  # Smaller, 384 dim

Issue: "Slow search performance"

# Solution: Reduce chunk size or document count
# Or: Switch to more efficient embedding model
# Or: Consider upgrading hardware

Issue: "PDF text extraction failed"

# Solution: Check if PDF is text-based (not scanned image)
# For scanned PDFs, use OCR preprocessing:
# tesseract input.pdf output -l eng pdf

Integration Examples

With LangChain

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from knowledge_base import KnowledgeBase

kb = KnowledgeBase("./data/kb")
retriever = kb.rag_system.vector_store.vectorstore.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retriever
)

answer = qa.run("What is RAG?")

With Claude API

import anthropic
from knowledge_base import KnowledgeBase

kb = KnowledgeBase("./data/kb")
results = kb.search("RAG implementation", limit=3)

# Build context from search results
context = "\n\n".join([r.text for r in results])

# Query Claude with context
client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Based on this context:\n\n{context}\n\nQuestion: How do I implement RAG?"
    }]
)

print(message.content[0].text)

Reference Documentation

RAG Utils Reference: Complete API documentation
Document Processors: Supported formats and extraction details
Chunking Strategies: Best practices for document chunking
Embedding Models: Comparison and selection guide

Examples

Basic Workflow: Step-by-step basic usage
Batch Ingestion: Process multiple documents
Advanced Search: Complex queries and filtering
Integration Examples: Use with LangChain, Claude, etc.

Version: 1.0.0 Created: 2025-11-01 Status: Production Ready Task: 022 Dependencies: sentence-transformers, chromadb, langchain

Source

git clone https://github.com/wrm3/ai_project_template/blob/main/.claude/skills/hanx-knowledge-base/SKILL.mdView on GitHub

Overview

Hanx Knowledge Base builds searchable knowledge bases from diverse documents and enables Retrieval-Augmented Generation (RAG) for fast, relevant answers. It handles ingestion, intelligent chunking, embeddings, and persistent storage in ChromaDB, with category management to keep content organized.

How This Skill Works

Documents are ingested from formats like PDF, DOCX, MD, TXT, CSV, HTML, and JSON. Text is extracted, split into semantically bounded chunks with overlap, and converted into vector embeddings stored in ChromaDB. Semantic search retrieves relevant chunks using natural language queries to answer user questions.

When to Use It

Ingest a single document or a batch directory into the knowledge base
Build a searchable base from mixed formats (PDF, DOCX, MD, CSV, HTML, JSON)
Answer questions using natural language against the knowledge base with RAG
Organize documents by topic/category and filter results by metadata
List, manage, or export your knowledge base for reuse

Quick Start

Step 1: python scripts/knowledge_base.py add document.pdf --category technical
Step 2: python scripts/search_kb.py "What is RAG?" --limit 5
Step 3: python scripts/knowledge_base.py list

Best Practices

Assign meaningful categories and metadata to improve filtering
Tune chunk size (default ~1000 chars) and overlap (default ~200 chars) for retrieval quality
Choose embeddings wisely: local models for cost and privacy; OpenAI for quality (requires API key)
Leverage persistent ChromaDB storage to preserve embeddings and metadata between runs
Batch ingest when possible to optimize processing time and track progress

Example Use Cases

Build a customer support KB by ingesting product manuals (PDF) and guide docs (MD, HTML) for fast Q&A
Ingest research papers and internal docs to support R&D decision making
Create a compliance/legal document repository with metadata for quick retrieval
Assemble developer docs from HTML/MD sources for a centralized API reference
Populate HR policies from CSV/MD/HTML sources to enable policy Q&A

Frequently Asked Questions

Add this skill to your agents