using-vector-databases
npx machina-cli add skill ancoleman/ai-design-components/using-vector-databases --openclawVector Databases for AI Applications
When to Use This Skill
Use this skill when implementing:
- RAG (Retrieval-Augmented Generation) systems for AI chatbots
- Semantic search capabilities (meaning-based, not just keyword)
- Recommendation systems based on similarity
- Multi-modal AI (unified search across text, images, audio)
- Document similarity and deduplication
- Question answering over private knowledge bases
Quick Decision Framework
1. Vector Database Selection
START: Choosing a Vector Database
EXISTING INFRASTRUCTURE?
├─ Using PostgreSQL already?
│ └─ pgvector (<10M vectors, tight budget)
│ See: references/pgvector.md
│
└─ No existing vector database?
│
├─ OPERATIONAL PREFERENCE?
│ │
│ ├─ Zero-ops managed only
│ │ └─ Pinecone (fully managed, excellent DX)
│ │ See: references/pinecone.md
│ │
│ └─ Flexible (self-hosted or managed)
│ │
│ ├─ SCALE: <100M vectors + complex filtering ⭐
│ │ └─ Qdrant (RECOMMENDED)
│ │ • Best metadata filtering
│ │ • Built-in hybrid search (BM25 + Vector)
│ │ • Self-host: Docker/K8s
│ │ • Managed: Qdrant Cloud
│ │ See: references/qdrant.md
│ │
│ ├─ SCALE: >100M vectors + GPU acceleration
│ │ └─ Milvus / Zilliz Cloud
│ │ See: references/milvus.md
│ │
│ ├─ Embedded / No server
│ │ └─ LanceDB (serverless, edge deployment)
│ │
│ └─ Local prototyping
│ └─ Chroma (simple API, in-memory)
2. Embedding Model Selection
REQUIREMENTS?
├─ Best quality (cost no object)
│ └─ Voyage AI voyage-3 (1024d)
│ • 9.74% better than OpenAI on MTEB
│ • ~$0.12/1M tokens
│ See: references/embedding-strategies.md
│
├─ Enterprise reliability
│ └─ OpenAI text-embedding-3-large (3072d)
│ • Industry standard
│ • ~$0.13/1M tokens
│ • Maturity shortening: reduce to 256/512/1024d
│
├─ Cost-optimized
│ └─ OpenAI text-embedding-3-small (1536d)
│ • ~$0.02/1M tokens (6x cheaper)
│ • 90-95% of large model performance
│
├─ Multilingual (100+ languages)
│ └─ Cohere embed-v3 (1024d)
│ • ~$0.10/1M tokens
│
└─ Self-hosted / Privacy-critical
├─ English: nomic-embed-text-v1.5 (768d, Apache 2.0)
├─ Multilingual: BAAI/bge-m3 (1024d, MIT)
└─ Long docs: jina-embeddings-v2 (768d, 8K context)
Core Concepts
Document Chunking Strategy
Recommended defaults for most RAG systems:
- Chunk size: 512 tokens (not characters)
- Overlap: 50 tokens (10% overlap)
Why these numbers?
- 512 tokens balances context vs. precision
- Too small (128-256): Fragments concepts, loses context
- Too large (1024-2048): Dilutes relevance, wastes LLM tokens
- 50 token overlap ensures sentences aren't split mid-context
See references/chunking-patterns.md for advanced strategies by content type.
Hybrid Search (Vector + Keyword)
Hybrid Search = Vector Similarity + BM25 Keyword Matching
User Query: "OAuth refresh token implementation"
│
┌──────┴──────┐
│ │
Vector Search Keyword Search
(Semantic) (BM25)
│ │
Top 20 docs Top 20 docs
│ │
└──────┬──────┘
│
Reciprocal Rank Fusion
(Merge + Re-rank)
│
Final Top 5 Results
Why hybrid matters:
- Vector captures semantic meaning ("OAuth refresh" ≈ "token renewal")
- Keyword ensures exact matches ("refresh_token" literal)
- Combined provides best retrieval quality
See references/hybrid-search.md for implementation details.
Getting Started
Python + Qdrant Example
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
# 1. Initialize client
client = QdrantClient("localhost", port=6333)
# 2. Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)
# 3. Insert documents with embeddings
points = [
PointStruct(
id=idx,
vector=embedding, # From OpenAI/Voyage/etc
payload={
"text": chunk_text,
"source": "docs/api.md",
"section": "Authentication"
}
)
for idx, (embedding, chunk_text) in enumerate(chunks)
]
client.upsert(collection_name="documents", points=points)
# 4. Search with metadata filtering
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
query_filter={
"must": [
{"key": "section", "match": {"value": "Authentication"}}
]
}
)
For complete examples, see examples/qdrant-python/.
TypeScript + Qdrant Example
import { QdrantClient } from '@qdrant/js-client-rest';
const client = new QdrantClient({ url: 'http://localhost:6333' });
// Create collection
await client.createCollection('documents', {
vectors: { size: 1024, distance: 'Cosine' }
});
// Insert documents
await client.upsert('documents', {
points: chunks.map((chunk, idx) => ({
id: idx,
vector: chunk.embedding,
payload: {
text: chunk.text,
source: chunk.source
}
}))
});
// Search
const results = await client.search('documents', {
vector: queryEmbedding,
limit: 5,
filter: {
must: [
{ key: 'source', match: { value: 'docs/api.md' } }
]
}
});
For complete examples, see examples/typescript-rag/.
RAG Pipeline Architecture
Complete Pipeline Components
1. INGESTION
├─ Document Loading (PDF, web, code, Office)
├─ Text Extraction & Cleaning
├─ Chunking (semantic, recursive, code-aware)
└─ Embedding Generation (batch, rate-limited)
2. INDEXING
├─ Vector Store Insertion (batch upsert)
├─ Index Configuration (HNSW, distance metric)
└─ Keyword Index (BM25 for hybrid search)
3. RETRIEVAL (Query Time)
├─ Query Processing (expansion, embedding)
├─ Hybrid Search (vector + keyword)
├─ Filtering & Post-Processing (metadata, MMR)
└─ Re-Ranking (cross-encoder, LLM-based)
4. GENERATION
├─ Context Construction (format chunks, citations)
├─ Prompt Engineering (system + context + query)
├─ LLM Inference (streaming, temperature tuning)
└─ Response Post-Processing (citations, validation)
5. EVALUATION (Production Critical)
├─ Retrieval Metrics (precision, recall, relevancy)
├─ Generation Metrics (faithfulness, correctness)
└─ System Metrics (latency, cost, satisfaction)
Essential Metadata for Production RAG
Critical for filtering and relevance:
metadata = {
# SOURCE TRACKING
"source": "docs/api-reference.md",
"source_type": "documentation", # code, docs, logs, chat
"last_updated": "2025-12-01T12:00:00Z",
# HIERARCHICAL CONTEXT
"section": "Authentication",
"subsection": "OAuth 2.1",
"heading_hierarchy": ["API Reference", "Authentication", "OAuth 2.1"],
# CONTENT CLASSIFICATION
"content_type": "code_example", # prose, code, table, list
"programming_language": "python",
# FILTERING DIMENSIONS
"product_version": "v2.0",
"audience": "enterprise", # free, pro, enterprise
# RETRIEVAL HINTS
"chunk_index": 3,
"total_chunks": 12,
"has_code": True
}
Why metadata matters:
- Enables filtering BEFORE vector search (reduces search space)
- Improves relevance through targeted retrieval
- Supports multi-tenant systems (filter by user/org)
- Enables versioned documentation (filter by product version)
Evaluation with RAGAS
Use scripts/evaluate_rag.py for automated evaluation:
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Answer grounded in context
answer_relevancy, # Answer addresses query
context_recall, # Retrieved docs cover ground truth
context_precision # Retrieved docs are relevant
)
# Test dataset
test_data = {
"question": ["How do I refresh OAuth tokens?"],
"answer": ["Use /token with refresh_token grant..."],
"contexts": [["OAuth refresh documentation..."]],
"ground_truth": ["POST to /token with grant_type=refresh_token"]
}
# Evaluate
results = evaluate(test_data, metrics=[
faithfulness,
answer_relevancy,
context_recall,
context_precision
])
# Production targets:
# faithfulness: >0.90 (minimal hallucination)
# answer_relevancy: >0.85 (addresses user query)
# context_recall: >0.80 (sufficient context retrieved)
# context_precision: >0.75 (minimal noise)
Performance Optimization
Embedding Generation
- Batch processing: 100-500 chunks per batch
- Caching: Cache embeddings by content hash
- Rate limiting: Respect API provider limits (exponential backoff)
Vector Search
- Index type: HNSW (Hierarchical Navigable Small World) for most cases
- Distance metric: Cosine for normalized embeddings
- Pre-filtering: Apply metadata filters before vector search
- Result diversity: Use MMR (Maximal Marginal Relevance) to reduce redundancy
Cost Optimization
- Embedding model: Consider text-embedding-3-small for budget constraints
- Dimension reduction: Use maturity shortening (3072d → 1024d)
- Caching: Implement semantic caching for repeated queries
- Batch operations: Group insertions/updates for efficiency
Common Workflows
1. Building a RAG Chatbot
- Vector database: Qdrant (self-hosted or cloud)
- Embeddings: OpenAI text-embedding-3-large
- Chunking: 512 tokens, 50 overlap, semantic splitter
- Search: Hybrid (vector + BM25)
- Integration: Frontend with ai-chat skill
See examples/qdrant-python/ for complete implementation.
2. Semantic Search Engine
- Vector database: Qdrant or Pinecone
- Embeddings: Voyage AI voyage-3 (best quality)
- Chunking: Content-type specific (see chunking-patterns.md)
- Search: Hybrid with re-ranking
- Filtering: Pre-filter by metadata (date, category, etc.)
3. Code Search
- Vector database: Qdrant
- Embeddings: OpenAI text-embedding-3-large
- Chunking: AST-based (function/class boundaries)
- Metadata: Language, file path, imports
- Search: Hybrid with language filtering
See examples/qdrant-python/ for code-specific implementation.
Integration with Other Skills
Frontend Skills
- ai-chat: Vector DB powers RAG pipeline behind chat interface
- search-filter: Replace keyword search with semantic search
- data-viz: Visualize embedding spaces, similarity scores
Backend Skills
- databases-relational: Hybrid approach using pgvector extension
- api-patterns: Expose semantic search via REST/GraphQL
- observability: Monitor embedding quality and retrieval metrics
Multi-Language Support
Python (Primary)
- Client:
qdrant-client - Framework: LangChain, LlamaIndex
- See:
examples/qdrant-python/
Rust
- Client:
qdrant-client(1,549 code snippets in Context7) - Framework: Raw Rust for performance-critical systems
- See:
examples/rust-axum-vector/
TypeScript
- Client:
@qdrant/js-client-rest - Framework: LangChain.js, integration with Next.js
- See:
examples/typescript-rag/
Go
- Client:
qdrant-go - Use case: High-performance microservices
Troubleshooting
Poor Retrieval Quality
- Check chunking strategy (too large/small?)
- Verify metadata filtering (too restrictive?)
- Try hybrid search instead of vector-only
- Implement re-ranking stage
- Evaluate with RAGAS metrics
Slow Performance
- Use HNSW index (not Flat)
- Pre-filter with metadata before vector search
- Reduce vector dimensions (maturity shortening)
- Batch operations (insertions, searches)
- Consider GPU acceleration (Milvus)
High Costs
- Switch to text-embedding-3-small
- Implement semantic caching
- Reduce chunk overlap
- Use self-hosted embeddings (nomic, bge-m3)
- Batch embedding generation
Qdrant Context7 Documentation
Primary resource: /llmstxt/qdrant_tech_llms-full_txt
- Trust score: High
- Code snippets: 10,154
- Quality score: 83.1
Access via Context7:
resolve-library-id({ libraryName: "Qdrant" })
get-library-docs({
context7CompatibleLibraryID: "/llmstxt/qdrant_tech_llms-full_txt",
topic: "hybrid search collections python",
mode: "code"
})
Additional Resources
Reference Documentation
references/qdrant.md- Comprehensive Qdrant guidereferences/pgvector.md- PostgreSQL pgvector extensionreferences/milvus.md- Milvus/Zilliz for billion-scalereferences/embedding-strategies.md- Embedding model comparisonreferences/chunking-patterns.md- Advanced chunking techniques
Code Examples
examples/qdrant-python/- FastAPI + Qdrant RAG pipelineexamples/pgvector-prisma/- PostgreSQL + Prisma integrationexamples/typescript-rag/- TypeScript RAG with Hono
Automation Scripts
scripts/generate_embeddings.py- Batch embedding generationscripts/benchmark_similarity.py- Performance benchmarkingscripts/evaluate_rag.py- RAGAS-based evaluation
Next Steps:
- Choose vector database based on scale and infrastructure
- Select embedding model based on quality vs. cost trade-off
- Implement chunking strategy for the content type
- Set up hybrid search for production quality
- Evaluate with RAGAS metrics
- Optimize for performance and cost
Source
git clone https://github.com/ancoleman/ai-design-components/blob/main/skills/using-vector-databases/SKILL.mdView on GitHub Overview
This skill covers vector database implementations for AI/ML apps, enabling semantic search, RAG, and similarity-based retrieval. It discusses popular engines (Qdrant as primary, Pinecone, Milvus, pgvector, Chroma), embedding options (OpenAI, Voyage, Cohere), chunking strategies, and hybrid search patterns.
How This Skill Works
Store text embeddings in a vector database, chunk content into 512-token pieces with a 50-token overlap, and index with metadata for filtering. Use hybrid search to combine vector similarity with BM25 keywords, then merge results with Reciprocal Rank Fusion to produce the top results.
When to Use It
- Building RAG-powered chatbots and assistants
- Implementing semantic search capabilities over content
- Creating recommendation systems based on embedding similarity
- Enabling multi-modal unified search across text, images, and audio
- Document similarity, deduplication, and private knowledge base Q&A
Quick Start
- Step 1: Choose a vector DB (Qdrant recommended for metadata and hybrid search) and select embeddings (OpenAI, Voyage, or Cohere) based on scale and cost
- Step 2: Chunk documents into 512-token blocks with 50-token overlap and generate embeddings for each chunk
- Step 3: Build a retrieval pipeline using vector similarity plus optional BM25 keyword search, then tune with Reciprocal Rank Fusion
Best Practices
- Chunk documents into 512-token blocks with 50-token overlap by default
- Choose embeddings (OpenAI, Voyage, Cohere) based on quality, cost, and multilingual needs
- Leverage metadata filtering in vector DBs like Qdrant for precise results
- Combine vector search with BM25 via hybrid search and Reciprocal Rank Fusion when appropriate
- Evaluate retrieval quality and latency with realistic prompts and end-to-end tests
Example Use Cases
- RAG-powered customer support chatbot over private knowledge bases
- Semantic search for a large e-commerce catalog
- Document deduplication and similarity search in historical archives
- Multi-modal search across text and images in a media library
- Private Q&A over internal documents using hybrid search