What is a vector database?

A database optimized to store high-dimensional embeddings and perform fast similarity searches for semantic retrieval and RAG.

Which vector DB should I choose?

Choose based on scale, hosting preference, and features: Qdrant for metadata and hybrid search, Pinecone for managed DX, Milvus for large-scale GPU, pgvector for PostgreSQL, Chroma for quick prototyping.

What chunking strategy should I start with?

Start with 512-token chunks and 50-token overlap; adjust by content type and retrieval goals to balance context and efficiency.

using-vector-databases

npx machina-cli add skill ancoleman/ai-design-components/using-vector-databases --openclaw

Files (1)

SKILL.md

13.9 KB

Vector Databases for AI Applications

When to Use This Skill

Use this skill when implementing:

RAG (Retrieval-Augmented Generation) systems for AI chatbots
Semantic search capabilities (meaning-based, not just keyword)
Recommendation systems based on similarity
Multi-modal AI (unified search across text, images, audio)
Document similarity and deduplication
Question answering over private knowledge bases

Quick Decision Framework

1. Vector Database Selection

START: Choosing a Vector Database

EXISTING INFRASTRUCTURE?
├─ Using PostgreSQL already?
│  └─ pgvector (<10M vectors, tight budget)
│      See: references/pgvector.md
│
└─ No existing vector database?
   │
   ├─ OPERATIONAL PREFERENCE?
   │  │
   │  ├─ Zero-ops managed only
   │  │  └─ Pinecone (fully managed, excellent DX)
   │  │      See: references/pinecone.md
   │  │
   │  └─ Flexible (self-hosted or managed)
   │     │
   │     ├─ SCALE: <100M vectors + complex filtering ⭐
   │     │  └─ Qdrant (RECOMMENDED)
   │     │      • Best metadata filtering
   │     │      • Built-in hybrid search (BM25 + Vector)
   │     │      • Self-host: Docker/K8s
   │     │      • Managed: Qdrant Cloud
   │     │      See: references/qdrant.md
   │     │
   │     ├─ SCALE: >100M vectors + GPU acceleration
   │     │  └─ Milvus / Zilliz Cloud
   │     │      See: references/milvus.md
   │     │
   │     ├─ Embedded / No server
   │     │  └─ LanceDB (serverless, edge deployment)
   │     │
   │     └─ Local prototyping
   │        └─ Chroma (simple API, in-memory)

2. Embedding Model Selection

REQUIREMENTS?

├─ Best quality (cost no object)
│  └─ Voyage AI voyage-3 (1024d)
│      • 9.74% better than OpenAI on MTEB
│      • ~$0.12/1M tokens
│      See: references/embedding-strategies.md
│
├─ Enterprise reliability
│  └─ OpenAI text-embedding-3-large (3072d)
│      • Industry standard
│      • ~$0.13/1M tokens
│      • Maturity shortening: reduce to 256/512/1024d
│
├─ Cost-optimized
│  └─ OpenAI text-embedding-3-small (1536d)
│      • ~$0.02/1M tokens (6x cheaper)
│      • 90-95% of large model performance
│
├─ Multilingual (100+ languages)
│  └─ Cohere embed-v3 (1024d)
│      • ~$0.10/1M tokens
│
└─ Self-hosted / Privacy-critical
   ├─ English: nomic-embed-text-v1.5 (768d, Apache 2.0)
   ├─ Multilingual: BAAI/bge-m3 (1024d, MIT)
   └─ Long docs: jina-embeddings-v2 (768d, 8K context)

Core Concepts

Document Chunking Strategy

Recommended defaults for most RAG systems:

Chunk size: 512 tokens (not characters)
Overlap: 50 tokens (10% overlap)

Why these numbers?

512 tokens balances context vs. precision
- Too small (128-256): Fragments concepts, loses context
- Too large (1024-2048): Dilutes relevance, wastes LLM tokens
50 token overlap ensures sentences aren't split mid-context

See references/chunking-patterns.md for advanced strategies by content type.

Hybrid Search (Vector + Keyword)

Hybrid Search = Vector Similarity + BM25 Keyword Matching

User Query: "OAuth refresh token implementation"
           │
    ┌──────┴──────┐
    │             │
Vector Search   Keyword Search
(Semantic)      (BM25)
    │             │
Top 20 docs   Top 20 docs
    │             │
    └──────┬──────┘
           │
   Reciprocal Rank Fusion
   (Merge + Re-rank)
           │
    Final Top 5 Results

Why hybrid matters:

Vector captures semantic meaning ("OAuth refresh" ≈ "token renewal")
Keyword ensures exact matches ("refresh_token" literal)
Combined provides best retrieval quality

See references/hybrid-search.md for implementation details.

Getting Started

Python + Qdrant Example

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# 1. Initialize client
client = QdrantClient("localhost", port=6333)

# 2. Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

# 3. Insert documents with embeddings
points = [
    PointStruct(
        id=idx,
        vector=embedding,  # From OpenAI/Voyage/etc
        payload={
            "text": chunk_text,
            "source": "docs/api.md",
            "section": "Authentication"
        }
    )
    for idx, (embedding, chunk_text) in enumerate(chunks)
]
client.upsert(collection_name="documents", points=points)

# 4. Search with metadata filtering
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter={
        "must": [
            {"key": "section", "match": {"value": "Authentication"}}
        ]
    }
)

For complete examples, see examples/qdrant-python/.

TypeScript + Qdrant Example

import { QdrantClient } from '@qdrant/js-client-rest';

const client = new QdrantClient({ url: 'http://localhost:6333' });

// Create collection
await client.createCollection('documents', {
  vectors: { size: 1024, distance: 'Cosine' }
});

// Insert documents
await client.upsert('documents', {
  points: chunks.map((chunk, idx) => ({
    id: idx,
    vector: chunk.embedding,
    payload: {
      text: chunk.text,
      source: chunk.source
    }
  }))
});

// Search
const results = await client.search('documents', {
  vector: queryEmbedding,
  limit: 5,
  filter: {
    must: [
      { key: 'source', match: { value: 'docs/api.md' } }
    ]
  }
});

For complete examples, see examples/typescript-rag/.

RAG Pipeline Architecture

Complete Pipeline Components

1. INGESTION
   ├─ Document Loading (PDF, web, code, Office)
   ├─ Text Extraction & Cleaning
   ├─ Chunking (semantic, recursive, code-aware)
   └─ Embedding Generation (batch, rate-limited)

2. INDEXING
   ├─ Vector Store Insertion (batch upsert)
   ├─ Index Configuration (HNSW, distance metric)
   └─ Keyword Index (BM25 for hybrid search)

3. RETRIEVAL (Query Time)
   ├─ Query Processing (expansion, embedding)
   ├─ Hybrid Search (vector + keyword)
   ├─ Filtering & Post-Processing (metadata, MMR)
   └─ Re-Ranking (cross-encoder, LLM-based)

4. GENERATION
   ├─ Context Construction (format chunks, citations)
   ├─ Prompt Engineering (system + context + query)
   ├─ LLM Inference (streaming, temperature tuning)
   └─ Response Post-Processing (citations, validation)

5. EVALUATION (Production Critical)
   ├─ Retrieval Metrics (precision, recall, relevancy)
   ├─ Generation Metrics (faithfulness, correctness)
   └─ System Metrics (latency, cost, satisfaction)

Essential Metadata for Production RAG

Critical for filtering and relevance:

metadata = {
    # SOURCE TRACKING
    "source": "docs/api-reference.md",
    "source_type": "documentation",  # code, docs, logs, chat
    "last_updated": "2025-12-01T12:00:00Z",

    # HIERARCHICAL CONTEXT
    "section": "Authentication",
    "subsection": "OAuth 2.1",
    "heading_hierarchy": ["API Reference", "Authentication", "OAuth 2.1"],

    # CONTENT CLASSIFICATION
    "content_type": "code_example",  # prose, code, table, list
    "programming_language": "python",

    # FILTERING DIMENSIONS
    "product_version": "v2.0",
    "audience": "enterprise",  # free, pro, enterprise

    # RETRIEVAL HINTS
    "chunk_index": 3,
    "total_chunks": 12,
    "has_code": True
}

Why metadata matters:

Enables filtering BEFORE vector search (reduces search space)
Improves relevance through targeted retrieval
Supports multi-tenant systems (filter by user/org)
Enables versioned documentation (filter by product version)

Evaluation with RAGAS

Use scripts/evaluate_rag.py for automated evaluation:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,       # Answer grounded in context
    answer_relevancy,   # Answer addresses query
    context_recall,     # Retrieved docs cover ground truth
    context_precision   # Retrieved docs are relevant
)

# Test dataset
test_data = {
    "question": ["How do I refresh OAuth tokens?"],
    "answer": ["Use /token with refresh_token grant..."],
    "contexts": [["OAuth refresh documentation..."]],
    "ground_truth": ["POST to /token with grant_type=refresh_token"]
}

# Evaluate
results = evaluate(test_data, metrics=[
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
])

# Production targets:
# faithfulness: >0.90 (minimal hallucination)
# answer_relevancy: >0.85 (addresses user query)
# context_recall: >0.80 (sufficient context retrieved)
# context_precision: >0.75 (minimal noise)

Performance Optimization

Embedding Generation

Batch processing: 100-500 chunks per batch
Caching: Cache embeddings by content hash
Rate limiting: Respect API provider limits (exponential backoff)

Vector Search

Index type: HNSW (Hierarchical Navigable Small World) for most cases
Distance metric: Cosine for normalized embeddings
Pre-filtering: Apply metadata filters before vector search
Result diversity: Use MMR (Maximal Marginal Relevance) to reduce redundancy

Cost Optimization

Embedding model: Consider text-embedding-3-small for budget constraints
Dimension reduction: Use maturity shortening (3072d → 1024d)
Caching: Implement semantic caching for repeated queries
Batch operations: Group insertions/updates for efficiency

Common Workflows

1. Building a RAG Chatbot

Vector database: Qdrant (self-hosted or cloud)
Embeddings: OpenAI text-embedding-3-large
Chunking: 512 tokens, 50 overlap, semantic splitter
Search: Hybrid (vector + BM25)
Integration: Frontend with ai-chat skill

See examples/qdrant-python/ for complete implementation.

2. Semantic Search Engine

Vector database: Qdrant or Pinecone
Embeddings: Voyage AI voyage-3 (best quality)
Chunking: Content-type specific (see chunking-patterns.md)
Search: Hybrid with re-ranking
Filtering: Pre-filter by metadata (date, category, etc.)

3. Code Search

Vector database: Qdrant
Embeddings: OpenAI text-embedding-3-large
Chunking: AST-based (function/class boundaries)
Metadata: Language, file path, imports
Search: Hybrid with language filtering

See examples/qdrant-python/ for code-specific implementation.

Integration with Other Skills

Frontend Skills

ai-chat: Vector DB powers RAG pipeline behind chat interface
search-filter: Replace keyword search with semantic search
data-viz: Visualize embedding spaces, similarity scores

Backend Skills

databases-relational: Hybrid approach using pgvector extension
api-patterns: Expose semantic search via REST/GraphQL
observability: Monitor embedding quality and retrieval metrics

Multi-Language Support

Python (Primary)

Client: qdrant-client
Framework: LangChain, LlamaIndex
See: examples/qdrant-python/

Rust

Client: qdrant-client (1,549 code snippets in Context7)
Framework: Raw Rust for performance-critical systems
See: examples/rust-axum-vector/

TypeScript

Client: @qdrant/js-client-rest
Framework: LangChain.js, integration with Next.js
See: examples/typescript-rag/

Go

Client: qdrant-go
Use case: High-performance microservices

Troubleshooting

Poor Retrieval Quality

Check chunking strategy (too large/small?)
Verify metadata filtering (too restrictive?)
Try hybrid search instead of vector-only
Implement re-ranking stage
Evaluate with RAGAS metrics

Slow Performance

Use HNSW index (not Flat)
Pre-filter with metadata before vector search
Reduce vector dimensions (maturity shortening)
Batch operations (insertions, searches)
Consider GPU acceleration (Milvus)

High Costs

Switch to text-embedding-3-small
Implement semantic caching
Reduce chunk overlap
Use self-hosted embeddings (nomic, bge-m3)
Batch embedding generation

Qdrant Context7 Documentation

Primary resource: /llmstxt/qdrant_tech_llms-full_txt

Trust score: High
Code snippets: 10,154
Quality score: 83.1

Access via Context7:

resolve-library-id({ libraryName: "Qdrant" })
get-library-docs({
  context7CompatibleLibraryID: "/llmstxt/qdrant_tech_llms-full_txt",
  topic: "hybrid search collections python",
  mode: "code"
})

Additional Resources

Reference Documentation

references/qdrant.md - Comprehensive Qdrant guide
references/pgvector.md - PostgreSQL pgvector extension
references/milvus.md - Milvus/Zilliz for billion-scale
references/embedding-strategies.md - Embedding model comparison
references/chunking-patterns.md - Advanced chunking techniques

Code Examples

examples/qdrant-python/ - FastAPI + Qdrant RAG pipeline
examples/pgvector-prisma/ - PostgreSQL + Prisma integration
examples/typescript-rag/ - TypeScript RAG with Hono

Automation Scripts

scripts/generate_embeddings.py - Batch embedding generation
scripts/benchmark_similarity.py - Performance benchmarking
scripts/evaluate_rag.py - RAGAS-based evaluation

Next Steps:

Choose vector database based on scale and infrastructure
Select embedding model based on quality vs. cost trade-off
Implement chunking strategy for the content type
Set up hybrid search for production quality
Evaluate with RAGAS metrics
Optimize for performance and cost

Source

git clone https://github.com/ancoleman/ai-design-components/blob/main/skills/using-vector-databases/SKILL.mdView on GitHub

Overview

This skill covers vector database implementations for AI/ML apps, enabling semantic search, RAG, and similarity-based retrieval. It discusses popular engines (Qdrant as primary, Pinecone, Milvus, pgvector, Chroma), embedding options (OpenAI, Voyage, Cohere), chunking strategies, and hybrid search patterns.

How This Skill Works

Store text embeddings in a vector database, chunk content into 512-token pieces with a 50-token overlap, and index with metadata for filtering. Use hybrid search to combine vector similarity with BM25 keywords, then merge results with Reciprocal Rank Fusion to produce the top results.

When to Use It

Building RAG-powered chatbots and assistants
Implementing semantic search capabilities over content
Creating recommendation systems based on embedding similarity
Enabling multi-modal unified search across text, images, and audio
Document similarity, deduplication, and private knowledge base Q&A

Quick Start

Step 1: Choose a vector DB (Qdrant recommended for metadata and hybrid search) and select embeddings (OpenAI, Voyage, or Cohere) based on scale and cost
Step 2: Chunk documents into 512-token blocks with 50-token overlap and generate embeddings for each chunk
Step 3: Build a retrieval pipeline using vector similarity plus optional BM25 keyword search, then tune with Reciprocal Rank Fusion

Best Practices

Chunk documents into 512-token blocks with 50-token overlap by default
Choose embeddings (OpenAI, Voyage, Cohere) based on quality, cost, and multilingual needs
Leverage metadata filtering in vector DBs like Qdrant for precise results
Combine vector search with BM25 via hybrid search and Reciprocal Rank Fusion when appropriate
Evaluate retrieval quality and latency with realistic prompts and end-to-end tests

Example Use Cases

RAG-powered customer support chatbot over private knowledge bases
Semantic search for a large e-commerce catalog
Document deduplication and similarity search in historical archives
Multi-modal search across text and images in a media library
Private Q&A over internal documents using hybrid search

Frequently Asked Questions

Add this skill to your agents