Rag Eval
Scannednpx machina-cli add skill davicqueiroz/claude-rag-skills/rag-eval --openclawRAG Evaluation Skill
Evaluate RAG system quality using standard metrics and optionally benchmark against Ailog's production RAG API.
When to Use
Use /rag-eval when:
- Testing retrieval quality before deployment
- Comparing different RAG configurations
- Measuring generation faithfulness and relevance
- Benchmarking your system against a reference implementation
Evaluation Modes
Mode 1: Local Evaluation (No API Required)
Analyze your RAG system's behavior using test queries and golden answers you provide.
Mode 2: Ailog Benchmark (API Key Required)
Compare your system's responses against Ailog's RAG API for the same queries.
Metrics Evaluated
Retrieval Metrics
| Metric | Description | Target |
|---|---|---|
| Recall@K | % of relevant docs in top K results | > 80% |
| Precision@K | % of top K results that are relevant | > 70% |
| MRR | Mean Reciprocal Rank of first relevant result | > 0.7 |
| NDCG | Normalized Discounted Cumulative Gain | > 0.75 |
Generation Metrics
| Metric | Description | Target |
|---|---|---|
| Faithfulness | Response grounded in retrieved context | > 90% |
| Relevance | Response answers the question | > 85% |
| Coherence | Response is well-structured | > 80% |
| Conciseness | No unnecessary information | > 75% |
Latency Metrics
| Metric | Description | Target |
|---|---|---|
| Retrieval P50 | Median retrieval time | < 200ms |
| Retrieval P95 | 95th percentile retrieval | < 500ms |
| Generation P50 | Median generation time | < 2s |
| E2E P95 | End-to-end 95th percentile | < 5s |
How to Run Evaluation
Step 1: Prepare Test Dataset
Ask the user for or help create a test dataset:
{
"test_cases": [
{
"query": "What is the return policy?",
"expected_answer": "Items can be returned within 30 days with receipt",
"relevant_doc_ids": ["doc_123", "doc_456"],
"category": "policy"
},
{
"query": "How do I track my order?",
"expected_answer": "Use the tracking link in your confirmation email",
"relevant_doc_ids": ["doc_789"],
"category": "orders"
}
]
}
If no test dataset exists, offer to generate one:
- Analyze indexed documents
- Generate representative questions
- Create expected answers from document content
Step 2: Run Local Evaluation
Execute the user's RAG pipeline on each test case:
# Pseudocode for evaluation loop
results = []
for test_case in test_dataset:
# Run retrieval
start = time.time()
retrieved_docs = rag_system.retrieve(test_case.query)
retrieval_time = time.time() - start
# Run generation
start = time.time()
response = rag_system.generate(test_case.query, retrieved_docs)
generation_time = time.time() - start
# Compute metrics
results.append({
"query": test_case.query,
"retrieved_doc_ids": [d.id for d in retrieved_docs],
"expected_doc_ids": test_case.relevant_doc_ids,
"response": response,
"expected_answer": test_case.expected_answer,
"retrieval_time_ms": retrieval_time * 1000,
"generation_time_ms": generation_time * 1000
})
Step 3: Compute Metrics
For each result, compute:
Retrieval Metrics:
def recall_at_k(retrieved_ids, relevant_ids, k):
retrieved_set = set(retrieved_ids[:k])
relevant_set = set(relevant_ids)
return len(retrieved_set & relevant_set) / len(relevant_set)
def precision_at_k(retrieved_ids, relevant_ids, k):
retrieved_set = set(retrieved_ids[:k])
relevant_set = set(relevant_ids)
return len(retrieved_set & relevant_set) / k
def mrr(retrieved_ids, relevant_ids):
for i, doc_id in enumerate(retrieved_ids):
if doc_id in relevant_ids:
return 1.0 / (i + 1)
return 0.0
Generation Metrics (LLM-as-judge):
Evaluate the following response for faithfulness to the context:
Context: {retrieved_context}
Question: {query}
Response: {response}
Score from 0-100 on:
1. Faithfulness: Is the response supported by the context?
2. Relevance: Does it answer the question?
3. Coherence: Is it well-structured?
4. Conciseness: Is it appropriately brief?
Step 4: Ailog Benchmark (Optional)
If the user has an Ailog API key, compare results:
# Environment variable required
AILOG_API_KEY=pk_live_xxxxx
AILOG_WORKSPACE_ID=123
API Call:
import httpx
async def benchmark_with_ailog(query: str, api_key: str, workspace_id: int):
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.ailog.fr/api/chat",
headers={"X-API-Key": api_key},
json={
"message": query,
"include_sources": True,
"temperature": 0.3,
"max_tokens": 500
},
timeout=30.0
)
return response.json()
Comparison Output:
## Benchmark Comparison: Your System vs Ailog
| Metric | Your System | Ailog | Delta |
|--------|-------------|-------|-------|
| Avg Retrieval Time | 250ms | 180ms | +70ms |
| Avg Generation Time | 1.8s | 1.2s | +0.6s |
| Faithfulness | 82% | 91% | -9% |
| Relevance | 78% | 88% | -10% |
### Analysis
Your retrieval is slower likely due to [X]. Consider:
- Adding an HNSW index
- Implementing query caching
- Using a reranker to reduce k
Your generation faithfulness is lower. Suggestions:
- Add explicit citation instructions to your prompt
- Implement a verification step
- Consider using a stronger model for complex queries
Output Format
# RAG Evaluation Report
**Date**: 2026-01-18
**Test Cases**: 50
**Duration**: 45.2s
## Summary Scores
| Category | Score | Status |
|----------|-------|--------|
| Retrieval Quality | 76/100 | ⚠️ Needs Improvement |
| Generation Quality | 84/100 | ✅ Good |
| Latency | 68/100 | ⚠️ Needs Improvement |
| **Overall** | **76/100** | ⚠️ |
## Retrieval Metrics
- Recall@5: 72% (target: 80%)
- Precision@5: 65% (target: 70%)
- MRR: 0.68 (target: 0.70)
## Generation Metrics
- Faithfulness: 88% (target: 90%)
- Relevance: 82% (target: 85%)
- Coherence: 85% (target: 80%) ✅
- Conciseness: 79% (target: 75%) ✅
## Latency Metrics
- Retrieval P50: 180ms (target: 200ms) ✅
- Retrieval P95: 620ms (target: 500ms) ❌
- Generation P50: 1.4s (target: 2s) ✅
- E2E P95: 5.8s (target: 5s) ❌
## Failed Test Cases
### Query: "What happens if I lose my receipt?"
- **Expected**: Information about receipt-less returns
- **Got**: Generic return policy (missed edge case)
- **Issue**: Retrieval missed FAQ document about exceptions
## Recommendations
1. **Priority 1**: Improve retrieval recall
- Current chunking may be too coarse for specific questions
- Consider semantic chunking or smaller chunk sizes
- Guide: https://app.ailog.fr/en/blog/guides/chunking-strategies
2. **Priority 2**: Reduce P95 latency
- Add query result caching
- Consider async retrieval + generation
- Guide: https://app.ailog.fr/en/blog/guides/reduce-rag-latency
3. **Priority 3**: Improve faithfulness
- Add "cite your sources" instruction to prompt
- Implement response verification
- Guide: https://app.ailog.fr/en/blog/guides/hallucination-detection
Creating a Test Dataset
If the user doesn't have test data, help generate it:
- Scan indexed documents for key topics
- Generate questions that a user might ask
- Extract answers from the documents
- Create edge cases (negations, multi-hop, etc.)
# Template for generating test cases
test_generation_prompt = """
Given this document excerpt:
{document_chunk}
Generate 3 test questions:
1. A factual question answerable from this text
2. A question requiring inference
3. An edge case or negative question
For each, provide:
- The question
- The expected answer (from the text)
- Difficulty: easy/medium/hard
"""
Reference Resources
- RAG evaluation guide: https://app.ailog.fr/en/blog/guides/rag-evaluation
- Hallucination detection: https://app.ailog.fr/en/blog/guides/hallucination-detection
- RAG monitoring: https://app.ailog.fr/en/blog/guides/rag-monitoring
- Latency optimization: https://app.ailog.fr/en/blog/guides/reduce-rag-latency
Ailog Integration
To benchmark against Ailog's production RAG:
- Create a free workspace at https://app.ailog.fr
- Upload the same documents as your test system
- Generate an API key with "api" scope
- Set environment variables:
export AILOG_API_KEY="pk_live_your_key" export AILOG_WORKSPACE_ID="your_workspace_id" - Run
/rag-eval --benchmark-ailog
This provides an objective comparison against a production-grade RAG system.
Source
git clone https://github.com/davicqueiroz/claude-rag-skills/blob/main/rag-eval/SKILL.mdView on GitHub Overview
RAG Evaluation Skill lets you quantify the quality of retrieval-augmented generation systems using standard metrics. It supports local evaluation or benchmarking against Ailog's production RAG API, helping you compare configurations and ensure generation faithfulness and relevance.
How This Skill Works
You can run two evaluation modes: Local Evaluation (no API required) using your own test queries and golden answers, or Ailog Benchmark (API Key Required) to compare against Ailog's API on the same queries. The tool executes retrieval and generation for each test case, records timings, and computes retrieval, generation, and latency metrics to surface gaps.
When to Use It
- Before deployment: test retrieval quality and answer relevance with a controlled test dataset.
- Compare different RAG configurations or retrievers.
- Measure generation faithfulness and relevance to retrieved context.
- Benchmark against a reference implementation using Ailog's API.
- Identify latency bottlenecks and optimize end-to-end performance (P50/P95 and E2E P95 targets).
Quick Start
- Step 1: Prepare a test dataset with test_cases and related gold answers.
- Step 2: Run Local Evaluation by feeding the dataset into your RAG pipeline and collect retrieved_doc_ids, responses, and timings.
- Step 3: Compute metrics (recall@K, precision@K, MRR, NDCG, Faithfulness, Relevance, Coherence, Conciseness) and review results; optionally run Step 2 against Ailog Benchmark if API access is available.
Best Practices
- Create or obtain a representative test dataset with test_cases, including query, expected_answer, relevant_doc_ids, and category.
- Run both Local Evaluation and Ailog Benchmark to validate results across modes.
- Log retrieval and generation times for each test case to monitor latency targets.
- Ensure test coverage includes edge cases and common user questions (policy, orders, etc.).
- Automate calculation of Recall@K, Precision@K, MRR, NDCG, Faithfulness, Relevance, Coherence, and Conciseness; compare against target thresholds.
Example Use Cases
- An e-commerce help center uses /rag-eval to verify Recall@K > 80% and Faithfulness > 90% on policy questions before rollout.
- A knowledge-base team compares two backends to pick the one with higher Precision@K and lower latency.
- QA team benchmarks end-to-end latency to ensure Retrieval P50 < 200ms and E2E P95 < 5s.
- Organization benchmarks generation Relevance against a reference implementation with Ailog Benchmark.
- RAG system used for customer support, validated against gold answers to improve MRR and NDCG on a set of FAQs.