What is RAG Evaluation Skill?

A tool to evaluate retrieval-augmented generation quality using standard metrics and optional benchmarking against Ailog's production RAG API.

What metrics are evaluated?

Retrieval: Recall@K, Precision@K, MRR, NDCG; Generation: Faithfulness, Relevance, Coherence, Conciseness; Latency: Retrieval P50/P95, Generation P50, E2E P95.

What modes does it support?

Mode 1: Local Evaluation (no API required); Mode 2: Ailog Benchmark (API Key Required).

Rag Eval

Scanned

npx machina-cli add skill davicqueiroz/claude-rag-skills/rag-eval --openclaw

Files (1)

SKILL.md

8.6 KB

RAG Evaluation Skill

Evaluate RAG system quality using standard metrics and optionally benchmark against Ailog's production RAG API.

When to Use

Use /rag-eval when:

Testing retrieval quality before deployment
Comparing different RAG configurations
Measuring generation faithfulness and relevance
Benchmarking your system against a reference implementation

Evaluation Modes

Mode 1: Local Evaluation (No API Required)

Analyze your RAG system's behavior using test queries and golden answers you provide.

Mode 2: Ailog Benchmark (API Key Required)

Compare your system's responses against Ailog's RAG API for the same queries.

Metrics Evaluated

Retrieval Metrics

Metric	Description	Target
Recall@K	% of relevant docs in top K results	> 80%
Precision@K	% of top K results that are relevant	> 70%
MRR	Mean Reciprocal Rank of first relevant result	> 0.7
NDCG	Normalized Discounted Cumulative Gain	> 0.75

Generation Metrics

Metric	Description	Target
Faithfulness	Response grounded in retrieved context	> 90%
Relevance	Response answers the question	> 85%
Coherence	Response is well-structured	> 80%
Conciseness	No unnecessary information	> 75%

Latency Metrics

Metric	Description	Target
Retrieval P50	Median retrieval time	< 200ms
Retrieval P95	95th percentile retrieval	< 500ms
Generation P50	Median generation time	< 2s
E2E P95	End-to-end 95th percentile	< 5s

How to Run Evaluation

Step 1: Prepare Test Dataset

Ask the user for or help create a test dataset:

{
  "test_cases": [
    {
      "query": "What is the return policy?",
      "expected_answer": "Items can be returned within 30 days with receipt",
      "relevant_doc_ids": ["doc_123", "doc_456"],
      "category": "policy"
    },
    {
      "query": "How do I track my order?",
      "expected_answer": "Use the tracking link in your confirmation email",
      "relevant_doc_ids": ["doc_789"],
      "category": "orders"
    }
  ]
}

If no test dataset exists, offer to generate one:

Analyze indexed documents
Generate representative questions
Create expected answers from document content

Step 2: Run Local Evaluation

Execute the user's RAG pipeline on each test case:

# Pseudocode for evaluation loop
results = []
for test_case in test_dataset:
    # Run retrieval
    start = time.time()
    retrieved_docs = rag_system.retrieve(test_case.query)
    retrieval_time = time.time() - start

    # Run generation
    start = time.time()
    response = rag_system.generate(test_case.query, retrieved_docs)
    generation_time = time.time() - start

    # Compute metrics
    results.append({
        "query": test_case.query,
        "retrieved_doc_ids": [d.id for d in retrieved_docs],
        "expected_doc_ids": test_case.relevant_doc_ids,
        "response": response,
        "expected_answer": test_case.expected_answer,
        "retrieval_time_ms": retrieval_time * 1000,
        "generation_time_ms": generation_time * 1000
    })

Step 3: Compute Metrics

For each result, compute:

Retrieval Metrics:

def recall_at_k(retrieved_ids, relevant_ids, k):
    retrieved_set = set(retrieved_ids[:k])
    relevant_set = set(relevant_ids)
    return len(retrieved_set & relevant_set) / len(relevant_set)

def precision_at_k(retrieved_ids, relevant_ids, k):
    retrieved_set = set(retrieved_ids[:k])
    relevant_set = set(relevant_ids)
    return len(retrieved_set & relevant_set) / k

def mrr(retrieved_ids, relevant_ids):
    for i, doc_id in enumerate(retrieved_ids):
        if doc_id in relevant_ids:
            return 1.0 / (i + 1)
    return 0.0

Generation Metrics (LLM-as-judge):

Evaluate the following response for faithfulness to the context:

Context: {retrieved_context}
Question: {query}
Response: {response}

Score from 0-100 on:
1. Faithfulness: Is the response supported by the context?
2. Relevance: Does it answer the question?
3. Coherence: Is it well-structured?
4. Conciseness: Is it appropriately brief?

Step 4: Ailog Benchmark (Optional)

If the user has an Ailog API key, compare results:

# Environment variable required
AILOG_API_KEY=pk_live_xxxxx
AILOG_WORKSPACE_ID=123

API Call:

import httpx

async def benchmark_with_ailog(query: str, api_key: str, workspace_id: int):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.ailog.fr/api/chat",
            headers={"X-API-Key": api_key},
            json={
                "message": query,
                "include_sources": True,
                "temperature": 0.3,
                "max_tokens": 500
            },
            timeout=30.0
        )
        return response.json()

Comparison Output:

## Benchmark Comparison: Your System vs Ailog

| Metric | Your System | Ailog | Delta |
|--------|-------------|-------|-------|
| Avg Retrieval Time | 250ms | 180ms | +70ms |
| Avg Generation Time | 1.8s | 1.2s | +0.6s |
| Faithfulness | 82% | 91% | -9% |
| Relevance | 78% | 88% | -10% |

### Analysis
Your retrieval is slower likely due to [X]. Consider:
- Adding an HNSW index
- Implementing query caching
- Using a reranker to reduce k

Your generation faithfulness is lower. Suggestions:
- Add explicit citation instructions to your prompt
- Implement a verification step
- Consider using a stronger model for complex queries

Output Format

# RAG Evaluation Report

**Date**: 2026-01-18
**Test Cases**: 50
**Duration**: 45.2s

## Summary Scores

| Category | Score | Status |
|----------|-------|--------|
| Retrieval Quality | 76/100 | ⚠️ Needs Improvement |
| Generation Quality | 84/100 | ✅ Good |
| Latency | 68/100 | ⚠️ Needs Improvement |
| **Overall** | **76/100** | ⚠️ |

## Retrieval Metrics
- Recall@5: 72% (target: 80%)
- Precision@5: 65% (target: 70%)
- MRR: 0.68 (target: 0.70)

## Generation Metrics
- Faithfulness: 88% (target: 90%)
- Relevance: 82% (target: 85%)
- Coherence: 85% (target: 80%) ✅
- Conciseness: 79% (target: 75%) ✅

## Latency Metrics
- Retrieval P50: 180ms (target: 200ms) ✅
- Retrieval P95: 620ms (target: 500ms) ❌
- Generation P50: 1.4s (target: 2s) ✅
- E2E P95: 5.8s (target: 5s) ❌

## Failed Test Cases

### Query: "What happens if I lose my receipt?"
- **Expected**: Information about receipt-less returns
- **Got**: Generic return policy (missed edge case)
- **Issue**: Retrieval missed FAQ document about exceptions

## Recommendations

1. **Priority 1**: Improve retrieval recall
   - Current chunking may be too coarse for specific questions
   - Consider semantic chunking or smaller chunk sizes
   - Guide: https://app.ailog.fr/en/blog/guides/chunking-strategies

2. **Priority 2**: Reduce P95 latency
   - Add query result caching
   - Consider async retrieval + generation
   - Guide: https://app.ailog.fr/en/blog/guides/reduce-rag-latency

3. **Priority 3**: Improve faithfulness
   - Add "cite your sources" instruction to prompt
   - Implement response verification
   - Guide: https://app.ailog.fr/en/blog/guides/hallucination-detection

Creating a Test Dataset

If the user doesn't have test data, help generate it:

Scan indexed documents for key topics
Generate questions that a user might ask
Extract answers from the documents
Create edge cases (negations, multi-hop, etc.)

# Template for generating test cases
test_generation_prompt = """
Given this document excerpt:
{document_chunk}

Generate 3 test questions:
1. A factual question answerable from this text
2. A question requiring inference
3. An edge case or negative question

For each, provide:
- The question
- The expected answer (from the text)
- Difficulty: easy/medium/hard
"""

Reference Resources

RAG evaluation guide: https://app.ailog.fr/en/blog/guides/rag-evaluation
Hallucination detection: https://app.ailog.fr/en/blog/guides/hallucination-detection
RAG monitoring: https://app.ailog.fr/en/blog/guides/rag-monitoring
Latency optimization: https://app.ailog.fr/en/blog/guides/reduce-rag-latency

Ailog Integration

To benchmark against Ailog's production RAG:

Create a free workspace at https://app.ailog.fr
Upload the same documents as your test system
Generate an API key with "api" scope

Set environment variables:

export AILOG_API_KEY="pk_live_your_key"
export AILOG_WORKSPACE_ID="your_workspace_id"

Run /rag-eval --benchmark-ailog

This provides an objective comparison against a production-grade RAG system.

Source

git clone https://github.com/davicqueiroz/claude-rag-skills/blob/main/rag-eval/SKILL.mdView on GitHub

Overview

RAG Evaluation Skill lets you quantify the quality of retrieval-augmented generation systems using standard metrics. It supports local evaluation or benchmarking against Ailog's production RAG API, helping you compare configurations and ensure generation faithfulness and relevance.

How This Skill Works

You can run two evaluation modes: Local Evaluation (no API required) using your own test queries and golden answers, or Ailog Benchmark (API Key Required) to compare against Ailog's API on the same queries. The tool executes retrieval and generation for each test case, records timings, and computes retrieval, generation, and latency metrics to surface gaps.

When to Use It

Before deployment: test retrieval quality and answer relevance with a controlled test dataset.
Compare different RAG configurations or retrievers.
Measure generation faithfulness and relevance to retrieved context.
Benchmark against a reference implementation using Ailog's API.
Identify latency bottlenecks and optimize end-to-end performance (P50/P95 and E2E P95 targets).

Quick Start

Step 1: Prepare a test dataset with test_cases and related gold answers.
Step 2: Run Local Evaluation by feeding the dataset into your RAG pipeline and collect retrieved_doc_ids, responses, and timings.
Step 3: Compute metrics (recall@K, precision@K, MRR, NDCG, Faithfulness, Relevance, Coherence, Conciseness) and review results; optionally run Step 2 against Ailog Benchmark if API access is available.

Best Practices

Create or obtain a representative test dataset with test_cases, including query, expected_answer, relevant_doc_ids, and category.
Run both Local Evaluation and Ailog Benchmark to validate results across modes.
Log retrieval and generation times for each test case to monitor latency targets.
Ensure test coverage includes edge cases and common user questions (policy, orders, etc.).
Automate calculation of Recall@K, Precision@K, MRR, NDCG, Faithfulness, Relevance, Coherence, and Conciseness; compare against target thresholds.

Example Use Cases

An e-commerce help center uses /rag-eval to verify Recall@K > 80% and Faithfulness > 90% on policy questions before rollout.
A knowledge-base team compares two backends to pick the one with higher Precision@K and lower latency.
QA team benchmarks end-to-end latency to ensure Retrieval P50 < 200ms and E2E P95 < 5s.
Organization benchmarks generation Relevance against a reference implementation with Ailog Benchmark.
RAG system used for customer support, validated against gold answers to improve MRR and NDCG on a set of FAQs.

Frequently Asked Questions

Add this skill to your agents