Get the FREE Ultimate OpenClaw Setup Guide →

kb-retriever

Scanned
npx machina-cli add skill uts58/my-random-skills/rag-skill --openclaw
Files (1)
SKILL.md
15.5 KB

Local Knowledge Base Retrieval Skill (kb-retriever)

Knowledge Base Directory Description

  • Knowledge is stored in a root directory containing various file types (e.g., .md / .txt, .pdf, .xlsx, etc.), usually split into multi-level subdirectories by type or business purpose.
  • Hierarchical Directory Index Files:
    • The root directory has a data_structure.md, explaining the main "domain directories" and their purposes.
    • Each domain directory can have its own data_structure.md, explaining the subdirectories/files within it and their respective purposes.
    • Deeper subdirectories can continue to have data_structure.md, forming a multi-level index tree.
  • Knowledge Base Root Directory Conventions:
    • By default, the knowledge base is assumed to be in the knowledge/ directory under the current project root.
    • If the user explicitly specifies another path in the conversation (e.g., "My knowledge base is at /data/kb" or "Use the ./docs directory as the knowledge base"), the user-specified path is used as the root.
    • When the default path knowledge/ does not exist or fails to be accessed, confirm the actual knowledge base root directory location with the user instead of guessing.
  • Individual business files may be very large:
    • Do not use Read to read the entire file directly.
    • For PDF and Excel, use the corresponding skills for structured processing first, then combine with grep / partial reading for fine-grained retrieval.

Locating the knowledge Root Directory

  • Prioritize the user: If the user provides a path (e.g., ./docs, ./knowledge-personal), use it directly.
  • Default root directory: Otherwise, the root directory is assumed to be knowledge/ under the current project.
    • Explicitly check if the directory exists using the shell: Prefer test -d knowledge, or failing that, ls -d knowledge.
    • Note: Prohibit using patterns like Glob "knowledge" in . to determine if a directory exists. Glob only returns file paths, not the directory itself; an empty result doesn't distinguish between "directory doesn't exist" and "directory exists but is empty".
  • Only when the root directory is confirmed to exist via test -d or similar, use Glob to retrieve content within that directory and treat the directory as the path. For example:
    • Index files: pattern="**/data_structure.md", path="knowledge"
    • All Markdown: pattern="**/*.md", path="knowledge"
  • If the default knowledge/ does not exist (test -d fails): Do not guess other directories. Clearly tell the user that the default root directory was not found and ask them to specify the actual knowledge base path.

Key Principle: Learn First, Process Later

Mandatory Checklist when encountering PDF or Excel files:

  • ✅ Read the corresponding references document to learn processing methods.
  • ✅ Understand the recommended tools and commands.
  • ✅ Complete file processing (extraction/conversion).
  • ⏭️ Now retrieval can begin.

Prohibited Actions:

  • ❌ Attempting to process a PDF without reading pdf_reading.md.
  • ❌ Attempting to process an Excel without reading excel_reading.md.
  • ❌ Skipping the file processing step and retrieving directly from the original PDF/Excel.

Overall Process

  1. Understand User Requirements

    • Read the user's question and extract:
      • Topic/Domain keywords (e.g., "sales report", "system architecture", "API documentation").
      • Time or scope constraints (e.g., "2023 Q1", "recent version").
      • Desired output type (explanation, summary, specific field values, etc.).
    • Determine the knowledge base root directory:
      • Check if the user specified a knowledge base path in the question first.
      • Otherwise, use the default root directory knowledge/.
      • If the default root directory doesn't exist or the structure is abnormal, ask the user for confirmation instead of making assumptions.
  2. Hierarchically Review Directory Index data_structure.md

    • Use the concept of a "Current Working Directory":
      • Start from the user-specified knowledge base root directory by default; if not specified, use the current directory.
    • If data_structure.md exists in the current working directory:
      • Use Read to read the first few hundred lines (e.g., limit=300), and continue reading in segments if necessary.
      • Goal:
        • Understand what subdirectories and files are in the current directory.
        • Understand the purpose description for each subdirectory/file.
      • Based on the user's question, select the most relevant subdirectories or files as candidates.
    • For candidate subdirectories:
      • Recursively enter the subdirectory, treating it as the new "Current Working Directory", and repeat the process of searching for data_structure.md.
      • During recursion, avoid diving into all branches at once; prioritize paths most relevant to the question.
    • For candidate business files (md/text, PDF, Excel, etc.):
      • After completing the necessary directory level exploration, collect these files as the final retrieval target list.
    • When prioritizing:
      • Prioritize domain directories and files whose purpose descriptions highly match the question topic.
      • Secondly, consider constraints like time/version (if reflected in the index).
      • General documentation (like README.md, overall design documents) should have lower priority.
  3. Learn File Processing Methods (Mandatory for PDF/Excel)

    • Before processing PDF files:
      • Must read references/pdf_reading.md (note this directory is under Skills, not Knowledge) to learn extraction methods.
      • Key focus: pdftotext command, pdfplumber usage, table extraction methods.
    • Before processing Excel files:
    • Purpose: Ensure correct tools and methods are used, avoiding blind retrieval.
  4. Execute Processing and Retrieval by File Type

    • Use the learned methods to process files (extraction, conversion, structuring).
    • For each category of candidate files, follow the "Markdown/Text", "PDF", or "Excel" strategies below.
    • General Principles:
      • Start with the most relevant and precise files.
      • Perform progressive local retrieval within each file to avoid loading the entire content at once.
      • If the current file doesn't provide satisfactory information, switch to the next candidate file.
  5. Iterative Retrieval

    • All file types use a unified "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).
  6. Answer Organization and Sourcing

    • Consolidate the context obtained from multiple rounds of retrieval to answer the user's question comprehensively.
    • Aim to:
      • Provide clear and direct answers.
      • Indicate the filenames used (including approximate locations like sections or line/page numbers where necessary).
    • If the answer is based on inference or incomplete information:
      • Clearly mark assumptions and uncertainties.
      • Prompt the user to provide a more specific file scope or keywords.

Common Retrieval Principles

Keyword Selection Strategy

  • Extract 3-8 keywords from the user's question (including possible English abbreviations, synonyms, hypernyms/hyponyms).
  • Use combinable phrases (e.g., "sales report", "API interface timeout").
  • Include business terms, technical jargon, and common abbreviations (e.g., "uv", "pv", "GMV") as necessary.

grep Retrieval Basic Principles

  • Always specify the most precise include and path possible, avoiding searching the entire directory.
  • Prioritize core nouns and terms from the question in the pattern, then try synonyms.
  • For each hit, only read the local area around the match (a few lines above and below).
  • Save "Filename + Position Information + Text Snippet".

Multi-round Iterative Retrieval Mechanism (Max 5 times)

A unified iterative strategy for all file types:

  1. Iteration Control
    • Maintain an "Attempted Retrieval Count", max 5 times.
    • Increment the count after each retrieval.
  2. Each Round Workflow
    1. Generate/update retrieval keywords based on the question (can include synonyms, extensions).
    2. Select files or file parts that haven't been fully retrieved yet.
    3. Execute retrieval (grep / partial read / specialized Skill call).
    4. Analyze the obtained context snippets.
    5. Determine if it's enough to answer the question.
  3. Termination Conditions
    • Found enough context to support the answer; OR
    • Reached 5 attempts without finding suitable information.
  4. Handling Insufficient Information
    • Clearly inform the user if information is missing or might not be in the current knowledge base.
    • Provide the closest information found and explain the uncertainty.
    • Suggest how the user can narrow the scope (more specific filenames, keywords, time range, etc.).

Important Notes

  • Prohibit calling Glob "knowledge" in . or any attempt to use Glob to determine directory existence for the first time. Directory existence should be checked via shell commands (e.g., test -d).
  • When using this Skill to query the knowledge base, prohibit using other tools like web search to obtain knowledge.

Specific Strategies for Different File Types

1. Markdown / Text Files (.md, .txt, .log, etc.)

  1. Candidate File Selection

    • Judge relevance based on data_structure.md, filenames, and paths.
    • Prioritize retrieving title and index files (e.g., summary documents, design overviews).
  2. grep Positioning and Partial Reading

    • Use the Grep tool on specified candidate files, limiting include to specific suffixes (e.g., "*.md").
    • For files with matches, use Read to only read the local area around the match:
      • Control reading via line number offset and limit (e.g., read a few dozen lines before and after the matching line).
      • Avoid reading the entire file.
  3. Special Handling

    • If the content is just a table of contents/title, continue to locate deeper content based on links or section names.
    • Apply the "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).

2. PDF File Retrieval Strategy

Workflow:

  1. First: Read Processing Guide

    • Before processing any PDF, must read references/pdf_reading.md (note this directory is under Skills, not Knowledge).
    • Key focus: pdftotext command, pdfplumber usage, table extraction methods, Quick Decision Table.
  2. Select Candidate PDFs

    • Based on the description in data_structure.md, select the 1-3 most relevant files.
    • If the user specifies a particular PDF file, prioritize that file.
  3. Apply Learned Methods to Extract Text

    • Use tools recommended in pdf_reading.md (prefer pdftotext or pdfplumber).
    • IMPORTANT: Use pdftotext input.pdf output.txt to extract text to a file; do not output directly to stdout (to avoid consuming many tokens).
    • If tables need to be extracted, use pdfplumber's table extraction functionality.
  4. Execute Retrieval on Extraction Results

    • Use grep to perform keyword searches on the extracted text.
    • For each hit, extract the context around the match (dozens of lines above/below or adjacent pages).
    • Save "Filename + Page Number/Approximate Position + Text Snippet".
    • Apply the "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).

3. Excel File Retrieval Strategy

Workflow:

  1. First: Read Processing Guide

    • Before processing any Excel, must read:
    • Key focus: pandas reading methods, column filtering, data filtering, aggregation operations.
  2. Select Candidate Excels

    • Based on data_structure.md and file/worksheet names, select the most relevant sheets.
    • Prioritize workbooks/sheets containing keywords like "report", "statistics", "log", "config", "mapping", etc.
    • If the user specifies a particular Excel file, prioritize that file.
  3. Apply Learned Methods to Explore Structure

    • Use pandas to read the first 10-50 rows (using the nrows parameter).
    • Key mastery: Column/field names, data types (numeric, date, text), key fields.
    • Compare column names with the user's question to identify potential key fields (e.g., "revenue", "sales", "error_code", etc.).
  4. Execute Data Retrieval and Analysis

    • Use learned pandas methods for filtering and aggregation (e.g., df[df['column'] == value]).
    • Read only the data around the matching rows at a time, avoiding reading the entire sheet at once.
    • If the question includes a time range, add time filtering to the retrieval.
    • Apply the "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).

Synergy with Other Tools

PDF Processing

  • Must read references/pdf_reading.md before processing PDF to learn methods.
  • Use pdfplumber / pypdf for text extraction, table extraction, and metadata reading.
  • Prioritize using the pdftotext command-line tool for fast text extraction.

Excel Processing

Tool Usage Principles

  • Grep: Used to find line numbers and matching snippets by keyword in specified files; always specify the most precise include and path possible.
  • Read: Only used for partial file reading; always set a reasonable limit (e.g., 200-500 lines) and appropriate offset.
  • For any file that may be large:
    • Prohibit reading directly from beginning to end.
    • Always narrow the range via indexes, table of contents, or keywords before reading.

Answer Style and Error Handling

  • Answer Style
    • Try to answer in the language the user asked (Chinese/English).
    • Give the conclusion first, then the brief rationale.
    • If needed, list the files cited and their approximate locations at the end, for example:
      • Source: design/api_gateway.md near line 100
      • Source: reports/2023_Q1_sales.xlsx Summary worksheet
  • Missing or Uncertain Information
    • Clearly state if no exact match was found in the current knowledge base or if only a partial answer can be given.
    • Do not fabricate facts.
    • Suggest how the user can help narrow the scope:
      • Specify a more specific directory/file.
      • Provide more precise keywords or field names.
      • Specify a time/version range.

Source

git clone https://github.com/uts58/my-random-skills/blob/main/rag-skill/SKILL.mdView on GitHub

Overview

kb-retriever is a retrieval and Q&A assistant designed to extract answers and data from a local knowledge base directory. It uses hierarchical index files to navigate domains, processes PDFs/Excels with reference-guided methods, and performs progressive retrieval using grep, Read, pdfplumber, and pandas to avoid loading entire files. This approach enables precise, fast answers and data retrieval from multi-level knowledge bases.

How This Skill Works

The tool navigates a multi-level directory tree guided by data_structure.md index files. For PDFs and Excel files, it first reads the relevant reference documents (e.g., pdf_reading.md, excel_reading.md) to learn processing methods, then uses selective reading (grep, Read, pdfplumber, pandas) to extract only the needed information. All file processing occurs before retrieval to minimize unnecessary data loading.

When to Use It

  • You need to answer questions or retrieve data from a knowledge base directory.
  • You must locate information across multiple domain subfolders defined by data_structure.md.
  • You are dealing with large PDFs or Excel files and want to avoid loading the entire file.
  • You want to verify or understand the knowledge base structure by inspecting domain-specific data_structure.md files.
  • You need to confirm the knowledge root path and avoid guessing when knowledge/ does not exist.

Quick Start

  1. Step 1: Confirm the knowledge root (default to knowledge/); use test -d knowledge to verify.
  2. Step 2: Let kb-retriever read data_structure.md files to build the hierarchical index.
  3. Step 3: Ask a question or request specific data to begin progressive retrieval.

Best Practices

  • Always verify the knowledge root directory with test -d knowledge; if it fails, ask the user for the correct path.
  • Read the domain-specific data_structure.md files to understand what each folder contains before retrieval.
  • For PDFs/Excel, consult the relevant reading references (pdf_reading.md and excel_reading.md) prior to processing.
  • Use progressive retrieval methods (grep, Read, pdfplumber, pandas) to pull only the required data, avoiding full-file loads.
  • Respect the hierarchical index navigation and avoid guessing the root path when the default is unavailable.

Example Use Cases

  • Answer: What is the latest Q1 revenue figure found in the knowledge base under knowledge/sales/reports/ and related subfolders?
  • Extract: Retrieve HR policy details from multiple subdirectories to compile a summary for a new onboarding guide.
  • Analyze: Pull key performance indicators from a large multi-page PDF report without loading the entire file.
  • Verify: Confirm the structure of the knowledge base by inspecting data_structure.md in the domain directories.
  • Locate: Find the exact contact information for a product support team stored across several domain files.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers