kb-retriever
Scannednpx machina-cli add skill uts58/my-random-skills/rag-skill --openclawLocal Knowledge Base Retrieval Skill (kb-retriever)
Knowledge Base Directory Description
- Knowledge is stored in a root directory containing various file types (e.g.,
.md/.txt,.pdf,.xlsx, etc.), usually split into multi-level subdirectories by type or business purpose. - Hierarchical Directory Index Files:
- The root directory has a
data_structure.md, explaining the main "domain directories" and their purposes. - Each domain directory can have its own
data_structure.md, explaining the subdirectories/files within it and their respective purposes. - Deeper subdirectories can continue to have
data_structure.md, forming a multi-level index tree.
- The root directory has a
- Knowledge Base Root Directory Conventions:
- By default, the knowledge base is assumed to be in the
knowledge/directory under the current project root. - If the user explicitly specifies another path in the conversation (e.g., "My knowledge base is at /data/kb" or "Use the ./docs directory as the knowledge base"), the user-specified path is used as the root.
- When the default path
knowledge/does not exist or fails to be accessed, confirm the actual knowledge base root directory location with the user instead of guessing.
- By default, the knowledge base is assumed to be in the
- Individual business files may be very large:
- Do not use
Readto read the entire file directly. - For PDF and Excel, use the corresponding skills for structured processing first, then combine with
grep/ partial reading for fine-grained retrieval.
- Do not use
Locating the knowledge Root Directory
- Prioritize the user: If the user provides a path (e.g.,
./docs,./knowledge-personal), use it directly. - Default root directory: Otherwise, the root directory is assumed to be
knowledge/under the current project.- Explicitly check if the directory exists using the shell: Prefer
test -d knowledge, or failing that,ls -d knowledge. - Note: Prohibit using patterns like
Glob "knowledge" in .to determine if a directory exists.Globonly returns file paths, not the directory itself; an empty result doesn't distinguish between "directory doesn't exist" and "directory exists but is empty".
- Explicitly check if the directory exists using the shell: Prefer
- Only when the root directory is confirmed to exist via
test -dor similar, useGlobto retrieve content within that directory and treat the directory as thepath. For example:- Index files:
pattern="**/data_structure.md",path="knowledge" - All Markdown:
pattern="**/*.md",path="knowledge"
- Index files:
- If the default
knowledge/does not exist (test -dfails): Do not guess other directories. Clearly tell the user that the default root directory was not found and ask them to specify the actual knowledge base path.
Key Principle: Learn First, Process Later
Mandatory Checklist when encountering PDF or Excel files:
- ✅ Read the corresponding references document to learn processing methods.
- ✅ Understand the recommended tools and commands.
- ✅ Complete file processing (extraction/conversion).
- ⏭️ Now retrieval can begin.
Prohibited Actions:
- ❌ Attempting to process a PDF without reading
pdf_reading.md. - ❌ Attempting to process an Excel without reading
excel_reading.md. - ❌ Skipping the file processing step and retrieving directly from the original PDF/Excel.
Overall Process
-
Understand User Requirements
- Read the user's question and extract:
- Topic/Domain keywords (e.g., "sales report", "system architecture", "API documentation").
- Time or scope constraints (e.g., "2023 Q1", "recent version").
- Desired output type (explanation, summary, specific field values, etc.).
- Determine the knowledge base root directory:
- Check if the user specified a knowledge base path in the question first.
- Otherwise, use the default root directory
knowledge/. - If the default root directory doesn't exist or the structure is abnormal, ask the user for confirmation instead of making assumptions.
- Read the user's question and extract:
-
Hierarchically Review Directory Index
data_structure.md- Use the concept of a "Current Working Directory":
- Start from the user-specified knowledge base root directory by default; if not specified, use the current directory.
- If
data_structure.mdexists in the current working directory:- Use
Readto read the first few hundred lines (e.g., limit=300), and continue reading in segments if necessary. - Goal:
- Understand what subdirectories and files are in the current directory.
- Understand the purpose description for each subdirectory/file.
- Based on the user's question, select the most relevant subdirectories or files as candidates.
- Use
- For candidate subdirectories:
- Recursively enter the subdirectory, treating it as the new "Current Working Directory", and repeat the process of searching for
data_structure.md. - During recursion, avoid diving into all branches at once; prioritize paths most relevant to the question.
- Recursively enter the subdirectory, treating it as the new "Current Working Directory", and repeat the process of searching for
- For candidate business files (md/text, PDF, Excel, etc.):
- After completing the necessary directory level exploration, collect these files as the final retrieval target list.
- When prioritizing:
- Prioritize domain directories and files whose purpose descriptions highly match the question topic.
- Secondly, consider constraints like time/version (if reflected in the index).
- General documentation (like README.md, overall design documents) should have lower priority.
- Use the concept of a "Current Working Directory":
-
Learn File Processing Methods (Mandatory for PDF/Excel)
- Before processing PDF files:
- Must read references/pdf_reading.md (note this directory is under Skills, not Knowledge) to learn extraction methods.
- Key focus:
pdftotextcommand,pdfplumberusage, table extraction methods.
- Before processing Excel files:
- Must read references/excel_reading.md to learn reading methods.
- Must read references/excel_analysis.md to learn analysis methods.
- Key focus: pandas reading, column filtering, data filtering.
- Purpose: Ensure correct tools and methods are used, avoiding blind retrieval.
- Before processing PDF files:
-
Execute Processing and Retrieval by File Type
- Use the learned methods to process files (extraction, conversion, structuring).
- For each category of candidate files, follow the "Markdown/Text", "PDF", or "Excel" strategies below.
- General Principles:
- Start with the most relevant and precise files.
- Perform progressive local retrieval within each file to avoid loading the entire content at once.
- If the current file doesn't provide satisfactory information, switch to the next candidate file.
-
Iterative Retrieval
- All file types use a unified "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).
-
Answer Organization and Sourcing
- Consolidate the context obtained from multiple rounds of retrieval to answer the user's question comprehensively.
- Aim to:
- Provide clear and direct answers.
- Indicate the filenames used (including approximate locations like sections or line/page numbers where necessary).
- If the answer is based on inference or incomplete information:
- Clearly mark assumptions and uncertainties.
- Prompt the user to provide a more specific file scope or keywords.
Common Retrieval Principles
Keyword Selection Strategy
- Extract 3-8 keywords from the user's question (including possible English abbreviations, synonyms, hypernyms/hyponyms).
- Use combinable phrases (e.g., "sales report", "API interface timeout").
- Include business terms, technical jargon, and common abbreviations (e.g., "uv", "pv", "GMV") as necessary.
grep Retrieval Basic Principles
- Always specify the most precise
includeandpathpossible, avoiding searching the entire directory. - Prioritize core nouns and terms from the question in the
pattern, then try synonyms. - For each hit, only read the local area around the match (a few lines above and below).
- Save "Filename + Position Information + Text Snippet".
Multi-round Iterative Retrieval Mechanism (Max 5 times)
A unified iterative strategy for all file types:
- Iteration Control
- Maintain an "Attempted Retrieval Count", max 5 times.
- Increment the count after each retrieval.
- Each Round Workflow
- Generate/update retrieval keywords based on the question (can include synonyms, extensions).
- Select files or file parts that haven't been fully retrieved yet.
- Execute retrieval (
grep/ partial read / specialized Skill call). - Analyze the obtained context snippets.
- Determine if it's enough to answer the question.
- Termination Conditions
- Found enough context to support the answer; OR
- Reached 5 attempts without finding suitable information.
- Handling Insufficient Information
- Clearly inform the user if information is missing or might not be in the current knowledge base.
- Provide the closest information found and explain the uncertainty.
- Suggest how the user can narrow the scope (more specific filenames, keywords, time range, etc.).
Important Notes
- Prohibit calling
Glob "knowledge" in .or any attempt to use Glob to determine directory existence for the first time. Directory existence should be checked via shell commands (e.g.,test -d). - When using this Skill to query the knowledge base, prohibit using other tools like web search to obtain knowledge.
Specific Strategies for Different File Types
1. Markdown / Text Files (.md, .txt, .log, etc.)
-
Candidate File Selection
- Judge relevance based on
data_structure.md, filenames, and paths. - Prioritize retrieving title and index files (e.g., summary documents, design overviews).
- Judge relevance based on
-
grep Positioning and Partial Reading
- Use the Grep tool on specified candidate files, limiting
includeto specific suffixes (e.g., "*.md"). - For files with matches, use
Readto only read the local area around the match:- Control reading via line number offset and
limit(e.g., read a few dozen lines before and after the matching line). - Avoid reading the entire file.
- Control reading via line number offset and
- Use the Grep tool on specified candidate files, limiting
-
Special Handling
- If the content is just a table of contents/title, continue to locate deeper content based on links or section names.
- Apply the "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).
2. PDF File Retrieval Strategy
Workflow:
-
First: Read Processing Guide
- Before processing any PDF, must read references/pdf_reading.md (note this directory is under Skills, not Knowledge).
- Key focus:
pdftotextcommand,pdfplumberusage, table extraction methods, Quick Decision Table.
-
Select Candidate PDFs
- Based on the description in
data_structure.md, select the 1-3 most relevant files. - If the user specifies a particular PDF file, prioritize that file.
- Based on the description in
-
Apply Learned Methods to Extract Text
- Use tools recommended in
pdf_reading.md(preferpdftotextorpdfplumber). - IMPORTANT: Use
pdftotext input.pdf output.txtto extract text to a file; do not output directly to stdout (to avoid consuming many tokens). - If tables need to be extracted, use
pdfplumber's table extraction functionality.
- Use tools recommended in
-
Execute Retrieval on Extraction Results
- Use
grepto perform keyword searches on the extracted text. - For each hit, extract the context around the match (dozens of lines above/below or adjacent pages).
- Save "Filename + Page Number/Approximate Position + Text Snippet".
- Apply the "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).
- Use
3. Excel File Retrieval Strategy
Workflow:
-
First: Read Processing Guide
- Before processing any Excel, must read:
- references/excel_reading.md - Learn how to read worksheets (note this directory is under Skills, not Knowledge).
- references/excel_analysis.md - Learn how to analyze data (note this directory is under Skills, not Knowledge).
- Key focus: pandas reading methods, column filtering, data filtering, aggregation operations.
- Before processing any Excel, must read:
-
Select Candidate Excels
- Based on
data_structure.mdand file/worksheet names, select the most relevant sheets. - Prioritize workbooks/sheets containing keywords like "report", "statistics", "log", "config", "mapping", etc.
- If the user specifies a particular Excel file, prioritize that file.
- Based on
-
Apply Learned Methods to Explore Structure
- Use pandas to read the first 10-50 rows (using the
nrowsparameter). - Key mastery: Column/field names, data types (numeric, date, text), key fields.
- Compare column names with the user's question to identify potential key fields (e.g., "revenue", "sales", "error_code", etc.).
- Use pandas to read the first 10-50 rows (using the
-
Execute Data Retrieval and Analysis
- Use learned pandas methods for filtering and aggregation (e.g.,
df[df['column'] == value]). - Read only the data around the matching rows at a time, avoiding reading the entire sheet at once.
- If the question includes a time range, add time filtering to the retrieval.
- Apply the "Multi-round Iterative Retrieval Mechanism" (see Common Retrieval Principles above).
- Use learned pandas methods for filtering and aggregation (e.g.,
Synergy with Other Tools
PDF Processing
- Must read references/pdf_reading.md before processing PDF to learn methods.
- Use
pdfplumber/pypdffor text extraction, table extraction, and metadata reading. - Prioritize using the
pdftotextcommand-line tool for fast text extraction.
Excel Processing
- Must read before processing Excel:
- references/excel_reading.md - Learn reading methods.
- references/excel_analysis.md - Learn analysis methods.
- Use pandas for data exploration, previewing, filtering, and analysis.
Tool Usage Principles
- Grep: Used to find line numbers and matching snippets by keyword in specified files; always specify the most precise
includeandpathpossible. - Read: Only used for partial file reading; always set a reasonable
limit(e.g., 200-500 lines) and appropriate offset. - For any file that may be large:
- Prohibit reading directly from beginning to end.
- Always narrow the range via indexes, table of contents, or keywords before reading.
Answer Style and Error Handling
- Answer Style
- Try to answer in the language the user asked (Chinese/English).
- Give the conclusion first, then the brief rationale.
- If needed, list the files cited and their approximate locations at the end, for example:
- Source:
design/api_gateway.mdnear line 100 - Source:
reports/2023_Q1_sales.xlsxSummary worksheet
- Source:
- Missing or Uncertain Information
- Clearly state if no exact match was found in the current knowledge base or if only a partial answer can be given.
- Do not fabricate facts.
- Suggest how the user can help narrow the scope:
- Specify a more specific directory/file.
- Provide more precise keywords or field names.
- Specify a time/version range.
Source
git clone https://github.com/uts58/my-random-skills/blob/main/rag-skill/SKILL.mdView on GitHub Overview
kb-retriever is a retrieval and Q&A assistant designed to extract answers and data from a local knowledge base directory. It uses hierarchical index files to navigate domains, processes PDFs/Excels with reference-guided methods, and performs progressive retrieval using grep, Read, pdfplumber, and pandas to avoid loading entire files. This approach enables precise, fast answers and data retrieval from multi-level knowledge bases.
How This Skill Works
The tool navigates a multi-level directory tree guided by data_structure.md index files. For PDFs and Excel files, it first reads the relevant reference documents (e.g., pdf_reading.md, excel_reading.md) to learn processing methods, then uses selective reading (grep, Read, pdfplumber, pandas) to extract only the needed information. All file processing occurs before retrieval to minimize unnecessary data loading.
When to Use It
- You need to answer questions or retrieve data from a knowledge base directory.
- You must locate information across multiple domain subfolders defined by data_structure.md.
- You are dealing with large PDFs or Excel files and want to avoid loading the entire file.
- You want to verify or understand the knowledge base structure by inspecting domain-specific data_structure.md files.
- You need to confirm the knowledge root path and avoid guessing when knowledge/ does not exist.
Quick Start
- Step 1: Confirm the knowledge root (default to knowledge/); use test -d knowledge to verify.
- Step 2: Let kb-retriever read data_structure.md files to build the hierarchical index.
- Step 3: Ask a question or request specific data to begin progressive retrieval.
Best Practices
- Always verify the knowledge root directory with test -d knowledge; if it fails, ask the user for the correct path.
- Read the domain-specific data_structure.md files to understand what each folder contains before retrieval.
- For PDFs/Excel, consult the relevant reading references (pdf_reading.md and excel_reading.md) prior to processing.
- Use progressive retrieval methods (grep, Read, pdfplumber, pandas) to pull only the required data, avoiding full-file loads.
- Respect the hierarchical index navigation and avoid guessing the root path when the default is unavailable.
Example Use Cases
- Answer: What is the latest Q1 revenue figure found in the knowledge base under knowledge/sales/reports/ and related subfolders?
- Extract: Retrieve HR policy details from multiple subdirectories to compile a summary for a new onboarding guide.
- Analyze: Pull key performance indicators from a large multi-page PDF report without loading the entire file.
- Verify: Confirm the structure of the knowledge base by inspecting data_structure.md in the domain directories.
- Locate: Find the exact contact information for a product support team stored across several domain files.