doc-processor
npx machina-cli add skill next-open-ai/openclawx/doc-processor --openclawDocument Processor Skill
Use this skill when the user asks you to read specific technical documents, summarize reports, or generate structured files (like a structured markdown report, a CSV of data, or an HTML presentation).
Workflow
- Reading Documents:
- If the file is plaintext (txt, md, csv, json), use the
readtool directly. - If it's a binary document (pdf, docx), check if tools like
pdftotextorpandocare installed via thebashtool, then convert it to text in a temporary directory (/tmp/) before reading it.
- If the file is plaintext (txt, md, csv, json), use the
- Generating Documents:
- Understand the required structure and content from the user.
- Draft the content in a plaintext format (e.g., Markdown) using the
writetool. - If the user requested a specific format like PDF or HTML, use
bashto runpandoc output.md -o output.pdfor similar commands.
- If necessary tools (like pandoc) are missing, politely inform the user to install them or provide the drafted Markdown as a fallback.
- Notify the user with the path to the newly generated document.
Source
git clone https://github.com/next-open-ai/openclawx/blob/main/presets/workspaces/doc-assistant/skills/doc-processor/SKILL.mdView on GitHub Overview
doc-processor reads, parses, and generates documents in formats such as Markdown, PDF, DOCX, CSV, and HTML. It uses bash commands to invoke conversion tools like pandoc or Python scripts to convert and format content, enabling seamless document workflows from reading to output generation.
How This Skill Works
Reading flow: plaintext files (txt, md, csv, json) are read directly via the read tool; binary formats (pdf, docx) are converted to text in /tmp using pdftotext or pandoc before processing. Generating flow: draft content in Markdown with the write tool, then, if a specific target format is requested (PDF, HTML), invoke pandoc through bash to produce the final file and report its path.
When to Use It
- You need to extract and summarize text from a PDF or DOCX document.
- You want to generate a structured Markdown report from data and notes.
- You require a CSV export of tabular data derived from a document or dataset.
- You need to convert a Markdown draft to PDF or HTML for distribution.
- You want to convert or reformat an existing document into another supported format.
Quick Start
- Step 1: Provide the source file and your target format (e.g., PDF, HTML, or CSV).
- Step 2: Use read to fetch content (or convert binary to text in /tmp) and draft the report with write in Markdown.
- Step 3: Run pandoc via bash to generate the final file and receive the path to the output (e.g., /tmp/output.pdf).
Best Practices
- Clearly confirm the input file type and the desired output format before starting.
- Draft content in Markdown first, then convert to the target format as needed.
- Check that required tools (pandoc, pdftotext) are installed; if missing, inform the user and provide a Markdown fallback.
- Use a temporary directory like /tmp for intermediate conversions and clean up afterward.
- Validate the final document and provide the user with the exact path to the generated file.
Example Use Cases
- Convert a PDF user guide to HTML for a web wiki using pandoc via bash.
- Generate a CSV summary from a Markdown table extracted from a report.
- Produce a PDF report from a Markdown draft for stakeholder distribution.
- Extract text from a DOCX resume for indexing and search optimization.
- Create a Markdown version of a JSON specification for developer documentation.