metadata-extraction
Scannednpx machina-cli add skill brege/dewey-decimal-skill/metadata-extraction --openclawMetadata Extraction
Extract bibliographic metadata from ebook files.
Priority Order
- EPUB OPF metadata (most authoritative)
- PDF document properties
- Filename parsing (least reliable)
- User input (for disambiguation only)
EPUB Extraction
Step 1: Find OPF Path
unzip -p "$file" META-INF/container.xml | grep -oP 'full-path="\K[^"]*'
Returns path like OEBPS/content.opf or content.opf.
Step 2: Extract Metadata
unzip -p "$file" "$opf_path" | grep -E '<dc:(creator|title|date|contributor)'
Fields
| Tag | Content |
|---|---|
<dc:creator> | Author name(s) |
<dc:title> | Book title |
<dc:date> | Publication date (extract YYYY) |
<dc:contributor opf:role="trn"> | Translator |
<dc:contributor opf:role="edt"> | Editor |
PDF Extraction
pdfinfo "$file" 2>/dev/null | grep -E '^(Title|Author|CreationDate|ModDate):'
Extract:
- Title field
- Author field
- CreationDate year (format: D:YYYYMMDDhhmmss)
Filename Parsing
Parse for hints when metadata insufficient:
- Author: Often at start, before dash or title
- Title: Main text
- Year: Four digits in parentheses or at end
Example: Charles.Bukowski.-.Love.Is.A.Dog.From.Hell.2007.RETAIL.EPUB.eBook-CTO.epub
- Author: Charles Bukowski
- Title: Love Is A Dog From Hell
- Year: 2007
Year Rules
Use publication year of THIS edition, not original work.
- Aristotle's Nicomachean Ethics (ancient) with Irwin translation (2019): use 2019
- Marx's Capital originally 1867: use edition's publication year
- If EPUB date differs from original work date, prefer EPUB date
Useless Metadata
Some EPUBs have placeholder or garbage metadata. Treat these as missing:
- Title:
[No data],Unknown,Untitled, empty - Author:
Unknown,Anonymous(unless actually anonymous work), empty - Date: Only modification date present, no publication date
When OPF metadata is useless, fall back to filename parsing immediately.
Author Name Normalization
Fix common capitalization issues from metadata:
Harold Mcgee→Harold McGee(fix surname caps)CHARLES BUKOWSKI→Charles Bukowski(fix all-caps)- Preserve intentional lowercase particles:
de Beauvoir,van Gogh
Use opf:file-as attribute if present - it often has correct citation format:
<dc:creator opf:file-as="McGee, Harold">Harold Mcgee</dc:creator>
Translator Verification
Translator info in filenames is often wrong or refers to different editions.
- Verify translator matches publisher/edition (e.g., Penguin 2004 Marx = Fowkes, not Guyer-Wood)
- If translator not in metadata, check ISBN against known editions
- When uncertain, ask user
Missing Data
When metadata genuinely unavailable:
- No author: Ask user
- No year: Ask user for "publication year of this edition"
- Ambiguous title: Ask user
- No translator but work is translated: Ask user or look up by ISBN/publisher
Do not guess. Do not proceed without author and title.
Source
git clone https://github.com/brege/dewey-decimal-skill/blob/main/skills/metadata-extraction/SKILL.mdView on GitHub Overview
This skill pulls bibliographic metadata from EPUB OPF, PDF properties, or filenames. It prioritizes EPUB OPF as the authoritative source, then PDF metadata, and finally filename hints or user input for disambiguation.
How This Skill Works
It first locates the EPUB OPF path from META-INF/container.xml and extracts dc:creator, dc:title, dc:date, and dc:contributor (with roles). If EPUB data is missing, it falls back to PDF metadata via pdfinfo, and as a last resort parses the filename. It also applies author name normalization and uses opf:file-as when present to improve citation format.
When to Use It
- Extract bibliographic data from an EPUB's OPF metadata when available
- Fallback to PDF document properties when EPUB metadata is missing or incomplete
- Parse the filename as a last resort when embedded metadata is unreliable
- Disambiguate authors, titles, or translators with user input
- Ensure edition year reflects the publication year of this edition, not the original work
Quick Start
- Step 1: Locate the EPUB OPF path using un unzip -p "$file" META-INF/container.xml and extract full-path
- Step 2: Extract metadata with unzip -p "$file" "$opf_path" | grep -E '<dc:(creator|title|date|contributor)' and map fields
- Step 3: If EPUB data is insufficient, run pdfinfo "$file" to fetch Title/Author and CreationDate, or parse the filename as a last resort; apply author normalization and use opf:file-as when present
Best Practices
- Check EPUB OPF first for authoritative data
- If OPF is missing, rely on PDF properties before resorting to filename parsing
- Normalize author names and use opf:file-as when available
- Verify translator/editor data against publisher/edition (e.g., ISBN cross-checks)
- Do not guess when data is missing; prompt the user for missing author, title, or year
Example Use Cases
- Example 1: An EPUB with valid OPF metadata yields Author: Charles Bukowski; Title: Love Is A Dog From Hell; Year: 2007 (from dc:creator, dc:title, dc:date)
- Example 2: A PDF with Title/Author in document properties and CreationDate: 2018-05-03
- Example 3: Filename only fallback: Charles.Bukowski.-.Love.Is.A.Dog.From.Hell.2007.RETAIL.EPUB.eBook-CTO.epub → Author: Charles Bukowski; Title: Love Is A Dog From Hell; Year: 2007
- Example 4: Edition year clarification: Aristotle's Nicomachean Ethics (2019 edition with translation) → use 2019 as the publication year
- Example 5: Missing data case: No author or title in OPF or PDF; rely on filename parse or prompt user for missing data