What if OPF data is missing or useless?

Fall back to PDF metadata; if still unavailable, parse the filename and, if necessary, ask the user for missing information.

How should author names be normalized?

Apply capitalization fixes (e.g., Harold Mcgee -> Harold McGee; CHARLES BUKOWSKI -> Charles Bukowski) while preserving particles (de Beauvoir, van Gogh). Use opf:file-as when present to ensure correct citation format.

How are translators and editors handled?

Verify translator/editor data against publisher/edition when possible. If metadata lacks this info, check ISBN/publisher editions or ask the user for confirmation to avoid misattribution.

metadata-extraction

Scanned

npx machina-cli add skill brege/dewey-decimal-skill/metadata-extraction --openclaw

Files (1)

SKILL.md

3.2 KB

Metadata Extraction

Extract bibliographic metadata from ebook files.

Priority Order

EPUB OPF metadata (most authoritative)
PDF document properties
Filename parsing (least reliable)
User input (for disambiguation only)

EPUB Extraction

Step 1: Find OPF Path

unzip -p "$file" META-INF/container.xml | grep -oP 'full-path="\K[^"]*'

Returns path like OEBPS/content.opf or content.opf.

Step 2: Extract Metadata

unzip -p "$file" "$opf_path" | grep -E '<dc:(creator|title|date|contributor)'

Fields

Tag	Content
`<dc:creator>`	Author name(s)
`<dc:title>`	Book title
`<dc:date>`	Publication date (extract YYYY)
`<dc:contributor opf:role="trn">`	Translator
`<dc:contributor opf:role="edt">`	Editor

PDF Extraction

pdfinfo "$file" 2>/dev/null | grep -E '^(Title|Author|CreationDate|ModDate):'

Extract:

Title field
Author field
CreationDate year (format: D:YYYYMMDDhhmmss)

Filename Parsing

Parse for hints when metadata insufficient:

Author: Often at start, before dash or title
Title: Main text
Year: Four digits in parentheses or at end

Example: Charles.Bukowski.-.Love.Is.A.Dog.From.Hell.2007.RETAIL.EPUB.eBook-CTO.epub

Author: Charles Bukowski
Title: Love Is A Dog From Hell
Year: 2007

Year Rules

Use publication year of THIS edition, not original work.

Aristotle's Nicomachean Ethics (ancient) with Irwin translation (2019): use 2019
Marx's Capital originally 1867: use edition's publication year
If EPUB date differs from original work date, prefer EPUB date

Useless Metadata

Some EPUBs have placeholder or garbage metadata. Treat these as missing:

Title: [No data], Unknown, Untitled, empty
Author: Unknown, Anonymous (unless actually anonymous work), empty
Date: Only modification date present, no publication date

When OPF metadata is useless, fall back to filename parsing immediately.

Author Name Normalization

Fix common capitalization issues from metadata:

Harold Mcgee → Harold McGee (fix surname caps)
CHARLES BUKOWSKI → Charles Bukowski (fix all-caps)
Preserve intentional lowercase particles: de Beauvoir, van Gogh

Use opf:file-as attribute if present - it often has correct citation format:

<dc:creator opf:file-as="McGee, Harold">Harold Mcgee</dc:creator>

Translator Verification

Translator info in filenames is often wrong or refers to different editions.

Verify translator matches publisher/edition (e.g., Penguin 2004 Marx = Fowkes, not Guyer-Wood)
If translator not in metadata, check ISBN against known editions
When uncertain, ask user

Missing Data

When metadata genuinely unavailable:

No author: Ask user
No year: Ask user for "publication year of this edition"
Ambiguous title: Ask user
No translator but work is translated: Ask user or look up by ISBN/publisher

Do not guess. Do not proceed without author and title.

Source

git clone https://github.com/brege/dewey-decimal-skill/blob/main/skills/metadata-extraction/SKILL.mdView on GitHub

Overview

This skill pulls bibliographic metadata from EPUB OPF, PDF properties, or filenames. It prioritizes EPUB OPF as the authoritative source, then PDF metadata, and finally filename hints or user input for disambiguation.

How This Skill Works

It first locates the EPUB OPF path from META-INF/container.xml and extracts dc:creator, dc:title, dc:date, and dc:contributor (with roles). If EPUB data is missing, it falls back to PDF metadata via pdfinfo, and as a last resort parses the filename. It also applies author name normalization and uses opf:file-as when present to improve citation format.

When to Use It

Extract bibliographic data from an EPUB's OPF metadata when available
Fallback to PDF document properties when EPUB metadata is missing or incomplete
Parse the filename as a last resort when embedded metadata is unreliable
Disambiguate authors, titles, or translators with user input
Ensure edition year reflects the publication year of this edition, not the original work

Quick Start

Step 1: Locate the EPUB OPF path using un unzip -p "$file" META-INF/container.xml and extract full-path
Step 2: Extract metadata with unzip -p "$file" "$opf_path" | grep -E '<dc:(creator|title|date|contributor)' and map fields
Step 3: If EPUB data is insufficient, run pdfinfo "$file" to fetch Title/Author and CreationDate, or parse the filename as a last resort; apply author normalization and use opf:file-as when present

Best Practices

Check EPUB OPF first for authoritative data
If OPF is missing, rely on PDF properties before resorting to filename parsing
Normalize author names and use opf:file-as when available
Verify translator/editor data against publisher/edition (e.g., ISBN cross-checks)
Do not guess when data is missing; prompt the user for missing author, title, or year

Example Use Cases

Example 1: An EPUB with valid OPF metadata yields Author: Charles Bukowski; Title: Love Is A Dog From Hell; Year: 2007 (from dc:creator, dc:title, dc:date)
Example 2: A PDF with Title/Author in document properties and CreationDate: 2018-05-03
Example 3: Filename only fallback: Charles.Bukowski.-.Love.Is.A.Dog.From.Hell.2007.RETAIL.EPUB.eBook-CTO.epub → Author: Charles Bukowski; Title: Love Is A Dog From Hell; Year: 2007
Example 4: Edition year clarification: Aristotle's Nicomachean Ethics (2019 edition with translation) → use 2019 as the publication year
Example 5: Missing data case: No author or title in OPF or PDF; rely on filename parse or prompt user for missing data

Frequently Asked Questions

Add this skill to your agents