Get the FREE Ultimate OpenClaw Setup Guide →

metadata-extraction

Scanned
npx machina-cli add skill brege/dewey-decimal-skill/metadata-extraction --openclaw
Files (1)
SKILL.md
3.2 KB

Metadata Extraction

Extract bibliographic metadata from ebook files.

Priority Order

  1. EPUB OPF metadata (most authoritative)
  2. PDF document properties
  3. Filename parsing (least reliable)
  4. User input (for disambiguation only)

EPUB Extraction

Step 1: Find OPF Path

unzip -p "$file" META-INF/container.xml | grep -oP 'full-path="\K[^"]*'

Returns path like OEBPS/content.opf or content.opf.

Step 2: Extract Metadata

unzip -p "$file" "$opf_path" | grep -E '<dc:(creator|title|date|contributor)'

Fields

TagContent
<dc:creator>Author name(s)
<dc:title>Book title
<dc:date>Publication date (extract YYYY)
<dc:contributor opf:role="trn">Translator
<dc:contributor opf:role="edt">Editor

PDF Extraction

pdfinfo "$file" 2>/dev/null | grep -E '^(Title|Author|CreationDate|ModDate):'

Extract:

  • Title field
  • Author field
  • CreationDate year (format: D:YYYYMMDDhhmmss)

Filename Parsing

Parse for hints when metadata insufficient:

  • Author: Often at start, before dash or title
  • Title: Main text
  • Year: Four digits in parentheses or at end

Example: Charles.Bukowski.-.Love.Is.A.Dog.From.Hell.2007.RETAIL.EPUB.eBook-CTO.epub

  • Author: Charles Bukowski
  • Title: Love Is A Dog From Hell
  • Year: 2007

Year Rules

Use publication year of THIS edition, not original work.

  • Aristotle's Nicomachean Ethics (ancient) with Irwin translation (2019): use 2019
  • Marx's Capital originally 1867: use edition's publication year
  • If EPUB date differs from original work date, prefer EPUB date

Useless Metadata

Some EPUBs have placeholder or garbage metadata. Treat these as missing:

  • Title: [No data], Unknown, Untitled, empty
  • Author: Unknown, Anonymous (unless actually anonymous work), empty
  • Date: Only modification date present, no publication date

When OPF metadata is useless, fall back to filename parsing immediately.

Author Name Normalization

Fix common capitalization issues from metadata:

  • Harold McgeeHarold McGee (fix surname caps)
  • CHARLES BUKOWSKICharles Bukowski (fix all-caps)
  • Preserve intentional lowercase particles: de Beauvoir, van Gogh

Use opf:file-as attribute if present - it often has correct citation format:

<dc:creator opf:file-as="McGee, Harold">Harold Mcgee</dc:creator>

Translator Verification

Translator info in filenames is often wrong or refers to different editions.

  • Verify translator matches publisher/edition (e.g., Penguin 2004 Marx = Fowkes, not Guyer-Wood)
  • If translator not in metadata, check ISBN against known editions
  • When uncertain, ask user

Missing Data

When metadata genuinely unavailable:

  • No author: Ask user
  • No year: Ask user for "publication year of this edition"
  • Ambiguous title: Ask user
  • No translator but work is translated: Ask user or look up by ISBN/publisher

Do not guess. Do not proceed without author and title.

Source

git clone https://github.com/brege/dewey-decimal-skill/blob/main/skills/metadata-extraction/SKILL.mdView on GitHub

Overview

This skill pulls bibliographic metadata from EPUB OPF, PDF properties, or filenames. It prioritizes EPUB OPF as the authoritative source, then PDF metadata, and finally filename hints or user input for disambiguation.

How This Skill Works

It first locates the EPUB OPF path from META-INF/container.xml and extracts dc:creator, dc:title, dc:date, and dc:contributor (with roles). If EPUB data is missing, it falls back to PDF metadata via pdfinfo, and as a last resort parses the filename. It also applies author name normalization and uses opf:file-as when present to improve citation format.

When to Use It

  • Extract bibliographic data from an EPUB's OPF metadata when available
  • Fallback to PDF document properties when EPUB metadata is missing or incomplete
  • Parse the filename as a last resort when embedded metadata is unreliable
  • Disambiguate authors, titles, or translators with user input
  • Ensure edition year reflects the publication year of this edition, not the original work

Quick Start

  1. Step 1: Locate the EPUB OPF path using un unzip -p "$file" META-INF/container.xml and extract full-path
  2. Step 2: Extract metadata with unzip -p "$file" "$opf_path" | grep -E '<dc:(creator|title|date|contributor)' and map fields
  3. Step 3: If EPUB data is insufficient, run pdfinfo "$file" to fetch Title/Author and CreationDate, or parse the filename as a last resort; apply author normalization and use opf:file-as when present

Best Practices

  • Check EPUB OPF first for authoritative data
  • If OPF is missing, rely on PDF properties before resorting to filename parsing
  • Normalize author names and use opf:file-as when available
  • Verify translator/editor data against publisher/edition (e.g., ISBN cross-checks)
  • Do not guess when data is missing; prompt the user for missing author, title, or year

Example Use Cases

  • Example 1: An EPUB with valid OPF metadata yields Author: Charles Bukowski; Title: Love Is A Dog From Hell; Year: 2007 (from dc:creator, dc:title, dc:date)
  • Example 2: A PDF with Title/Author in document properties and CreationDate: 2018-05-03
  • Example 3: Filename only fallback: Charles.Bukowski.-.Love.Is.A.Dog.From.Hell.2007.RETAIL.EPUB.eBook-CTO.epub → Author: Charles Bukowski; Title: Love Is A Dog From Hell; Year: 2007
  • Example 4: Edition year clarification: Aristotle's Nicomachean Ethics (2019 edition with translation) → use 2019 as the publication year
  • Example 5: Missing data case: No author or title in OPF or PDF; rely on filename parse or prompt user for missing data

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers