email-receipt-scanning
npx machina-cli add skill peerjakobsen/smartspender/email-receipt-scanning --openclawEmail Receipt Scanning
Purpose
Provides Gmail search queries, Danish sender patterns, email content type detection, and deduplication rules for scanning a user's inbox for receipts and invoices. Used by the /smartspender:receipt email command.
Prerequisites
This skill requires the Gmail MCP server to be configured in the user's environment. The following MCP tools are needed:
gmail_searchor equivalent — search emails by querygmail_get_messageor equivalent — read email content and metadatagmail_get_attachmentor equivalent — download PDF attachments
If Gmail MCP is not available, commands using this skill should fail gracefully with a clear message.
Gmail Search Queries
Primary Search Query
Search for receipt and invoice emails using this combined query:
subject:(faktura OR kvittering OR receipt OR invoice OR ordre OR order OR betaling) has:attachment after:{YYYY/MM/DD}
Where {YYYY/MM/DD} is either:
- The
last_email_scantimestamp from settings.csv (incremental scan) - A calculated date based on the
daysargument (e.g., 30 days back) - Default: 90 days back if no previous scan and no argument
Supplementary Queries
If the primary query returns few results, also try:
from:(*faktura* OR *invoice* OR *noreply* OR *no-reply*) has:attachment after:{YYYY/MM/DD}
subject:(ordrebekraeftelse OR orderbekraeftelse OR betalingsbekraeftelse) after:{YYYY/MM/DD}
Query Notes
has:attachmentensures only emails with files are returned (most invoices are PDF attachments)- Danish keywords (
faktura,kvittering,betaling) catch Danish vendor emails - English keywords (
receipt,invoice,order) catch international vendors - Date filter prevents re-scanning old emails
Danish Sender Patterns
Map known email sender domains to vendor IDs for faster vendor detection:
| Sender Pattern | Vendor ID | Vendor Name | Type |
|---|---|---|---|
*@tdc.dk | tdc | TDC | Telecom |
*@tdcnet.dk | tdc | TDC | Telecom |
*@telenor.dk | telenor | Telenor | Telecom |
*@telia.dk | telia | Telia | Telecom |
*@orsted.dk | orsted | Oersted | Electricity |
*@hofor.dk | hofor | HOFOR | Water |
*@norlys.dk | norlys | Norlys | Electricity |
*@ewii.dk | ewii | EWII | Utility |
*@dinenergi.dk | dinenergi | Din Energi | Electricity |
*@netflix.com | netflix | Netflix | Streaming |
*@spotify.com | spotify | Spotify | Streaming |
*@amazon.com | amazon | Amazon | Online order |
*@amazon.de | amazon | Amazon | Online order |
*@zalando.dk | zalando | Zalando | Online order |
*@ikea.com | ikea | IKEA | Online order |
*@wolt.com | wolt | Wolt | Delivery |
*@nemlig.com | nemlig | Nemlig | Delivery |
For senders not in this table: extract the domain name as a starting point for vendor detection, then fall through to the invoice-parsing skill's vendor detection workflow.
Email Content Types
Receipt emails come in three forms. Detect and handle each:
| Content Type | Detection | Extraction Method |
|---|---|---|
| PDF attachment | Email has .pdf attachment | Download attachment → process as PDF invoice |
| Inline HTML | Email body contains structured receipt data, no PDF | Extract from email HTML body |
| Both | PDF attachment + summary in body | Prefer PDF attachment (more complete) |
PDF Attachment Priority
When an email has both a PDF attachment and inline content, always process the PDF. The inline content is typically a summary or notification, while the PDF is the full invoice.
Inline HTML Extraction
For emails without PDF attachments (e.g., Wolt order confirmations, Nemlig receipts):
- Parse the email HTML body
- Look for structured tables with item names, quantities, prices
- Extract total from summary section
- Set
file_referencetoemail:{message_id}(no file to archive)
Deduplication
Timestamp-Based Scan Window
- Read
last_email_scanfrom settings.csv - Only search emails received after this timestamp
- After successful scan, update
last_email_scanto current datetime
Cross-Check with receipts.csv
Before processing each email:
- Detect vendor and date from email metadata
- Extract total (from subject line or quick body scan)
- Check receipts.csv for existing receipt with same date + merchant + total_amount
- If match found: skip and note in scan summary as "allerede registreret"
Deduplication Fields
| Field | Source | Match Rule |
|---|---|---|
| date | Email date or invoice date | Same day |
| merchant | Sender domain mapping or vendor detection | Same normalized merchant |
| total_amount | PDF extraction or email body | Exact match |
Date Range Calculation
| Scenario | Date Range |
|---|---|
last_email_scan exists in settings.csv | From last_email_scan to now |
User provides days argument | From (today - days) to now |
| Neither (first scan) | From (today - 90 days) to now |
If the user provides a days argument, it overrides last_email_scan. This allows rescanning a specific period.
Email Filtering Heuristics
Not every email matching the search query is a receipt. Apply these filters:
Include
- Emails with PDF attachments from known vendor domains
- Emails with "faktura" or "kvittering" in subject
- Order confirmation emails with itemized totals
Exclude
- Marketing emails (subject contains "tilbud", "kampagne", "nyhedsbrev" without "faktura"/"kvittering")
- Password reset or account notification emails
- Shipping notifications without invoice content
- Emails already processed (deduplication check)
Examples
Example 1: Incremental Scan
Context: last_email_scan = 2026-01-15 in settings.csv
Search query: subject:(faktura OR kvittering OR receipt OR invoice OR ordre OR order OR betaling) has:attachment after:2026/01/15
Results: 4 emails found
- TDC faktura (2026-01-20) — PDF attachment — known vendor
- Oersted aarsopgoerelse (2026-01-25) — PDF attachment — known vendor
- Wolt ordrebekraeftelse (2026-01-28) — inline HTML — no parser
- Spam email about "tilbud" — filtered out
After processing: Update last_email_scan to 2026-02-01
Example 2: First-Time Scan with Days Argument
Context: No last_email_scan in settings.csv. User runs /smartspender:receipt email 30
Search query: subject:(faktura OR kvittering OR receipt OR invoice OR ordre OR order OR betaling) has:attachment after:2026/01/02
Results: Emails from the last 30 days
Related Skills
- See
skills/document-parsing/SKILL.mdfor vendor detection, parser lookup workflow, and general extraction rules - See
skills/data-schemas/SKILL.mdfor the CSV file structure (email receipts usesource: email)
Source
git clone https://github.com/peerjakobsen/smartspender/blob/main/skills/email-receipt-scanning/SKILL.mdView on GitHub Overview
Email Receipt Scanning identifies receipts and invoices in a user's Gmail by applying dedicated search queries, Danish sender patterns, and content-type detection. It includes deduplication rules and is used by the /smartspender:receipt email command to streamline expense tracking.
How This Skill Works
It relies on Gmail MCP tools (gmail_search, gmail_get_message, gmail_get_attachment) to locate emails, read content, and download PDFs. It uses a primary search query for receipts and invoices with attachments after a date, plus supplementary queries if needed. When both a PDF and inline content exist, the PDF is given priority because it typically contains the full invoice.
When to Use It
- When scanning a Gmail inbox for new receipts and invoices since the last run
- When dealing with Danish vendors that use faktura or kvittering keywords
- When emails include PDF attachments that are likely invoices
- When receipts arrive as inline HTML without attachments
- When encountering an unknown vendor, relying on domain-based sender patterns to kick off vendor detection
Quick Start
- Step 1: Ensure Gmail MCP server is configured with gmail_search, gmail_get_message, and gmail_get_attachment
- Step 2: Run /smartspender:receipt email to start scanning
- Step 3: Validate results and review deduplication; ensure last_email_scan is updated
Best Practices
- Keep last_email_scan in settings.csv to drive incremental scans
- Start with the primary search query; fall back to supplementary queries if results are sparse
- Process PDFs first when both PDF and inline content exist
- Configure Gmail MCP tools (gmail_search, gmail_get_message, gmail_get_attachment) and handle errors gracefully if unavailable
- Map known Danish sender patterns to vendor IDs to speed up vendor detection; otherwise fall back to domain-based parsing
Example Use Cases
- A Danish telecom invoice from TDC with a PDF attachment is found via the primary query and parsed
- Nemlig receipts are inline HTML without PDFs and are parsed from the email body
- Netflix receipts arrive from netflix.com and are detected via sender pattern
- Wolt delivery receipts with PDFs are downloaded and parsed
- Unknown vendor emails are handled by domain-based detection to route to the vendor-detection workflow