Get the FREE Ultimate OpenClaw Setup Guide →

document-hunter

npx machina-cli add skill bitwize-music-studio/claude-ai-music-skills/document-hunter --openclaw
Files (1)
SKILL.md
7.0 KB

Your Task

Input: $ARGUMENTS

You are an automated document hunter using browser automation (Playwright) to systematically search and download primary source documents from free public archives.

When invoked:

  1. Identify what documents are needed - Based on case name, album research needs, or explicit request
  2. Search all free sources systematically - DocumentCloud, CourtListener, Scribd, Justia, government sites
  3. Download all documents found - PDFs, transcripts, complaints, indictments, reports
  4. Organize with metadata - Create manifest showing what was found where
  5. Report results - What was found, what's still missing, quality assessment

Supporting Files


Document Hunter - Browser Automation Agent

You automate the tedious work of hunting down primary source documents across multiple free public archives.

Important Disclaimers:

  • Requires Playwright (pip install playwright && playwright install chromium)
  • Archive availability changes over time
  • Some sources have anti-bot protection (alternatives documented)
  • Always verify downloaded documents match expected content

Core Principles

  1. U.S. federal court documents are public domain - No copyright, freely redistributable
  2. Use FULL Playwright capabilities - Click buttons, wait for JavaScript, extract from rendered DOM
  3. Two-phase approach: Direct downloads first (fast), then browser automation (thorough)
  4. Skip known blockers: SEC.gov has Akamai WAF - use alternatives
  5. Multiple strategies per site: If one method fails, try another

Free Sources (Search Order)

SourceURLBest For
DocumentClouddocumentcloud.orgPACER docs journalists uploaded
CourtListenercourtlistener.comRECAP crowdsourced documents
Scribdscribd.comUser-uploaded court docs
Justiajustia.comAppellate opinions
DOJjustice.govIndictments, press releases
SECsec.gov/litigationComplaints, settlements

See site-patterns.md for automation strategies for each source.


Document Storage Strategy

⚠️ Primary source PDFs should NOT be committed to Git (too large)

Storage Location

PDFs go to {documents_root}/artists/[artist]/albums/[genre]/[album]/ (mirrored structure from content_root).

{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── indictment.pdf
├── plea-agreement.pdf
└── manifest.json

Store in Git (in album's SOURCES.md):

  • Extracted quotes with page numbers
  • Source URLs
  • References to external PDF locations

In .gitignore (already configured):

# Primary source PDFs - too large for Git
*.pdf
primary-sources/

Workflow

Phase 1: Setup

# Check Playwright
pip list | grep playwright

# Install if needed
pip install playwright beautifulsoup4 requests
playwright install chromium

Resolve document storage path:

  • Call resolve_path("documents", album_slug) — returns {documents_root}/artists/{artist}/albums/{genre}/{album}/
  • Create directory: mkdir -p {resolved_path}

Phase 2: Search

Generate and run a Python script that:

  1. Searches all free sources (DocumentCloud, CourtListener, Scribd, etc.)
  2. Downloads all found documents
  3. Creates manifest with metadata
  4. Reports what was found

See site-patterns.md for code templates.

Phase 3: Report Results

DOCUMENT HUNT COMPLETE
======================
Case: [case name]
Date: [date]

DOCUMENTS FOUND: X
- documentcloud_indictment.pdf (2.3 MB) - DocumentCloud
- courtlistener_complaint.pdf (1.1 MB) - CourtListener
- doj_press_release.pdf (0.5 MB) - DOJ

SOURCES SEARCHED:
✓ DocumentCloud - 3 documents
✓ CourtListener - 1 document
✓ Scribd - 0 documents
✓ DOJ - 1 document
⚠ SEC - blocked (use DOJ alternative)

STILL NEEDED:
- Trial transcript (not found in free sources)
- Sentencing memo (may require PACER)

MANIFEST: {documents_root}/artists/[artist]/albums/[genre]/[album]/manifest.json

RECAP Extension

The RECAP browser extension crowdsources PACER documents.

What it does:

  • When anyone views a PACER document, RECAP uploads it to CourtListener
  • You can then download for free

Location: ${CLAUDE_PLUGIN_ROOT}/tools/extensions/recap-extension/

Setup:

cd tools/extensions
curl -L "https://github.com/freelawproject/recap-chrome/releases/download/2.8.6/chrome-release.zip" -o recap.zip
unzip recap.zip -d recap-extension
rm recap.zip

Output Structure

In {documents_root}/artists/[artist]/albums/[genre]/[album]/ (not in git):

{documents_root}/artists/[artist]/albums/[genre]/[album]/
├── manifest.json                 # Complete catalog with metadata
├── documentcloud_*.pdf           # From DocumentCloud
├── courtlistener_*.pdf           # From CourtListener
├── doj_*.pdf                     # From DOJ
└── download-documents.py         # Reproducibility script

In {content_root}/.../[album]/SOURCES.md (in git):

  • Extracted quotes with page numbers
  • Source URLs for each document
  • References like: PDF: {documents_root}/artists/[artist]/albums/[genre]/[album]/indictment.pdf

Manifest Format

{
  "case_name": "Dorr et al. v. USIA",
  "search_date": "2025-01-23T12:00:00",
  "sources_searched": ["DocumentCloud", "CourtListener", "DOJ"],
  "documents_found": [
    {
      "source": "DocumentCloud",
      "title": "Great Molasses Flood Investigation",
      "filename": "documentcloud_molasses_investigation.pdf",
      "url": "https://...",
      "size": 2400000
    }
  ]
}

Troubleshooting

Site Blocked

  • SEC.gov: Use DOJ press releases instead (link to same docs)
  • Scribd: May need account; create or skip
  • CourtListener: If RECAP doesn't have it, doc requires PACER

No Results Found

  • Try alternate search terms (party names, case numbers)
  • Check if case is too old (pre-digital archives)
  • Some cases have documents sealed

Download Fails

  • Check if site requires login
  • Try direct URL download instead of button click
  • Check for rate limiting

Remember

  1. Exhaust free sources first - PACER charges per page
  2. Save metadata - URLs, dates, sources for citation
  3. Don't commit PDFs - Too large for Git
  4. Verify downloads - Ensure content matches expected document
  5. Report gaps - Note what couldn't be found for manual follow-up

Source

git clone https://github.com/bitwize-music-studio/claude-ai-music-skills/blob/main/skills/document-hunter/SKILL.mdView on GitHub

Overview

Document-hunter uses Playwright to search free public archives and download primary-source documents such as court filings, government reports, and public records. It navigates sources like DocumentCloud, CourtListener, Scribd, Justia, DOJ, and SEC, then organizes results with a manifest so researchers have provenance and gaps.

How This Skill Works

When invoked, it identifies the needed documents from a case name or research brief and runs a two-phase workflow: fast direct downloads when possible, then browser-based extraction for rendered pages. It saves PDFs to a structured path under documents_root and creates a manifest with quotes, source URLs, and references to PDF locations.

When to Use It

  • You need primary-source documents for a legal case (court filings, opinions) from multiple free archives.
  • You’re researching government reports, indictments, or public records and need verifiable sources.
  • You must gather documents from DocumentCloud, CourtListener, Scribd, Justia, DOJ, and SEC and compare findings.
  • The source site uses dynamic content or anti-bot protections, requiring robust Playwright automation.
  • You want an auditable artifact: a manifest containing provenance, quotes with page numbers, and external URLs.

Quick Start

  1. Step 1: Install Playwright and dependencies: pip install playwright beautifulsoup4 requests; then run playwright install chromium.
  2. Step 2: Resolve storage path: call resolve_path("documents", album_slug) to get {documents_root}/artists/.../albums/... and mkdir -p the directory.
  3. Step 3: Run the search script to fetch documents, download PDFs, and generate a manifest with provenance.

Best Practices

  • Start with direct downloads to capture PDFs quickly; use browser automation for rendered content as a fallback.
  • Verify downloaded PDFs match expected content; rely on checksums or page-quoted extracts when possible.
  • Store primary-source PDFs outside Git (per the guidance) in the designated documents_root.
  • Create a detailed manifest per album: include extracted quotes with page numbers, source URLs, and PDF references.
  • Apply multiple strategies per site; if one method fails, try an alternative approach or site-pattern.

Example Use Cases

  • Retrieve DOJ indictments and SEC settlements as primary-source PDFs for a criminal case analysis.
  • Collect appellate opinions from Justia to support civil litigation research.
  • Assemble court filings from CourtListener for investigative reporting on a high-profile matter.
  • Gather public documents hosted on DocumentCloud to corroborate newsroom findings.
  • Compile government reports and press releases from DOJ/SEC and similar sites for policy analysis.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers