What is Media Processor?

A unified tool that processes and generates multimedia content (audio, images, video, documents) using Google Gemini's multimodal API, covering transcription, analysis, data extraction, and image generation.

What content can it handle and in what limits?

Audio transcription up to 9.5 hours, video analysis up to 6 hours with frame-level insights, PDFs up to 1,000 pages with structured JSON output, and image generation with support for HQ mode and multiple aspect ratios.

What outputs can I expect?

Structured outputs such as timestamps, bounding boxes, OCR text, chart data, forms, and JSON representations suitable for downstream processing or integration.

media-processor

npx machina-cli add skill Vibe-Builders/claude-prime/media-processor --openclaw

Files (1)

SKILL.md

7.4 KB

Media Processor

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

Core Capabilities

Audio Processing

Transcription with timestamps (up to 9.5 hours)
Audio summarization and analysis
Speech understanding and speaker identification
Music and environmental sound analysis
Text-to-speech generation with controllable voice

Image Understanding

Image captioning and description
Object detection with bounding boxes (2.0+)
Pixel-level segmentation (2.5+)
Visual question answering
Multi-image comparison (up to 3,600 images)
OCR and text extraction

Video Analysis

Scene detection and summarization
Video Q&A with temporal understanding
Transcription with visual descriptions
YouTube URL support
Long video processing (up to 6 hours)
Frame-level analysis

Document Extraction

Native PDF vision processing (up to 1,000 pages)
Table and form extraction
Chart and diagram analysis
Multi-page document understanding
Structured data output (JSON schema)
Format conversion (PDF to HTML/JSON)

Image Generation

Text-to-image generation
High-quality generation variant (generate-hq) for detailed outputs
Image editing and modification
Multi-image composition (up to 3 images)
Iterative refinement
Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
Controllable style and quality

Supported Tasks

Task	Description
`transcribe`	Audio/video transcription with timestamps
`analyze`	Image, video, or audio analysis with custom prompts
`extract`	Structured data extraction from PDFs and documents
`generate`	Text-to-image generation
`generate-hq`	High-quality image generation with enhanced detail

See model routing for model selection per task.

Quick Start

Prerequisites

API Key Setup: Supports both Google AI Studio and Vertex AI.

Set GEMINI_API_KEY via environment or .claude/skills/media-processor/.env.

Get API key: https://aistudio.google.com/apikey

For Vertex AI:

export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional

Install SDK:

pip install google-genai python-dotenv pillow

Common Patterns

Transcribe Audio:

python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe

Analyze Image:

python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md

Process Video:

python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md

Extract from PDF:

python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json

Generate Image:

python scripts/image_gen.py \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name>.png \
  --aspect-ratio 16:9

Generate High-Quality Image:

python scripts/image_gen.py \
  --prompt "Detailed architectural blueprint of a modern house" \
  --output docs/assets/<output-file-name>.png \
  --mode generate-hq

Edit an Existing Image:

python scripts/image_gen.py \
  --prompt "Make the sky sunset colors" \
  --input photo.jpg \
  --output docs/assets/edited.png

Convert Documents to Markdown:

# Convert to PDF
python scripts/document_converter.py \
  --input document.docx \
  --output docs/assets/document.md

# Extract pages
python scripts/document_converter.py \
  --input large.pdf \
  --output docs/assets/chapter1.md \
  --pages 1-20

Supported Formats

Audio

WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
Max 9.5 hours per request
Auto-downsampled to 16 Kbps mono

Images

PNG, JPEG, WEBP, HEIC, HEIF
Max 3,600 images per request
Resolution: <=384px = 258 tokens, larger = tiled

Video

MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
Max 6 hours (low-res) or 2 hours (default)
YouTube URLs supported (public only)

Documents

PDF only for vision processing
Max 1,000 pages
TXT, HTML, Markdown supported (text-only)

Size Limits

Inline: <20MB total request
File API: 2GB per file, 20GB project quota
Retention: 48 hours auto-delete

Reference Navigation

For detailed implementation guidance, see:

Reference	Description
audio-processing.md	Transcription, analysis, TTS, timestamps, multi-speaker
vision-understanding.md	Captioning, detection, segmentation, OCR, multi-image
video-analysis.md	Scene detection, YouTube, timestamps, long video
image-generation.md	Text-to-image, editing, composition, aspect ratios
model-routing.md	Model selection per task, pricing, context windows
document-extraction.md	PDF processing, table extraction, structured output
media-optimization.md	ffmpeg recipes for compressing/resizing before upload

Scripts Overview

All scripts support unified API key detection and error handling:

gemini_batch_process.py: Batch process multiple media files

Supports file-based modalities (audio, image, video, PDF)
Tasks: transcribe, analyze, extract
Progress tracking and error recovery
Output formats: JSON, Markdown, CSV
Rate limiting and retry logic
Dry-run mode

image_gen.py: Generate images from text prompts

Modes: generate (standard) and generate-hq (high quality)
Image editing with optional input image
Aspect ratio control
Retry logic and error handling

document_converter.py: Convert documents to PDF

Convert DOCX, XLSX, PPTX to PDF
Extract page ranges
Optimize PDFs for Gemini
Extract images from PDFs
Batch conversion support

Run any script with --help for detailed usage.

Error Handling

Common errors and solutions:

400: Invalid format/size - validate before upload
401: Invalid API key - check configuration
403: Permission denied - verify API key restrictions
404: File not found - ensure file uploaded and active
429: Rate limit exceeded - implement exponential backoff
500: Server error - retry with backoff

Resources

Source

git clone https://github.com/Vibe-Builders/claude-prime/blob/main/.claude/skills/media-processor/SKILL.mdView on GitHub

Overview

Media Processor handles audio, image, video, and document tasks using Google Gemini's multimodal API. It provides a single interface for transcription, analysis, data extraction, and image generation, enabling deeper image study, chart data extraction, dense screenshots reading, and artwork analysis.

How This Skill Works

Inputs are routed to Gemini's multimodal API through a unified interface supporting tasks like transcribe, analyze, extract, generate, and generate-hq. Outputs are structured (e.g., JSON) with details such as timestamps, bounding boxes, OCR, charts, or image data, enabling seamless integration into downstream workflows.

When to Use It

Transcribe long-form audio or video with precise timestamps for accessibility or note-taking.
Deeply analyze images or videos for UI design, artwork study, or data-related visuals using object detection, OCR, and VQA.
Extract tables, forms, and structured data from multi-page PDFs and convert to JSON or HTML.
Generate or edit high-quality images and compositions for design briefs, including multi-image layouts and multiple aspect ratios.
Summarize long videos, perform frame-level analysis, or answer questions about video content (with YouTube URL support).

Quick Start

Step 1: Configure API access by setting GEMINI_API_KEY and, if using Vertex AI, enable Vertex settings (GEMINI_USE_VERTEX, VERTEX_PROJECT_ID, VERTEX_LOCATION).
Step 2: Install required SDKs: 'pip install google-genai python-dotenv pillow'.
Step 3: Run sample tasks such as transcribing audio, analyzing an image, processing a PDF extract, or generating an image with a prompt.

Best Practices

Clearly define the task (transcribe, analyze, extract, generate, generate-hq) and expected output format before processing.
For image/video analysis, provide explicit prompts or questions to guide detection, OCR, and VQA results.
When handling PDFs, specify the pages, sections, and output schema (JSON/table data) to maximize structured results.
Chunk very long inputs (audio/video) if needed to respect duration limits and maintain accuracy.
Validate outputs against a defined schema (timestamps, bounding boxes, JSON tables) and perform post-processing as needed.

Example Use Cases

Transcribe a conference talk and generate a timestamped caption file for accessibility.
Analyze a UI mockup image to detect components, extract text, and generate descriptive alt text.
Extract a price table and key metrics from a 500-page financial report into JSON for analytics.
Generate a set of marketing images in 16:9 and 1:1, then iteratively refine styling and composition.
Summarize a 3-hour training video with key points and timestamps, and answer user questions about specific sections.

Frequently Asked Questions

Add this skill to your agents