media-processor
npx machina-cli add skill Vibe-Builders/claude-prime/media-processor --openclawMedia Processor
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
Core Capabilities
Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice
Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction
Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis
Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)
Image Generation
- Text-to-image generation
- High-quality generation variant (
generate-hq) for detailed outputs - Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality
Supported Tasks
| Task | Description |
|---|---|
transcribe | Audio/video transcription with timestamps |
analyze | Image, video, or audio analysis with custom prompts |
extract | Structured data extraction from PDFs and documents |
generate | Text-to-image generation |
generate-hq | High-quality image generation with enhanced detail |
See model routing for model selection per task.
Quick Start
Prerequisites
API Key Setup: Supports both Google AI Studio and Vertex AI.
Set GEMINI_API_KEY via environment or .claude/skills/media-processor/.env.
Get API key: https://aistudio.google.com/apikey
For Vertex AI:
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
Install SDK:
pip install google-genai python-dotenv pillow
Common Patterns
Transcribe Audio:
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe
Analyze Image:
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md
Process Video:
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md
Extract from PDF:
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json
Generate Image:
python scripts/image_gen.py \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name>.png \
--aspect-ratio 16:9
Generate High-Quality Image:
python scripts/image_gen.py \
--prompt "Detailed architectural blueprint of a modern house" \
--output docs/assets/<output-file-name>.png \
--mode generate-hq
Edit an Existing Image:
python scripts/image_gen.py \
--prompt "Make the sky sunset colors" \
--input photo.jpg \
--output docs/assets/edited.png
Convert Documents to Markdown:
# Convert to PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
Supported Formats
Audio
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
- Max 9.5 hours per request
- Auto-downsampled to 16 Kbps mono
Images
- PNG, JPEG, WEBP, HEIC, HEIF
- Max 3,600 images per request
- Resolution: <=384px = 258 tokens, larger = tiled
Video
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Max 6 hours (low-res) or 2 hours (default)
- YouTube URLs supported (public only)
Documents
- PDF only for vision processing
- Max 1,000 pages
- TXT, HTML, Markdown supported (text-only)
Size Limits
- Inline: <20MB total request
- File API: 2GB per file, 20GB project quota
- Retention: 48 hours auto-delete
Reference Navigation
For detailed implementation guidance, see:
| Reference | Description |
|---|---|
| audio-processing.md | Transcription, analysis, TTS, timestamps, multi-speaker |
| vision-understanding.md | Captioning, detection, segmentation, OCR, multi-image |
| video-analysis.md | Scene detection, YouTube, timestamps, long video |
| image-generation.md | Text-to-image, editing, composition, aspect ratios |
| model-routing.md | Model selection per task, pricing, context windows |
| document-extraction.md | PDF processing, table extraction, structured output |
| media-optimization.md | ffmpeg recipes for compressing/resizing before upload |
Scripts Overview
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
- Supports file-based modalities (audio, image, video, PDF)
- Tasks: transcribe, analyze, extract
- Progress tracking and error recovery
- Output formats: JSON, Markdown, CSV
- Rate limiting and retry logic
- Dry-run mode
image_gen.py: Generate images from text prompts
- Modes: generate (standard) and generate-hq (high quality)
- Image editing with optional input image
- Aspect ratio control
- Retry logic and error handling
document_converter.py: Convert documents to PDF
- Convert DOCX, XLSX, PPTX to PDF
- Extract page ranges
- Optimize PDFs for Gemini
- Extract images from PDFs
- Batch conversion support
Run any script with --help for detailed usage.
Error Handling
Common errors and solutions:
- 400: Invalid format/size - validate before upload
- 401: Invalid API key - check configuration
- 403: Permission denied - verify API key restrictions
- 404: File not found - ensure file uploaded and active
- 429: Rate limit exceeded - implement exponential backoff
- 500: Server error - retry with backoff
Resources
Source
git clone https://github.com/Vibe-Builders/claude-prime/blob/main/.claude/skills/media-processor/SKILL.mdView on GitHub Overview
Media Processor handles audio, image, video, and document tasks using Google Gemini's multimodal API. It provides a single interface for transcription, analysis, data extraction, and image generation, enabling deeper image study, chart data extraction, dense screenshots reading, and artwork analysis.
How This Skill Works
Inputs are routed to Gemini's multimodal API through a unified interface supporting tasks like transcribe, analyze, extract, generate, and generate-hq. Outputs are structured (e.g., JSON) with details such as timestamps, bounding boxes, OCR, charts, or image data, enabling seamless integration into downstream workflows.
When to Use It
- Transcribe long-form audio or video with precise timestamps for accessibility or note-taking.
- Deeply analyze images or videos for UI design, artwork study, or data-related visuals using object detection, OCR, and VQA.
- Extract tables, forms, and structured data from multi-page PDFs and convert to JSON or HTML.
- Generate or edit high-quality images and compositions for design briefs, including multi-image layouts and multiple aspect ratios.
- Summarize long videos, perform frame-level analysis, or answer questions about video content (with YouTube URL support).
Quick Start
- Step 1: Configure API access by setting GEMINI_API_KEY and, if using Vertex AI, enable Vertex settings (GEMINI_USE_VERTEX, VERTEX_PROJECT_ID, VERTEX_LOCATION).
- Step 2: Install required SDKs: 'pip install google-genai python-dotenv pillow'.
- Step 3: Run sample tasks such as transcribing audio, analyzing an image, processing a PDF extract, or generating an image with a prompt.
Best Practices
- Clearly define the task (transcribe, analyze, extract, generate, generate-hq) and expected output format before processing.
- For image/video analysis, provide explicit prompts or questions to guide detection, OCR, and VQA results.
- When handling PDFs, specify the pages, sections, and output schema (JSON/table data) to maximize structured results.
- Chunk very long inputs (audio/video) if needed to respect duration limits and maintain accuracy.
- Validate outputs against a defined schema (timestamps, bounding boxes, JSON tables) and perform post-processing as needed.
Example Use Cases
- Transcribe a conference talk and generate a timestamped caption file for accessibility.
- Analyze a UI mockup image to detect components, extract text, and generate descriptive alt text.
- Extract a price table and key metrics from a 500-page financial report into JSON for analytics.
- Generate a set of marketing images in 16:9 and 1:1, then iteratively refine styling and composition.
- Summarize a 3-hour training video with key points and timestamps, and answer user questions about specific sections.