ai-multimodal
npx machina-cli add skill zircote/agents/ai-multimodal --openclawAI Multimodal Processing Skill
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
Core Capabilities
Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice
Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction
Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis
Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)
Image Generation
- Text-to-image generation
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality
Capability Matrix
| Task | Audio | Image | Video | Document | Generation |
|---|---|---|---|---|---|
| Transcription | Y | - | Y | - | - |
| Summarization | Y | Y | Y | Y | - |
| Q&A | Y | Y | Y | Y | - |
| Object Detection | - | Y | Y | - | - |
| Text Extraction | - | Y | - | Y | - |
| Structured Output | Y | Y | Y | Y | - |
| Creation | TTS | - | - | - | Y |
| Timestamps | Y | - | Y | - | - |
| Segmentation | - | Y | - | - | - |
Model Selection Guide
Gemini 2.5 Series (Recommended)
- gemini-2.5-pro: Highest quality, all features, 1M-2M context
- gemini-2.5-flash: Best balance, all features, 1M-2M context
- gemini-2.5-flash-lite: Lightweight, segmentation support
- gemini-2.5-flash-image: Image generation only
Gemini 2.0 Series
- gemini-2.0-flash: Fast processing, object detection
- gemini-2.0-flash-lite: Lightweight option
Feature Requirements
- Segmentation: Requires 2.5+ models
- Object Detection: Requires 2.0+ models
- Multi-video: Requires 2.5+ models
- Image Generation: Requires flash-image model
Context Windows
- 2M tokens: ~6 hours video (low-res) or ~2 hours (default)
- 1M tokens: ~3 hours video (low-res) or ~1 hour (default)
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
Quick Start
Prerequisites
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
- Process environment:
export GEMINI_API_KEY="your-key" - Project root:
.env .claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.env
Get API key: https://aistudio.google.com/apikey
For Vertex AI:
<example type="usage"> <code language="bash"> export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # Optional </code> </example>Install SDK:
<example type="usage"> <code language="bash"> pip install google-genai python-dotenv pillow </code> </example>Common Patterns
Transcribe Audio:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files audio.mp3 \ --task transcribe \ --model gemini-2.5-flash </code> </example>Analyze Image:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files image.jpg \ --task analyze \ --prompt "Describe this image" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash </code> </example>Process Video:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files video.mp4 \ --task analyze \ --prompt "Summarize key points with timestamps" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash </code> </example>Extract from PDF:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files document.pdf \ --task extract \ --prompt "Extract table data as JSON" \ --output docs/assets/<output-name>.md \ --format json </code> </example>Generate Image:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --task generate \ --prompt "A futuristic city at sunset" \ --output docs/assets/<output-file-name> \ --model gemini-2.5-flash-image \ --aspect-ratio 16:9 </code> </example>Optimize Media:
<example type="usage"> <code language="bash"> # Prepare large video for processing python scripts/media_optimizer.py \ --input large-video.mp4 \ --output docs/assets/<output-file-name> \ --target-size 100MBBatch optimize multiple files
python scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85
</code>
</example>
Convert Documents to Markdown:
<example type="usage"> <code language="bash"> # Convert to PDF python scripts/document_converter.py \ --input document.docx \ --output docs/assets/document.mdExtract pages
python scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
</code>
</example>
Supported Formats
Audio
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
- Max 9.5 hours per request
- Auto-downsampled to 16 Kbps mono
Images
- PNG, JPEG, WEBP, HEIC, HEIF
- Max 3,600 images per request
- Resolution: <=384px = 258 tokens, larger = tiled
Video
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Max 6 hours (low-res) or 2 hours (default)
- YouTube URLs supported (public only)
Documents
- PDF only for vision processing
- Max 1,000 pages
- TXT, HTML, Markdown supported (text-only)
Reference Navigation
For detailed implementation guidance, see:
Audio Processing
references/audio-processing.md- Transcription, analysis, TTS- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation
Image Understanding
references/vision-understanding.md- Captioning, detection, OCR- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison
Video Analysis
references/video-analysis.md- Scene detection, temporal understanding- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization
Document Extraction
references/document-extraction.md- PDF processing, structured output- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling
Image Generation
references/image-generation.md- Text-to-image, editing- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings
Cost Optimization
Token Costs
Input Pricing:
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
Token Rates:
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- Video: ~300 tokens/second (default) or ~100 (low-res)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
TTS Pricing:
- Flash TTS: $10/1M tokens
- Pro TTS: $20/1M tokens
Best Practices
- Use
gemini-2.5-flashfor most tasks (best price/performance) - Use File API for files >20MB or repeated queries
- Optimize media before upload (see
media_optimizer.py) - Process specific segments instead of full videos
- Use lower FPS for static content
- Implement context caching for repeated queries
- Batch process multiple files in parallel
Rate Limits
Free Tier:
- 10-15 RPM (requests per minute)
- 1M-4M TPM (tokens per minute)
- 1,500 RPD (requests per day)
YouTube Limits:
- Free tier: 8 hours/day
- Paid tier: No length limits
- Public videos only
Storage Limits:
- 20GB per project
- 2GB per file
- 48-hour retention
Error Handling
Common errors and solutions:
- 400: Invalid format/size - validate before upload
- 401: Invalid API key - check configuration
- 403: Permission denied - verify API key restrictions
- 404: File not found - ensure file uploaded and active
- 429: Rate limit exceeded - implement exponential backoff
- 500: Server error - retry with backoff
Scripts Overview
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
- Supports all modalities (audio, image, video, PDF)
- Progress tracking and error recovery
- Output formats: JSON, Markdown, CSV
- Rate limiting and retry logic
- Dry-run mode
media_optimizer.py: Prepare media for Gemini API
- Compress videos/audio for size limits
- Resize images appropriately
- Split long videos into chunks
- Format conversion
- Quality vs size optimization
document_converter.py: Convert documents to PDF
- Convert DOCX, XLSX, PPTX to PDF
- Extract page ranges
- Optimize PDFs for Gemini
- Extract images from PDFs
- Batch conversion support
Run any script with --help for detailed usage.
Resources
Source
git clone https://github.com/zircote/agents/blob/main/skills/ai-multimodal/SKILL.mdView on GitHub Overview
ai-multimodal provides a unified interface to analyze and generate audio, images, videos, and documents using Google Gemini's multimodal API. It offers transcription, captioning, scene detection, OCR, visual Q&A, and image generation with model options like Gemini 2.5/2.0 and up to 2M tokens.
How This Skill Works
Inputs are routed to specialized Gemini-powered modules for audio, image, video, document, and generation tasks. The skill supports long-form media (audio up to 9.5 hours, video up to 6 hours) and YouTube URLs, returning structured outputs (JSON or HTML) or generated media. Model selections (Gemini 2.5/2.0) govern features like segmentation (2.5+) and image generation capabilities, with a context window up to 2M tokens.
When to Use It
- Analyze long audio files or podcasts with transcription timestamps and a concise summary.
- Extract structured data from PDFs (tables, forms, charts) into JSON or HTML.
- Caption images, detect objects, perform OCR, and run visual Q&A on images or screenshots.
- Analyze videos with scene detection, Q&A, and temporal analysis (including YouTube URLs).
- Create or refine images from text prompts and iterate on compositions.
Quick Start
- Step 1: Set GEMINI_API_KEY in your environment (export GEMINI_API_KEY=your-key).
- Step 2: Pick a Gemini model (e.g., gemini-2.5-pro for full features or gemini-2.0-flash for speed).
- Step 3: Submit a multimodal request with your media (audio, image, video, or PDF) and specify tasks like transcription, OCR, Q&A, or generation.
Best Practices
- Choose the right Gemini model first: use Gemini 2.5-series for full multimodal capabilities including segmentation.
- Pre-process inputs to improve accuracy: crop images, enhance PDFs, and provide clear prompts for tasks.
- Be mindful of context limits: plan around up to 2M tokens and chunk long documents or videos as needed.
- Request structured outputs (JSON) for data extraction and downstream integration.
- Leverage model capabilities strategically: use VQA, OCR, and scene detection where they deliver tangible value.
Example Use Cases
- Transcribe a 90-minute podcast with timestamps, plus a short topic summary for show notes.
- Extract tables and charts from a multi-page PDF invoice and export as JSON for accounting.
- Caption a product image set, detect objects, and OCR on-image text to populate a catalog.
- Process a 2-hour lecture video with scene detection and a Q&A pass to highlight key topics.
- Generate a set of marketing images from prompts and iteratively refine composition and style.