ai-multimodal
npx machina-cli add skill Microck/ordinary-claude-skills/ai-multimodal --openclawAI Multimodal Processing Skill
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
Core Capabilities
Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice
Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction
Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis
Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)
Image Generation
- Text-to-image generation
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality
Capability Matrix
| Task | Audio | Image | Video | Document | Generation |
|---|---|---|---|---|---|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object Detection | - | ✓ | ✓ | - | - |
| Text Extraction | - | ✓ | - | ✓ | - |
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
| Creation | TTS | - | - | - | ✓ |
| Timestamps | ✓ | - | ✓ | - | - |
| Segmentation | - | ✓ | - | - | - |
Model Selection Guide
Gemini 2.5 Series (Recommended)
- gemini-2.5-pro: Highest quality, all features, 1M-2M context
- gemini-2.5-flash: Best balance, all features, 1M-2M context
- gemini-2.5-flash-lite: Lightweight, segmentation support
- gemini-2.5-flash-image: Image generation only
Gemini 2.0 Series
- gemini-2.0-flash: Fast processing, object detection
- gemini-2.0-flash-lite: Lightweight option
Feature Requirements
- Segmentation: Requires 2.5+ models
- Object Detection: Requires 2.0+ models
- Multi-video: Requires 2.5+ models
- Image Generation: Requires flash-image model
Context Windows
- 2M tokens: ~6 hours video (low-res) or ~2 hours (default)
- 1M tokens: ~3 hours video (low-res) or ~1 hour (default)
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
Quick Start
Prerequisites
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
- Process environment:
export GEMINI_API_KEY="your-key" - Project root:
.env .claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.env
Get API key: https://aistudio.google.com/apikey
For Vertex AI:
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
Install SDK:
pip install google-genai python-dotenv pillow
Common Patterns
Transcribe Audio:
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash
Analyze Image:
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Process Video:
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Extract from PDF:
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json
Generate Image:
python scripts/gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9
Optimize Media:
# Prepare large video for processing
python scripts/media_optimizer.py \
--input large-video.mp4 \
--output docs/assets/<output-file-name> \
--target-size 100MB
# Batch optimize multiple files
python scripts/media_optimizer.py \
--input-dir ./videos \
--output-dir docs/assets/optimized \
--quality 85
Convert Documents to Markdown:
# Convert to PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
Supported Formats
Audio
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
- Max 9.5 hours per request
- Auto-downsampled to 16 Kbps mono
Images
- PNG, JPEG, WEBP, HEIC, HEIF
- Max 3,600 images per request
- Resolution: ≤384px = 258 tokens, larger = tiled
Video
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Max 6 hours (low-res) or 2 hours (default)
- YouTube URLs supported (public only)
Documents
- PDF only for vision processing
- Max 1,000 pages
- TXT, HTML, Markdown supported (text-only)
Size Limits
- Inline: <20MB total request
- File API: 2GB per file, 20GB project quota
- Retention: 48 hours auto-delete
Reference Navigation
For detailed implementation guidance, see:
Audio Processing
references/audio-processing.md- Transcription, analysis, TTS- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation
Image Understanding
references/vision-understanding.md- Captioning, detection, OCR- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison
Video Analysis
references/video-analysis.md- Scene detection, temporal understanding- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization
Document Extraction
references/document-extraction.md- PDF processing, structured output- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling
Image Generation
references/image-generation.md- Text-to-image, editing- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings
Cost Optimization
Token Costs
Input Pricing:
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
Token Rates:
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- Video: ~300 tokens/second (default) or ~100 (low-res)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
TTS Pricing:
- Flash TTS: $10/1M tokens
- Pro TTS: $20/1M tokens
Best Practices
- Use
gemini-2.5-flashfor most tasks (best price/performance) - Use File API for files >20MB or repeated queries
- Optimize media before upload (see
media_optimizer.py) - Process specific segments instead of full videos
- Use lower FPS for static content
- Implement context caching for repeated queries
- Batch process multiple files in parallel
Rate Limits
Free Tier:
- 10-15 RPM (requests per minute)
- 1M-4M TPM (tokens per minute)
- 1,500 RPD (requests per day)
YouTube Limits:
- Free tier: 8 hours/day
- Paid tier: No length limits
- Public videos only
Storage Limits:
- 20GB per project
- 2GB per file
- 48-hour retention
Error Handling
Common errors and solutions:
- 400: Invalid format/size - validate before upload
- 401: Invalid API key - check configuration
- 403: Permission denied - verify API key restrictions
- 404: File not found - ensure file uploaded and active
- 429: Rate limit exceeded - implement exponential backoff
- 500: Server error - retry with backoff
Scripts Overview
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
- Supports all modalities (audio, image, video, PDF)
- Progress tracking and error recovery
- Output formats: JSON, Markdown, CSV
- Rate limiting and retry logic
- Dry-run mode
media_optimizer.py: Prepare media for Gemini API
- Compress videos/audio for size limits
- Resize images appropriately
- Split long videos into chunks
- Format conversion
- Quality vs size optimization
document_converter.py: Convert documents to PDF
- Convert DOCX, XLSX, PPTX to PDF
- Extract page ranges
- Optimize PDFs for Gemini
- Extract images from PDFs
- Batch conversion support
Run any script with --help for detailed usage.
Resources
Source
git clone https://github.com/Microck/ordinary-claude-skills/blob/main/skills_all/ai-multimodal/SKILL.mdView on GitHub Overview
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. It offers transcription, captioning, scene detection, OCR, form extraction, TTS, and image generation through a unified interface for multimedia understanding and creation.
How This Skill Works
Inputs are routed to the appropriate Gemini model (2.5/2.0) based on task and required features. The skill handles long media with up to 2M token context, performs analysis and generation across modalities, and returns structured outputs (JSON, captions, or images) ready for downstream apps.
When to Use It
- When you need transcription, summarization, or Q&A for audio clips up to 9.5 hours.
- When analyzing or captioning images, performing OCR, or detecting objects in visuals.
- When processing videos (scene detection, Q&A, or YouTube URLs) up to several hours.
- When extracting tables, forms, charts, or multi-page data from PDFs and documents.
- When generating or editing images from text prompts or refining visuals with multimodal outputs.
Quick Start
- Step 1: Set GEMINI_API_KEY (export GEMINI_API_KEY="your-key"), or place it in a .env file as per your environment.
- Step 2: Choose a Gemini model (e.g., gemini-2.5-pro for full features) and provide your media inputs (audio, image, video, PDFs) or text prompts for generation.
- Step 3: Call the multimodal API and handle outputs (structured JSON for data, captions for images, or generated images) in your app.
Best Practices
- Choose Gemini 2.5-series models for full capabilities (segmentation, multi-image analysis) and switch to 2.0-series for speed when features are not required.
- Leverage context window guidance: 2M tokens for long videos, 1M for smaller tasks, and account for audio token rates.
- Prefer structured outputs (JSON) for documents and data extraction to simplify downstream parsing.
- Preprocess inputs: ensure audio timestamps are accurate, crop images for OCR, and provide YouTube URLs with permission.
- Validate results with cross-checks across modalities (e.g., captions vs. transcripts) and use iterative refinement for image generation.
Example Use Cases
- Transcribe a 45-minute podcast with timestamps, then generate a concise summary and highlight key quotes.
- Extract a product catalog from a PDF including tables and forms into JSON.
- Caption product images, detect objects with bounding boxes, and produce a visual QA for an e-commerce listing.
- Analyze a YouTube video for scenes and generate a Q&A about the content with contextual answers.
- Create a set of AI-generated images from a design brief and refine them through iterative edits.