What media types does ai-multimodal support?

Audio, images, video, PDFs and other documents; includes transcription, captioning, OCR, object detection, visual Q&A, and image generation.

What are the model and context limitations?

Supports Gemini 2.5 and 2.0 with up to 2M tokens context; segmentation requires 2.5+; video up to 6 hours, audio up to 9.5 hours.

What outputs does it produce?

Structured data in JSON for documents; HTML/JSON conversions; transcripts with timestamps; image assets for generation; TTS outputs when requested.

ai-multimodal

npx machina-cli add skill zircote/agents/ai-multimodal --openclaw

Files (1)

SKILL.md

11.3 KB

AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

Core Capabilities

Audio Processing

Transcription with timestamps (up to 9.5 hours)
Audio summarization and analysis
Speech understanding and speaker identification
Music and environmental sound analysis
Text-to-speech generation with controllable voice

Image Understanding

Image captioning and description
Object detection with bounding boxes (2.0+)
Pixel-level segmentation (2.5+)
Visual question answering
Multi-image comparison (up to 3,600 images)
OCR and text extraction

Video Analysis

Scene detection and summarization
Video Q&A with temporal understanding
Transcription with visual descriptions
YouTube URL support
Long video processing (up to 6 hours)
Frame-level analysis

Document Extraction

Native PDF vision processing (up to 1,000 pages)
Table and form extraction
Chart and diagram analysis
Multi-page document understanding
Structured data output (JSON schema)
Format conversion (PDF to HTML/JSON)

Image Generation

Text-to-image generation
Image editing and modification
Multi-image composition (up to 3 images)
Iterative refinement
Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
Controllable style and quality

Capability Matrix

Task	Audio	Image	Video	Document	Generation
Transcription	Y	-	Y	-	-
Summarization	Y	Y	Y	Y	-
Q&A	Y	Y	Y	Y	-
Object Detection	-	Y	Y	-	-
Text Extraction	-	Y	-	Y	-
Structured Output	Y	Y	Y	Y	-
Creation	TTS	-	-	-	Y
Timestamps	Y	-	Y	-	-
Segmentation	-	Y	-	-	-

Model Selection Guide

Gemini 2.5 Series (Recommended)

gemini-2.5-pro: Highest quality, all features, 1M-2M context
gemini-2.5-flash: Best balance, all features, 1M-2M context
gemini-2.5-flash-lite: Lightweight, segmentation support
gemini-2.5-flash-image: Image generation only

Gemini 2.0 Series

gemini-2.0-flash: Fast processing, object detection
gemini-2.0-flash-lite: Lightweight option

Feature Requirements

Segmentation: Requires 2.5+ models
Object Detection: Requires 2.0+ models
Multi-video: Requires 2.5+ models
Image Generation: Requires flash-image model

Context Windows

2M tokens: ~6 hours video (low-res) or ~2 hours (default)
1M tokens: ~3 hours video (low-res) or ~1 hour (default)
Audio: 32 tokens/second (1 min = 1,920 tokens)
PDF: 258 tokens/page (fixed)
Image: 258-1,548 tokens based on size

Quick Start

Prerequisites

API Key Setup: Supports both Google AI Studio and Vertex AI.

The skill checks for GEMINI_API_KEY in this order:

Process environment: export GEMINI_API_KEY="your-key"
Project root: .env
.claude/.env
.claude/skills/.env
.claude/skills/ai-multimodal/.env

Get API key: https://aistudio.google.com/apikey

For Vertex AI:

<example type="usage"> <code language="bash"> export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # Optional </code> </example>

Install SDK:

<example type="usage"> <code language="bash"> pip install google-genai python-dotenv pillow </code> </example>

Common Patterns

Transcribe Audio:

<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files audio.mp3 \ --task transcribe \ --model gemini-2.5-flash </code> </example>

Analyze Image:

<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files image.jpg \ --task analyze \ --prompt "Describe this image" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash </code> </example>

Process Video:

<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files video.mp4 \ --task analyze \ --prompt "Summarize key points with timestamps" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash </code> </example>

Extract from PDF:

<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files document.pdf \ --task extract \ --prompt "Extract table data as JSON" \ --output docs/assets/<output-name>.md \ --format json </code> </example>

Generate Image:

<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --task generate \ --prompt "A futuristic city at sunset" \ --output docs/assets/<output-file-name> \ --model gemini-2.5-flash-image \ --aspect-ratio 16:9 </code> </example>

Optimize Media:

<example type="usage"> <code language="bash"> # Prepare large video for processing python scripts/media_optimizer.py \ --input large-video.mp4 \ --output docs/assets/<output-file-name> \ --target-size 100MB

Batch optimize multiple files

python scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85 </code> </example>

Convert Documents to Markdown:

<example type="usage"> <code language="bash"> # Convert to PDF python scripts/document_converter.py \ --input document.docx \ --output docs/assets/document.md

Extract pages

python scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20 </code> </example>

Supported Formats

Audio

WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
Max 9.5 hours per request
Auto-downsampled to 16 Kbps mono

Images

PNG, JPEG, WEBP, HEIC, HEIF
Max 3,600 images per request
Resolution: <=384px = 258 tokens, larger = tiled

Video

MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
Max 6 hours (low-res) or 2 hours (default)
YouTube URLs supported (public only)

Documents

PDF only for vision processing
Max 1,000 pages
TXT, HTML, Markdown supported (text-only)

<constraints> <constraint severity="critical">API key must be kept secure - never commit to version control</constraint> <constraint severity="high">File size limit: 20MB inline, 2GB via File API</constraint> <constraint severity="high">YouTube processing limited to public videos only</constraint> <constraint severity="medium">Free tier rate limit: 10-15 requests per minute</constraint> <constraint severity="medium">Files uploaded via File API are auto-deleted after 48 hours</constraint> <constraint severity="low">Image generation requires specific flash-image model</constraint> </constraints>

Reference Navigation

For detailed implementation guidance, see:

Audio Processing

references/audio-processing.md - Transcription, analysis, TTS
- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation

Image Understanding

references/vision-understanding.md - Captioning, detection, OCR
- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison

Video Analysis

references/video-analysis.md - Scene detection, temporal understanding
- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization

Document Extraction

references/document-extraction.md - PDF processing, structured output
- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling

Image Generation

references/image-generation.md - Text-to-image, editing
- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings

Cost Optimization

Token Costs

Input Pricing:

Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output

Token Rates:

Audio: 32 tokens/second (1 min = 1,920 tokens)
Video: ~300 tokens/second (default) or ~100 (low-res)
PDF: 258 tokens/page (fixed)
Image: 258-1,548 tokens based on size

TTS Pricing:

Flash TTS: $10/1M tokens
Pro TTS: $20/1M tokens

Best Practices

Use gemini-2.5-flash for most tasks (best price/performance)
Use File API for files >20MB or repeated queries
Optimize media before upload (see media_optimizer.py)
Process specific segments instead of full videos
Use lower FPS for static content
Implement context caching for repeated queries
Batch process multiple files in parallel

Rate Limits

Free Tier:

10-15 RPM (requests per minute)
1M-4M TPM (tokens per minute)
1,500 RPD (requests per day)

YouTube Limits:

Free tier: 8 hours/day
Paid tier: No length limits
Public videos only

Storage Limits:

20GB per project
2GB per file
48-hour retention

Error Handling

Common errors and solutions:

400: Invalid format/size - validate before upload
401: Invalid API key - check configuration
403: Permission denied - verify API key restrictions
404: File not found - ensure file uploaded and active
429: Rate limit exceeded - implement exponential backoff
500: Server error - retry with backoff

Scripts Overview

All scripts support unified API key detection and error handling:

gemini_batch_process.py: Batch process multiple media files

Supports all modalities (audio, image, video, PDF)
Progress tracking and error recovery
Output formats: JSON, Markdown, CSV
Rate limiting and retry logic
Dry-run mode

media_optimizer.py: Prepare media for Gemini API

Compress videos/audio for size limits
Resize images appropriately
Split long videos into chunks
Format conversion
Quality vs size optimization

document_converter.py: Convert documents to PDF

Convert DOCX, XLSX, PPTX to PDF
Extract page ranges
Optimize PDFs for Gemini
Extract images from PDFs
Batch conversion support

Run any script with --help for detailed usage.

Resources

Source

git clone https://github.com/zircote/agents/blob/main/skills/ai-multimodal/SKILL.mdView on GitHub

Overview

ai-multimodal provides a unified interface to analyze and generate audio, images, videos, and documents using Google Gemini's multimodal API. It offers transcription, captioning, scene detection, OCR, visual Q&A, and image generation with model options like Gemini 2.5/2.0 and up to 2M tokens.

How This Skill Works

Inputs are routed to specialized Gemini-powered modules for audio, image, video, document, and generation tasks. The skill supports long-form media (audio up to 9.5 hours, video up to 6 hours) and YouTube URLs, returning structured outputs (JSON or HTML) or generated media. Model selections (Gemini 2.5/2.0) govern features like segmentation (2.5+) and image generation capabilities, with a context window up to 2M tokens.

When to Use It

Analyze long audio files or podcasts with transcription timestamps and a concise summary.
Extract structured data from PDFs (tables, forms, charts) into JSON or HTML.
Caption images, detect objects, perform OCR, and run visual Q&A on images or screenshots.
Analyze videos with scene detection, Q&A, and temporal analysis (including YouTube URLs).
Create or refine images from text prompts and iterate on compositions.

Quick Start

Step 1: Set GEMINI_API_KEY in your environment (export GEMINI_API_KEY=your-key).
Step 2: Pick a Gemini model (e.g., gemini-2.5-pro for full features or gemini-2.0-flash for speed).
Step 3: Submit a multimodal request with your media (audio, image, video, or PDF) and specify tasks like transcription, OCR, Q&A, or generation.

Best Practices

Choose the right Gemini model first: use Gemini 2.5-series for full multimodal capabilities including segmentation.
Pre-process inputs to improve accuracy: crop images, enhance PDFs, and provide clear prompts for tasks.
Be mindful of context limits: plan around up to 2M tokens and chunk long documents or videos as needed.
Request structured outputs (JSON) for data extraction and downstream integration.
Leverage model capabilities strategically: use VQA, OCR, and scene detection where they deliver tangible value.

Example Use Cases

Transcribe a 90-minute podcast with timestamps, plus a short topic summary for show notes.
Extract tables and charts from a multi-page PDF invoice and export as JSON for accounting.
Caption a product image set, detect objects, and OCR on-image text to populate a catalog.
Process a 2-hour lecture video with scene detection and a Q&A pass to highlight key topics.
Generate a set of marketing images from prompts and iteratively refine composition and style.

Frequently Asked Questions

Add this skill to your agents