video-audio-processor
Scannednpx machina-cli add skill sacredvoid/skillkit/video-audio-processor --openclawVideo/Audio Processor
Process video and audio recordings into text transcripts and visual frame captures for analysis. Auto-detects your hardware and picks the optimal transcription model, device, and settings.
Phase 0: Detect Compute Environment
Run this at the start of every invocation. It determines everything downstream.
python3 -c "
import platform, shutil, subprocess, json
info = {'os': platform.system(), 'arch': platform.machine(), 'ram_gb': 0, 'gpu': 'none', 'vram_gb': 0, 'device': 'cpu', 'dtype': 'float32'}
# RAM
try:
if platform.system() == 'Darwin':
import os; info['ram_gb'] = round(os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024**3))
elif platform.system() == 'Linux':
with open('/proc/meminfo') as f:
for line in f:
if line.startswith('MemTotal'):
info['ram_gb'] = round(int(line.split()[1]) / (1024**2))
break
else:
import ctypes
mem = ctypes.c_ulonglong(0)
ctypes.windll.kernel32.GetPhysicallyInstalledMemory(ctypes.byref(mem))
info['ram_gb'] = round(mem.value / (1024**2))
except: pass
# GPU detection
try:
import torch
if torch.cuda.is_available():
info['gpu'] = torch.cuda.get_device_name(0)
info['vram_gb'] = round(torch.cuda.get_device_properties(0).total_mem / (1024**3))
info['device'] = 'cuda'
info['dtype'] = 'float16'
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
info['gpu'] = 'Apple Silicon (MPS)'
info['vram_gb'] = info['ram_gb'] # unified memory
info['device'] = 'mps'
info['dtype'] = 'float16'
elif hasattr(torch, 'hip') or 'AMD' in str(getattr(torch, '_C', '')):
info['gpu'] = 'AMD (ROCm)'
info['device'] = 'cuda' # ROCm uses cuda API
info['dtype'] = 'float16'
except ImportError:
pass
# ffmpeg detection
info['ffmpeg'] = shutil.which('ffmpeg') is not None
info['ffprobe'] = shutil.which('ffprobe') is not None
print(json.dumps(info))
"
Parse the JSON output and store it internally as COMPUTE. Present the results to the user:
Detected hardware:
- OS: {os} ({arch})
- RAM: {ram_gb} GB
- GPU: {gpu} ({vram_gb} GB VRAM)
- Compute device: {device}
- ffmpeg: {installed/missing}
Whisper Engine Selection
Based on hardware, choose the transcription engine:
| Condition | Engine | Reason |
|---|---|---|
| CUDA GPU available | faster-whisper | GPU-accelerated CTranslate2, 4x faster than standard Whisper |
| MPS (Apple Silicon) | openai-whisper | Standard Whisper, uses MPS acceleration when available |
| CPU only | openai-whisper | Standard Whisper on CPU, reliable fallback |
Model Selection Matrix
Based on hardware and recording length, recommend a model:
| Condition | Recommended Model | Speed (1hr recording) | Accuracy |
|---|---|---|---|
| CUDA >= 8 GB VRAM | large-v3 | ~5 min | Best |
| CUDA 4-7 GB VRAM | medium | ~8 min | High |
| MPS (Apple Silicon, 16+ GB) | medium | ~15 min | High |
| MPS (Apple Silicon, 8 GB) | small | ~10 min | Good |
| CPU + RAM >= 16 GB | small | ~30 min | Good |
| CPU + RAM < 16 GB | base | ~15 min | Adequate |
Present the recommendation and let the user choose:
AskUserQuestion: "Which transcription model should I use?"
Options:
- [Recommended model] (Recommended) - [reason based on their hardware]
- large-v3 - Best accuracy, needs 8+ GB VRAM or lots of RAM
- medium - High accuracy, balanced speed
- small - Good accuracy, faster processing
- base - Adequate for most meetings, fastest
Store the chosen model as WHISPER_MODEL and engine as WHISPER_ENGINE.
Phase 1: Identify the Recording
Find the file. Zoom recordings typically follow the pattern:
GMT{date}-{time}_Recording_{resolution}.mp4(video)GMT{date}-{time}_Recording.m4a(audio only)GMT{date}-{time}_RecordingnewChat.txt(chat log)
Check for all three artifacts. Read the chat file first (it's small and gives immediate context).
Phase 2: Install Dependencies
Check and install what's needed based on platform.
ffmpeg
# Check if ffmpeg is available
ffmpeg -version 2>&1 | head -1
ffprobe -version 2>&1 | head -1
If ffmpeg is missing, install based on platform:
| Platform | Install command |
|---|---|
| macOS (Homebrew) | brew install ffmpeg |
| macOS (no Homebrew) | Guide user to install Homebrew first, then ffmpeg |
| Ubuntu/Debian | sudo apt update && sudo apt install -y ffmpeg |
| Fedora/RHEL | sudo dnf install -y ffmpeg |
| Arch Linux | sudo pacman -S ffmpeg |
| Windows (winget) | winget install Gyan.FFmpeg |
| Windows (choco) | choco install ffmpeg |
| Windows (scoop) | scoop install ffmpeg |
Python + Whisper
python3 -c "import torch; print(torch.__version__)" 2>&1
If torch is missing, install based on platform:
| Platform | Install command |
|---|---|
| macOS (MPS) | pip3 install torch torchvision torchaudio |
| Linux (CUDA) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
| Linux (ROCm) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0 |
| Linux/Windows (CPU) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu |
| Windows (CUDA) | pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 |
Then install the transcription engine:
# For CUDA systems (faster-whisper with CTranslate2 acceleration)
pip3 install faster-whisper
# For MPS / CPU systems (standard OpenAI Whisper)
pip3 install openai-whisper
Phase 3: Get Duration
Use ffprobe (cross-platform) to get the file duration:
ffprobe -v quiet -show_entries format=duration -of csv=p=0 "/path/to/file"
This returns duration in seconds. Convert to human-readable format and display to user. Use this to estimate transcription time based on the selected model and hardware.
Present the plan:
Processing plan:
- File: {filename} ({duration})
- Engine: {WHISPER_ENGINE} with {WHISPER_MODEL} model on {device}
- Estimated transcription time: {estimate}
- Frame extraction: 1 per minute ({frame_count} frames)
Want me to adjust anything before starting?
Time Estimates
| Device | base | small | medium | large-v3 |
|---|---|---|---|---|
| CUDA (RTX 3060+) | ~1 min/hr | ~2 min/hr | ~4 min/hr | ~5 min/hr |
| CUDA (faster-whisper) | ~30s/hr | ~1 min/hr | ~2 min/hr | ~3 min/hr |
| MPS (M1/M2/M3/M4) | ~3 min/hr | ~6 min/hr | ~15 min/hr | ~25 min/hr |
| CPU (16GB+ RAM) | ~8 min/hr | ~15 min/hr | ~30 min/hr | ~60 min/hr |
Phase 4: Extract Visual Frames
Extract one frame per minute for visual context. Use a cross-platform temp directory:
# Create temp directory (cross-platform)
python3 -c "import tempfile, os; d = os.path.join(tempfile.gettempdir(), 'av-frames'); os.makedirs(d, exist_ok=True); print(d)"
Then extract frames:
ffmpeg -i "/path/to/video.mp4" \
-vf "fps=1/60,scale=1280:-1" -q:v 3 {TEMP_DIR}/frame_%03d.jpg
Use the Read tool on each frame to view it. Frames show:
- Screen shares (code, slides, UIs)
- Participant names (bottom-left of each video tile)
- Timestamps (if visible in screen share)
Phase 5: Transcribe Audio
Run the transcription engine. Choose the correct code based on WHISPER_ENGINE:
faster-whisper (CUDA systems)
from faster_whisper import WhisperModel
import json
model = WhisperModel("{WHISPER_MODEL}", device="cuda", compute_type="float16")
segments, info = model.transcribe("/path/to/audio", beam_size=5, language="en")
results = []
for segment in segments:
results.append({
"start": segment.start,
"end": segment.end,
"text": segment.text.strip()
})
# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
json.dump({"language": info.language, "segments": results}, f, indent=2)
with open("{TEMP_DIR}/transcript.txt", "w") as f:
for seg in results:
start = int(seg["start"])
mins, secs = divmod(start, 60)
f.write(f"[{mins:02d}:{secs:02d}] {seg['text']}\n")
print(f"Transcribed {len(results)} segments")
openai-whisper (MPS / CPU systems)
import whisper, json
model = whisper.load_model("{WHISPER_MODEL}")
result = model.transcribe("/path/to/audio", verbose=False)
# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
json.dump(result, f, indent=2)
with open("{TEMP_DIR}/transcript.txt", "w") as f:
for seg in result["segments"]:
start = int(seg["start"])
mins, secs = divmod(start, 60)
f.write(f"[{mins:02d}:{secs:02d}] {seg['text'].strip()}\n")
print(f"Transcribed {len(result['segments'])} segments")
Run transcription in the background since it takes significant time. While it runs, review the extracted frames visually.
Phase 6: Synthesize
Combine visual frames + transcript + chat to build understanding:
- Who: Identify participants from video name labels and voice attribution
- What: What was shown on screen (demos, slides, code)
- Said: Key quotes, decisions, requirements, feedback
- Action items: Explicit commitments and follow-ups
Model Reference
| Model | Parameters | Speed | Accuracy | Best For |
|---|---|---|---|---|
tiny | 39M | Very fast | Low | Quick scan, checking if audio is usable |
base | 74M | Fast | Adequate | Default for short/clear meetings |
small | 244M | Moderate | Good | General meetings with accents |
medium | 769M | Slow | High | Important meetings, customer-facing |
large-v3 | 1.5B | Very slow | Best | Critical recordings, poor audio quality |
Tips
- Always run transcription in the background while reviewing frames visually
- Read frames in batches of 8 (tool parallelism limit)
- Standard Whisper uses FP32 on CPU and MPS; faster-whisper uses CTranslate2 with FP16 on CUDA
- For very long recordings (>1 hour), consider splitting audio first
- Chat files from Zoom contain timestamps and shared links, read first for quick context
- On Windows, use forward slashes or raw strings for file paths in Python
Error Handling
| Error | Cause | Fix |
|---|---|---|
ffmpeg: command not found | ffmpeg not installed | Install per platform table above |
No module named 'whisper' | Whisper not installed | pip3 install openai-whisper |
No module named 'faster_whisper' | faster-whisper not installed | pip3 install faster-whisper |
CUDA out of memory | Model too large for VRAM | Switch to smaller model |
RuntimeError: MPS backend | MPS issue on older macOS | Update macOS, or fall back to CPU with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 |
torch not found | Python deps missing | Install torch per platform table above |
| Garbled transcript | Wrong language or poor audio | Try specifying language="en" or use a larger model |
| Very slow transcription | Running large model on CPU | Switch to base or small model |
Source
git clone https://github.com/sacredvoid/skillkit/blob/main/skills/video-audio-processor/SKILL.mdView on GitHub Overview
Video/Audio Processor converts recordings into text transcripts using Whisper and extracts visual frames with ffmpeg. It auto-detects your hardware (Apple Silicon, NVIDIA, AMD, CPU) and selects the optimal transcription engine and model. It works across macOS, Linux, and Windows.
How This Skill Works
At startup, it runs Phase 0 to detect OS, architecture, RAM, GPU/VRAM, and whether ffmpeg/ffprobe are installed. Based on the detection, it selects the Whisper engine (faster-whisper for CUDA, openai-whisper for MPS or CPU) and chooses a model from the matrix (large-v3, medium, small, base). It then transcribes the audio and uses ffmpeg to extract visual frames, returning a transcript and frame captures.
When to Use It
- Transcribing Zoom or meeting recordings for searchable archives and meeting minutes.
- Generating transcripts for online lectures or webinars to improve accessibility.
- Analyzing interview or podcast recordings with synchronized frame captures for context.
- Compliance or QA reviews of recorded conversations requiring auditable transcripts.
- Processing long recordings on mixed hardware (CPU-only, Apple Silicon, or GPUs) to balance speed and accuracy.
Quick Start
- Step 1: Place your media file in a known path and ensure ffmpeg/ffprobe are installed.
- Step 2: Run the video-audio-processor to let it detect your hardware, engine, and model automatically.
- Step 3: Retrieve the generated transcript and frame captures, then review or export as needed.
Best Practices
- Ensure ffmpeg and ffprobe are installed and accessible in your PATH before processing.
- Run the Phase 0 hardware detection to let the system pick the best engine and model.
- If you have a powerful GPU and ample VRAM, consider larger models for higher accuracy.
- For very long recordings, chunk the media into shorter segments to optimize speed and memory usage.
- Quickly validate a small sample before processing full-length files to confirm output quality.
Example Use Cases
- Transcribe a 90-minute Zoom meeting on Apple Silicon and extract frame captures for visual context.
- Process a 2-hour lecture on an NVIDIA-powered workstation to generate a large-v3 transcript and frame timeline.
- Archive a conference talk on Windows using CPU-only, producing a base or small model transcript with frames.
- Review a sales call recording for QA by obtaining a text transcript and key frame snapshots for highlights.
- Label a video dataset by aligning transcripts with scene frames for sentiment and topic analysis.