What formats are supported?

The tool uses ffmpeg to decode audio/video, so it supports most common formats (MP4, MKV, MOV, MP3, WAV, etc.). Transcripts are produced using Whisper. Ensure ffmpeg is installed for broad format support.

How do I force a specific engine or model?

The system selects the engine (faster-whisper on CUDA, openai-whisper on MPS/CPU) and a model based on hardware. If you want to override, use the built-in model selection prompt (AskUserQuestion) to choose from the recommended and alternative models.

How does it handle limited hardware?

If GPU/VRAM is unavailable or insufficient, the processor falls back to CPU-based openai-whisper with a smaller model (e.g., base or small). The model selection matrix guides choices based on RAM and CPU capabilities.

video-audio-processor

Scanned

npx machina-cli add skill sacredvoid/skillkit/video-audio-processor --openclaw

Files (1)

SKILL.md

11.4 KB

Video/Audio Processor

Process video and audio recordings into text transcripts and visual frame captures for analysis. Auto-detects your hardware and picks the optimal transcription model, device, and settings.

Phase 0: Detect Compute Environment

Run this at the start of every invocation. It determines everything downstream.

python3 -c "
import platform, shutil, subprocess, json

info = {'os': platform.system(), 'arch': platform.machine(), 'ram_gb': 0, 'gpu': 'none', 'vram_gb': 0, 'device': 'cpu', 'dtype': 'float32'}

# RAM
try:
    if platform.system() == 'Darwin':
        import os; info['ram_gb'] = round(os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024**3))
    elif platform.system() == 'Linux':
        with open('/proc/meminfo') as f:
            for line in f:
                if line.startswith('MemTotal'):
                    info['ram_gb'] = round(int(line.split()[1]) / (1024**2))
                    break
    else:
        import ctypes
        mem = ctypes.c_ulonglong(0)
        ctypes.windll.kernel32.GetPhysicallyInstalledMemory(ctypes.byref(mem))
        info['ram_gb'] = round(mem.value / (1024**2))
except: pass

# GPU detection
try:
    import torch
    if torch.cuda.is_available():
        info['gpu'] = torch.cuda.get_device_name(0)
        info['vram_gb'] = round(torch.cuda.get_device_properties(0).total_mem / (1024**3))
        info['device'] = 'cuda'
        info['dtype'] = 'float16'
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        info['gpu'] = 'Apple Silicon (MPS)'
        info['vram_gb'] = info['ram_gb']  # unified memory
        info['device'] = 'mps'
        info['dtype'] = 'float16'
    elif hasattr(torch, 'hip') or 'AMD' in str(getattr(torch, '_C', '')):
        info['gpu'] = 'AMD (ROCm)'
        info['device'] = 'cuda'  # ROCm uses cuda API
        info['dtype'] = 'float16'
except ImportError:
    pass

# ffmpeg detection
info['ffmpeg'] = shutil.which('ffmpeg') is not None
info['ffprobe'] = shutil.which('ffprobe') is not None

print(json.dumps(info))
"

Parse the JSON output and store it internally as COMPUTE. Present the results to the user:

Detected hardware:

OS: {os} ({arch})

RAM: {ram_gb} GB

GPU: {gpu} ({vram_gb} GB VRAM)

Compute device: {device}

ffmpeg: {installed/missing}

Whisper Engine Selection

Based on hardware, choose the transcription engine:

Condition	Engine	Reason
CUDA GPU available	`faster-whisper`	GPU-accelerated CTranslate2, 4x faster than standard Whisper
MPS (Apple Silicon)	`openai-whisper`	Standard Whisper, uses MPS acceleration when available
CPU only	`openai-whisper`	Standard Whisper on CPU, reliable fallback

Model Selection Matrix

Based on hardware and recording length, recommend a model:

Condition	Recommended Model	Speed (1hr recording)	Accuracy
CUDA >= 8 GB VRAM	`large-v3`	~5 min	Best
CUDA 4-7 GB VRAM	`medium`	~8 min	High
MPS (Apple Silicon, 16+ GB)	`medium`	~15 min	High
MPS (Apple Silicon, 8 GB)	`small`	~10 min	Good
CPU + RAM >= 16 GB	`small`	~30 min	Good
CPU + RAM < 16 GB	`base`	~15 min	Adequate

Present the recommendation and let the user choose:

AskUserQuestion: "Which transcription model should I use?"
Options:
- [Recommended model] (Recommended) - [reason based on their hardware]
- large-v3 - Best accuracy, needs 8+ GB VRAM or lots of RAM
- medium - High accuracy, balanced speed
- small - Good accuracy, faster processing
- base - Adequate for most meetings, fastest

Store the chosen model as WHISPER_MODEL and engine as WHISPER_ENGINE.

Phase 1: Identify the Recording

Find the file. Zoom recordings typically follow the pattern:

GMT{date}-{time}_Recording_{resolution}.mp4 (video)
GMT{date}-{time}_Recording.m4a (audio only)
GMT{date}-{time}_RecordingnewChat.txt (chat log)

Check for all three artifacts. Read the chat file first (it's small and gives immediate context).

Phase 2: Install Dependencies

Check and install what's needed based on platform.

ffmpeg

# Check if ffmpeg is available
ffmpeg -version 2>&1 | head -1
ffprobe -version 2>&1 | head -1

If ffmpeg is missing, install based on platform:

Platform	Install command
macOS (Homebrew)	`brew install ffmpeg`
macOS (no Homebrew)	Guide user to install Homebrew first, then ffmpeg
Ubuntu/Debian	`sudo apt update && sudo apt install -y ffmpeg`
Fedora/RHEL	`sudo dnf install -y ffmpeg`
Arch Linux	`sudo pacman -S ffmpeg`
Windows (winget)	`winget install Gyan.FFmpeg`
Windows (choco)	`choco install ffmpeg`
Windows (scoop)	`scoop install ffmpeg`

Python + Whisper

python3 -c "import torch; print(torch.__version__)" 2>&1

If torch is missing, install based on platform:

Platform	Install command
macOS (MPS)	`pip3 install torch torchvision torchaudio`
Linux (CUDA)	`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121`
Linux (ROCm)	`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0`
Linux/Windows (CPU)	`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu`
Windows (CUDA)	`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121`

Then install the transcription engine:

# For CUDA systems (faster-whisper with CTranslate2 acceleration)
pip3 install faster-whisper

# For MPS / CPU systems (standard OpenAI Whisper)
pip3 install openai-whisper

Phase 3: Get Duration

Use ffprobe (cross-platform) to get the file duration:

ffprobe -v quiet -show_entries format=duration -of csv=p=0 "/path/to/file"

This returns duration in seconds. Convert to human-readable format and display to user. Use this to estimate transcription time based on the selected model and hardware.

Present the plan:

Processing plan:

File: {filename} ({duration})

Engine: {WHISPER_ENGINE} with {WHISPER_MODEL} model on {device}

Estimated transcription time: {estimate}

Frame extraction: 1 per minute ({frame_count} frames)

Want me to adjust anything before starting?

Time Estimates

Device	base	small	medium	large-v3
CUDA (RTX 3060+)	~1 min/hr	~2 min/hr	~4 min/hr	~5 min/hr
CUDA (faster-whisper)	~30s/hr	~1 min/hr	~2 min/hr	~3 min/hr
MPS (M1/M2/M3/M4)	~3 min/hr	~6 min/hr	~15 min/hr	~25 min/hr
CPU (16GB+ RAM)	~8 min/hr	~15 min/hr	~30 min/hr	~60 min/hr

Phase 4: Extract Visual Frames

Extract one frame per minute for visual context. Use a cross-platform temp directory:

# Create temp directory (cross-platform)
python3 -c "import tempfile, os; d = os.path.join(tempfile.gettempdir(), 'av-frames'); os.makedirs(d, exist_ok=True); print(d)"

Then extract frames:

ffmpeg -i "/path/to/video.mp4" \
  -vf "fps=1/60,scale=1280:-1" -q:v 3 {TEMP_DIR}/frame_%03d.jpg

Use the Read tool on each frame to view it. Frames show:

Screen shares (code, slides, UIs)
Participant names (bottom-left of each video tile)
Timestamps (if visible in screen share)

Phase 5: Transcribe Audio

Run the transcription engine. Choose the correct code based on WHISPER_ENGINE:

faster-whisper (CUDA systems)

from faster_whisper import WhisperModel
import json

model = WhisperModel("{WHISPER_MODEL}", device="cuda", compute_type="float16")
segments, info = model.transcribe("/path/to/audio", beam_size=5, language="en")

results = []
for segment in segments:
    results.append({
        "start": segment.start,
        "end": segment.end,
        "text": segment.text.strip()
    })

# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
    json.dump({"language": info.language, "segments": results}, f, indent=2)

with open("{TEMP_DIR}/transcript.txt", "w") as f:
    for seg in results:
        start = int(seg["start"])
        mins, secs = divmod(start, 60)
        f.write(f"[{mins:02d}:{secs:02d}] {seg['text']}\n")

print(f"Transcribed {len(results)} segments")

openai-whisper (MPS / CPU systems)

import whisper, json

model = whisper.load_model("{WHISPER_MODEL}")
result = model.transcribe("/path/to/audio", verbose=False)

# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
    json.dump(result, f, indent=2)

with open("{TEMP_DIR}/transcript.txt", "w") as f:
    for seg in result["segments"]:
        start = int(seg["start"])
        mins, secs = divmod(start, 60)
        f.write(f"[{mins:02d}:{secs:02d}] {seg['text'].strip()}\n")

print(f"Transcribed {len(result['segments'])} segments")

Run transcription in the background since it takes significant time. While it runs, review the extracted frames visually.

Phase 6: Synthesize

Combine visual frames + transcript + chat to build understanding:

Who: Identify participants from video name labels and voice attribution
What: What was shown on screen (demos, slides, code)
Said: Key quotes, decisions, requirements, feedback
Action items: Explicit commitments and follow-ups

Model Reference

Model	Parameters	Speed	Accuracy	Best For
`tiny`	39M	Very fast	Low	Quick scan, checking if audio is usable
`base`	74M	Fast	Adequate	Default for short/clear meetings
`small`	244M	Moderate	Good	General meetings with accents
`medium`	769M	Slow	High	Important meetings, customer-facing
`large-v3`	1.5B	Very slow	Best	Critical recordings, poor audio quality

Tips

Always run transcription in the background while reviewing frames visually
Read frames in batches of 8 (tool parallelism limit)
Standard Whisper uses FP32 on CPU and MPS; faster-whisper uses CTranslate2 with FP16 on CUDA
For very long recordings (>1 hour), consider splitting audio first
Chat files from Zoom contain timestamps and shared links, read first for quick context
On Windows, use forward slashes or raw strings for file paths in Python

Error Handling

Error	Cause	Fix
`ffmpeg: command not found`	ffmpeg not installed	Install per platform table above
`No module named 'whisper'`	Whisper not installed	`pip3 install openai-whisper`
`No module named 'faster_whisper'`	faster-whisper not installed	`pip3 install faster-whisper`
`CUDA out of memory`	Model too large for VRAM	Switch to smaller model
`RuntimeError: MPS backend`	MPS issue on older macOS	Update macOS, or fall back to CPU with `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`
`torch not found`	Python deps missing	Install torch per platform table above
Garbled transcript	Wrong language or poor audio	Try specifying `language="en"` or use a larger model
Very slow transcription	Running large model on CPU	Switch to `base` or `small` model

Source

git clone https://github.com/sacredvoid/skillkit/blob/main/skills/video-audio-processor/SKILL.mdView on GitHub

Overview

Video/Audio Processor converts recordings into text transcripts using Whisper and extracts visual frames with ffmpeg. It auto-detects your hardware (Apple Silicon, NVIDIA, AMD, CPU) and selects the optimal transcription engine and model. It works across macOS, Linux, and Windows.

How This Skill Works

At startup, it runs Phase 0 to detect OS, architecture, RAM, GPU/VRAM, and whether ffmpeg/ffprobe are installed. Based on the detection, it selects the Whisper engine (faster-whisper for CUDA, openai-whisper for MPS or CPU) and chooses a model from the matrix (large-v3, medium, small, base). It then transcribes the audio and uses ffmpeg to extract visual frames, returning a transcript and frame captures.

When to Use It

Transcribing Zoom or meeting recordings for searchable archives and meeting minutes.
Generating transcripts for online lectures or webinars to improve accessibility.
Analyzing interview or podcast recordings with synchronized frame captures for context.
Compliance or QA reviews of recorded conversations requiring auditable transcripts.
Processing long recordings on mixed hardware (CPU-only, Apple Silicon, or GPUs) to balance speed and accuracy.

Quick Start

Step 1: Place your media file in a known path and ensure ffmpeg/ffprobe are installed.
Step 2: Run the video-audio-processor to let it detect your hardware, engine, and model automatically.
Step 3: Retrieve the generated transcript and frame captures, then review or export as needed.

Best Practices

Ensure ffmpeg and ffprobe are installed and accessible in your PATH before processing.
Run the Phase 0 hardware detection to let the system pick the best engine and model.
If you have a powerful GPU and ample VRAM, consider larger models for higher accuracy.
For very long recordings, chunk the media into shorter segments to optimize speed and memory usage.
Quickly validate a small sample before processing full-length files to confirm output quality.

Example Use Cases

Transcribe a 90-minute Zoom meeting on Apple Silicon and extract frame captures for visual context.
Process a 2-hour lecture on an NVIDIA-powered workstation to generate a large-v3 transcript and frame timeline.
Archive a conference talk on Windows using CPU-only, producing a base or small model transcript with frames.
Review a sales call recording for QA by obtaining a text transcript and key frame snapshots for highlights.
Label a video dataset by aligning transcripts with scene frames for sentiment and topic analysis.

Frequently Asked Questions

Add this skill to your agents