Get the FREE Ultimate OpenClaw Setup Guide →

video-audio-processor

Scanned
npx machina-cli add skill sacredvoid/skillkit/video-audio-processor --openclaw
Files (1)
SKILL.md
11.4 KB

Video/Audio Processor

Process video and audio recordings into text transcripts and visual frame captures for analysis. Auto-detects your hardware and picks the optimal transcription model, device, and settings.

Phase 0: Detect Compute Environment

Run this at the start of every invocation. It determines everything downstream.

python3 -c "
import platform, shutil, subprocess, json

info = {'os': platform.system(), 'arch': platform.machine(), 'ram_gb': 0, 'gpu': 'none', 'vram_gb': 0, 'device': 'cpu', 'dtype': 'float32'}

# RAM
try:
    if platform.system() == 'Darwin':
        import os; info['ram_gb'] = round(os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024**3))
    elif platform.system() == 'Linux':
        with open('/proc/meminfo') as f:
            for line in f:
                if line.startswith('MemTotal'):
                    info['ram_gb'] = round(int(line.split()[1]) / (1024**2))
                    break
    else:
        import ctypes
        mem = ctypes.c_ulonglong(0)
        ctypes.windll.kernel32.GetPhysicallyInstalledMemory(ctypes.byref(mem))
        info['ram_gb'] = round(mem.value / (1024**2))
except: pass

# GPU detection
try:
    import torch
    if torch.cuda.is_available():
        info['gpu'] = torch.cuda.get_device_name(0)
        info['vram_gb'] = round(torch.cuda.get_device_properties(0).total_mem / (1024**3))
        info['device'] = 'cuda'
        info['dtype'] = 'float16'
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        info['gpu'] = 'Apple Silicon (MPS)'
        info['vram_gb'] = info['ram_gb']  # unified memory
        info['device'] = 'mps'
        info['dtype'] = 'float16'
    elif hasattr(torch, 'hip') or 'AMD' in str(getattr(torch, '_C', '')):
        info['gpu'] = 'AMD (ROCm)'
        info['device'] = 'cuda'  # ROCm uses cuda API
        info['dtype'] = 'float16'
except ImportError:
    pass

# ffmpeg detection
info['ffmpeg'] = shutil.which('ffmpeg') is not None
info['ffprobe'] = shutil.which('ffprobe') is not None

print(json.dumps(info))
"

Parse the JSON output and store it internally as COMPUTE. Present the results to the user:

Detected hardware:

  • OS: {os} ({arch})
  • RAM: {ram_gb} GB
  • GPU: {gpu} ({vram_gb} GB VRAM)
  • Compute device: {device}
  • ffmpeg: {installed/missing}

Whisper Engine Selection

Based on hardware, choose the transcription engine:

ConditionEngineReason
CUDA GPU availablefaster-whisperGPU-accelerated CTranslate2, 4x faster than standard Whisper
MPS (Apple Silicon)openai-whisperStandard Whisper, uses MPS acceleration when available
CPU onlyopenai-whisperStandard Whisper on CPU, reliable fallback

Model Selection Matrix

Based on hardware and recording length, recommend a model:

ConditionRecommended ModelSpeed (1hr recording)Accuracy
CUDA >= 8 GB VRAMlarge-v3~5 minBest
CUDA 4-7 GB VRAMmedium~8 minHigh
MPS (Apple Silicon, 16+ GB)medium~15 minHigh
MPS (Apple Silicon, 8 GB)small~10 minGood
CPU + RAM >= 16 GBsmall~30 minGood
CPU + RAM < 16 GBbase~15 minAdequate

Present the recommendation and let the user choose:

AskUserQuestion: "Which transcription model should I use?"
Options:
- [Recommended model] (Recommended) - [reason based on their hardware]
- large-v3 - Best accuracy, needs 8+ GB VRAM or lots of RAM
- medium - High accuracy, balanced speed
- small - Good accuracy, faster processing
- base - Adequate for most meetings, fastest

Store the chosen model as WHISPER_MODEL and engine as WHISPER_ENGINE.

Phase 1: Identify the Recording

Find the file. Zoom recordings typically follow the pattern:

  • GMT{date}-{time}_Recording_{resolution}.mp4 (video)
  • GMT{date}-{time}_Recording.m4a (audio only)
  • GMT{date}-{time}_RecordingnewChat.txt (chat log)

Check for all three artifacts. Read the chat file first (it's small and gives immediate context).

Phase 2: Install Dependencies

Check and install what's needed based on platform.

ffmpeg

# Check if ffmpeg is available
ffmpeg -version 2>&1 | head -1
ffprobe -version 2>&1 | head -1

If ffmpeg is missing, install based on platform:

PlatformInstall command
macOS (Homebrew)brew install ffmpeg
macOS (no Homebrew)Guide user to install Homebrew first, then ffmpeg
Ubuntu/Debiansudo apt update && sudo apt install -y ffmpeg
Fedora/RHELsudo dnf install -y ffmpeg
Arch Linuxsudo pacman -S ffmpeg
Windows (winget)winget install Gyan.FFmpeg
Windows (choco)choco install ffmpeg
Windows (scoop)scoop install ffmpeg

Python + Whisper

python3 -c "import torch; print(torch.__version__)" 2>&1

If torch is missing, install based on platform:

PlatformInstall command
macOS (MPS)pip3 install torch torchvision torchaudio
Linux (CUDA)pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Linux (ROCm)pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
Linux/Windows (CPU)pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Windows (CUDA)pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Then install the transcription engine:

# For CUDA systems (faster-whisper with CTranslate2 acceleration)
pip3 install faster-whisper

# For MPS / CPU systems (standard OpenAI Whisper)
pip3 install openai-whisper

Phase 3: Get Duration

Use ffprobe (cross-platform) to get the file duration:

ffprobe -v quiet -show_entries format=duration -of csv=p=0 "/path/to/file"

This returns duration in seconds. Convert to human-readable format and display to user. Use this to estimate transcription time based on the selected model and hardware.

Present the plan:

Processing plan:

  • File: {filename} ({duration})
  • Engine: {WHISPER_ENGINE} with {WHISPER_MODEL} model on {device}
  • Estimated transcription time: {estimate}
  • Frame extraction: 1 per minute ({frame_count} frames)

Want me to adjust anything before starting?

Time Estimates

Devicebasesmallmediumlarge-v3
CUDA (RTX 3060+)~1 min/hr~2 min/hr~4 min/hr~5 min/hr
CUDA (faster-whisper)~30s/hr~1 min/hr~2 min/hr~3 min/hr
MPS (M1/M2/M3/M4)~3 min/hr~6 min/hr~15 min/hr~25 min/hr
CPU (16GB+ RAM)~8 min/hr~15 min/hr~30 min/hr~60 min/hr

Phase 4: Extract Visual Frames

Extract one frame per minute for visual context. Use a cross-platform temp directory:

# Create temp directory (cross-platform)
python3 -c "import tempfile, os; d = os.path.join(tempfile.gettempdir(), 'av-frames'); os.makedirs(d, exist_ok=True); print(d)"

Then extract frames:

ffmpeg -i "/path/to/video.mp4" \
  -vf "fps=1/60,scale=1280:-1" -q:v 3 {TEMP_DIR}/frame_%03d.jpg

Use the Read tool on each frame to view it. Frames show:

  • Screen shares (code, slides, UIs)
  • Participant names (bottom-left of each video tile)
  • Timestamps (if visible in screen share)

Phase 5: Transcribe Audio

Run the transcription engine. Choose the correct code based on WHISPER_ENGINE:

faster-whisper (CUDA systems)

from faster_whisper import WhisperModel
import json

model = WhisperModel("{WHISPER_MODEL}", device="cuda", compute_type="float16")
segments, info = model.transcribe("/path/to/audio", beam_size=5, language="en")

results = []
for segment in segments:
    results.append({
        "start": segment.start,
        "end": segment.end,
        "text": segment.text.strip()
    })

# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
    json.dump({"language": info.language, "segments": results}, f, indent=2)

with open("{TEMP_DIR}/transcript.txt", "w") as f:
    for seg in results:
        start = int(seg["start"])
        mins, secs = divmod(start, 60)
        f.write(f"[{mins:02d}:{secs:02d}] {seg['text']}\n")

print(f"Transcribed {len(results)} segments")

openai-whisper (MPS / CPU systems)

import whisper, json

model = whisper.load_model("{WHISPER_MODEL}")
result = model.transcribe("/path/to/audio", verbose=False)

# Save timestamped transcript
with open("{TEMP_DIR}/transcript.json", "w") as f:
    json.dump(result, f, indent=2)

with open("{TEMP_DIR}/transcript.txt", "w") as f:
    for seg in result["segments"]:
        start = int(seg["start"])
        mins, secs = divmod(start, 60)
        f.write(f"[{mins:02d}:{secs:02d}] {seg['text'].strip()}\n")

print(f"Transcribed {len(result['segments'])} segments")

Run transcription in the background since it takes significant time. While it runs, review the extracted frames visually.

Phase 6: Synthesize

Combine visual frames + transcript + chat to build understanding:

  • Who: Identify participants from video name labels and voice attribution
  • What: What was shown on screen (demos, slides, code)
  • Said: Key quotes, decisions, requirements, feedback
  • Action items: Explicit commitments and follow-ups

Model Reference

ModelParametersSpeedAccuracyBest For
tiny39MVery fastLowQuick scan, checking if audio is usable
base74MFastAdequateDefault for short/clear meetings
small244MModerateGoodGeneral meetings with accents
medium769MSlowHighImportant meetings, customer-facing
large-v31.5BVery slowBestCritical recordings, poor audio quality

Tips

  • Always run transcription in the background while reviewing frames visually
  • Read frames in batches of 8 (tool parallelism limit)
  • Standard Whisper uses FP32 on CPU and MPS; faster-whisper uses CTranslate2 with FP16 on CUDA
  • For very long recordings (>1 hour), consider splitting audio first
  • Chat files from Zoom contain timestamps and shared links, read first for quick context
  • On Windows, use forward slashes or raw strings for file paths in Python

Error Handling

ErrorCauseFix
ffmpeg: command not foundffmpeg not installedInstall per platform table above
No module named 'whisper'Whisper not installedpip3 install openai-whisper
No module named 'faster_whisper'faster-whisper not installedpip3 install faster-whisper
CUDA out of memoryModel too large for VRAMSwitch to smaller model
RuntimeError: MPS backendMPS issue on older macOSUpdate macOS, or fall back to CPU with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
torch not foundPython deps missingInstall torch per platform table above
Garbled transcriptWrong language or poor audioTry specifying language="en" or use a larger model
Very slow transcriptionRunning large model on CPUSwitch to base or small model

Source

git clone https://github.com/sacredvoid/skillkit/blob/main/skills/video-audio-processor/SKILL.mdView on GitHub

Overview

Video/Audio Processor converts recordings into text transcripts using Whisper and extracts visual frames with ffmpeg. It auto-detects your hardware (Apple Silicon, NVIDIA, AMD, CPU) and selects the optimal transcription engine and model. It works across macOS, Linux, and Windows.

How This Skill Works

At startup, it runs Phase 0 to detect OS, architecture, RAM, GPU/VRAM, and whether ffmpeg/ffprobe are installed. Based on the detection, it selects the Whisper engine (faster-whisper for CUDA, openai-whisper for MPS or CPU) and chooses a model from the matrix (large-v3, medium, small, base). It then transcribes the audio and uses ffmpeg to extract visual frames, returning a transcript and frame captures.

When to Use It

  • Transcribing Zoom or meeting recordings for searchable archives and meeting minutes.
  • Generating transcripts for online lectures or webinars to improve accessibility.
  • Analyzing interview or podcast recordings with synchronized frame captures for context.
  • Compliance or QA reviews of recorded conversations requiring auditable transcripts.
  • Processing long recordings on mixed hardware (CPU-only, Apple Silicon, or GPUs) to balance speed and accuracy.

Quick Start

  1. Step 1: Place your media file in a known path and ensure ffmpeg/ffprobe are installed.
  2. Step 2: Run the video-audio-processor to let it detect your hardware, engine, and model automatically.
  3. Step 3: Retrieve the generated transcript and frame captures, then review or export as needed.

Best Practices

  • Ensure ffmpeg and ffprobe are installed and accessible in your PATH before processing.
  • Run the Phase 0 hardware detection to let the system pick the best engine and model.
  • If you have a powerful GPU and ample VRAM, consider larger models for higher accuracy.
  • For very long recordings, chunk the media into shorter segments to optimize speed and memory usage.
  • Quickly validate a small sample before processing full-length files to confirm output quality.

Example Use Cases

  • Transcribe a 90-minute Zoom meeting on Apple Silicon and extract frame captures for visual context.
  • Process a 2-hour lecture on an NVIDIA-powered workstation to generate a large-v3 transcript and frame timeline.
  • Archive a conference talk on Windows using CPU-only, producing a base or small model transcript with frames.
  • Review a sales call recording for QA by obtaining a text transcript and key frame snapshots for highlights.
  • Label a video dataset by aligning transcripts with scene frames for sentiment and topic analysis.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers