Voice.ai: Creator Voiceover Forge
Scanned@gizmoGremlin
npx machina-cli add skill @gizmoGremlin/voiceai-voiceover-creator --openclawVoice.ai Creator Voiceover Pipeline
This skill follows the Agent Skills specification.
Turn any script into a publish-ready voiceover — complete with numbered segments, a stitched master, YouTube chapters, SRT captions, and a beautiful review page. Optionally, replace the audio track on an existing video.
Built for creators who want studio-quality voiceovers without the studio. Powered by Voice.ai.
When to use this skill
| Scenario | Why it fits |
|---|---|
| YouTube long-form | Full narration with chapter markers and captions |
| YouTube Shorts | Quick hooks with the shortform template |
| Podcasts | Consistent host voice, intro/outro templates |
| Course content | Professional narration for educational videos |
| Quick iteration | Smart caching — edit one section, only that segment re-renders |
| Video audio replacement | Drop AI voiceover onto screen recordings or B-roll |
The one-command workflow
Have a script and a video? Turn them into a finished video with AI voiceover in one shot:
node voiceai-vo.cjs build \
--input my-script.md \
--voice oliver \
--title "My Video" \
--video ./my-recording.mp4 \
--mux
This renders the voiceover, stitches the master audio, and drops it onto your video — all in one command. Output:
out/my-video/muxed.mp4— your video with the new voiceoverout/my-video/master.wav— the standalone audioout/my-video/review.html— listen and review each segmentout/my-video/chapters.txt— YouTube-ready chapter timestampsout/my-video/captions.srt— SRT captions
Use --sync pad if the audio is shorter than the video, or --sync trim to cut it to match.
Requirements
- Node.js 20+ — runtime (no npm install needed — the CLI is a single bundled file)
- VOICE_AI_API_KEY — set as environment variable or in a
.envfile in the skill root. Get a key at voice.ai/dashboard. - ffmpeg (optional) — needed for master stitching, MP3 encoding, loudness normalization, and video muxing. The pipeline still produces individual segments, the review page, chapters, and captions without it.
Configuration
The skill reads VOICE_AI_API_KEY from (in order):
- Environment variable
VOICE_AI_API_KEY - Environment variable
VOICEAI_API_KEY(alternate) .envfile in the skill root
echo 'VOICE_AI_API_KEY=your-key-here' > .env
Use --mock on any command to run the full pipeline without an API key (produces placeholder audio).
Commands
build — Generate a voiceover from a script
node voiceai-vo.cjs build \
--input <script.md or script.txt> \
--voice <voice-alias-or-uuid> \
--title "My Project" \
[--template youtube|podcast|shortform] \
[--language en] \
[--video input.mp4 --mux --sync shortest] \
[--force] [--mock]
What it does:
- Reads the script and splits it into segments (by
##headings for.md, or by sentence boundaries for.txt) - Optionally prepends/appends template intro/outro segments
- Renders each segment via Voice.ai TTS as a numbered WAV file
- Stitches a master audio file (if ffmpeg is available)
- Generates chapters, captions, a review page, and metadata files
- Optionally muxes the voiceover into an existing video
Full options:
| Option | Description |
|---|---|
-i, --input <path> | Script file (.txt or .md) — required |
-v, --voice <id> | Voice alias or UUID — required |
-t, --title <title> | Project title (defaults to filename) |
--template <name> | youtube, podcast, or shortform |
--mode <mode> | headings or auto (default: headings for .md) |
--max-chars <n> | Max characters per auto-chunk (default: 1500) |
--language <code> | Language code (default: en) |
--video <path> | Input video for muxing |
--mux | Enable video muxing (requires --video) |
--sync <policy> | shortest, pad, or trim (default: shortest) |
--force | Re-render all segments (ignore cache) |
--mock | Mock mode — no API calls, placeholder audio |
-o, --out <dir> | Custom output directory |
replace-audio — Swap the audio track on a video
node voiceai-vo.cjs replace-audio \
--video ./input.mp4 \
--audio ./out/my-project/master.wav \
[--out ./out/my-project/muxed.mp4] \
[--sync shortest|pad|trim]
Requires ffmpeg. If not installed, generates helper shell/PowerShell scripts instead.
| Sync policy | Behavior |
|---|---|
shortest (default) | Output ends when the shorter track ends |
pad | Pad audio with silence to match video duration |
trim | Trim audio to match video duration |
Video stream is copied without re-encoding (-c:v copy). Audio is encoded as AAC. A mux report is saved alongside the output.
Privacy: Video processing is entirely local. Only script text is sent to Voice.ai for TTS.
voices — List available voices
node voiceai-vo.cjs voices [--limit 20] [--query "deep"] [--mock]
Available voices
Use short aliases or full UUIDs with --voice:
| Alias | Voice | Gender | Style |
|---|---|---|---|
ellie | Ellie | F | Youthful, vibrant vlogger |
oliver | Oliver | M | Friendly British |
lilith | Lilith | F | Soft, feminine |
smooth | Smooth Calm Voice | M | Deep, smooth narrator |
corpse | Corpse Husband | M | Deep, distinctive |
skadi | Skadi | F | Anime character |
zhongli | Zhongli | M | Deep, authoritative |
flora | Flora | F | Cheerful, high pitch |
chief | Master Chief | M | Heroic, commanding |
The voices command also returns any additional voices available on the API. Voice list is cached for 10 minutes.
Build outputs
After a build, the output directory contains:
out/<title-slug>/
segments/ # Numbered WAV files (001-intro.wav, 002-section.wav, …)
master.wav # Stitched audio (requires ffmpeg)
master.mp3 # MP3 encode (requires ffmpeg)
manifest.json # Build metadata: voice, template, segment list, hashes
timeline.json # Segment durations and start times
review.html # Interactive review page with audio players
chapters.txt # YouTube-friendly chapter timestamps
captions.srt # SRT captions using segment boundaries
description.txt # YouTube description with chapters + Voice.ai credit
review.html
A standalone HTML page with:
- Master audio player (if stitched)
- Individual segment players with titles and durations
- Collapsible script text for each segment
- Regeneration command hints
Templates
Templates auto-inject intro/outro segments around the script content:
| Template | Prepends | Appends |
|---|---|---|
youtube | templates/youtube_intro.txt | templates/youtube_outro.txt |
podcast | templates/podcast_intro.txt | — |
shortform | templates/shortform_hook.txt | — |
Edit the files in templates/ to customize the intro/outro text.
Caching
Segments are cached by a hash of: text content + voice ID + language.
- Unchanged segments are skipped on rebuild — fast iteration
- Modified segments are re-rendered automatically
- Use
--forceto re-render everything - Cache manifest is stored in
segments/.cache.json
Multilingual support
Voice.ai supports 11 languages. Use --language <code> to switch:
en, es, fr, de, it, pt, pl, ru, nl, sv, ca
The pipeline auto-selects the multilingual TTS model for non-English languages.
Troubleshooting
| Issue | Solution |
|---|---|
| ffmpeg missing | Pipeline still works — you get segments, review page, chapters, captions. Install ffmpeg for master stitching and video muxing. |
| Rate limits (429) | Segments render sequentially, which stays under most limits. Wait and retry. |
| Insufficient credits (402) | Top up at voice.ai/dashboard. Cached segments won't re-use credits on retry. |
| Long scripts | Caching makes rebuilds fast. Text over 490 chars per segment is automatically split across API calls. |
| Windows paths | Wrap paths with spaces in quotes: --input "C:\My Scripts\script.md" |
See references/TROUBLESHOOTING.md for more.
References
- Agent Skills Specification
- Voice.ai
references/VOICEAI_API.md— API endpoints, audio formats, modelsreferences/TROUBLESHOOTING.md— Common issues and fixes
Overview
Turn scripts into publish-ready AI voiceovers with Voice.ai TTS. It segments scripts into parts, renders each segment, stitches a master track, and generates YouTube chapters and SRT captions. Optional video muxing lets you replace audio in an existing video.
How This Skill Works
The pipeline reads the script and splits it into segments (MD headings or sentence boundaries). Each segment is rendered via Voice.ai TTS as WAV, then stitched into a master track if ffmpeg is available. It outputs chapters, captions, a review page, and metadata, and can mux the voiceover into a video when requested.
When to Use It
- YouTube long-form narration with chapter markers and captions
- YouTube Shorts using the shortform template for quick hooks
- Podcasts requiring a consistent host voice and intro/outro templates
- Course content with professional narration for educational videos
- Video audio replacement by muxing the AI voiceover into an existing video
Quick Start
- Step 1: Prepare your script (MD/TXT) and choose a voice, e.g., --input my-script.md --voice oliver
- Step 2: Run the build command with your title and optional video, e.g., node voiceai-vo.cjs build --input my-script.md --voice oliver --title "My Video" --video ./my-recording.mp4 --mux
- Step 3: Review outputs in out/<project> (master.wav, chapters.txt, captions.srt, review.html) and mux if needed
Best Practices
- Structure scripts with clear segment headings (## in MD) to maximize segment rendering
- Use template intro/outro segments to maintain consistent branding
- Enable chapters and captions to boost SEO and accessibility
- Leverage smart caching to re-render only edited segments for faster iteration
- If replacing video audio, provide the video path and enable muxing; ensure ffmpeg is available
Example Use Cases
- A YouTube long-form documentary with chapters and captions generated from a full script
- A YouTube Shorts clip produced with the shortform template and AI voiceover
- A podcast episode narrated by a consistent host voice with intro/outro templates
- An online course video with professional narration for each module
- Replacing the audio track of a screen recording or B-roll video with AI voiceover