What formats are supported?

The script supports MP3, WAV, M4A, FLAC, and other common audio formats via the Python-based workflow, using openai-whisper for transcription and pydub for processing (ffmpeg is a system dependency).

How do I control line length and timing?

Use --max-chars to set max characters per subtitle line (default 22, min 4); text is split while respecting word boundaries and, for Chinese content, characters are counted appropriately. Gaps shorter than 0.3 seconds are merged.

Where is the output saved?

The final subtitle file is saved as origin.srt, typically in the output directory associated with the run.

Audio to SRT Converter

Scanned

npx machina-cli add skill dean9703111/ai-agent-skill-for-video-workflow/audio-to-srt --openclaw

Files (1)

SKILL.md

3.7 KB

Audio to SRT Converter

This skill provides a Python-based workflow for converting audio files (MP3, WAV, M4A, FLAC, etc.) into SRT subtitle files with automatic speech recognition, customizable text formatting, and timeline optimization.

Purpose

Convert audio files (MP3, WAV, M4A, FLAC, etc.) into properly formatted SRT subtitle files with:

Automatic speech recognition and transcription
Support for multiple audio formats (MP3, WAV, M4A, FLAC, and more)
Customizable character limits per subtitle line (default: 22 characters, minimum: 4 characters)
Automatic timeline gap filling (gaps < 0.3s are merged)
Environment and dependency validation
Output naming convention: origin.srt

When to Use This Skill

Use this skill when:

Converting audio files to subtitle format
Generating transcriptions with timeline information
Creating SRT files for video editing or accessibility
Processing Chinese or multilingual audio content

Core Workflow

1. Environment Validation

Before processing, validate:

Python 3.7+ is installed
Required packages are available (see Dependencies section)
Input MP3 file exists and is readable
Output directory is writable

2. Audio Transcription

Process the audio file using speech recognition:

Load audio file (supports MP3, WAV, M4A, FLAC, etc.)
Perform speech-to-text conversion
Extract timestamps for each segment
Handle silence detection and word boundaries

3. Text Formatting

Format transcribed text according to parameters:

Split text into lines based on character limit
Ensure minimum 4 characters per line
Respect word boundaries when possible
Handle Chinese character counting correctly

4. Timeline Optimization

Adjust subtitle timing:

Identify gaps between subtitle segments
Merge segments when gap < 0.3 seconds
Extend previous subtitle end time to next subtitle start time
Maintain synchronization with audio

5. SRT Generation

Create final SRT file:

Format according to SRT specification
Number subtitles sequentially
Use proper timestamp format (HH:MM:SS,mmm)
Save as origin.srt

Using the Conversion Script

The main conversion script is located at scripts/audio_to_srt.py.

Basic Usage

python scripts/audio_to_srt.py <audio_file> [--max-chars MAX_CHARS]

Parameters

audio_file (required): Path to the input audio file (MP3, WAV, M4A, FLAC, etc.)
--max-chars (optional): Maximum characters per subtitle line (default: 22, minimum: 4)

Examples

See examples/usage_example.sh for complete usage examples.

Dependencies

The script requires the following Python packages:

openai-whisper - For speech recognition
pydub - For audio processing
ffmpeg - System dependency for audio handling

Install with:

pip install openai-whisper pydub
brew install ffmpeg  # macOS

Output Format

The generated SRT file follows this format:

1
00:00:00,000 --> 00:00:03,500
這是第一行字幕

2
00:00:03,500 --> 00:00:07,200
這是第二行字幕

Additional Resources

Scripts

scripts/audio_to_srt.py - Main conversion script with environment validation
scripts/check_environment.py - Standalone environment checker

Examples

examples/usage_example.sh - Complete usage examples with different parameters

Source

git clone https://github.com/dean9703111/ai-agent-skill-for-video-workflow/blob/master/.agent/skills/audio-to-srt/SKILL.md

View on GitHub

Overview

This Python-based workflow converts audio files (MP3, WAV, M4A, FLAC, etc.) into properly formatted SRT subtitles using automatic speech recognition. It supports customizable character limits per subtitle line (default 22, min 4), timeline gap filling for smooth playback, and saves the result as origin.srt after validating the environment and dependencies.

How This Skill Works

It loads the input audio file, runs speech-to-text (using a backend like openai-whisper), and timestamps each segment. It then formats text into lines based on the max-chars setting, respects word boundaries including Chinese text, performs timeline optimization by merging gaps under 0.3 seconds, and generates a standards-compliant SRT saved as origin.srt.

When to Use It

You need subtitles for an audio-only file to publish on YouTube or other video platforms.
You want a transcription with accurate timing for use in video editing or captioning workflows.
You’re working with multilingual audio, including Chinese, and need proper character counting.
You need to customize subtitle line length to improve readability across devices.
You’re preparing accessible or archival SRT files from audio recordings.

Quick Start

Step 1: Place your audio file (MP3, WAV, M4A, FLAC, etc.) in a reachable path.
Step 2: Run the converter: python scripts/audio_to_srt.py <audio_file> [--max-chars MAX_CHARS].
Step 3: Check the generated origin.srt in the output directory.

Best Practices

Start with high-quality audio to improve transcription accuracy.
Set --max-chars within 4–22 to balance readability and timing.
Ensure Python 3.7+ and dependencies (openai-whisper, pydub, ffmpeg) are installed.
Review the merged-timing results and adjust if necessary after playback.
Test with multilingual content to confirm language handling and character counting.

Example Use Cases

Podcast episode converted to YouTube-ready subtitles.
Lecture recording transcribed to SRT with Chinese content.
Multilingual interview audio turned into synchronized subtitles.
Video tutorial subtitled for accessibility with controlled line length.
Film dialogue captured as SRT for video editing workflow.

Frequently Asked Questions

Add this skill to your agents