Get the FREE Ultimate OpenClaw Setup Guide →

gemini-tts

npx machina-cli add skill akrindev/google-studio-skills/gemini-tts --openclaw
Files (1)
SKILL.md
10.3 KB

Gemini Text-to-Speech

Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.

When to Use This Skill

Use this skill when you need to:

  • Convert text to natural speech
  • Create audio for podcasts, audiobooks, or videos
  • Generate multi-speaker conversations
  • Stream audio for long content
  • Choose from multiple voice options
  • Create accessible audio content
  • Generate voiceovers for presentations
  • Batch convert text to audio files

Available Scripts

scripts/tts.py

Purpose: Convert text to speech using Gemini TTS models

When to use:

  • Any text-to-speech conversion
  • Multi-speaker conversation generation
  • Streaming audio for long texts
  • Voiceovers for content creation
  • Accessible audio generation

Key parameters:

ParameterDescriptionExample
textText to convert (required)"Hello, world!"
--voice, -vVoice nameKore
--output, -oBase name for output filewelcome
--output-dirOutput directory for audioaudio/
--no-timestampDisable auto timestampFlag
--model, -mTTS modelgemini-2.5-flash-preview-tts
--stream, -sEnable streamingFlag
--speakersMulti-speaker mapping"Joe:Kore,Jane:Puck"

Output: WAV audio file path

Workflows

Workflow 1: Basic Text-to-Speech

python scripts/tts.py "Hello, world! Have a wonderful day."
  • Best for: Quick audio generation, simple messages
  • Voice: Kore (default, clear and professional)
  • Output: audio/tts_output_YYYYMMDD_HHMMSS.wav (auto timestamp)

Workflow 2: Choose Different Voice

python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome
  • Best for: Friendly, conversational content
  • Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
  • Output: audio/welcome_YYYYMMDD_HHMMSS.wav

Workflow 3: Multi-Speaker Conversation

python scripts/tts.py "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation
  • Best for: Dialogues, interviews, role-playing content
  • Format: Marked conversation with speaker names
  • Script automatically routes text to appropriate voices
  • Output: audio/conversation_YYYYMMDD_HHMMSS.wav

Workflow 4: Long Content with Streaming

python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form
  • Best for: Podcasts, audiobooks, long articles
  • Streaming: Processes audio in chunks for long texts
  • Output: audio/long-form_YYYYMMDD_HHMMSS.wav

Workflow 5: Professional Voiceover

python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover
  • Best for: Corporate content, presentations, formal announcements
  • Voice: Charon (deep, authoritative)
  • Use when: Professional, serious tone required

Workflow 6: Custom Output Directory

python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1
  • Best for: Organized project structures
  • Directory created automatically if it doesn't exist
  • Output: ./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav

Workflow 7: Content Creation Pipeline (Text → Audio)

# 1. Generate script (gemini-text skill)
python skills/gemini-text/scripts/generate.py "Write a 2-minute podcast intro about sustainable energy"

# 2. Generate audio (this skill)
python scripts/tts.py "[Paste generated script]" --voice Fenrir --output podcast-intro

# 3. Use in video or podcast
  • Best for: Podcasts, audiobooks, video narration
  • Combines with: gemini-text for script generation

Workflow 8: Accessible Content

python scripts/tts.py "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility
  • Best for: Web accessibility, screen reader alternatives
  • Voice: Aoede (melodic, pleasant)
  • Use when: Making content accessible to visually impaired users

Workflow 9: Educational Content

python scripts/tts.py "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1
  • Best for: Educational materials, tutorials, e-learning
  • Voice: Zephyr (light, airy)
  • Combines well with: gemini-text for content generation

Workflow 10: Disable Timestamp

python scripts/tts.py "Fixed filename." --output my-audio --no-timestamp
  • Best for: When you want complete control over filename
  • Output: audio/my-audio.wav (no timestamp)
  • Use when: Generating files for specific naming schemes

Parameters Reference

Model Selection

ModelQualitySpeedBest For
gemini-2.5-flash-preview-ttsGoodFastGeneral use, high volume
gemini-2.5-pro-preview-ttsHigherSlowerPremium content, voiceovers

Voice Selection

VoiceCharacteristicsBest For
KoreClear, professionalAnnouncements, general purpose (default)
PuckFriendly, conversationalCasual content, interviews
CharonDeep, authoritativeCorporate, serious content
FenrirWarm, expressiveStorytelling, narratives
AoedeMelodic, pleasantEducational, accessibility
ZephyrLight, airyGentle content, tutorials
SulafatNeutral, balancedDocumentaries, factual content

Audio Format

SpecificationValue
FormatWAV (PCM)
Sample rate24000 Hz
Channels1 (mono)
Bit depth16-bit

Token Limits

LimitTypeDescription
8,192InputMaximum input text tokens
16,384OutputMaximum output audio tokens

Output Interpretation

Audio File

  • Format: WAV (compatible with most players)
  • Mono channel (single audio track)
  • Sample rate: 24000 Hz (broadcast quality)
  • Can be converted to MP3/AAC if needed

Multi-Speaker Files

  • Single WAV file with multiple voices
  • Voices separated by timing within file
  • Use --speakers parameter to map speakers to voices

Streaming Output

  • Audio processed in chunks during generation
  • Script shows "Streaming audio..." message
  • Useful for very long texts or real-time applications

Common Issues

"google-genai not installed"

pip install google-genai

"Voice name not found"

  • Check voice name spelling
  • Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
  • Voice names are case-sensitive

"No audio generated"

  • Check text is not empty
  • Verify text doesn't exceed token limit (8,192)
  • Try shorter text segments
  • Check API quota limits

"Multi-speaker format error"

  • Format: SpeakerName:VoiceName,Speaker2:Voice2
  • Separate speakers with commas
  • Use colon between speaker and voice
  • Example: "Joe:Kore,Jane:Puck,Host:Charon"

"Output file already exists"

  • Script will overwrite existing files
  • Change --output filename to avoid conflicts
  • Use unique names for batch generation

Audio quality issues

  • Check input text for unusual characters
  • Try different voice for better pronunciation
  • Consider splitting long text into smaller segments
  • Verify audio playback software compatibility

Best Practices

Voice Selection

  • Kore: General purpose, clear articulation
  • Puck: Conversational, engaging tone
  • Charon: Professional, authoritative
  • Fenrir: Emotional, storytelling
  • Aoede: Soft, gentle for accessibility
  • Zephyr: Educational, clear explanations

Text Preparation

  • Use natural language and punctuation
  • Include pauses with commas and periods
  • Spell out difficult words if needed
  • Break very long text into logical segments
  • Add speaker labels for multi-speaker content

Performance Optimization

  • Use streaming for very long texts
  • Generate shorter segments for better control
  • Use flash model for faster generation
  • Batch process multiple files for efficiency

Quality Tips

  • Test different voices for your content type
  • Use appropriate pacing with punctuation
  • Consider context when selecting voice
  • Listen to output before final use
  • Multi-speaker requires clear speaker labeling

Use Cases by Voice

VoiceIdeal Use Cases
KoreAnnouncements, navigation, general info
PuckPodcasts, interviews, casual content
CharonCorporate, news, formal presentations
FenrirAudiobooks, stories, emotional content
AoedeAccessibility, educational, gentle content
ZephyrTutorials, explanations, guides
SulafatDocumentaries, factual presentations

Related Skills

  • gemini-text: Generate scripts and text for TTS
  • gemini-image: Create visuals to accompany audio
  • gemini-batch: Process multiple TTS requests efficiently
  • gemini-files: Upload audio files for processing

Quick Reference

# Basic
python scripts/tts.py "Your text here"

# Custom voice
python scripts/tts.py "Your text" --voice Puck --output audio.wav

# Multi-speaker
python scripts/tts.py "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"

# Streaming
python scripts/tts.py "Long text..." --stream --output long.wav

# Professional
python scripts/tts.py "Corporate announcement" --voice Charon

Reference

Source

git clone https://github.com/akrindev/google-studio-skills/blob/main/skills/gemini-tts/SKILL.mdView on GitHub

Overview

Gemini Text-to-Speech generates natural-sounding speech from text using Gemini TTS models through executable scripts. It supports multiple voices, streaming for long content, and multi-speaker conversations, making it ideal for podcasts, videos, voiceovers, and accessible audio.

How This Skill Works

Run the Python script scripts/tts.py with text and optional flags such as --voice, --model, --output, --output-dir, --stream, and --speakers. The script outputs WAV audio files to your specified location, optionally streaming the audio in chunks for long texts and routing different text segments to multiple voices when using --speakers.

When to Use It

  • Convert plain text into natural-sounding speech for podcasts, videos, or audio notes.
  • Create multi-speaker conversations or interviews for dialogues and training materials.
  • Stream audio for long-form content like articles or books to reduce latency.
  • Produce professional voiceovers or presentations with selected voices and tones.
  • Batch convert and organize text-to-audio outputs across projects using output directories.

Quick Start

  1. Step 1: Install or access the gemini-tts scripts and ensure Python is configured.
  2. Step 2: Run a basic TTS example, e.g., python scripts/tts.py "Hello, world!" --voice Kore --output hello
  3. Step 3: Explore advanced options like --model, --stream, --speakers, and --output-dir to tailor output and voices.

Best Practices

  • Choose a voice that matches the desired tone, such as Kore for clear narration or Charon for deep, authoritative voice.
  • Use --stream for long content to avoid a single long file and improve memory handling.
  • Leverage --speakers to map characters in multi-speaker dialogues and ensure consistent voice assignments.
  • Explicitly set --output and --output-dir to keep files organized across projects.
  • Test with representative text and adjust the model (--model) for quality and latency trade-offs.

Example Use Cases

  • python scripts/tts.py "Hello, world! Have a wonderful day." --voice Kore --output hello
  • python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome
  • python scripts/tts.py "TTS the following conversation: Joe: How's it going today? Jane: Not too bad, how about you? Joe: I'm working on a new project. Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation
  • python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form
  • python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers