How do I choose a voice?

Choose from Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat via the --voice flag. Preview samples to match tone.

Can I stream long texts?

Yes. Enable streaming with --stream to process audio in chunks for long articles, podcasts, or books.

Where are the generated files saved?

Outputs are WAV files saved to the directory you specify with --output or the default audio/ path (e.g., audio/tts_output_YYYYMMDD_HHMMSS.wav). Use --output-dir to organize files by project.

gemini-tts

npx machina-cli add skill akrindev/google-studio-skills/gemini-tts --openclaw

Files (1)

SKILL.md

10.3 KB

Gemini Text-to-Speech

Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.

When to Use This Skill

Use this skill when you need to:

Convert text to natural speech
Create audio for podcasts, audiobooks, or videos
Generate multi-speaker conversations
Stream audio for long content
Choose from multiple voice options
Create accessible audio content
Generate voiceovers for presentations
Batch convert text to audio files

Available Scripts

scripts/tts.py

Purpose: Convert text to speech using Gemini TTS models

When to use:

Any text-to-speech conversion
Multi-speaker conversation generation
Streaming audio for long texts
Voiceovers for content creation
Accessible audio generation

Key parameters:

Parameter	Description	Example
`text`	Text to convert (required)	`"Hello, world!"`
`--voice`, `-v`	Voice name	`Kore`
`--output`, `-o`	Base name for output file	`welcome`
`--output-dir`	Output directory for audio	`audio/`
`--no-timestamp`	Disable auto timestamp	Flag
`--model`, `-m`	TTS model	`gemini-2.5-flash-preview-tts`
`--stream`, `-s`	Enable streaming	Flag
`--speakers`	Multi-speaker mapping	`"Joe:Kore,Jane:Puck"`

Output: WAV audio file path

Workflows

Workflow 1: Basic Text-to-Speech

python scripts/tts.py "Hello, world! Have a wonderful day."

Best for: Quick audio generation, simple messages
Voice: Kore (default, clear and professional)
Output: audio/tts_output_YYYYMMDD_HHMMSS.wav (auto timestamp)

Workflow 2: Choose Different Voice

python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome

Best for: Friendly, conversational content
Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
Output: audio/welcome_YYYYMMDD_HHMMSS.wav

Workflow 3: Multi-Speaker Conversation

python scripts/tts.py "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation

Best for: Dialogues, interviews, role-playing content
Format: Marked conversation with speaker names
Script automatically routes text to appropriate voices
Output: audio/conversation_YYYYMMDD_HHMMSS.wav

Workflow 4: Long Content with Streaming

python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form

Best for: Podcasts, audiobooks, long articles
Streaming: Processes audio in chunks for long texts
Output: audio/long-form_YYYYMMDD_HHMMSS.wav

Workflow 5: Professional Voiceover

python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover

Best for: Corporate content, presentations, formal announcements
Voice: Charon (deep, authoritative)
Use when: Professional, serious tone required

Workflow 6: Custom Output Directory

python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1

Best for: Organized project structures
Directory created automatically if it doesn't exist
Output: ./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav

Workflow 7: Content Creation Pipeline (Text → Audio)

# 1. Generate script (gemini-text skill)
python skills/gemini-text/scripts/generate.py "Write a 2-minute podcast intro about sustainable energy"

# 2. Generate audio (this skill)
python scripts/tts.py "[Paste generated script]" --voice Fenrir --output podcast-intro

# 3. Use in video or podcast

Best for: Podcasts, audiobooks, video narration
Combines with: gemini-text for script generation

Workflow 8: Accessible Content

python scripts/tts.py "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility

Best for: Web accessibility, screen reader alternatives
Voice: Aoede (melodic, pleasant)
Use when: Making content accessible to visually impaired users

Workflow 9: Educational Content

python scripts/tts.py "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1

Best for: Educational materials, tutorials, e-learning
Voice: Zephyr (light, airy)
Combines well with: gemini-text for content generation

Workflow 10: Disable Timestamp

python scripts/tts.py "Fixed filename." --output my-audio --no-timestamp

Best for: When you want complete control over filename
Output: audio/my-audio.wav (no timestamp)
Use when: Generating files for specific naming schemes

Parameters Reference

Model Selection

Model	Quality	Speed	Best For
`gemini-2.5-flash-preview-tts`	Good	Fast	General use, high volume
`gemini-2.5-pro-preview-tts`	Higher	Slower	Premium content, voiceovers

Voice Selection

Voice	Characteristics	Best For
Kore	Clear, professional	Announcements, general purpose (default)
Puck	Friendly, conversational	Casual content, interviews
Charon	Deep, authoritative	Corporate, serious content
Fenrir	Warm, expressive	Storytelling, narratives
Aoede	Melodic, pleasant	Educational, accessibility
Zephyr	Light, airy	Gentle content, tutorials
Sulafat	Neutral, balanced	Documentaries, factual content

Audio Format

Specification	Value
Format	WAV (PCM)
Sample rate	24000 Hz
Channels	1 (mono)
Bit depth	16-bit

Token Limits

Limit	Type	Description
8,192	Input	Maximum input text tokens
16,384	Output	Maximum output audio tokens

Output Interpretation

Audio File

Format: WAV (compatible with most players)
Mono channel (single audio track)
Sample rate: 24000 Hz (broadcast quality)
Can be converted to MP3/AAC if needed

Multi-Speaker Files

Single WAV file with multiple voices
Voices separated by timing within file
Use --speakers parameter to map speakers to voices

Streaming Output

Audio processed in chunks during generation
Script shows "Streaming audio..." message
Useful for very long texts or real-time applications

Common Issues

"google-genai not installed"

pip install google-genai

"Voice name not found"

Check voice name spelling
Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
Voice names are case-sensitive

"No audio generated"

Check text is not empty
Verify text doesn't exceed token limit (8,192)
Try shorter text segments
Check API quota limits

"Multi-speaker format error"

Format: SpeakerName:VoiceName,Speaker2:Voice2
Separate speakers with commas
Use colon between speaker and voice
Example: "Joe:Kore,Jane:Puck,Host:Charon"

"Output file already exists"

Script will overwrite existing files
Change --output filename to avoid conflicts
Use unique names for batch generation

Audio quality issues

Check input text for unusual characters
Try different voice for better pronunciation
Consider splitting long text into smaller segments
Verify audio playback software compatibility

Best Practices

Voice Selection

Kore: General purpose, clear articulation
Puck: Conversational, engaging tone
Charon: Professional, authoritative
Fenrir: Emotional, storytelling
Aoede: Soft, gentle for accessibility
Zephyr: Educational, clear explanations

Text Preparation

Use natural language and punctuation
Include pauses with commas and periods
Spell out difficult words if needed
Break very long text into logical segments
Add speaker labels for multi-speaker content

Performance Optimization

Use streaming for very long texts
Generate shorter segments for better control
Use flash model for faster generation
Batch process multiple files for efficiency

Quality Tips

Test different voices for your content type
Use appropriate pacing with punctuation
Consider context when selecting voice
Listen to output before final use
Multi-speaker requires clear speaker labeling

Use Cases by Voice

Voice	Ideal Use Cases
Kore	Announcements, navigation, general info
Puck	Podcasts, interviews, casual content
Charon	Corporate, news, formal presentations
Fenrir	Audiobooks, stories, emotional content
Aoede	Accessibility, educational, gentle content
Zephyr	Tutorials, explanations, guides
Sulafat	Documentaries, factual presentations

Related Skills

gemini-text: Generate scripts and text for TTS
gemini-image: Create visuals to accompany audio
gemini-batch: Process multiple TTS requests efficiently
gemini-files: Upload audio files for processing

Quick Reference

# Basic
python scripts/tts.py "Your text here"

# Custom voice
python scripts/tts.py "Your text" --voice Puck --output audio.wav

# Multi-speaker
python scripts/tts.py "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"

# Streaming
python scripts/tts.py "Long text..." --stream --output long.wav

# Professional
python scripts/tts.py "Corporate announcement" --voice Charon

Reference

See references/voices.md for complete voice documentation
Get API key: https://aistudio.google.com/apikey
Documentation: https://ai.google.dev/gemini-api/docs/text-to-speech
Sample rate: 24000 Hz standard for most applications

Source

git clone https://github.com/akrindev/google-studio-skills/blob/main/skills/gemini-tts/SKILL.mdView on GitHub

Overview

Gemini Text-to-Speech generates natural-sounding speech from text using Gemini TTS models through executable scripts. It supports multiple voices, streaming for long content, and multi-speaker conversations, making it ideal for podcasts, videos, voiceovers, and accessible audio.

How This Skill Works

Run the Python script scripts/tts.py with text and optional flags such as --voice, --model, --output, --output-dir, --stream, and --speakers. The script outputs WAV audio files to your specified location, optionally streaming the audio in chunks for long texts and routing different text segments to multiple voices when using --speakers.

When to Use It

Convert plain text into natural-sounding speech for podcasts, videos, or audio notes.
Create multi-speaker conversations or interviews for dialogues and training materials.
Stream audio for long-form content like articles or books to reduce latency.
Produce professional voiceovers or presentations with selected voices and tones.
Batch convert and organize text-to-audio outputs across projects using output directories.

Quick Start

Step 1: Install or access the gemini-tts scripts and ensure Python is configured.
Step 2: Run a basic TTS example, e.g., python scripts/tts.py "Hello, world!" --voice Kore --output hello
Step 3: Explore advanced options like --model, --stream, --speakers, and --output-dir to tailor output and voices.

Best Practices

Choose a voice that matches the desired tone, such as Kore for clear narration or Charon for deep, authoritative voice.
Use --stream for long content to avoid a single long file and improve memory handling.
Leverage --speakers to map characters in multi-speaker dialogues and ensure consistent voice assignments.
Explicitly set --output and --output-dir to keep files organized across projects.
Test with representative text and adjust the model (--model) for quality and latency trade-offs.

Example Use Cases

python scripts/tts.py "Hello, world! Have a wonderful day." --voice Kore --output hello
python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome
python scripts/tts.py "TTS the following conversation: Joe: How's it going today? Jane: Not too bad, how about you? Joe: I'm working on a new project. Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation
python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form
python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover

Frequently Asked Questions

Add this skill to your agents