What languages and dialects are supported?

CosyVoice3 covers 9 languages (including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) and 18+ Chinese dialects (e.g., Cantonese, Sichuan, Dongbei, Shanghai).

Is this offline, or does it require the cloud?

It runs locally on macOS Apple Silicon with CPU PyTorch, providing offline inference and no cloud dependency.

How do I clone a voice or do cross-lingual synthesis?

Provide 3-10 seconds of reference audio for cloning and use cross-lingual options (e.g., language flags) to synthesize in a different language voice.

CosyVoice3 macOS

Verified

@lhuaizhong

npx machina-cli add skill @lhuaizhong/cosyvoice3-macos --openclaw

Files (1)

SKILL.md

6.0 KB

CosyVoice3 TTS

Local text-to-speech using Alibaba's CosyVoice3 on macOS Apple Silicon.

Overview

CosyVoice3 is an advanced TTS system based on large language models, supporting:

9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
18+ Chinese dialects: Cantonese, Sichuan, Dongbei, Shanghai, etc.
Zero-shot voice cloning: Clone any voice from 3-10 seconds of audio
Cross-lingual synthesis: Speak Chinese with English voice or vice versa
Fine-grained control: Emotions, speed, volume via text tags

Prerequisites

macOS with Apple Silicon (M1/M2/M3)
Python 3.10
Conda installed
~5GB disk space for models

Installation

Run the installation script:

cd /Users/lhz/.openclaw/workspace/skills/cosyvoice3/scripts
bash install.sh

This will:

Create conda environment cosyvoice
Install PyTorch (CPU version for Apple Silicon)
Install CosyVoice dependencies
Download Fun-CosyVoice3-0.5B model (~2GB)

Usage

Quick Start - Basic TTS

重要：CosyVoice3 需要在参考文本中添加 <|endofprompt|> 标记！

cd /Users/lhz/.openclaw/workspace/cosyvoice3-repo
export PATH="$HOME/miniconda3/bin:$PATH"
conda activate cosyvoice

python -c "
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
for i, j in enumerate(cosyvoice.inference_zero_shot(
    '你好，这是CosyVoice3语音合成测试。', 
    '希望你以后能够做的比我还好呦。<|endofprompt|>',  # 注意这个标记！
    'asset/zero_shot_prompt.wav'
)):
    torchaudio.save('output.wav', j['tts_speech'], cosyvoice.sample_rate)
print('Generated: output.wav')
"

Using the TTS Script

Generate speech from text:

cd /Users/lhz/.openclaw/workspace/skills/cosyvoice3/scripts
conda activate cosyvoice

# Basic TTS with default voice
python tts.py "你好，这是一个测试。"

# With custom reference audio for voice cloning
python tts.py "你好，这是克隆的声音。" --reference /path/to/reference.wav

# Cross-lingual (English text with Chinese voice)
python tts.py "Hello, this is cross-lingual synthesis." --reference asset/zero_shot_prompt.wav --lang en

# With speed control
python tts.py "这是一段快速的语音。" --speed 1.5

# Save to specific path
python tts.py "你好。" --output ~/Desktop/greeting.wav

Available Assets

Reference audio files in cosyvoice3-repo/asset/:

zero_shot_prompt.wav - Default Chinese female voice
cross_lingual_prompt.wav - English prompt for cross-lingual

Advanced Features

Voice Cloning

Clone a voice from 3-10 seconds of reference audio:

from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')

# Clone voice and generate
for i, j in enumerate(cosyvoice.inference_zero_shot(
    '这是克隆后的声音在说话。',
    'Reference text transcription',
    '/path/to/reference.wav'
)):
    torchaudio.save('cloned.wav', j['tts_speech'], cosyvoice.sample_rate)

Fine-Grained Control

Control prosody with special tags:

# Add laughter
"他突然[laughter]笑了起来[laughter]。"

# Add breathing
"他说完这句话[breath]，深吸一口气。"

# Strong emphasis
"这是<strong>非常重要</strong>的。"

# Combined
"在面对挑战时，他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>[breath]。"

Dialect Support

Use instruct mode for dialects:

cosyvoice = AutoModel(model_dir='pretrained_models/CosyVoice-300M-Instruct')

for i, j in enumerate(cosyvoice.inference_instruct(
    '你好，这是测试语音。',
    '中文男',
    '用四川话说这句话<|endofprompt|>'
)):
    torchaudio.save('sichuan.wav', j['tts_speech'], cosyvoice.sample_rate)

Troubleshooting

Model not found

If you get "model not found" errors, download models manually:

cd /Users/lhz/.openclaw/workspace/cosyvoice3-repo
export PATH="$HOME/miniconda3/bin:$PATH"
conda activate cosyvoice

python -c "
from modelscope import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
"

Memory issues

For long text, split into sentences:

text = "很长的文本..."
sentences = text.split('。')
for sent in sentences:
    if sent.strip():
        # Process each sentence

Audio format

Reference audio requirements:

Format: WAV, MP3
Sample rate: 16kHz+ (automatically resampled)
Duration: 3-10 seconds optimal
Content: Clear speech, minimal background noise

Resources

Scripts

install.sh - Installation script for macOS
tts.py - Main TTS script with CLI interface
download_models.py - Download pretrained models

References

Model Files

Located in cosyvoice3-repo/pretrained_models/:

Fun-CosyVoice3-0.5B/ - Main model (recommended)
CosyVoice2-0.5B/ - Previous version
CosyVoice-300M/ - Lighter model
CosyVoice-300M-SFT/ - SFT version
CosyVoice-300M-Instruct/ - Instruct version

Notes

First inference takes ~30 seconds (model warmup)
Subsequent inferences are faster
Apple Silicon uses CPU mode (no CUDA)
RTF (real-time factor) ~0.3-0.5 on M-series chips
Model files are cached locally after first download

Source

git clone https://clawhub.ai/lhuaizhong/cosyvoice3-macosView on GitHub

Overview

CosyVoice3 TTS runs locally on Apple Silicon Macs, delivering high-quality voices across 9 languages (including Chinese, English, Japanese, Korean) and 18+ Chinese dialects. It enables zero-shot voice cloning, cross-lingual synthesis, and fine-grained control over emotion, speed, and volume, all offline for privacy and reliability.

How This Skill Works

Install a cosyvoice conda environment and download the Fun-CosyVoice3-0.5B model, then run TTS locally with CPU PyTorch. Text input is processed by an AutoModel with optional reference audio for cloning and text tags for prosody control. Cross-lingual synthesis lets you mix languages, while cloning uses 3-10 seconds of reference audio to mimic a voice.

When to Use It

You need local offline TTS with high-quality Chinese/English voices on Apple Silicon.
You want to clone a voice from a short reference audio (3-10 seconds).
Offline/inference TTS is required with no cloud dependency.
You want natural-sounding speech with emotion or dialect variants.
You require cross-lingual synthesis (Chinese voice for English text or vice versa).

Quick Start

Step 1: Run the install script from /Users/lhz/.openclaw/workspace/skills/cosyvoice3/scripts to create the cosyvoice conda environment and download the model.
Step 2: Activate the environment and run a basic TTS example, e.g., conda activate cosyvoice and execute the provided Python one-liner for synthesis.
Step 3: For voice cloning or cross-lingual synthesis, supply a 3-10s reference audio and use the cloning or language options described in the guide.

Best Practices

Provide clean 3-10 second reference audio when doing voice cloning.
Include the end-of-prompt tag <|endofprompt|> in prompts as shown in examples.
Ensure macOS Apple Silicon, Python 3.10, and Conda are installed; allocate about 5GB disk space.
Run the provided install script to create the cosyvoice environment and download the model (~2GB).
Test across dialects and use prosody tags to tune emotion, speed, and volume.

Example Use Cases

An offline macOS voice assistant demo with high-quality Chinese/English voices.
A game or app NPC using a cloned voice for a consistent character voice.
Dialect-aware narration in Chinese apps (Cantonese, Sichuan, Shanghai, etc.).
Cross-lingual storytelling where Chinese voice renders English text.
Privacy-preserving language learning tools that run entirely offline.

Frequently Asked Questions

Add this skill to your agents