How does Walkie-Talkie handle a spoken message?

It transcribes the incoming audio with tools/transcribe_voice.sh, processes the text as a normal prompt, then generates a local TTS reply with bin/sherpa-onnx-tts and sends it as a voice note.

What tools does this mode rely on?

All processing is local: ffmpeg, whisper-cpp, sherpa-onnx-tts, and the transcription script tools/transcribe_voice.sh.

How can I manually send a reply?

Follow the internal manual steps: run bin/sherpa-onnx-tts /tmp/reply.ogg 'Your message here' and send /tmp/reply.ogg via the message tool with filePath.

Walkie-Talkie Mode

@rubenfb23

npx machina-cli add skill @rubenfb23/walkie-talkie --openclaw

Files (1)

SKILL.md

1.1 KB

Walkie-Talkie Mode

This skill automates the voice-to-voice loop on WhatsApp using local transcription and local TTS.

Workflow

Incoming Audio: When a user sends an audio/ogg/opus file:
- Use tools/transcribe_voice.sh to get the text.
- Process the text as a normal user prompt.
Outgoing Response:
- Instead of a text reply, generate speech using bin/sherpa-onnx-tts.
- Send the resulting .ogg file back to the user as a voice note.

Triggers

User sends an audio message.
User says "activa modo walkie-talkie" or "hablemos por voz".

Constraints

Use local tools only (ffmpeg, whisper-cpp, sherpa-onnx-tts).
Maintain a fast response time (RTF < 0.5).
Always reply with BOTH text (for clarity) and audio.

Manual Execution (Internal)

To respond with voice manually:

bin/sherpa-onnx-tts /tmp/reply.ogg "Tu mensaje aquí"

Then send /tmp/reply.ogg via message tool with filePath.

Source

git clone https://clawhub.ai/rubenfb23/walkie-talkieView on GitHub

Overview

Walkie-Talkie Mode automates the voice-to-voice loop on WhatsApp by transcribing incoming audio locally and replying with synthesized speech. It enables users to talk instead of typing, using on-device processing to preserve privacy and deliver quick responses.

How This Skill Works

Incoming audio is transcribed locally with tools/transcribe_voice.sh, converting speech to text. The text is treated as a normal user prompt, then a reply is generated using bin/sherpa-onnx-tts and sent back as an .ogg voice note. The workflow relies entirely on local tools (ffmpeg, whisper-cpp, sherpa-onnx-tts) to keep latency low.

When to Use It

When you’d rather speak than type in WhatsApp.
When you receive voice notes and want a quick, natural reply.
When you need fast latency (RTF < 0.5s) from transcription to reply.
When privacy matters and you prefer on-device processing (no cloud).
When you want a hands-free conversation flow and continuous back-and-forth.

Quick Start

Step 1: Trigger Walkie-Talkie by saying 'activa modo walkie-talkie' or 'hablemos por voz'.
Step 2: Send an audio message; the system will transcribe it locally and treat it as a normal prompt.
Step 3: Receive the reply as a generated .ogg voice note (created with sherpa-onnx-tts) and sent back automatically.

Best Practices

Keep incoming audio messages reasonably brief to improve transcription accuracy.
Speak clearly and enunciate to aid local transcription (whisper-cpp).
Use the defined triggers 'activa modo walkie-talkie' or 'hablemos por voz' to start.
Test with different languages or accents to tune TTS voice cadence.
Always provide both a text and an audio reply to maintain context and accessibility.

Example Use Cases

A user sends a short voice memo in WhatsApp; Walkie-Talkie transcribes it, processes the prompt, and returns a natural-sounding voice reply.
A bilingual user speaks in Spanish; the system transcribes and replies in the same language using local TTS.
A remote worker uses Walkie-Talkie to draft quick updates while multitasking, receiving immediate audio replies.
A parent communicates via voice while cooking; the assistant responds with a concise voice note to keep hands free.
During a workout, the user asks for reminders or directions and gets a rapid audio response back.

Frequently Asked Questions

Add this skill to your agents