gemini-live-api
Scannednpx machina-cli add skill Hildegaardchiasmal966/claude-skills/gemini-live-api --openclawGemini Live API Developer Skill
Overview
The Gemini Live API enables low-latency, real-time voice and video interactions with Gemini models. This skill provides comprehensive guidance for implementing natural voice conversations, handling interruptions, integrating tools and function calling, managing sessions, and customizing voice output.
When to Use This Skill
Use this skill when:
- Implementing real-time two-way voice conversations between AI and users
- Building voice agents that can be interrupted naturally mid-response
- Adding function calling and real-time web search to live voice sessions
- Implementing text-to-speech with natural cadence and tone
- Debugging audio streaming issues or session management problems
- Customizing voice characteristics (cadence, tone, style, language)
- Managing WebSocket connections, session resumption, or context compression
- Implementing ephemeral tokens for client-side authentication
Core Capabilities Covered
- Bidirectional Audio Streaming: 16kHz PCM input, 24kHz output with real-time processing
- Voice Activity Detection (VAD): Automatic or manual speech detection with interruption handling
- Function Calling: Tool integration with async execution and scheduling parameters
- Session Management: Connection lifecycle, resumption tokens, context compression
- Voice Customization: Multiple voices, speech configuration, natural prosody
- Audio Processing: PCM encoding/decoding, sample rate conversion, queue management
- Security: Ephemeral tokens for production client-to-server implementations
How to Use This Skill
For Quick Reference
Start with references/api-overview.md for endpoints, authentication, models, and limitations.
For Specific Implementation Tasks
Implementing Voice Conversations:
- Review
references/audio-handling.mdfor audio specifications and streaming strategies - Use
scripts/audio_utils.pyfor PCM encoding/decoding utilities - Reference
references/architecture-patterns.mdfor complete implementation approaches
Handling Interruptions:
Check references/audio-handling.md for VAD configuration and interruption patterns.
Adding Function Calling:
See references/function-calling.md for tool declarations, response handling, and async execution patterns.
Customizing Voices:
references/voice-customization.md provides comprehensive guidance on:
- Available voices and their personalities
- Making voices sound natural
- Adjusting cadence, pitch, and tone
- Language and multilingual support
Managing Sessions:
Review references/session-management.md for lifecycle management, resumption tokens, and graceful shutdown.
Production Best Practices:
Check references/best-practices.md for error handling, optimization, and common pitfalls.
Audio Utilities
The scripts/audio_utils.py module provides reusable functions for:
- Converting between numpy arrays, bytes, and base64 for PCM audio
- Sample rate conversion (16kHz input ↔ 24kHz output)
- Audio chunk management for gap-free playback
- Format conversions for different Python audio libraries (pyaudio, sounddevice)
Import and use these utilities in your implementation to avoid rewriting common audio processing code.
Working Example
For a complete working implementation, reference the ai-news-app repository which demonstrates:
- Bidirectional audio streaming with interruption handling
- Real-time transcription display
- Audio playback scheduling without gaps
- Session management and cleanup
- Browser-based implementation using the JavaScript SDK
Key Technical Considerations
Audio Format Requirements:
- Input: 16-bit PCM, 16kHz, mono (
audio/pcm;rate=16000) - Output: 24kHz sample rate
- Real-time processing with minimal latency
Session Limits:
- Audio-only: 15 minutes without compression (unlimited with compression)
- Audio+video: 2 minutes without compression
- WebSocket connections: ~10 minutes maximum
- Context window: 128k tokens (native audio), 32k (half-cascade)
Response Modalities: Can only set ONE modality per session - either TEXT or AUDIO, not both simultaneously.
Production Security: Always use ephemeral tokens for client-to-server implementations. Never expose API keys in client-side code.
Reference Documentation
All reference files contain detailed technical information with code examples in both Python and JavaScript:
api-overview.md- Endpoints, authentication, models, limitationsaudio-handling.md- Audio specs, VAD, interruptions, streamingfunction-calling.md- Tool integration and async executionsession-management.md- Lifecycle, resumption, compressionvoice-customization.md- Voices, speech config, natural prosodybest-practices.md- Production patterns and optimizationarchitecture-patterns.md- Implementation approaches with examples
Additional Resources
- Official Documentation: https://ai.google.dev/gemini-api/docs/live
- Python SDK:
google-genaipackage - JavaScript SDK:
@google/genaipackage - Interactive Demo: Google AI Studio
- Example Implementation: ai-news-app repository
Source
git clone https://github.com/Hildegaardchiasmal966/claude-skills/blob/master/gemini-live-api/SKILL.mdView on GitHub Overview
The Gemini Live API enables low-latency, real-time voice and video interactions with Gemini models. This skill provides practical guidance for implementing natural voice conversations, interruption handling, real-time transcription, in-session function calling, session management, and voice customization.
How This Skill Works
Technically, you establish a low-latency bidirectional channel (16kHz PCM input, 24kHz output) with VAD-based interruption handling and real-time transcription. Use the provided audio utilities (scripts/audio_utils.py) for PCM encoding/decoding and sample-rate conversion, integrate function calls via async tool execution, manage sessions with resumption tokens and context compression, and tune voice output to match chosen voices.
When to Use It
- Real-time two-way voice conversations between AI and users
- Voice agents that can be interrupted naturally mid-response
- Live function calling and real-time web search during sessions
- Text-to-speech with natural cadence and voice customization
- Managing sessions with resumption tokens, context compression, and secure authentication
Quick Start
- Step 1: Review references/api-overview.md for endpoints, authentication, models, and limits
- Step 2: Wire up bidirectional audio streaming using references/audio-handling.md and scripts/audio_utils.py for PCM encoding/decoding and sample-rate conversion
- Step 3: Implement session management, function calling, and voice customization using references/session-management.md, references/function-calling.md, and references/voice-customization.md
Best Practices
- Design for 16kHz PCM input and 24kHz output with low-latency buffering
- Configure Voice Activity Detection (VAD) for reliable interruptions
- Use function calling with async tool declarations and scheduling
- Implement robust session management: resumption tokens, context compression, and graceful shutdown
- Leverage voice customization: test multiple voices, cadence, pitch, and language
Example Use Cases
- Real-time customer support bot with interruptible conversations and live transcription
- Voice-enabled assistant performing live tool calls and web searches during sessions
- Multilingual meetings with real-time transcription and translation
- Persona-based voice experimentation with cadence and tone adjustments
- Secure mobile/web sessions using ephemeral tokens and token-based authentication