VibeVoice

Advanced multi-speaker text-to-speech AI that generates up to 90 minutes of natural conversational audio. VibeVoice creates podcast-quality dialogue with up to 4 distinct speakers, perfect for audiobooks, e-learning, and long-form content.

Try VibeVoice Free View on GitHub

Max Audio Length: 90 Minutes
Speakers Supported: Up to 4
Context Window: 64K Tokens
License: MIT Open Source

Experience VibeVoice Multi-Speaker Demo

Try VibeVoice's advanced conversational AI and generate natural multi-speaker dialogue. Create podcast-style conversations, audiobook narrations, and engaging educational content with authentic turn-taking and emotional expression.

VibeVoice Core Features

Revolutionary Long-Form Conversational Speech Synthesis

VibeVoice transforms how we create audio content with unprecedented multi-speaker capabilities, natural dialogue flow, and extended generation length for professional podcast production and immersive storytelling.

Extended 90-Minute Generation: VibeVoice uniquely generates up to 90 minutes of continuous, high-quality speech in a single session with its 64K context window. This breakthrough capability makes VibeVoice ideal for full podcast episodes, complete audiobook chapters, comprehensive training modules, and long-form interviews without interruption or quality degradation over time.
Natural Multi-Speaker Conversations: Create authentic dialogue with up to 4 distinct speakers using VibeVoice's advanced turn-taking system. Each VibeVoice speaker maintains consistent voice characteristics, personality, and speaking style throughout the conversation. Perfect for panel discussions, interviews, educational dialogues, customer service simulations, and dramatic storytelling with multiple characters.
Spontaneous Emotional Expression: VibeVoice captures genuine emotional nuances including laughter, excitement, concern, and subtle mood shifts. The model generates spontaneous emotional responses that feel natural and unscripted, creating engaging content that resonates with listeners. VibeVoice even handles spontaneous singing and musical elements within conversations for creative podcast production.
Cross-Lingual Voice Synthesis: VibeVoice excels in both English and Chinese with native-quality pronunciation and intonation. Seamlessly switch between languages within a single conversation while maintaining speaker identity. This makes VibeVoice perfect for bilingual content, language learning materials, international business communications, and global podcast audiences.
Podcast-Quality Audio Production: VibeVoice generates broadcast-quality audio suitable for professional podcast distribution. The system maintains consistent audio characteristics, natural prosody, and appropriate pacing throughout long sessions. VibeVoice handles background ambiance gracefully and produces clean speech ideal for direct publishing or minimal post-production editing.
Efficient Hybrid Architecture: Built on cutting-edge continuous speech tokenizers operating at 7.5 Hz combined with next-token diffusion decoding, VibeVoice achieves exceptional quality while maintaining computational efficiency. The hybrid architecture enables VibeVoice to process long sequences effectively, making 90-minute generation practical on accessible hardware for researchers and content creators.

VibeVoice Frequently Asked Questions

What is VibeVoice and what makes it unique?: VibeVoice is an open-source text-to-speech framework specifically designed for long-form, multi-speaker conversational audio. Unlike traditional TTS systems, VibeVoice can generate up to 90 minutes of natural dialogue with up to 4 speakers in a single session. VibeVoice excels at maintaining speaker consistency, natural turn-taking, and emotional expressiveness throughout extended conversations, making it ideal for podcasts, audiobooks, and educational content.
How long can VibeVoice generate audio in one session?: VibeVoice can generate up to 90 minutes of continuous speech with its 64K context window using the 1.5B parameter model. The 7B parameter VibeVoice model supports up to 45 minutes of high-quality audio generation. This extended capability makes VibeVoice perfect for complete podcast episodes, full audiobook chapters, comprehensive training sessions, and long-form interviews without requiring segmentation.
How many speakers can VibeVoice handle simultaneously?: VibeVoice supports up to 4 distinct speakers in a single conversation. Each VibeVoice speaker maintains consistent voice characteristics, personality traits, and speaking patterns throughout the entire session. This multi-speaker capability makes VibeVoice ideal for panel discussions, interviews, educational dialogues, dramatic storytelling, and any content requiring natural conversational dynamics between multiple participants.
What languages does VibeVoice support?: VibeVoice is primarily trained for English and Chinese, delivering native-quality speech in both languages. VibeVoice can seamlessly switch between English and Chinese within a single conversation while maintaining speaker identity. Other languages may produce experimental results. For best quality and stability, use VibeVoice with English or Chinese content for professional podcast production and audiobook narration.
Can VibeVoice be used for podcast production?: Absolutely! VibeVoice is specifically designed as a podcast voice generator. It creates broadcast-quality multi-speaker conversations with natural turn-taking, appropriate pacing, and emotional expressiveness. VibeVoice handles long-form content effortlessly, making it perfect for interview podcasts, panel discussions, educational series, and storytelling podcasts. The output quality is suitable for direct publishing with minimal post-production.
Is VibeVoice open source and free to use?: Yes! VibeVoice is released under the MIT open-source license. You can access the complete VibeVoice codebase on GitHub, deploy it locally on your own hardware, and use it for both personal and commercial projects. VibeVoice is available through Hugging Face for easy integration, and you can try VibeVoice demos online before deploying your own instance.
What hardware does VibeVoice require?: VibeVoice offers two model sizes with different hardware requirements. The 1.5B parameter VibeVoice model requires 7-10GB VRAM and can generate up to 90 minutes of audio. The 7B parameter VibeVoice model needs 18-24GB VRAM and supports up to 45 minutes of higher-quality generation. Both VibeVoice models can run on consumer-grade GPUs, making professional multi-speaker synthesis accessible to individual creators and small teams.