VoxCPM2
Create studio-quality voices with VoxCPM2's advanced AI voice generation. Design entirely new voices from text descriptions or clone any voice with precise control over emotion, pace, and style. Experience 48kHz high-fidelity audio across 30 languages with tokenizer-free diffusion architecture.
- Model Parameters
- 2B
- Languages Supported
- 30
- Audio Quality
- 48kHz
- Training Data
- 2M+ Hours
Experience VoxCPM2 Live Demo
Test VoxCPM2's voice design and controllable cloning in real-time. Create custom voices from text descriptions or clone voices with fine-grained control over timbre, emotion, and speaking style.
Model Versions
Choose the Right VoxCPM Model for Your Needs
VoxCPM offers three model versions optimized for different use cases. Compare features, performance, and capabilities to select the best fit for your project.
VoxCPM2
Latest
- Status
- 🟢 Latest
- Backbone Parameters
- 2B
- Audio Sample Rate
- 48kHz
- LM Token Rate
- 6.25Hz
- Languages
- 30
- Cloning Mode
- Isolated Reference & Continuation
- Voice Design
- ✅
- Controllable Cloning
- ✅
- SFT / LoRA
- ✅
- RTF (RTX 4090)
- ~0.30
- RTF in Nano-VLLM
- ~0.13
- VRAM
- ~8 GB
VoxCPM1.5
Stable
- Status
- 🔵 Stable
- Backbone Parameters
- 0.6B
- Audio Sample Rate
- 44.1kHz
- LM Token Rate
- 6.25Hz
- Languages
- 2 (zh, en)
- Cloning Mode
- Continuation only
- Voice Design
- —
- Controllable Cloning
- —
- SFT / LoRA
- ✅
- RTF (RTX 4090)
- ~0.15
- RTF in Nano-VLLM
- ~0.08
- VRAM
- ~6 GB
VoxCPM-0.5B
Legacy
- Status
- ⚪ Legacy
- Backbone Parameters
- 0.5B
- Audio Sample Rate
- 16kHz
- LM Token Rate
- 12.5Hz
- Languages
- 2 (zh, en)
- Cloning Mode
- Continuation only
- Voice Design
- —
- Controllable Cloning
- —
- SFT / LoRA
- ✅
- RTF (RTX 4090)
- ~0.17
- RTF in Nano-VLLM
- ~0.10
- VRAM
- ~5 GB
VoxCPM2 Core Features
Professional Voice Generation for Every Application
VoxCPM2 combines cutting-edge diffusion autoregressive architecture with intuitive controls, delivering studio-quality voice synthesis for content creators, developers, and enterprises worldwide.
- Zero-Shot Voice Design
Create entirely new voices from natural language descriptions with VoxCPM2's zero-shot capability. No reference audio required—simply describe voice characteristics like gender, age, accent, and speaking style. VoxCPM2 synthesizes unique voices instantly, perfect for character creation, brand voice development, and creative audio projects without extensive voice talent recording.
- Controllable Voice Cloning
Clone any voice from short reference clips with VoxCPM2's advanced cloning modes. Control emotion, pace, pitch, and speaking style while preserving original timbre. VoxCPM2 offers both isolated reference cloning and continuation modes for maximum flexibility. Ultimate cloning mode with transcript guidance delivers highest fidelity voice replication for professional dubbing and voice preservation.
- Studio-Quality 48kHz Audio
VoxCPM2 outputs broadcast-ready 48kHz high-fidelity audio through AudioVAE V2's asymmetric architecture. Accepts 16kHz reference audio and upsamples to 48kHz with built-in super-resolution, eliminating external processing. VoxCPM2 delivers crisp, natural-sounding speech suitable for professional media production, podcasts, audiobooks, and commercial applications without post-processing.
- 30-Language Multilingual Support
VoxCPM2 supports 30 languages including English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Hindi, and more. Includes Chinese dialect support for Cantonese, Sichuanese, Wu, and regional variations. VoxCPM2 automatically detects input language in most cases, making multilingual voice generation seamless for global content creators and localization teams.
- Real-Time Streaming Performance
VoxCPM2 achieves RTF of 0.30 on RTX 4090, or 0.13 with Nano-VLLM optimization, enabling real-time voice synthesis for interactive applications. Requires only 8GB VRAM for deployment. VoxCPM2's efficient tokenizer-free architecture processes speech at 6.25Hz token rate with 8192 token sequence length, perfect for voice agents, live dubbing, and streaming applications.
- Open Source & Customizable
VoxCPM2 is fully open source under Apache 2.0 license with complete model weights on Hugging Face. Built on MiniCPM-4 backbone with 2B parameters trained on 2M+ hours of multilingual speech data. VoxCPM2 supports fine-tuning via SFT and LoRA for custom voice adaptation. Deploy on your infrastructure with full control and transparency for research and commercial use.
VoxCPM2 Frequently Asked Questions
- What is VoxCPM2 and how does it differ from other TTS models?
- VoxCPM2 is an open-source tokenizer-free text-to-speech model with 2B parameters developed by OpenBMB. Unlike traditional TTS systems, VoxCPM2 uses diffusion autoregressive architecture to generate continuous speech representations directly. VoxCPM2 uniquely combines zero-shot voice design from text descriptions with controllable voice cloning, offering three distinct modes: voice design without reference audio, controllable cloning with style control, and ultimate cloning with transcript guidance for maximum fidelity.
- What languages does VoxCPM2 support?
- VoxCPM2 supports 30 languages: Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, and Vietnamese. VoxCPM2 also includes Chinese dialect support for Cantonese, Sichuanese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, and Minnan dialects. VoxCPM2 automatically detects input language in most cases.
- How does VoxCPM2 voice design work without reference audio?
- VoxCPM2's voice design mode uses zero-shot learning to create entirely new voices from natural language descriptions. Simply describe desired voice characteristics in parentheses at the start of your text—such as gender, age, accent, pitch, speaking style—and VoxCPM2 synthesizes a matching voice instantly. This eliminates the need for voice talent recording or reference audio collection, making custom voice creation accessible for character development, brand voices, and creative projects.
- What audio quality does VoxCPM2 produce?
- VoxCPM2 outputs studio-quality 48kHz audio suitable for professional media production. VoxCPM2 uses AudioVAE V2 with asymmetric encode/decode architecture that accepts 16kHz reference audio and outputs 48kHz with built-in super-resolution. This high-fidelity output eliminates the need for external upsampling or post-processing. VoxCPM2 achieves state-of-the-art results on major TTS benchmarks including Seed-TTS-eval, CV3-eval, and InstructTTSEval.
- What are VoxCPM2's hardware requirements?
- VoxCPM2 requires approximately 8GB VRAM for inference with the 2B parameter model in bfloat16 precision. VoxCPM2 achieves RTF of 0.30 on RTX 4090 GPU, or 0.13 with Nano-VLLM optimization for faster generation. Minimum requirements are Python 3.10+, PyTorch 2.5.0+, and CUDA 12.0+. VoxCPM2 can run on consumer-grade GPUs, making professional voice synthesis accessible to individual developers and small teams without enterprise infrastructure.
- Is VoxCPM2 free to use for commercial projects?
- Yes, VoxCPM2 is fully open source under Apache 2.0 license, allowing both personal and commercial use. You can download VoxCPM2 model weights from Hugging Face, deploy on your own infrastructure, and customize for your specific needs. VoxCPM2 supports fine-tuning via supervised fine-tuning (SFT) and LoRA for voice adaptation. OpenBMB provides complete documentation, code, and model weights with no API fees or usage restrictions for VoxCPM2.
- Can I control emotion and speaking style with VoxCPM2?
- Yes, VoxCPM2's controllable cloning mode provides fine-grained control over voice attributes. You can adjust emotion, pace, pitch variation, and speaking style while preserving the original voice timbre from reference audio. VoxCPM2 accepts natural language control instructions to steer voice characteristics. Note that controllable generation results may vary between runs—VoxCPM2 developers recommend generating 1-3 times to achieve desired voice or style as they continue improving controllability consistency.