Kwen Researchers Release Kwen3-TTS: An Open Multilingual TTS Suite with Real-Time Latency and Micro-Voice Control

by
0 comments
Kwen Researchers Release Kwen3-TTS: An Open Multilingual TTS Suite with Real-Time Latency and Micro-Voice Control

Alibaba Cloud’s Quen team has the open-source Quen3-TTS, a family of multilingual text-to-speech models that target three core functions in a stack, voice clone, voice design, and high-quality speech generation.

https://arxiv.org/pdf/2601.15621v1

Ideal Family and Capabilities

Qwen3-TTS uses a 12Hz speech tokenizer and 2 language model sizes, 0.6B and 1.7B, packed into 3 main functions. The open release includes 5 models, Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base for voice cloning and generic TTS, Qwen3-TTS-12Hz-0.6B-CustomVoice and Qwen3-TTS-12Hz-1.7B-CustomVoice for promptable preset speakers. Qwen3-TTS-12Hz-1.7B-VoiceDesign, with Qwen3-TTS-Tokenizer-12Hz codec, for free-form voice generation from natural language descriptions.

All models support 10 languages, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish and Italian. CustomVoice variants come with 9 curated timbres, such as Vivian, a bright young Chinese female voice, Ryan, a dynamic English male voice, and Ono_Anna, a playful Japanese female voice, each with a short description that encodes the timbre and speaking style.

The VoiceDesign model maps text instructions directly onto new voices, for example ‘speaking in a nervous teenage male voice with rising intonation’ and can then be combined with the base model by first creating a short reference clip and reusing it through the voice design. create_voice_clone_prompt.

https://arxiv.org/pdf/2601.15621v1

Architecture, tokenizer, and streaming path

Qwen3-TTS is a dual-track language model, one track predicts acoustic tokens separately from text, the other handles alignment and control signals. The system is trained on over 5 million hours of multilingual speech in 3 pre-training phases that run from simple mapping, to high quality data, to long context support up to 32,768 tokens.

A key component is the Qwen3-TTS-Tokenizer-12Hz codec. It operates at 12.5 frames per second, approximately 80 ms per token, and uses 16 quantizers with a 2048 entry codebook. On the LibreSpeech test clean it reaches PESQ wideband 3.21, STOI 0.96, and UTMOS 4.16, using the same or lower frame rates, outperforming SpeechTokenizer, XCodec, Mimi, FireredTTS 2, and other recent semantic tokenizers.

The tokenizer is implemented as a pure left context streaming decoder, so it can emit waves as soon as enough tokens are available. With 4 tokens per packet, each streaming packet contains 320 ms of audio. Non-DiT decoder and BigVGAN free design reduces decode costs and simplifies batching.

On the language model side, the research team reports end-to-end streaming measurements on a single VLLM backend with torch.compilation and CUDA graph optimization. For Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base at Concurrency 1, the first packet latency is about 97 ms and 101 ms with real time factors of 0.288 and 0.313, respectively. Even at Concurrency 6, the latency of the first packet remains around 299 ms and 333 ms.

https://arxiv.org/pdf/2601.15621v1

alignment and control

After training a stepwise alignment pipeline is used. First, direct preference adaptation aligns the speech generated with human preferences on multilingual data. Then GSPO improves stability and consistency with rule based rewards. The final speaker fine tuning step on the base model generates the target speaker variant while preserving the core capabilities of the normal model.

The following instructions are implemented in ChatML style format, where text instructions about style, emotion or motion are attached to the input. This same interface powers VoiceDesign, customvoice style prompts, and fine editing for cloned speakers.

Benchmarks, Zero Shot Cloning, and Multilingual Speech

On the Seed-TTS test set, Qwen3-TTS is evaluated as a zero-shot voice cloning system. The Qwen3-TTS-12Hz-1.7B-base model reaches a word error rate of 0.77 on test-zh and 1.24 on test-n. The research team highlights the 1.24 WER on Test-N as state-of-the-art among compared systems, while the Chinese WER is close to, but not inferior to, the best CosyVoice 3 score.

https://arxiv.org/pdf/2601.15621v1

On the multilingual TTS test set covering 10 languages, Quen3-TTS achieves the lowest WER in 6 languages, Chinese, English, Italian, French, Korean, and Russian, and competitive performance in the remaining 4 languages, while also achieving the highest speaker similarity in all 10 languages ​​compared to Minimax-Speech and ElevenLabs Multilingual v2.

Cross-lingual evaluation shows that Qwen3-TTS-12Hz-1.7B-Base reduces the compound error rate for many language pairs, such as zh-to-ko, where the error decreases from 14.4 for CosyVoice3 to 4.82, which is about a 66 percent relative reduction.

At InstructTTSEval, the Qwen3TTS-12Hz-1.7B-VD VoiceDesign model sets new state-of-the-art scores among open source models on detail-speech consistency and response precision in both Chinese and English, and is competitive with commercial systems such as Hume and Gemini on multiple metrics.

key takeaways

  • Complete open source multilingual TTS stack: Qwen3-TTS is an Apache 2.0 licensed suite covering 3 functions in one stack, high quality TTS, 3 second voice cloning and instruction based voice design in 10 languages ​​using the 12Hz tokenizer family.
  • Efficient discrete codec and real-time streaming:Qwen3-TTS-Tokenizer-12Hz uses 16 codebooks at 12.5 frames per second, reaches strong PESQ, STOI and UTMOS scores, and supports packetized streaming with approximately 320 ms audio per packet and less than 120 ms first packet latency for 0.6B and 1.7B models in the reported setup.
  • Function specific model variants:The release provides base models for cloning and generic TTS, customVoice models with 9 predefined speakers and style prompts, and a VoiceDesign model that generates new voices directly from natural language descriptions that can be reused by the base model.
  • Strong alignment and multilingual quality: A multi stage alignment pipeline with DPO, GSPO and speaker fine tuning gives Qwen3-TTS low word error rates and high speaker similarity, with the lowest WER in 6 out of 10 languages ​​and the best speaker similarity across all 10 languages ​​among the evaluated systems, and state-of-the-art zero shot English cloning on seed TTS.

check it out model weight, repo And playground. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.


Related Articles

Leave a Comment