Meet 'Kani-TTS-2': a 400M ultimate open source text-to-speech model that runs in 3GB of VRAM with voice cloning support

The landscape of generative audio is moving toward efficiency. A new open-source contender, kani-tts-2Released by the team nineninesix.ai. This model marks a departure from bulky, computation-expensive TTS systems. Instead, it treats audio as a language, providing high-fidelity speech synthesis with a remarkably small footprint.

Kani-TTS-2 provides a lightweight, high-performance alternative to the closed-source API. It is currently available at Hugging Face in both English (n) And Portuguese (PT) version.

Architecture: LFM2 and Nanocodec

Follows KANI-TTS-2 ‘Audio as language’‘ Visit. The model does not use traditional matching-spectrogram pipelines. Instead, it converts the raw audio into separate tokens using neural codecs.

This system relies on a two-step process:

Backbone of Language: is built on the model LiquidAI’s LFM2 (350M) architecture. This backbone generates ‘audio intent’ by predicting the next audio token. Because LFMs (Liquid Foundation Models) are designed for efficiency, they provide a faster alternative to standard transformers.
Neural Codec: uses it nvidia nanocodec To convert those tokens into 22kHz waveforms.

Using this architecture, the model captures the rhythm and intonation of human-like prosody-speech without the ‘robotic’ artefacts found in older TTS systems.

Efficiency: 10,000 hours in 6 hours

The training metrics for KANI-TTS-2 are a masterclass in optimization. Training was given on English model 10,000 hours of high quality speech data.

While that scale is impressive, the speed of training is the real story. The research team only trained the model 6 hours using a cluster of 8 Nvidia H100 GPU. This proves that large datasets no longer require weeks of computation time when paired with efficient architectures like LFM2.

Zero-shot voice cloning and performance

Has extraordinary convenience for developers zero-shot voice cloning. Unlike traditional models, which require fine-tuning for new sounds, the Kani-TTS-2 is used speaker embedding.

how it works: You provide a brief reference audio clip.
Result: The model extracts the unique features of that voice and immediately applies them to the generated text.

From a deployment perspective, the model is highly accessible:

Parameter Calculation: 400M (0.4B) parameter.
pace: It has a feature Real Time Factor (RTF) 0.2. This means it can produce 10 seconds of speech in about 2 seconds.
Hardware: it only requires 3GB VRAMWhich makes it compatible with consumer-grade GPUs like the RTX 3060 or 4050.
License: Issued under Apache 2.0 License, permission for commercial use.

key takeaways

Efficient Architecture: The model uses a 400M parameter based on spinal cord LiquidAI’s LFM2 (350M). This ‘audio-as-language’ approach treats speech as individual tokens, allowing faster processing and more human-like intonation than traditional architectures.
Scale intense training: Trained on KANI-TTS-2-N 10,000 hours of high quality speech data on the bus 6 hours using the 8 Nvidia H100 GPU.
Instant zero-shot cloning: There is no need for fine-tuning to replicate a specific sound. By providing a brief reference audio clip, the model uses speaker embedding To quickly synthesize text in the target speaker’s voice.
High performance on edge hardware: with Real Time Factor (RTF) 0.2The model can produce 10 seconds of audio in approximately 2 seconds. it only requires 3GB VRAMWhich makes it fully functional on consumer-grade GPUs like the RTX 3060.
Developer-Friendly Licensing: Issued under Apache 2.0 LicenseKANI-TTS-2 is ready for commercial integration. It provides a local-first, low-latency alternative to expensive closed-source TTS APIs.

check it out model weight. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

Meet ‘Kani-TTS-2’: a 400M ultimate open source text-to-speech model that runs in 3GB of VRAM with voice cloning support

Architecture: LFM2 and Nanocodec

Efficiency: 10,000 hours in 6 hours

Zero-shot voice cloning and performance

key takeaways

Business or Net? Japanese investors bet on Takaichi after huge decline

AI hasn’t created a world of good video games yet, and probably never will.

Related Articles

Leave a Comment Cancel Reply