OpenWeight Text-to-Speech with Voxtral TTS

Image by editor

# Introduction

Voice-enabled applications are everywhere from virtual assistants to customer service chatbots. But for developers, creating natural-sounding speech in apps often means relying on expensive cloud APIs or dealing with robotic, unnatural voices.

Mistral AI aims to change this Voxtral TTS. It is a powerful, open-source text-to-speech (TTS) model that you can run on your own hardware. Released on March 26, 2026, this 4-billion-parameter model generates human-like speech in nine languages and optimizes a new voice from just three seconds of reference audio.

In this Voxtral TTS tutorial, you’ll learn how the model works, what makes its voice cloning and low-latency performance so special, and how to start generating speech with a few lines of Python code.

# What is Voxtral TTS?

Voxtral TTS is the first TTS model from Mistral AI. Unlike many commercial offerings that lock you into the cloud API, Voxtral TTS is released with open access. You can download the model and run it completely on your infrastructure. This gives you full control over your data, costs and optimization.

The model is built on Mistral’s existing Minstral 3B architecture, making it small enough to run on consumer hardware, including laptops and edge devices. According to Mistral, Voxtral delivers TTS “boundary quality“Performance that matches or exceeds leading proprietary systems in human hearing tests.

// Open Vet vs Open Source

It is important to understand that “open weight” is not the same as fully open source. Voxtral TTS gives you access to trained model weights, which you can use for research and personal projects CC BY-NC 4.0 license. However, commercial use requires a separate licensing agreement or use of Mistral’s payments API.

// key features

Voxtral TTS offers a powerful set of features designed for real-world voice applications:

It can clone a new voice from just 3 seconds of reference audio.
Offers low latency with 70ms model latency and approximately 100ms time-to-first-audio.
Achieves a real time factor (RTF) of 9.7x, which means it produces 10 seconds of speech in about 1.6 seconds.
Supports 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi and Arabic.
There are 4 billion parameters.
Provides open access under CC BY-NC 4.0 for non-commercial use, with an API option for commercial projects, and includes native support for low-latency streaming inference.

# Cloning a voice from three seconds of audio

One of the most impressive capabilities of Voxtral TTS is zero-shot voice cloning. Traditional voice cloning systems often require 30 seconds or more of reference audio to capture a person’s voice. Voxtral TTS works in just 3 seconds.

When you provide a brief audio signal, the model analyzes the speaker’s unique characteristics – such as accent, intonation, rhythm and even emotional tone – and can then generate new speech in the same voice. It works in all nine supported languages, meaning you can create a multilingual voice clone that speaks English, French or Hindi while preserving the original voice identity.

// How does Voxtral TTS compare to ElevenLabs

In blind human evaluation conducted by native speakers in all nine languages, Voxtral TTS achieved a 68.4% win rate is over elevenlabs flash v2.5. The model performed exceptionally well in the following:

Language	Win Rate vs ElevenLabs Flash v2.5
Spanish	87.8%
Hindi	79.8%
Portuguese	74.4%
Arabic	72.9%
German	72.0%
English	60.8%
Italian	57.1%
French	54.4%
dutch	49.4%

Source: Hugging Face Community Blog: Voxtral TTS vs ElevenLabs

# Latency Performance: Built for real-time conversations

For voice agents and interactive applications, speed matters. A delay of even a few hundred milliseconds can make the conversation feel awkward or broken.

Voxtral TTS is specifically designed for low-latency streaming inference. According to Mistral’s official documentation, the model receives:

Model latency of 70 ms for a typical input of 10 seconds of sound sample and 500 characters of text.
~100 ms Time-to-First-Audio (TTFA) – The time from when you send the text to when you hear the first sound.
RTF of 9.7x – which means it can generate almost ten times faster than real-time.

To put this in perspective: a 10 second audio clip can be produced in just over 1 second. This makes Voxtral TTS suitable for real-time applications such as:

Conversational AI Agent
Live Customer Support System
real time translation tools
Voice-Enabled IoT Devices

The model can basically produce continuous audio for up to two minutes without stopping.

// Understanding the Real-Time Factor

RTF measures how fast a model produces audio compared to the actual duration of that audio. An RTF of 1.0 means that generation takes the same amount of time as the audio length. An RTF of 9.7 means production is 9.7 times faster – a 10-second clip takes just 1.03 seconds to produce.

# How does Voxtral TTS work?

Without going too deep into the mathematics, here is a high-level overview of the model’s architecture.

Voxtral TTS uses a hybrid approach that combines two technologies:

Semantic token generation. The model first generates “semantic tokens” that represent the meaning and structure of what is to be said. This is similar to the way a language model generates text tokens.
Flow matching for acoustic tokens. These semantic tokens are then converted into acoustic tokens that represent the actual sound waves of speech.

Both types of tokens are encoded and decoded using the Voxtral codec, which is a custom speech tokenizer trained from scratch with a hybrid vector quantization – finite scalar quantization (VQ-FSQ) scheme.

This two-step process allows the model to be different What tell from (content) How To say it (voice style, emotion, pronunciation). This is why the model can clone voices from a small sample; It learns “How“Takes context audio and applies it to any text.

For more in-depth technical information, see the full Voxtral TTS paper on arXiv.

# Getting Started: Installation and Setup

You can use Voxtral TTS in two ways:

Via Mistral API – easiest for quick testing and business use.
Self-hosted with open weights – full control, free for non-commercial use.

Prerequisites:

Basic familiarity with Python and command line.
Python 3.10 or higher.
pip package manager.
For self-hosting: an NVIDIA GPU (8GB+ VRAM recommended) or Apple Silicon Mac.

// Option 1: Using Mistral API

Mistral provides a simple Python SDK. First, install the Mistral AI client:

Then, prepare the speech with just a few lines:

from mistralai import Mistral

api_key = "your-api-key"  # Get from console.mistral.ai
client = Mistral(api_key=api_key)

response = client.audio.speech.create(
    model="voxtral-tts-26-03",
    input="Hello, world! This is a test of Voxtral TTS.",
    voice="alloy",  # or a custom voice prompt
)

# Save the audio to a file
with open("output.wav", "wb") as f:
    f.write(response.audio)

The API costs $0.016 per 1,000 characters. You can also test the model for free Mistral Studio.

// Option 2: Self-Hosting with OpenWeight

For self-hosting, you can download the model weights here hugging face. The model is released under CC BY-NC 4.0 license. A popular community-developed alternative is to use int4 quantization for efficient inference. voxtral-int4 Implementation yields:

4.6x real-time speech generation.
Using 3.7GB of VRAM on the RTX 3090.
54% VRAM reduction compared to full precision.

# Voice cloning with custom voices: a practical example

One of the most powerful features is the ability to adapt the model to any voice. Here is a complete example using the Mistral API:

from mistralai import Mistral

api_key = "your-api-key"
client = Mistral(api_key=api_key)

# Step 1: Load or record a reference audio file (3+ seconds)
reference_audio_path = "my_voice_sample.wav"

# Step 2: Open the audio file for upload
with open(reference_audio_path, "rb") as f:
    audio_content = f.read()

# Step 3: Generate speech using the cloned voice
response = client.audio.speech.create(
    model="voxtral-tts-26-03",
    input="This is my voice, cloned from just a few seconds of audio.",
    voice=audio_content,  # Pass the reference audio directly
)

# Save the generated speech
with open("cloned_voice_output.wav", "wb") as f:
    f.write(response.audio)

Reference audio must be clear, without background noise and at least 3 seconds long. The longer the sample (up to about 25 seconds), the better the voice quality.

# use cases

Here are practical scenarios where Voxtral TTS excels:

Voice assistants and chatbots. Low latency (~100ms TTFA) means conversations feel natural and responsive. Unlike cloud-based APIs, which add network costs, self-hosted Voxtral TTS can keep everything on your own servers.
Multilingual customer support. With support for nine major languages and inter-language voice cloning, a single model can serve global customers. For example, you can generate English speech with a French accent based on a brief context signal.
Content localization. Translate and dub videos, podcasts, or e-learning content into multiple languages while preserving the original speaker’s voice identity in all languages.
Accessibility Tools. Build screen readers and assistive technologies with natural, expressive voices that users can customize to their preferred voice.
Gaming and interactive media. Generate dynamic character dialogue in real-time, adapting to player preferences without pre-recording every line.

# Licensing and deployment considerations

// Open Weight (CC BY-NC 4.0)

Permission: Research, personal projects, educational use, internal testing.
Not allowed: Commercial products, revenue generating services, redistribution for commercial purposes.
Mistral AI needs to be given credit.

// commercial use

For business applications, you have two options:

Use Mistral’s API – pay $0.016 per 1,000 characters.
Negotiate a commercial license – Contact Mistral for enterprise licensing.

If you need unlimited scaling without per-request costs, self-hosting with commercial licenses is the most cost-effective route for high-volume use cases. For low to medium volumes, the API is simple.

# conclusion

Voxtral TTS brings enterprise-grade, open-source text-to-speech within reach of any developer. With only 3 seconds of audio for voice cloning, 70 ms latency and a 9.7x real-time factor, it is built for the real-time, conversational applications users expect today.

Whether you choose the simplicity of Mistral’s API or the full control of a self-hosted deployment, Voxtral TTS gives you a powerful foundation for adding natural, expressive speech to your projects.

Next Steps:

Shittu Olumide He is a software engineer and technical writer who is passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and the ability to simplify complex concepts. You can also find Shittu Twitter.