Chroma 1.0 is a real-time speech-to-speech dialogue model that takes audio as input and returns audio as output while preserving speaker identity in multi-turn conversations. It is presented as the first open source end-to-end spoken dialogue system that combines low latency interaction with high fidelity personalized voice cloning from just a few seconds of reference audio.
The model works directly on individual speech representations rather than on text transcripts. It targets the same use cases as commercial real-time agents, but with a compact 4B parameter dialogue core and a design that treats speaker similarity as the primary objective, not as a supporting feature. Chroma has achieved a 10.96% relative improvement in speaker similarity over the human baseline and reached a Real Time Factor (RTF) of 0.43, so it can generate speech more than 2 times faster than playback.

Cascaded ASR to ➡️ LLM ➡️ TTS ➡️ End to End S2S
Most production assistants still use a three-stage pipeline, automatic speech recognition to convert audio to text, a large language model for reasoning, and text-to-speech synthesis. This structure is flexible but it introduces latency and loses paralinguistic information such as timing, sentiment, speaking rate and prosody when the system converts audio to text. This loss of acoustic detail in real-time communication directly harms the fidelity and naturalness of the speaker.
Chroma follows a new class of speech-to-speech systems that map between sequences of codec tokens. A speech tokenizer and neural codec generate the quantized acoustic code. A language model then reasons and reacts to a sequence that combines text tokens and audio codes, without an explicit intermediate transcript. This keeps the model optimized on prosody and speaker identification throughout the entire processing chain.
Architecture, Reasoner + Speech Generation Stack
Chroma 1.0 has two main subsystems. Chroma Reasoner handles multimodal understanding and text generation. The speech stack, chroma backbone, chroma decoder, and chroma codec decoder convert that semantic output into personalized response audio.
Chroma Reasoner is built on the Thinker module of the Kwen-Omni series and uses the Kwen2 audio encoding pipeline. It processes text and audio input with a shared front end, fuses them with cross modal attention, and aligns them over time using time aligned multimodal rotary position embedding (TM-ROPE). The output is a sequence of hidden states that contain both linguistic content and acoustic signals, for example rhythm and loudness.


Chroma Backbone is a 1B parameter LLaMA style model based on Llama3. It is conditioned on the target voice using a CSM-1B, which encodes a short reference audio clip and its transcript into embedding signals that are concatenated to the sequence. During inference, the token embeddings and hidden state from the reasoner are fed as a unified context, so the backbone always sees the semantic state of the dialogue while it generates the acoustic code.
To support streaming, the system uses a fixed 1 to 2 interleaving schedule. For each text token from Reasoner, Backbone generates 2 audio code tokens. This allows the model to start producing speech as soon as text generation begins and avoids waiting for complete sentences. This is the main mechanism behind interleaving low time to first token.
The chroma decoder is a lightweight LLaMA variant with about 100M parameters. Backbone only predicts the first residual vector quantization codebook per frame, which is a rough representation. The decoder then takes the backbone hidden state and the first code and automatically predicts the remaining RVQ levels inside the same frame. This factorization maintains the long context temporal structure in the backbone and restricts the decoder to frame local refinement, which reduces computation and improves detailed prosody and expressivity.
The chroma codec decoder combines coarse and fine codes and maps them into wavelet samples. It follows the decoder design of the mm vocoder and uses a causal convolutional neural network so that each output sample depends only on the previous context, which is necessary for streaming. The system uses 8 codebooks, which cuts the number of autoregressive refinement steps for the decoder while preserving enough detail for voice cloning.
Training Setup and Synthetic Speech to Speech (S2S) Data
High quality speech dialogue data with strong logic signals is rare. Therefore Chroma uses a synthetic speech to speech (S2S) pipeline. A reasoner like LLM first generates textual answers to the user’s questions. A Test to Speech (TTS) system then synthesizes target speech that matches the timing of the reference audio for those answers. These synthetic pairs train the backbone and decoder to perform acoustic modeling and voice cloning. Reasoner remains static and acts as a provider of text embeddings and multimodal hidden states.
Voice cloning quality and comparison with existing systems
Objective evaluation uses the SEED-TTS-EVAL protocol on English CommonVoice speakers. Chroma operates at a 24 kHz sampling rate and achieves a speaker similarity score of 0.81. The human baseline is 0.73. CosyVoice-3 reaches 0.72 and most other TTS baselines are below human reference. The research team reports this as a 10.96% relative improvement over the human baseline, which indicates that the model captures finer paranormal detail more consistently than human recordings in this metric.


Subjective evaluation comparing Chroma to the elevenLabs eleven_multilingual_v2 model. In naturalness CMOS, listeners prefer ElevenLabs 57.2% of the time, compared to 24.4% for Chroma, with Deuce 18.3%. Speaker similarity In CMOS, the scores are very close, 42.4% for ElevenLabs and 40.6% for Chroma, with a 17.0% deuce. A follow-up test asking which audio sounded more natural between ElevenLabs and the original recording yielded a 92.0% preference for ElevenLabs, versus 8.0% for ground truth, indicating that perceived naturalness and speaker fidelity are not aligned.
Latency and real-time behavior
Latency is measured with a concurrent stream. For a response of 38.80 seconds, the total production time is 16.58 seconds, giving a real time factor (RTF) of 0.43. The reasoner contributes an average of 119.12 ms TTFT, the backbone 8.48 ms, and the decoder 19.27 ms per frame. The codec decoder works on groups of 4 frames so TTFT does not apply to that component. The total time for the first token is 146.87 ms, which is less than a second and suitable for interactive dialogue.


Spoken Dialogue and Argumentation Criteria
Chroma is evaluated on the original track of the Euro bench. It uses only 4B parameters yet achieves an overall task achievement score of 57.44%. GLM-4 VOICE, a 9b parameter model, leads with 69.09%. Chroma ranked second overall and outperformed the 7B and 0.5B Omni baselines on several dimensions. It reaches 71.14% on Storal, 51.69% on TruthfulQA, and 22.74% on GSM8K. For spoken conversation metrics it gets the highest scores at 60.26% on MLC and 62.07% on CommonVoice.


Critically, the Chroma is the only model in this comparison that supports personalized voice cloning. All other systems focus only on verbal communication and logic. This means Chroma delivers competitive cognitive capabilities while delivering high-fidelity voice personalization in real-time.
key takeaways
- End to end real time speech to speech:Chroma 1.0 is a 4B parameter spoken dialogue model that maps speech directly to speech using codec tokens, it avoids explicit ASR and TTS steps and preserves prosody and speaker identity through the entire pipeline.
- Reasoner Plus Speech Stack Architecture: The system combines a QUOTE-based chroma reasoner with a 1B LLAMA style backbone, a 100M chroma decoder and a mm based codec decoder. It uses 1 to 2 text interleaved in the RVQ codebook and audio token schedule to support streaming and low time to first token.
- Strong personalized voice cloning: On SEED-TTS-EVAL with CommonVoice speakers, Chroma reaches a speaker similarity score of 0.81 at 24 kHz, reported as a 10.96 percent relative improvement over the human baseline of 0.73 and outperforms CosyVoice 3 and other TTS baselines.
- Sub second latency and faster than real-time generation: Single stream estimation on the H200 GPU gives a total time of about 147 ms for the first token, the model produces audio in 16.58 seconds for a response of 38.80 seconds, resulting in a real time factor of 0.43 which is more than 2 times faster than playback.
- Competitive dialogue and reasoning with cloning as a unique feature: On the URO Bench Basic track, Croma has achieved 57.44 percent overall task achievement and competitive scores on Storal, TruthfulQA, GSM8K, MLC and CommonVoice.
check it out paper, model weight, Project And playground. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
