Inworld AI Releases TTS-1.5 for Realtime, Production Grade Voice Agents

by
0 comments
Inworld AI Releases TTS-1.5 for Realtime, Production Grade Voice Agents

Inworld AI has introduced Inworld TTS-1.5, an upgrade to its TTS-1 family, which targets realtime voice agents with tighter constraints on latency, quality, and cost. TTS-1.5 is described as the top ranking number in text to speech system artificial analysis And it is designed to be more expressive and more stable than previous generations while remaining suitable for large-scale consumer deployment.

Realtime latency for interactive agents

TTS-1.5 focuses on P90 time to first audio latency, which is an important metric for user-perceived response. TTS-1.5 Max, audio to first time for P90 is less than 250 ms. For TTS-1.5 Mini, P90 the time to first audio is less than 130 ms. According to Inworld these values ​​are approximately 4 times faster than the previous TTS generation.

The TTS-1.5 stack supports streaming over WebSockets so that synthesis and playback can begin as soon as the first audio segment is generated. In practice this keeps end-to-end interaction latency in the same range as typical realtime language model responses when models run on modern GPUs, which is important when the TTS is part of a full agent pipeline.

Inworld recommends TTS-1.5 Max for most applications because it balances latency close to 200 ms with high stability and quality. The TTS-1.5 Mini is positioned for latency sensitive workloads such as real-time gaming or ultra responsive voice agents where every millisecond matters.

Expression, stability and benchmark status

The TTS-1.5 builds on the TTS-1 and offers approximately 30 percent more expressive range and approximately 40 percent better stability than previous models.

Expression here refers to characteristics such as rhyme, emphasis and emotional variation. Stability is measured by metrics such as word error rate and output stability across long sequences and different signals. The reduction in word error rates reduces issues such as truncated sentences, unintended word replacements, or artifacts, which is important when TTS output is driven directly from generated language model text.

Consumer scale pricing and cost profile

The TTS-1.5 is priced with two main configurations. The cost of Inworld TTS-1.5 Mini is $5 per 1 million characters, which is approximately $0.005 per minute of speech. TTS-1.5 Max costs $10 per 1 million characters, which is about $0.01 per minute.

This cost profile makes it possible to run TTS consistently without becoming a major variable cost in high-usage products such as voice native companions, education platforms, or customer support lines.

Multilingual support, voice cloning and deployment options

Inworld TTS-1.5 supports 15 languages. The list includes English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. This allows a single TTS pipeline to cover a wide group of markets without having to have separate models per region.

The system provides instant voice cloning and professional voice cloning. Instant Voice Cloning can create a custom voice from about 15 seconds of audio and expose it directly through the Inworld portal and API. Professional voice cloning uses at least 30 minutes of clean audio, with 20 minutes or more recommended for best results, and targets branded voices and less common accents.

For deployment, TTS-1.5 is available as a cloud API and also as an on-premises solution, where the entire model runs inside the customer infrastructure for data sovereignty and compliance. The same quality profile is maintained in both deployment modes, and the models integrate with partner platforms like LiveKit, Pipecat, and Vapi for an end-to-end voice agent stack.

key takeaways

  • Inworld TTS 1.5 delivers realtime performance, with a P90 time to first audio of less than 250 ms for the Max model and less than 130 ms for the Mini model, approximately 4 times faster than the previous generation.
  • The model increases expressivity by approximately 30 percent and improves stability with approximately 40 percent lower word error rates.
  • Pricing has been optimized for consumer scale, with TTS 1.5 Mini costing approximately $5 per 10 million characters and TTS 1.5 Max costing approximately $10 per 10 million characters, significantly cheaper per minute than many competing systems.
  • TTS 1.5 supports 15 languages ​​and provides instant and professional voice cloning, enabling custom and branded voices from short reference audio or longer recorded datasets.
  • The system is available as a cloud API and on-prem deployment, and integrates with existing voice agent stacks, making it suitable for production realtime agents that require clear guarantees on latency, quality, and data control.

check it out technical details. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.


Related Articles

Leave a Comment