NVIDIA AI releases Nemotron Speech ASR: a new open source transcription model designed from the ground up for low-latency use cases like voice agents

by
0 comments
NVIDIA AI releases Nemotron Speech ASR: a new open source transcription model designed from the ground up for low-latency use cases like voice agents

NVIDIA recently released its new streaming English transcription model (Nemotron Speech ASR) built specifically for low-latency voice agents and live captioning. outpost nvidia/nemotron-speech-streaming-en-0.6b On Hugging Face combines a cache-aware FastConformer encoder with an RNNT decoder, and is tuned for both streaming and batch workloads on modern NVIDIA GPUs.

Model design, architecture and input assumptions

Nemotron Speech ASR (Automatic Speech Recognition) is a 600M parameter model based on a cache aware FastConformer encoder with 24 layers and an RNNT decoder. The encoder uses aggressive 8x convolutional downsampling to reduce the number of time steps, which directly reduces compute and memory costs for streaming workloads. The model consumes 16 kHz mono audio and requires at least 80 ms of input audio per segment.

Runtime latency is controlled through configurable context sizes. The model displays 4 standard segment configurations corresponding to approximately 80 ms, 160 ms, 560 ms and 1.12 seconds of audio. These modes are driven by att_context_size The parameter, which sets the left and right attention contexts in multiples of 80 ms frames, and can be changed at inference time without retraining.

Cache aware streaming, not buffered sliding window

Traditional ‘streaming ASR’ often uses overlapping windows. Each incoming window reprocesses part of the previous audio to maintain context, which wastes computation and drives latency upward as concurrency increases.

Nemotron Speech ASR instead maintains a cache of encoder states for all self-attention and convolution layers. Each new segment is processed once, with the model reusing cached activations rather than recalculating the overlapping context. This gives:

  • Non-overlapping frame processing, so work linearly with audio length
  • Predictable memory growth, as cache size grows with sequence length rather than concurrent related repetitions
  • Stable latency under load, which is important for turn taking and interruptions in voice agents

Accuracy vs Latency: WER under Streaming Constraints

Nemotron Speech ASR is evaluated on the Hugging Face OpenASR Leaderboard dataset, which includes AMI, Earnings22, GigaSpeech, and LibriSpeech. Accuracy is reported as word error rate (WER) for different segment sizes.

For the average across these benchmarks, the model achieves:

  • Approximately 7.84 percent WER at 0.16 second chunk size
  • Approximately 7.22 percent WER at 0.56 second chunk size
  • Approximately 7.16 percent WER at 1.12 second chunk size

This reflects the latency accuracy tradeoff. Larger chunks give more phonetic context and slightly lower WER, but the 0.16 s mode also keeps WER below 8 percent while remaining usable for real-time agents. Developers can choose the inference time operating point based on the needs of the application, for example 160 ms for aggressive voice agents, or 560 ms for transcription-centric workflows.

Throughput and concurrency on modern GPUs

Cache aware design has a measurable impact on concurrency. On the NVIDIA H100 GPU, Nemotron Speech ASR supports approximately 560 concurrent streams at a 320 ms chunk size, which is approximately 3 times the concurrency of the baseline streaming system at the same latency target. The RTX A5000 and DGX B200 benchmarks show similar throughput gains, with concurrency gains of over 5x on the A5000 and up to 2x on the B200 at typical latency settings.

Equally important, latency remains constant as concurrency increases. In Modal’s tests with 127 concurrent WebSocket clients at 560 ms mode, the system maintained an average end-to-end delay around 182 ms without drift, which is essential for agents who must remain synchronized with live speech over multi-minute sessions.

Training data and ecosystem integration

Nemotron Speech ASR is trained primarily on the English portion of NVIDIA’s Granary dataset with a large mix of public speech corpora totaling approximately 285k hours of audio. The dataset includes YouTube Commons, YODAS2, Mosel, LibriLite, Fisher, Switchboard, WSJ, VCTK, Voxpopuli, and several Mozilla Commons releases. Labels connect human and ASR-generated transcripts.

key takeaways

  1. Nemotron Speech ASR is a 0.6b parameter English streaming model that uses a cache aware FastConformer encoder with an RNNT decoder and operates on 16 kHz mono audio with at least 80 ms input segments.
  2. The model exposes 4 estimation time segment configurations, approximately 80 ms, 160 ms, 560 ms, and 1.12 s, which let engineers trade latency for accuracy without retraining while keeping WER around 7.2 percent to 7.8 percent on standard ASR benchmarks.
  3. Cache Aware Streaming removes overlapping window recalculation, so each audio frame is encoded once, yielding approximately 3x more concurrent streams on the H100, over 5x on the RTX A5000, and up to 2x on the DGX B200 compared to the buffered streaming baseline at the same latency.
  4. In an end-to-end voice agent with Nemotron Speech ASR, Nemotron 3 Nano 30B and Magpie TTS, the average time measured for the final transcription is approximately 24 ms and server side voice to voice latency on an RTX 5090 is approximately 500 ms, making ASR a small fraction of the total latency budget.
  5. Nemotron Speech ASR is released as a Nemo Checkpoint with open weights and training details under the NVIDIA Permissive Open Model License, so teams can self-host, fine tune, and profile the full stack for low-latency voice agents and speech applications.

check it out model weight hereAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Check out our latest releases ai2025.devA 2025-focused analytics platform that models launches, benchmarks and transforms ecosystem activity into a structured dataset that you can filter, compare and export


Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.

Related Articles

Leave a Comment