Nvidia on Tuesday launched a multimodal open model that combines vision, speech and language, aiming to help enterprises save time with agents that can reason across all modalities to provide faster, smarter responses.
The Nemotron 3 Nano Omni is the latest version of the vendor’s open source family of models. This model removes the need for separate perception models for video, audio, image and text. It combines vision and audio encoders (neural network modules that process complex inputs and capture the most salient features of the data) within its 30B. mix of experts architecture. This combination enables the AI system to achieve higher throughput than Nvidia’s other Omni models, leading to lower costs and better inference efficiency, the vendor said.
This model is another way Nvidia is trying to extend its dominance in AI hardware into models and services. While the vendor currently leads the AI market in hardware with its ubiquitous GPUs, its emphasis on the Nemotron open model could help it stay on top, especially as its biggest customers — including Google, Microsoft, and AWS — have their own chips and are ramping up production. Other customers, such as OpenAI is partnering with Nvidia competitors like Cerebras and Broadcom, and some foreign customers, notably DeepSeek, are Shift towards local chipmakers like Huawei.
“This is happening against the backdrop of Nvidia’s biggest customers doing everything they can to erode the margins that Nvidia is making in hardware right now,” said David Nicholson, an analyst at Futurum Group. “In the long run, they’re not going to be able to maintain the hardware margins that they have right now.”
However, Nvidia is trying to help enterprises be more efficient by helping agents understand the context of all modalities. The vendor is promising a system that integrates diverse files and functionality, making it easier for enterprises to create agents.
“The idea is that we’re going to give you an environment where when you create an agent, it will automatically understand how to communicate with all the other parts of the entire infrastructure,” Nicholson said. “This is a step forward toward intelligently engineered systems that deliver efficiencies that are hard to achieve when you don’t have control over all the components.”
to be efficient
This model can work alongside proprietary models and other Nemotron open models to power agentic workflows such as computer usage agents, document intelligence, and audio and video understanding. with computer access agentThe Nemotron 3 Nano Omni powers the perception loop for agents navigating the computer screen and reasoning about its content.
With Document Intelligence, models can interpret documents, charts, tables, and screenshots, and reason on both visual and textual content. With audio and video understanding, the model maintains the context of both modalities within the same logic stream.
obstacles
The challenge, however, is that it is not clear whether Nvidia envisions this model or system for a specific enterprise size and whether its hyperscale customers would benefit from using it.
Nicholson said some Nvidia customers have their own accelerators. “I don’t know if Nvidia is thinking this will be a hyperscale cloud provider strategy they’ll be able to use.”
Furthermore, while the model is open source and Nvidia has provided the weights, training techniques, etc. training setIt’s unclear whether enterprises outside the Nvidia Stack environment will use it.
“It’s very unlikely,” Nicholson said. “Most of this will be deployed across the entire Nvidia stack environment.”
Still, said Chirag Shah, a professor at the University of Washington’s School of Information, developers will still experiment with the model.
“When you make something like this open source, it immediately inspires all those developers to try it, integrate it into their existing solutions, and when it works well, they want to use Nvidia as their infrastructure partner,” he said.