World models could open the door to the next revolution in artificial intelligence

by
0 comments
World models could open the door to the next revolution in artificial intelligence

You might have seen an artificial intelligence system derail. You ask for a video of a dog and as the dog runs behind the love seat, its collar disappears. Then, as the camera pans back, the love seat becomes a sofa.

Part of the problem lies in the predictive nature of many AI models. Like the models powering ChatGPT, which are trained to predict text, video generation models predict what is statistically most plausible to look at next. In neither case does AI have a clearly defined model of the world that it constantly updates to make more informed decisions.

But this is beginning to change as researchers across many AI domains are working on building “world models,” the implications of which extend beyond video creation and chatbot use to augmented reality, robotics, autonomous vehicles, and even human-like intelligence—or artificial general intelligence (AGI).


On supporting science journalism

If you enjoyed this article, consider supporting our award-winning journalism Subscribing By purchasing a subscription, you are helping ensure a future of impactful stories about the discoveries and ideas shaping our world today.


A simple way to understand world modeling is a four-dimensional, or 4D, model (three dimensions and time). To do this, let’s think about 2012, when titanic, 15 years after its theatrical release, it was painstakingly converted to stereoscopic 3D. If you freeze a frame, you will get a sense of distance between characters and objects on the ship. But if Leonardo DiCaprio’s back is to the camera, you won’t be able to walk around him to see his face. Cinema’s illusion of 3D is created using stereoscopy – two separate images are often projected in rapid alternation, one for the left eye and one for the right. In the cinema everyone sees the same pair of images and thus the same perspective.

However, due to research over the past decade several approaches are becoming increasingly possible. Imagine that you were supposed to capture a photo from a different angle and then with the help of AI, that adjustment was made, bringing the same scene in a new perspective. Starting in 2020, the NERF (Neural Radiance Field) algorithm introduced a way to createphotorealistic novel scene“But multiple photographs need to be combined so that an AI system can generate a 3D representation. Other 3D approaches use AI to fill in the missing information, deviating more from reality.

Now, imagine that in every frame titanic Displayed in 3D so that the film exists in 4D. You can scroll through time to see different moments or scroll through space to see from different perspectives. You can also prepare new versions of it. For example, a recent preprint, “Neoverse: Augmenting 4D World Models with In-the-Wild Monocular Video,” describes a way to convert videos into 4D models to create new videos from different perspectives.

But 4D technology can also help in creating new video content. A more recent preprint, “Teleworld: Towards dynamic multimodal synthesis with 4D world models.,” applies to the scenario we started with: the dog running after the love seat. The authors argue that the stability of an AI video system is improved when a constantly updated 4D world model guides the generation. The 4D model of the system will help prevent the love seat from becoming a sofa and the dog from losing its collar.

These are preliminary results, but they point to a broader trend: models that update internal scene maps as they build them. Yet the applications of 4D modeling extend far beyond video production. For augmented reality (AR) – think of Meta’s Orion prototype glasses – a 4D world model is an evolving map of the user’s world over time. This allows AR systems to keep virtual objects stable, make lighting and perspective reliable, and have spatial memory of what happened recently. It also allows for occlusions – when digital objects disappear behind real objects. a paper of 2023 Puts the requirement clearly: “To achieve occlusion, a 3D model of the physical environment is necessary.”

Being able to rapidly convert video to 4D also provides rich data for training robots and autonomous vehicles on how the real world works. And by creating 4D models of the space they are in, robots can better navigate it and predict what might happen next. Today’s general-purpose vision-language AI models—which understand images and text but do not generate clearly defined world models—often make errors; A benchmark paper The report presented at the 2025 conference notes “surprising limitations” in their basic world-modeling capabilities, including “almost random accuracy when discretizing motion trajectories.”

Here’s the problem: “world model” means too much to those pursuing AGI. For example, today’s leading large language models (LLM), such as the one powering ChatGPT, have an implicit understanding of the world from their training data. “In a way, I would say that LLM already has a pretty good world model; the thing is that we don’t really understand how it’s doing it,” says Anzu Kanazawa, assistant professor of electrical engineering and computer science at the University of California, Berkeley. However, these conceptual models are not a real-time physical understanding of the world because LLMs cannot update their training data in real time. Even OpenAI technical Report Note that, once deployed, its model GPT-4 does not “learn from experience.”

“How do you develop an intelligent LLM vision system that can actually take streaming input and update its understanding of the world and act accordingly?” Kanazawa says. “This is a big open problem. I think AGI is not possible without really solving this problem.”

Although researchers debate whether LLMs could ever achieve AGI, many see LLMs as a component of future AI systems. Kanazawa says the LLM will act as a layer of “language and common sense to communicate”; This would serve as an “interface”, while a more clearly defined underlying world model would provide the necessary “spatial temporal memory” that current LLMs lack.

In recent years many leading AI researchers have turned to world models. In 2024, Fei Fei Li founded World Labs, which recently launched its Marble software for creating 3D worlds from “text, images, video or rough 3D layouts,” according to the start-up. promotional material. And last November, AI researcher Yann LeCun Announced on LinkedIn He was leaving Meta to launch a start-up, now called Advanced Machine Intelligence (AMI Labs), “to build systems that understand the physical world, have persistent memory, can reason, and plan complex task sequences.” He planted the seeds of these ideas 2022 position paper In which he asked why humans can function well in situations they have never encountered and argued that the answer may lie in the ability to learn “world models, internal models of how the world works.” Research increasingly shows the benefits of internal models. 1st April 2025 Nature paper reported Results on Dreamerv3An AI agent that can improve its behavior by learning world models and “imagining” future scenarios.

So while in the context of AGI, “world model” refers more closely to an internal model of how reality works, not just a 4D reconstruction, advances in 4D modeling can provide components that help understand vision, memory, and even short-term prediction. And meanwhile, on the road to AGI, 4D models can provide rich simulations of reality in which to test AI to ensure that when we put them to work in the real world, they know how to exist in it.

Related Articles

Leave a Comment