Waymo is presenting Waymo World ModelA marginal generative model that drives its next generation of autonomous driving simulations. The system is built on top of Google DeepMind’s general-purpose world model, Genie 3, and is optimized to produce large-scale photorealistic, controllable, multi-sensor driving scenes.
Waymo already reports nearly 200 million fully autonomous miles driven on public roads. Behind the scenes, drivers undergo training and are evaluated on billions of extra miles in the virtual world. The Waymo World Model is now the main engine for generating those worlds, with the explicit goal of making the stack aware of rare, safety-critical ‘long-tail’ events that are often almost impossible to observe in reality.
From Genie 3 to driving-specific world models
Genie 3 is a general-purpose world model that turns text prompts into interactive environments that you can navigate in real time at about 24 frames per second, typically 720p resolution. It learns the dynamics of scenes directly from large video corpora and supports fluid control by user input.
Waymo uses Genie 3 as the backbone and trains it for the driving domain. Waymo World Model retains the ability to generate Genie 3’s coherent 3D world, but aligns the output with Waymo’s sensor suite and operating constraints. It produces high-fidelity camera images and lidar point clouds that continuously evolve over time, matching how the Waymo driver actually perceives the environment.
This is not just video rendering. The model generates multi-sensor, temporally coherent observations that downstream autonomous driving systems can consume under conditions similar to real-world logs.
Emerging Multimodal World Knowledge
Most AV simulators are trained only on on-road fleet data. This limits them to the weather, infrastructure and traffic patterns that a fleet actually encountered. Waymo instead leverages Genie 3’s pre-training on extremely large and diverse sets of videos to import comprehensive ‘world knowledge’ into the simulator.
Waymo then applies this knowledge after specialized training to transfer it from 2D video to 3D lidar output tailored to its hardware. The cameras offer rich textures and lighting. Lidar contributes to precise geometry and depth. The Waymo World Model jointly generates these modalities, so a simulated scene comes with both an RGB stream and a realistic 4D point cloud.
Due to the diversity of pre-training data, the model can synthesize situations that Waymo’s fleet has not directly observed. The Waymo team shows examples such as light snowfall on the Golden Gate Bridge, a tornado, flooding in a cul-de-sac, tropical roads strangely covered in snow, and street fire evacuations. It also handles unusual objects and edge cases like elephants, Texas longhorns, lions, pedestrians dressed as T-rexes and car-shaped tumbleweeds.
The important thing is that these behaviors Casual. The model is not explicitly programmed with the rules of elephants or tornado fluid dynamics. Instead, it reuses the general spatiotemporal structure learned from the video and adapts it to driving scenes.
three axes of controllability
A key design goal is robust simulation controllability. The Waymo World Model highlights three main control mechanisms: Driving action controls, scene layout controls and language controls.
driving action control: The simulator reacts to specific driving inputs, allowing ‘what if’ counterfactuals on top of the recorded log. Devs could ask whether the Waymo driver could have driven more firmly instead of leaning in the rear view, and then simulate that alternative behavior. Because the model is fully generative, it maintains realism even when the simulated path deviates far from the original trajectory, where purely reconstruction methods such as 3D Gaussian Splatting (3DGS) would suffer from missing viewpoints.
visual layout control: The model can be adapted to modified road geometry, traffic signal conditions and other road users. Waymo can insert or reposition vehicles and pedestrians or apply mutations to the road layout to synthesize targeted interaction scenarios. It supports systematic stress testing of yield, merge and interaction behavior beyond what is visible in the raw logs.
language control:Natural language prompts serve as a flexible, high-level interface for editing the time of day, weather, or even generating completely synthetic scenes. The Waymo team performs ‘world mutation’ sequences where the same base city scene is rendered at dawn, dusk, noon, afternoon, evening and night, and then rendered in cloudy, foggy, rainy, snowy and sunny conditions.
This three-axis control is closer to a structured API: numerical driving actions, structural layout editing, and semantic text all operate on the same underlying world model.
Converting simple video to multimodal simulation
Waymo World Model can convert regular mobile or dashcam recordings into multimodal simulations that show how a Waymo driver would see the same scene.
Waymo shows examples of scenic drives in Norway, Arches National Park, and Death Valley. Given only the video, the model reconstructs a simulation with aligned camera images and lidar outputs. This creates scenarios with strong realism and factuality as the generated world is tied to the actual footage, while still being controllable through the three mechanisms above.
In practical terms, this means that a large collection of consumer-style videos can be reused as structured simulation input without the need for lidar recordings at those locations.
Scalable estimates and long rollouts
Long-horizon maneuvers such as swerving a narrow lane with oncoming traffic or navigating through a dense neighborhood require multiple simulation steps. Inexperienced generator models suffer from quality drift and high computation costs during long rollouts.
The Waymo team reports an efficient version of the Waymo World Model that supports longer sequences with a dramatic reduction in computation while maintaining realism. They feature 4x-speed playback of extended scenes like freeway navigation around in-lane stoppers, driving through busy neighborhoods, climbing steep roads around motorcyclists and handling SUV U-turns.
For training and regression testing, this reduces the hardware budget per scenario and makes large test suites more streamlined.
key takeaways
- Genie 3-based world model: Waymo World Model adapts Google DeepMind’s Genie 3 into a driving-specific world model that generates photorealistic, interactive, multi-sensor 3D environments for AV simulations.
- Multi-sensor, 4D output aligned with Wemo driver: The simulator jointly produces temporally coherent camera imagery and lidar point clouds, aligned with Waymo’s real sensor stack, so downstream autonomy systems can consume the simulation like real logs.
- Emerging coverage of rare and long-tail scenarios: By leveraging massive video pre-training, the model can synthesize rare situations and objects, such as snow on unusual roads, floods, fires, and animals such as elephants or lions that the fleet has never seen directly.
- Tri-axis controllability for targeted stress testingDriving action controls, scene layout controls, and language controls let developers drive counterfactuals, edit road geometry and traffic participants, and change the time of day or season through text prompts in the same generative environment.
- Efficient long-horizon and video-anchored simulation: An optimized version supports longer rollouts at lower compute cost, and the system can also convert simple dashcam or mobile video into controllable multimodal simulations, expanding the pool of realistic scenarios.
check it out technical details. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

