There’s always been a dirty secret in video editing: It’s easy to remove an object from footage; Making the scene appear as if it never existed is cruel. Take out the guy holding the guitar, and you’re left with a floating instrument that defies gravity. Hollywood VFX teams spend weeks fixing this type of problem. Netflix and a team of researchers from INSAIT, Sofia University ‘St. Kliment Ohridski’ released void (Removing video objects and interactions) models that can do this automatically.
VOID removes objects from the video along with all the interactions they produce on the scene – not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed.
What problem is VOID actually solving?
Standard video inpainting models – the type used in most editing workflows today – are trained to fill the pixel area where an object was. He is basically a very sophisticated background painter. What they don’t do is cause causal relationship: If I remove the actor holding a prop, what should happen to that prop?
Existing video object removal methods excel at portraying content behind objects and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, existing models fail to capture them and produce unreliable results.
VOID is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning. The main innovation is in how the model understands the scene – not just ‘which pixels should I fill in?’ But ‘What is physically plausible given the disappearance of this object?’
Canonical example from the research paper: If the person holding the guitar is removed, VOID also removes the person’s influence on the guitar – causing it to naturally fall away. This is not a trivial matter. The model would have to understand that the guitar was Supported By the person, and removing the person means gravity takes over.
And unlike previous work, VOID was evaluated head-to-head against real competitors. Experiments on both synthetic and real data show that the approach better preserves persistent scene dynamics after object removal compared to prior video object removal methods including Propenter, Defuzzer, Runway, Minimax-Remover, ROSE, and Zen-Omnimat.

Architecture: CogVideoX under the hood
Built on VOID CogVideoX-Fun-V1.5-5b-InP – A model from Alibaba PAI – and fine-tuned for video inpainting with interaction-aware quadmask Conditioning. CogVideoX is a 3D Transformer-based video generation model. Think of it like the video version of Stable Diffusion – a diffusion model that works on temporal sequences of frames rather than single images. Typical base model (CogVideoX-Fun-V1.5-5b-InP) is released by Alibaba PAI on Hugging Face, which Checkpoint engineers must download separately before running VOID.
Fine-tuned architecture specs: a CogVideoX 3D transformer with 5B parameters, video capture, quadmask and a text prompt describing the scene after removal as input, operates at a default resolution of 384×672, processes a maximum of 197 frames, uses a DDIM scheduler, and runs in BF16 with FP8 quantization for memory efficiency Is.
quadmask Arguably the most interesting technical contribution here. Instead of a binary mask (remove this pixel/keep this pixel), the quadmask is a 4-value mask that encodes the primary object to be removed, overlapping areas, affected areas (falling objects, displaced objects) and keeping the background.
In practice, each pixel in the mask gets one of four values: 0 (primary object being removed), 63 (overlap between primary and affected areas), 127 (interaction-affected areas – things that will move or change as a result of the deletion), and 255 (Background, keep as it is). This gives the model a structured semantic map what is happening in the scenenot only where is the thing.
two-pass inference pipeline
VOID uses two Transformer checkpoints, which are trained sequentially. You can run estimation with Pass 1 alone or chain both passes for higher temporal stability.
pass 1 (void_pass1.safetensors) is the base inpainting model and is sufficient for most videos. Pass 2 serves a specific purpose: to correct known failure modes. If the model detects object morphing – a known failure mode of small video diffusion models – an optional second pass re-runs the estimation using the flow-distorting noise obtained from the first pass, stabilizing the object shape along the newly synthesized trajectory.
It’s worth understanding the difference: Pass 2 isn’t just for longer clips – it’s specifically a fix shape stability. When the diffusion model produces objects that gradually distort or distort across the frame (a well-documented artifact in video diffusion), Pass 2 uses the optical flow to distort the latents from Pass 1 and feeds them as initializations into the second diffusion run, anchoring the shape of the synthesized objects frame-to-frame.
How the training data was generated
This is where things get really interesting. Training a model to understand physical interactions requires paired videos – the same scene, with and without the object, where the physics works correctly in both. Real-world paired data at this scale does not exist. So the team created it artificially.
The training used paired counterfactual videos generated from two sources: HUMOTO – human-object interactions presented in Blender with physics simulations – and KUBRIC – object-only interactions using Google scanned objects.
HUMOTO uses motion-capture data of human-object interactions. The main mechanic is a Blender re-simulation: the scene is set up with the human and objects, once presented with the human presence, then the human is removed from the simulation and the physics are run again from that point. The result is a physically correct counterfactual – objects that were being held or supported now fall, exactly as they should. Kubrick, developed by Google Research, applies similar ideas to object-object collisions. Together, they produce a dataset of paired videos where the physics is probably correct, not inferred by human annotators.
key takeaways
- VOID goes beyond pixel-filling. Unlike existing video inpainting tools that only correct visual artifacts like shadows and reflections, VOID understands physical causality – if you remove a person holding an object, the object naturally falls into place in the output video.
- QuadMask is the main innovation. Instead of a simple binary remove/keep mask, VOID uses a 4-value quadmask (values 0, 63, 127, 255) that encodes not only what to remove, but also surrounding areas of the scene. physically affected – To provide understanding of working diffusion model structured view.
- The two-pass estimation resolves the actual failure modes. Pass 1 handles most of the video; Pass 2 exists specifically to correct object morphing artifacts – a known weakness of video propagation models – by using the optical flow-warped latents from pass 1 as initialization for the second propagation run.
- Synthetic paired data made training possible. Since paired counterfactual video data does not exist at scale in the real world, the research team created it using Blender Physics Re-Simulation (HUMOTO) and Google’s Kubrick Framework, generating ground truth before/after video pairs where the physics holds true.
check it out paper, model weight And repo. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.