The Salesforce AI research team introduced FOFPred, a language-driven future optical flow prediction framework that connects large vision language models to diffusion transformers for dense motion prediction in control and video generation settings. FOFPred takes one or more images and a natural language instruction such as ‘move the bottle from right to left’ and predicts 4 future optical flow frames that describe how each pixel is expected to move over time.

Optical flow of the future as a motion representation
Optical flow is the apparent per pixel displacement between two frames. FOFPred focuses on future optical flow, meaning predicting dense displacement fields for future frames, given only current observations and text, without access to future images.
The optical flow of the future is simply a compact representation of motion. It removes static appearance and keeps only pixel level motion, so it is suitable as an intermediate state for robot control policies and a conditioning signal for video propagation models. Compared to predicting future RGB frames, this reduces the complexity of the output distribution and avoids modeling textures and high-frequency details that are not necessary for motion planning.
To plug into existing latent propagation infrastructure, the research team encodes the optical flows as RGB images. They polarly map flow magnitude and direction to HSV channels, then convert to RGB. The scaling of each channel has been tuned so that consecutive flow frames are visually smooth and resemble animated graphics. A standard Flux.1 variable autoencoder encodes and decodes these flow images.
Integrated VLM Propagation Spine
FOFPred uses an integrated architecture that combines a frozen vision language model, a frozen VAE, and a trained propagation transformer. the pipeline is: :
- Qwen2.5-VL is used as a vision language encoder to jointly encode captions and visual input.
- Flux.1 VAE encodes input images and training optical flow targets into latent tensors.
- An omnigen style diffusion transformer, DIT, takes predicted visual and textual features as conditional inputs and generates latent future flow sequences.
Only DIT and small MLP projectors are trained. The Qwen2.5-VL and Flux.1 weights remain frozen, allowing models to reuse image editing pretraining and multimodal reasoning capabilities from prior work. Temporal modeling is added to the input and output frame sequences by extending RoPE positional encoding and attention blocks from two-dimensional spatial positions to full spatio-temporal positions. This provides full spatio-temporal attention without adding additional parameters, so DiT can directly reuse Omnigen image pretraining.


Training on noisy web videos with relative optical flow.
The main model is trained on web scale human activity videos with paired captions. The research team uses the Say Something v2 dataset and the EgoDex egocentric manipulation dataset to obtain approximately 500,000 video caption pairs.
The training uses an end-to-end flow matching objective in the latent space. Future optical flow sequences are first calculated offline, then encoded by the VAE and used as targets in the flow matching propagation loss for the DIT. During training the method also applies classifier-free guidance to both textual and visual conditions and masks certain frames and viewpoints to improve robustness.
An important contribution is the relative optical flow calculation used to generate clean training targets from noisy egocentric videos. Method for each frame pair:
- Computes dense optical flow with an off the shelf estimator.
- Estimates camera motion through homography using depth features.
- Uses projective geometry to constrain camera motion and obtain object-centered relative flow vectors.
- Filters frame pairs by selecting pairs where the top k percent flow magnitudes exceed a threshold, focusing training on segments with meaningful motion.
These steps are run offline at a lower resolution for efficiency, then recomputed at the original resolution for final goals. Ablation studies show that raw flow without removing static frame targets or camera motion hurts downstream performance, while isolated relative flow targets give the best results.


language driven robot manipulation
The first downstream use case is robot control. FOFPred has been refined on robot video caption data to predict future optical flow from both fixed and wrist-mounted cameras. On top of FOFPred, the research team attaches a propagation policy network that takes estimated flows, text, and robot state, and outputs continuous actions. This setup follows prior propagation policy work but uses the future optical flow instead of the projected RGB frame as the main representation.
On the Calvin ABCD benchmark, which evaluates long horizon zero shot chains of 5 language specified manipulation functions, FOFPred reaches an average chain length of 4.48. Under the same protocol VPP reaches 4.33 and DreamVLA reaches 4.44. FOFPred also achieves a Task 5 success rate of 78.7 percent, which is the best among reported methods. In the low data setting with 10 percent Kelvin displays, FOFPred still reaches 3.43 average length, which is higher than VPP’s 3.25.
On Robotwin 2.0, a dual arm manipulation benchmark with 5 tasks that require both arms, FOFPred achieves an average success rate of 68.6 percent. Under the same training settings the VPP reaches 61.8 percent of the baseline. FOFPred improves success on each task in the subgroup.


Motion Aware Text to Video Generation
The second downstream function is speed control in text-to-video creation. The research team built a two-stage pipeline by combining FoFred with Go and the flow video diffusion model. FOFPred takes an initial frame and a linguistic description of motion, predicts a sequence of future flow frames, and interpolates them into a dense motion field. Go with the flow and then use this motion field and the initial frames to synthesize the final video, applying the motion patterns described.
On the motion heavy Something Something V2 benchmark, FOFPred improves on the CogVideoX baseline under similar conditions with the flow pipeline with Go. This method reaches SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and Motion Fidelity 0.662, which are consistently better than CogVideoX. Importantly, FOFPred uses only language and a frame at inference time, whereas many controllable video baselines require hand or object masks or trajectories as additional inputs.


key takeaways
- FOFPred reframes motion prediction as language-driven future optical flow, predicting 4 dense optical flow frames from one or more current images and a text instruction, providing only a compact motion representation for downstream tasks.
- The model uses an integrated VLM diffusion backbone, with Qven2.5-VL as a frozen vision language encoder, Flux.1-VAE as a frozen latent encoder for images and flows, and an OmniGen style DIT with spatio-temporal ROPE based attention as the only trained component.
- Training relies extensively on web and egocentric videos from Something-v2 and egodex, and creates relative optical flow targets by estimating ego-motion via homography, subtracting camera flow, and filtering for high-motion segments, which significantly improves downstream performance.
- In robot manipulation, FOFPred serves as a motion backbone for a diffusion policy head and achieves state-of-the-art or better results on CALVIN ABCD and RoboTwin 2.0, including 4.48 percent average task chain length on CALVIN and 68.6 percent average success on RoboTwin, outperforming the VPP and DreamVLA variants.
- For text to video generation, connecting FOFPred with Flow provides better SSv2 metrics than CogVideoX, with higher SSIM and PSNR, lower FVD and KVD and better motion fidelity, while only requiring one frame for language and inference, making FOFPred a reusable motion controller for both robotics and video synthesis pipelines.
check it out paper, Sample And repo. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.