One model, three modalities: ByteDance releases Lance for image and video understanding, creation and editing

Building a single model that can understand and generate images and video is more difficult than it sounds. Both tasks pull in opposite directions. Understanding the benefits of high-level semantic features strongly associated with language. Generation requires a low-level consistent representation that preserves texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into separate architectures, then bridging post-hoc.

ByteDance research team took a different approach lance. Instead of assembling separate components, the research team designed a model that seamlessly integrates understanding, creation, and editing in both image and video modalities – jointly trained from the beginning.

what can lance do

Lance organizes its capabilities into three output families: text (X2T), images (X2I), and video (X2V). On the comprehension side, this includes image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing and video editing – including multi-turn stability editing in both modalities.

This all-in-one capability is a major milestone. While standard integrated architectures typically stop at basic image understanding and text-to-image generation, LANCE is one of the few that seamlessly bridges the entire image-video ecosystem across both understanding and generation tasks.

how architecture works

The architecture is based on two principles: integrated context modeling And decoupled potential path.

For unified context, Lance transforms all input – text, images and video – into a shared interleaved multimodal sequence. The text tokens come from the Qwen2.5-VL embedding layer. For comprehension-oriented visual input, the Qwen2.5-VL ViT encoder generates compact semantic visual tokens. For generation-oriented visual input, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token types – text, semantic views and latent views – remain in the same order. The model then runs generalized 3D causal attention over the entire context, with text tokens using causal attention and visual tokens using bidirectional attention.

For decoupled paths, Lance uses the dual-stream mix-expert architecture introduced with the Qwen2.5-VL 3B. Understanding Expert (LLMUND) handles text and semantic visual tokens, preparing output for multimodal reasoning and text generation. Generation Expert (LLMGEN) handles VAE latent tokens for visual synthesis and editing. Importantly, both experts work on the same shared interleaved sequence – they share the context but do not compete for the same parameters. The understanding expert is trained with the next-token prediction loss; The generation expert is trained with the flow matching objective in continuous latent space. The two losses are combined with configurable weights throughout training.

Modality-Aware Rotary Positional Encoding (MAPE)

Running VIT semantic tokens, clean VAE position tokens, and noisy VAE target tokens through the same sequence creates a subtle problem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone – there is no way to tell these token groups apart. When multiple visual token groups occupy the same sequence, their positional boundaries become unclear, which may harm cross-task alignment.

Lance introduces Modality-Aware Rotary Positional Encoding (MAPE) To fix it. MaPE applies a fixed temporal offset to each instrument group based on its index in the sequence. Spatial coordinates remain unchanged, so the internal layout within images and videos is preserved. The temporal offset alone is sufficient to differentiate token groups in the global positional space without disrupting the temporal order within any individual video.

Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 – a steady decline in generation, editing, and understanding.

Training: Four Steps, an Integrated Framework

lance is trained four sequential stagesEach building on the last.

Pre-Training (PT) Lays the foundation using approximately 1B image-text and 140M video-text pairs covering 1.5T training tokens. This phase establishes basic multimodal alignment and production capability. The VAE and VIT encoders are frozen here; Only the spinal cord and connectors are trained.

Continuous Training (CT) Expands the task space by introducing interleaved multi-task data – editing samples, topic-driven generation samples and multimodal comprehension data – in approximately 300B tokens. As training progresses, a progressive data-mixing schedule gradually increases the proportion of difficult tasks such as editing.

Supervised Fine-Tuning (SFT) Following instructions using high-quality data curated in 72B tokens strengthens edit accuracy and identification consistency.

Reinforcement Learning (RL) PaddleOCR uses Group Relative Policy Optimization (GRPO) acting as a reward model to further accelerate text rendering accuracy and image-text alignment.

Everything fits into the maximum training budget of 128 GPUs.

Result

Image creation. On GenEval, Lance has an overall score of 0.90, matching TUNA for the top spot among integrated models. Subscale scores include counting (0.84), colors (0.97), and spatial position (0.87). On DPG-Bench, Lance scores 84.67 overall, with particularly strong correlation modeling – although TUNA (86.76) and TUNA-2 (86.54) are ahead of that benchmark. To put parameter efficiency into perspective: Janus-Pro-7B’s score on GenEval is 0.80; Show-O2 (7B) Score 0.76. The Lance 3B matches the top integrated model scores on active parameters.

Video Generation. On vBench, Lance achieved an overall score of 85.11 (using LLM rewriting), the highest among integrated models. The next best integrated model, TUNA, scores 84.06. The Lance also beats out dedicated generation-only models, including HunyuanVideo (83.43) and WAN2.1-T2V (83.69).

image editing. On gedit-bench, Lance has a score of 7.30 avg/g_o, the highest among integrated models. It is a leader in background change, content modification, motion change, picture beautification, subject removal, subject replacement and tone shifting. Text revision has been flagged as a remaining weakness.

Video understanding. On MVBench, the Lance achieved a 62.0 overall score, the highest among integrated models. Sho-O2 (7B), the next best integrated model, scores 55.7. Lance also outperforms many sense-only models with more parameters – notable given that it is trained simultaneously for generation and editing.

MarketTechPost’s visual explainer

key takeaways

LANCE is a 3B active parameter native integrated multimodal model Which handles image and video understanding, creation, and editing within a jointly trained framework.
A dual-stream mixin-expert architecture with modality-aware rotary positional encoding (MAPE). Separates the pathways of comprehension and generation while placing them in a shared interleaved multimodal context.
The Lance scored 0.90 on GenEval and 85.11 on VBench.Highest overall score among integrated models trained within a maximum budget of 128 GPUs.
Lance’s score on MVBench is 62.0Also supports generation and editing, outperforming the Show-O2 (7B) at 55.7 – the highest among integrated models.
Lance is open-source under Apache 2.0Weights available on Hugging Face.

check it out paper, model weight And project page. Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

One model, three modalities: ByteDance releases Lance for image and video understanding, creation and editing

what can lance do

how architecture works

Modality-Aware Rotary Positional Encoding (MAPE)

Training: Four Steps, an Integrated Framework

Result

MarketTechPost’s visual explainer

Check your surroundings first

clone from github

install required packages

Download Lance-3B Checkpoint

Run tasks via CLI

Launch Gradio interface (optional)

key takeaways

One model, three modalities: ByteDance releases Lance for image and video understanding, creation and editing

what can lance do

how architecture works

Modality-Aware Rotary Positional Encoding (MAPE)

Training: Four Steps, an Integrated Framework

Result

MarketTechPost’s visual explainer

Check your surroundings first

clone from github

install required packages

Download Lance-3B Checkpoint

Run tasks via CLI

Launch Gradio interface (optional)

key takeaways

Unleashing the power of ONNX for speedy SBERT inference

System Design Interview Questions: A Useful Collection

Related Articles

Leave a Comment Cancel Reply