Ant Group releases Lingbot-VLA, a Vision Language Action Foundation model for real-world robot manipulation

How do you create a single vision language action model that can control many different dual-handed robots in the real world? Lingbot-VLA is Ant Group Robiont’s new Vision Language Action Foundation model that targets practical robot manipulation in the real world. It has been trained on approximately 20,000 hours of teleoperated bi-manual data collected from 9 dual-handed robot avatars and extensively evaluated on the GM-100 benchmark on 3 platforms. The model is designed for cross morphology generalization, data efficient post training, and high training throughput on commodity GPU clusters.

Massive dual-handed dataset across 9 robot avatars

The pre-training dataset is built from real-world teleoperation on 9 popular dual arm configurations. These include the Azibot G1, AgileX, Galaxy R1 Lite, Galaxy R1 Pro, Realme 02 RS, Leju Quavo 4 Pro, Kinglong Humanoid, ARX Lift2 and a bimanual Franca setup. All systems have dual 6 or 7 degrees of freedom arms with parallel grippers and multiple RGB-D cameras providing multi view observation

Teleoperation uses VR controls for the AgiBot G1 and isomorphic arm controls for the AgileX. For each scene the recorded videos from all scenes are divided by human annotators into clips that correspond to atomic actions. Still frames at the beginning and end of each clip are removed to reduce redundancy. The task level and sub task level language instructions are then generated with the Qwen3-VL-235B-A22B. This pipeline generates synchronized sequences of images, instructions, and action trajectories for pre-training.

To characterize verb diversity the research team visualizes the most frequently occurring atomic verbs in training and tests through word clouds. About 50 percent of the nuclear verbs in the test set are not included in the top 100 most frequent verbs in the training set. This difference ensures that the assessment emphasizes cross task generalization rather than frequency based memorization.

Architecture, mixing of transformers, and flow matching actions

Lingbot-VLA combines a robust multimodal backbone with an action expert through a mixture of Transformers architectures. The backbone of the Vision language is Qwen2.5-VL. This multi-view operation encodes images and natural language instructions into sequences of multimodal tokens. In parallel, the action expert receives part of the robot proprioceptive state and previous actions. Both branches share a self-attention module that performs layer-wise joint sequence modeling on observation and action tokens.

At each time step the model creates an observation sequence that connects tokens from the 3 camera views, task instructions and robot position. The action sequence is part of a future action in which the temporal horizon is set to 50 during pre-training. The objective of training is conditional flow matching. The model learns a vector field that transports Gaussian noise along a linear probability path to the ground truth action trajectory. This gives a continuous action representation and produces smooth, temporally coherent control suitable for precise dual-handed manipulation.

Lingbot-VLA uses blockwise causal attention on the joint sequence. Overview Tokens can connect to each other bilaterally. Action tokens can include all observation tokens and only the previous action token. This mask prevents information leakage from future actions into current observations, while allowing the action expert to exploit the full multimodal context at each decision step.

Spatial perception via Lingbot depth distillation

Many VLA models struggle with depth logic when depth sensors fail or return sparse measurements. Lingbot-VLA addresses this by integrating Lingbot-Depth, a different spatial perception model based on masked depth modeling. Lingbot-Depth is trained in a self-supervised manner on a large RGB-D corpus and learns to reconstruct dense metric depth when parts of the depth map are hidden, often in areas where physical sensors fail.

Scene queries from each camera view in Lingbot-VLA are aligned with Lingbot-depth tokens through a projection layer and distillation loss. Cross Attention maps VLM queries into a depth latent space and minimizes their differences from training Lingbot-depth features. This incorporates geometry aware information into the policy and improves performance on tasks that require precise 3D spatial reasoning, such as insertion, stacking, and folding under dislocation and occlusion.

GM-100 real world benchmarks on 3 platforms

The main evaluation uses GM-100, a real-world benchmark with 100 manipulation tasks and 130 filtered teleoperated trajectories per task on each of 3 hardware platforms. Experiments comparing Lingbot-VLA to π_0.5GR00T N1.6, and WALL-OSS under a shared post training protocol. All methods recover from public checkpoints with the same dataset, batch size 256 and 20 epochs. Success rate measures completion of all subtasks within 3 minutes and progress score tracks partial completion.

On GM-100, Lingbot-VLA achieves state-of-the-art average across 3 platforms with depth. The average success rate is 17.30 percent and the average progress score is 35.41 percent. π_0.5 Reaches 13.02 percent SR (success rate) and 27.65 percent PS (progress score). The GR00T N1.6 and WALL-OSS are lower at 7.59 percent SR, 15.99 percent PS and 4.05 percent SR, 10.35 percent PS, respectively. LingBot-VLA without depth already outperforms GR00T N1.6 and WALL-OSS and the version with depth adds more benefits.

In the Robotwin 2.0 simulation with 50 tasks, models are trained on 50 demonstrations per task in clean scenes and 500 demonstrations per task in random scenes. Lingbot-VLA with depth reaches an average success rate of 88.56 percent in clean scenes and 86.68 percent in random scenes. π_0.5 Reaches 82.74 percent and 76.76 percent in the same settings. When domain randomization is strong it shows consistent benefits from similar architectures and depth integration.

Scaling behavior and data efficiency after training

The research team analyzes scaling laws by varying 3,000 to 20,000 hours of pre-training data on a subset of 25 tasks. Both success rates and progress scores increase monotonically with the amount of data, with no saturation at the largest scales studied. This is the first empirical evidence that VLA models maintain favorable scaling on real robot data at this size.

They also study the data efficiency after training on AgiBot G1 using 8 representative GM-100 tasks. Lingbot-VLA already surpasses π with only 80 demonstrations per task_0.5 Which uses the full 130 performance set in both success rate and progress score. As more trajectories are added, the performance gap increases. This confirms that the pre-trained policy moves from only dozens to about 100 task specific trajectories, which directly reduces the optimization cost for new robots or tasks.

Training Throughput and Open Source Toolkit

Lingbot-VLA comes with an optimized training stack for multi-node efficiency. The codebase uses FSDP style strategies for parameter and optimizer conditions, hybrid sharding for the action expert, mixed precision with float32 reduction and bfloat16 storage, and operator level acceleration with fused attention kernels and torch compiles.

On an 8 GPU setup the research team reported a throughput of 261 samples per second per GPU for the Qwen2.5-VL-3B and PolyGemma-3B-PT-224 model configurations. This corresponds to a 1.5x to 2.8x speedup compared to existing VLA-oriented codebases such as StarVLA, Dexbotic, and OpenPI evaluated on the same Libero-based benchmarks. Going from 8 to 256 GPUs the throughput scales close to linear. The full post training toolkit has been released as open source.

key takeaways

Lingbot-VLA is a QWEN2.5-VL based Vision Language Action Foundation model, trained on approximately 20,000 hours of real-world dual-hand teleoperation across 9 robot avatars, enabling robust cross morphology and cross task generalization.
The model integrates Lingbot Depth through feature distillation, so vision tokens are aligned with the depth completion expert, which significantly improves 3D spatial understanding for insertion, stacking, folding and other geometry sensitive tasks.
On the GM-100 real-world benchmark, Lingbot-VLA achieves about 17.30 percent average success rate and 35.41 percent average progress score with depth, which is greater than π._0.5GR00T N1.6, and WALL OSS under the same post training protocol.
LingBot-VLA shows high data efficiency after training, as it can surpass π on AgiBot G1_0.5 It uses 130 trajectories per task compared to only 80 trajectories per task, and performance continues to improve as more trajectories are added.

check it out paper, model weight, repo And project page. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Ant Group releases Lingbot-VLA, a Vision Language Action Foundation model for real-world robot manipulation

Massive dual-handed dataset across 9 robot avatars

Architecture, mixing of transformers, and flow matching actions

Spatial perception via Lingbot depth distillation

GM-100 real world benchmarks on 3 platforms

Scaling behavior and data efficiency after training

Training Throughput and Open Source Toolkit

key takeaways

Apple is buying an AI company that listens to ‘silent speech’

Study finds use of AI in breast cancer screening reduces late diagnosis rate by 12% Cancer Research

Related Articles

Leave a Comment Cancel Reply