Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Providing 15-Minute References to the Gemma 3-4B VLA for Complex Tasks

by
0 comments
Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Providing 15-Minute References to the Gemma 3-4B VLA for Complex Tasks

Current end-to-end robotic policies, especially vision-language-action (VLA) models, typically work on single observations or very brief histories. This ‘memory deficit’ makes long-term tasks, such as cleaning the kitchen or following a complex recipe, computationally difficult or prone to failure. To address this, researchers at Physical Intelligence, Stanford, UC Berkeley, and MIT have introduced Multi-Scale Embodied Memory (MEM).

https://www.pi.website/download/Mem.pdf

Dual-scale memory architecture

MEM divides robotic memory into two different scales to balance semantic context with real-time control constraints..

(1) Short-term video memory

Tasks that require fine-grained spatial awareness – such as solving self-contrasts or optimizing comprehension – require dense visual data. MEM uses an efficient video encoder that extends the Standard Vision Transformer (VIT). To maintain real-time estimation (380ms ‘real-time barrier’), the architecture avoids joint attention to all patches. Instead, it uses space-time separable attentionCombining spatial attention within the frame with causal-temporal attention across the frame at every fourth layer.

Computational complexity is reduced hey(n2Of2) To hey(K.N.2+N.K2), Where n is the number of spatial patches and Of is the number of timesteps. By dropping tokens from previous timesteps in the upper layers, the model only transmits a representation of the current observation to the VLA backbone, leaving the token count unchanged compared to the single-frame model.

(2) Long-term language memory

To handle tasks up to 15 minutes long, MEM uses language-based representations for semantic events.. The system decomposes action prediction as follows:

$$pi(a_{t:t+H},l_{t+1},m_{t+1}|o_{tT:t},m_{t},g) approxpi_{LL}(a_{t:t+H}|o_{tK:t},l_{t+1},g)pi_{HL}(l_{t+1},m_{t+1}|o_{t},m_{t},g)$$

Here, a high level policy (πhl) Maintains a running language summary (mTea) of previous events and generates sub-task instructions (Lt+1) for low-level policy (πwill put). This language memory is trained using LLM-generated summaries that compress information (for example, ‘I placed three bowls’ instead of individual features), reducing the risk of training-inference distribution shifts.

https://www.pi.website/download/Mem.pdf

Implementation and Performance

The research team integrated MEM into π0.6 VLAwhich is started from pre-trained Gemma 3-4B Sample. The model was pre-trained on a diverse mix of robot demonstrations, vision-language tasks, and Internet video data.

main results:

  • Adaptation to context: MEM enables robots to adapt manipulation strategies based on recent failures. In evaluation, this causes a +62% Increasing the success rate in opening refrigerators with unknown hinge directions and a +11% Increase in lifting chopsticks at variable height.
  • long-horizon work: The model successfully performed 15-minute tasks like ‘Recipe Setup’ (getting ingredients from multiple locations) and ‘Kitchen Cleaning’ (washing utensils and wiping counters). Memory-less VLAs fail to perform these tasks quite often.
  • Capacity: The video encoder allows the model to process up to 16 observation frames (~1 minute) on a single NVIDIA H100 GPU while remaining under critical real-time estimation limits.

MEM shows that combining dense, short-term visual tokens with compressed, long-term language summaries allows VLAs to scale their ‘working memory’ without prohibitive computational costs.


check it out paper And technical details. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.


Related Articles

Leave a Comment