What comes after transformer? Google Research is proposing a new way to give sequence models usable long-term memory with TITANs and MIRAS, while keeping training parallel and inference close to linear.
Titans is a solid architecture that adds a deep neural memory to the backbone of the Transformers genre. MIRAS is a general framework that views most modern sequence models as examples of online optimization over associative memory.
Why Titans and Miraas?
Standard transformers focus on key value caches. This gives robustness to context learning, but the cost grows quadratically with context length, so practical context is limited even with flashattention and other kernel tricks.
Efficient linear recurrent neural networks and state space models like Mamba-2 compress the history into a fixed-size state, so the cost is linear in sequence length. However, this compression loses information in very long sequences, which disadvantages tasks such as genomic modeling and excessively long reference retrieval.
Titans and Miraas combine these ideas. Attention serves as an accurate short-term memory on the current window. A separate neural module provides long-term memory, learns at the time of testing, and is trained so that its dynamics is paralleled on the accelerator.


Titans, a neural long-term memory that learns when tested
titans research paper Introduces a neural long-term memory module which is itself a deep multi-layer perceptron rather than a state vector or matrix. Attention is interpreted as short-term memory, as it only sees a limited window, whereas neural memory functions as persistent long-term memory.
For each token, Titans defines an associative memory loss.
ℓ(Mₜ₋₁; kₜ, vₜ) = ‖Mₜ₋₁(kₜ) − vₜ‖²
Where Mₜ₋₁ is the current memory, kₜ is the key and vₜ is the value. The gradient of this loss with respect to memory parameters is the “surprise metric”. Large gradients correspond to surprise tokens that should be stored, small gradients correspond to expected tokens that can mostly be ignored.
At the time of testing the memory parameters are updated by gradient descent with speed and weight reduction, which simultaneously act as a retention gate and forgetting mechanism. To keep this online optimization efficient, the paper shows how to compute these updates with batch matrix multiplication on sequence segments, which preserves parallel training over long sequences.
Architecturally, Titans uses three memory branches in the backbone, often exemplified in the Titans Mac version:
- A main branch that carries out standard teaching in context with a focus on
- An episodic memory branch that learns from a recent sequence
- A persistent memory branch with fixed weights that encodes pre-training knowledge
Long-term memory compresses previous tokens into summaries, which are then passed to attention as additional context. Attention can choose when to read that summary.
Experimental results for Titans
On language modeling and common sense logic benchmarks such as C4, WikiText and HellaSwag, Titans architectures outperform state-of-the-art linear recurrent baseline Mamba-2 and gated DeltaNet and Transformer++ models of comparable size. Google research attributes this to the high expressive power of deep memory and its ability to maintain performance as the length of reference increases. Deep neural memories with the same parameter budget but higher depth consistently give less entanglement.
To remember extremely long context, the research team uses the BABILong benchmark, where facts are distributed across very long documents. Titans outperforms all baselines, including very large models like GPT-4, while using many fewer parameters, and using context windows that scale to over 2,000,000 tokens.
The research team reports that TITANs possesses efficient parallel training and fast linear inference. Neural memory alone is slightly slower than the fastest linear recurrent models, but hybrid TITAN layers with sliding window attention remain competitive on throughput while improving accuracy.


MIRAS, a unified framework for sequence models as associative memory.
miras research paper, “It’s All Connected: A Journey Through Test Time Memorization, Attentional Bias, Retention, and Online Optimization,” generalizes this approach. It observes that modern sequence models can be viewed as associative memories that map keys to values, balancing learning and forgetting.
MIRAS defines any sequence model through four design options:
- memory structure For example a vector, linear map, or MLP
- attentional bias Intrinsic loss that defines which parallelisms the memory cares about
- retention gate Regularizer that keeps the memory close to its previous state
- memory algorithm Online optimization rules, often gradient descent with speed
Using this lens, MIRAS recovered several families:
- RetNet as Hebbian style linear recurrent models and dot product based associative memories.
- Delta rule models such as DeltaNet and Gated DeltaNet with value replacement and specific retention gates as MSE based memory
- Titan’s LMM as a nonlinear MSE based memory optimized by gradient descent with speedup with local and global retention
Importantly, MIRAS goes beyond typical MSE or DOT product objectives. The research team builds new retention gates based on Lₚ norms, robust Huber loss and robust optimization, and divergence on probability simplification, elastic net regularization and Bregman divergence.
From this design space, the research team presented three attention-free models:
- moneta Lₚ uses 2 layer MLP memory with attentional bias and hybrid retention gate based on normalized norms
- Memory Huber loss uses similar MLP memory with forget gates related to attentional bias and titans
- Memora The probability simplex style uses regression loss as a retention gate based on attentional bias and KL divergence on memory.
These MIRAS variants replace attention blocks in the Llama style backbone, use depth-separable convolutions in the MIRAS layer, and can be combined with sliding window attention in hybrid models. Training remains parallel by splitting the sequences and computing gradients with respect to the memory state from the previous section.
In research experiments, Moneta, Yaad and Memora match or outperform robust linear recurrent models and Transformer++ in language modeling, common sense reasoning, and memorizing deep tasks while maintaining linear time estimation.
key takeaways
- Titans introduce a deep neural long-term memory that learns when testedUsing gradient descent on L2 associative memory loss, the model selectively stores only surprising tokens, while updates are parallelized on the accelerator.
- Titans links attention with neural memory for long referenceThe core, using branches such as episodic memory and persistent memory so that attention can handle short-range accuracy and the neural module maintains information on sequences of more than 2,000,000 tokens.
- Titans outperforms robust linear RNN and Transformer++ baselinesincluding Mamba-2 and Gated DeltaNet on language modeling and commonsense reasoning benchmarks at comparable parameter scales, while remaining competitive on throughput.
- Recall benchmarks like BABILong in extremely long contextUsing fewer parameters and still enabling efficient training and inference, TITANs achieves higher accuracy than all baselines, including models with large attention spans such as GPT4.
- MIRAS provides a unified framework to model sequences as associative memories.Defines them by memory structure, attentional bias, retention gate and optimization rules, and generates new attention-free architectures such as Moneta, Memora and Memora that match or outperform linear RNNs and Transformer++ on long context and reasoning tasks.
check it out technical detailsFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.
