Google Introduces Titans Architecture To Give LLMs Long-Term Memory

Several experts have said that the lack of a long-term memory for LLMs — each interaction essentially starts from scratch — is the biggest impediment for their large-scale deployment, but Google has come up with a way to help change that.

In two newly published research papers, Google has introduced Titans, a novel architecture, and MIRAS, a theoretical framework that together represent a fundamental shift in how AI models handle context and memory. The breakthrough combines the processing speed of recurrent neural networks with the accuracy of transformers while enabling models to learn and adapt in real-time as data streams in.

The Context Problem

Traditional transformer models have revolutionized AI through their attention mechanisms, which allow models to reference earlier inputs when processing new information. However, this approach comes with a significant limitation: computational costs increase drastically with sequence length, making it impractical to process extremely long contexts like full documents or genomic sequences.

While the research community has explored alternatives such as linear recurrent neural networks and state space models like Mamba-2, these solutions compress context into fixed-size representations that struggle to capture the rich information contained in very long sequences.

How Titans Works

Titans addresses this challenge through what Google calls “test-time memorization”—the ability to maintain long-term memory by updating the model’s parameters while it’s actively running, without requiring offline retraining.

The architecture introduces a neural long-term memory module that functions as a deep neural network, specifically a multi-layer perceptron. Unlike traditional RNNs that use fixed-size vectors or matrices for memory, this approach provides significantly higher expressive power, allowing the model to summarize large volumes of information without losing important context.

The key innovation lies in what Google researchers call the “surprise metric.” Drawing inspiration from human psychology—where we easily forget routine events but remember unexpected ones—Titans uses an internal error signal to detect when new information significantly differs from what the model currently remembers.

When surprise is low, such as encountering an expected word in context, the model can skip permanently storing that information. When surprise is high, like encountering an anomalous data point in a financial report, the model prioritizes that information for permanent storage in its long-term memory.

Google has refined this mechanism with two critical elements: momentum, which ensures the model captures relevant subsequent information even if individual tokens aren’t surprising, and adaptive weight decay, which acts as a forgetting gate to manage the finite capacity of memory when dealing with extremely long sequences.

The MIRAS Framework

MIRAS provides the theoretical foundation for understanding and generalizing these approaches. Rather than viewing different AI architectures as distinct systems, MIRAS treats them as different methods of solving the same fundamental problem: efficiently combining new information with old memories without losing essential concepts.

The framework defines sequence models through four key design choices: memory architecture, attentional bias, retention gate, and memory algorithm. Importantly, MIRAS moves beyond the mean squared error paradigm that virtually all existing sequence models rely on, opening up a richer design space for creating novel architectures.

Using MIRAS, Google created three attention-free model variants: YAAD, which is more robust to outliers; MONETA, which explores complex mathematical penalties for more stable long-term memory; and MEMORA, which achieves optimal memory stability by constraining its memory to act like a strict probability map.

Performance Results

Google’s testing shows impressive results across multiple benchmarks. In language modeling tasks using standard datasets like C4 and WikiText, Titans and MIRAS variants consistently demonstrated higher accuracy compared to leading architectures including Transformer++, Mamba-2, and Gated DeltaNet.

The most significant advantage appears in extreme long-context scenarios. In the BABILong benchmark, which requires reasoning across facts distributed in extremely long documents, Titans outperformed all baselines, including GPT-4, despite having substantially fewer parameters. The architecture has demonstrated the capability to scale effectively to context windows larger than 2 million tokens.

Ablation studies confirmed that the depth of the memory architecture is crucial—deeper memory modules consistently achieve better performance and maintain it as sequence length increases significantly.

Implications for AI Deployment

The introduction of Titans and MIRAS marks a significant step toward solving one of the most persistent challenges in AI deployment. By enabling models to actively learn and update their core knowledge as data streams in, rather than compressing everything into static states, these approaches could enable new applications in areas requiring long-context understanding.

Google has validated the architecture’s versatility beyond text, successfully testing Titans on genomic modeling and time-series forecasting. The models maintain efficient, parallelizable training and fast linear inference speeds, making them practical for real-world deployment.

As organizations increasingly look to deploy AI systems that can handle complex, long-running interactions and maintain context across extended sessions, architectures like Titans that combine RNN efficiency with transformer-level expressive power may prove essential to moving beyond the current limitations of large language models.