thoughts.txt [慧慧]

i blunder words and overthink

memory

December 3, 2025 · 12:42 AM
back at home, reading google's nested memory paper

we've been trying to solve memory in neural architectures since the 50s, starting from symbolic memory (SOAR architecture) to RNNs and LSTMs with gates to control information flow, then the legendary "Attention is all you need" paper in 2017 introducing a context window as memory (-> context rot, lost in the middle), and now finally external memory banks (RAG, graphs, scratch pads). catastrophic forgetting continues. labs rush to solve it.

memory is not a bag of facts, it's layered control: it's not just about storing information, it's about being selective about what to remember and what to forget. pruning and forgetting are equally important when it comes to memory
memory != continual learning: learning isnt just remembering and is never static ofc. LLMs are static post-pretraining and limited to in-context learning, lacking the ability for adaptation and "neuroplasticity"
formation of long-term memory involves two consolidation processes: online (synaptic) consolidation soon after learning where new info is stabilised and being transferred from ST to LT storage, then offline (systems) consolidation, replaying recently encoded patterns with sharp-wave ripples in the hippocampus

sources & more

1.SOAR: ↗ 2205.03854(arxiv.org)
2.LSTM: ↗ Long Short-Term Memory(ieeexplore.ieee.org)
3.attention is all you need: ↗ 1706.03762(arxiv.org)
4.memory from then to now: ↗ 🦸🏻#9: Does AI Remember? The Role of Memory in Agentic Workflows(huggingface.co)
5.nested learning: ↗ Introducing Nested Learning: A new ML paradigm for continual learning(research.google)

follow-up thoughts

December 8, 2025

From Google's Nested Learning paper

↗ abehrouz.github.io

Limitations of LLM/transformer models as how they represent the brain:

Lack of multi-timescale processing: Brainwaves are the results of the brain coordinating its computations in different timescales and frequency of updates. In deep learning models, however, the weights of the architectures are fixed at test time and also it is common in pre-training to use the same update rate for all the blocks/layers in the model. Later, in Section 6, however, we show that in-context learning provides an extreme case of this design and in fact, Transformer architectures are based on two extreme frequencies of update: i.e., ∞ and 0 for attention and MLP blocks, respectively
Brain's neuroplasticity: can change itself in response to new memories, knowledge, and even damage
Human brain not as an isolated system of specific areas, but rather distributed: The modern deep learning architectures in recent years, however, at least on the surface, seem to be heterogeneous and are based on a combination of a subset of self-attention variants (Vaswani et al. 2017), modern recurrent neural networks (Katharopoulos et al. 2020; Schlag et al. 2021; Behrouz et al. 2025c; Peng et al. 2025b), canon layers (Allen-Zhu 2025), global convolutions (Hasani et al. 2023; Poli et al. 2023), and MLP blocks (Shazeer 2020).

Paradigm shift: unification of architecture and optimisation using Nested Learning A model is viewed as a hierarchy of interconnected learning modules, each with its own:

Typical deep learning built on a simple hierarchy, you have the model, you have an optimiser (the training rule), you have data -- treating them as separate entities
Nested learning says that the division is artificial, model and optimiser are both learning systems, just operating at different levels of abstraction
Each has its own internal flow of information, or what the authors call “context flow.” When you zoom out, training isn’t one process, it’s a stack of interconnected optimization problems nested inside one another, each updating at a different rate (update frequency how often it adapts)
These modules form nested or parallel optimization loops, enabling multi-level learning within a single system.

December 12, 2025

continual learning in token space??

Continual Learning in Token Space | Letta

At Letta, we believe that learning in token space is the key to building AI agents that truly improve over time. Our interest in this problem is driven by a simple observation: agents that can carry their memories across model generations will outlast any single foundation model.

↗ letta.com

← back to all posts