Original Reddit post

I have been spending weeks trying to understand the memory bottlenecks of long-context and long-generation inference. I kept seeing many post transformer ideas & they all converge on the same theme: not just making attention faster but changing what the model uses as working memory. I have written down the core derivation on one handwritten sheet and labeled it Eqn A through Eqn E so the discussion can stay free of maths here. Here is the mental model I mapped out. In autoregressive inference, memory is operated via attention computations, often combined with a softmax non-linearity. Generating the next token requires comparing the current query against previous keys to select the relevant previous values, which forces the model to keep an explicit list of past key and value vectors. That growing list is the famous KV cache. See Eqn A. There is excellent work done to reduce the cost inside the softmax paradigm. Examples include reducing how many KV heads are stored as in Grouped-Query Attention (Ainslie et al. 2023), compressing KV representations as in Multi-head Latent Attention from DeepSeek-V2 (DeepSeek-AI 2024) and limiting which past tokens are read. These help a lot, but they still keep the same underlying memory object: an explicit list of past token states. These improvements are not enough and LLM costs keep scaling and performance remain stuck at the 1M token wall. Maybe a fundamental change in how memory operates is required? The question that keeps me awake at night: should working memory be a growing list at all? Fixed size memory approaches say no. A classic starting point is linear attention, as in “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention” (Katharopoulos et al. 2020). If you replace the softmax weighting with a linear formulation, you can reassociate the computation so that the history is accumulated into a fixed size state. See Eqn B and Eqn C. This produces a recurrent memory matrix updated once per token and read out using the current query. See Eqn D. The good thing is that the working memory object becomes constant-sized with respect to sequence length. This opens the door to the SSM or FWP literature… But Eqn E is the catch, when you query a fixedsize state, you recover the target term plus cross terms from every other stored item. Those cross terms are not inherently bad: if two items are unrelated, a wellbehaved system can make their keys close to orthogonal, so the term is approximately zero and if they are related, a similarity weighted contribution is exactly the associative retrieval you want. IMO, the problem is capacity i.e. in a finite key dimension, you can only fit so many near-orthogonal keys, so once you store too many items, the cross terms can no longer stay small and retrieval degrades from interference. That is why naive linear attention often struggles on associative recall as more items are stored. Currently, it seems that the most successful approaches integrating SSM-like layers still hybrid them with standard attention layers to preserve the recall capacities. On the SSM side, Dragon Hatchling (BDH) is moving linear attention into a high-dimensional (~10^11) “neuron activation” space, interpreting the state as a connectivity or synaptic memory object, and using low-rank factors to stay GPU-friendly. This seems like a smart way to preserve the recall power and expressivity of softmax attention, we know that we can express a non-linear operation in a low-dimensional space (~10^3) as a linear function in a high-dimensional space, as we do for kernel methods! Do you expect the field to converge on softmax attention with increasingly aggressive KV cache engineering, settle on hybrids, or eventually shift toward architectures where the basic working memory object is a fixed-size state rather than an explicit KV cache? submitted by /u/dank_philosopher

Originally posted by u/dank_philosopher on r/ArtificialInteligence