Attention is NOT All You Need
The "Attention Heads" in Transformer Models Do Not Attend to Anything
Welcome back to the Lamb_OS Substack! As always, I am thankful to have my subscribers and reader stop by! If you are a regular reader of @Lamb_OS, then you – the readers – know you are the only reason I do this. So as always, thank you for visiting!
Let me begin by introducing myself as Dr. William A. Lambos. I call myself a computational neuroscientist, and I’ve been involved with the AI field (one way or another) since about 1970. As far as credentialing, you can see the footnote below if interested1. I write a lot about AI, but not exclusively. When addressing a topic, I take pains to assure my perspective is highly informed. I draw from many areas of study, and my conclusions, beliefs, and predictions often fall outside current or mainstream thinking. But these same beliefs are grounded by 50 years of study and rigorous cross training in multiple fields of study. See my previous screeds on this Substack to learn more, and judge for yourself.
Please Subscribe today if you have not already yet done so. This Substack is free! So, please subscribe for whatever reason might appeal to you. But I’d hope you do so for the value it offers.
I’ve been wanting to write this post for some time now. You see, ‘Attention’ is one of my favorite subjects. One could claim that attention is among the most important functions of the brain, and this has some truth to it. But the construct of attention actually refers to several rather different classes of brain activity, with purposes that overlap only somewhat. It is therefore not surprising that attention, in the human sense, is among the least understood aspects of global brain function.
Here are three areas in which I deal with the construct of attention almost daily, between clinical practice and neuroscience:
Theoretically, the brain mechanisms underlying human (and animal) attention are fascinating. Many readers may be unaware of the close and intertwined relation between attention and arousal, for example. Or that attention requires no consciousness. Or that while many areas of the brain have the ability to influence focus, in fact, the brain spends far more time and energy inhibiting the switching of attention. Otherwise, we would stop what we were doing every time the ventilation system clicked on, or a neighbor’s car started — stimuli we now ignore so effectively that we become unaware of them.
Clinically, I’ve been diagnosing and treating patients who suffer from AD/HD (among many other issues) for many decades. The longer I do so, the more I am convinced that such folks tend to have far less in common than patients diagnosed with other neurocognitive or affective (emotional) disorders. What’s so interesting about this group is that when assessed with neuropsych tests of cognitive functioning, two-thirds or more cannot be diagnosed with a disorder of attention based on criteria in DSM-5-TR. Rather, most of these individuals have difficulties, sometimes disabling, in “executive functioning”: they have difficulty in making decisions with regard to starting, stopping, and switching behaviors, and with planning organization, judgement, self-regulation and social interactions.
Socioeconomically, ‘Attention’ has become the value currency of the Internet age. It is fungible and easily converted to cash via algorithmic Ad placements and selling our personal data. All social media, and a large part of most other economic sectors, provides us something in exchange our “eyeballs” and mouse clicks. Since (for most of us), nearly everything we do on or in our computers, phones, TVs, cars, homes and workplaces is tracked or recorded, the trail of data crumbs each person continuously creates has value. Monetary value, and more of it than we ever would have imagined. If we wanted to eliminate the attention economy, all we would have to do is to break this value chain. Of course, then much of the free stuff we like would have to be paid for, or go away.
So, yes, ‘attention’ is a deep rabbit hole indeed!
The construct of ‘Attention’ in GPTs, conversely, refers to a series of parallel processing operations — on four of the five main data structures purposed to represent information added to the model. I am not going to describe these individually - they are all vectors or matrices of numbers that are associated with the words (for an LLM), pixels (for an image diffusion model), or whatever else the transformer model has been designed to handle2.
But this is where any similarity ends. Before we look any further at the distinction — and it is an important one — I want to state that the the term attention, as used by data science engineers, bears little resemblance to biological ‘attention’ beyond a casual semantic convenience. When we state that a person is “paying attention” to something, or that a GPT is using a “multi-headed attention mechanism,” we are saying that n both cases, the decision to be made as to the next processing or action/output step takes the context of the data into account.
Here’s today’s example. In 2017, a paper was published with the title “Attention is All You Need.” proposed the architecture of Transformer models and led within two years to the first GPTs, offered that:
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. (Emphasis added).
What’s that now? An “attention mechanism?” If by “attention” the authors meant “input/output token processing enhanced by representing the token’s context in the input stream, and sped up by parallelism,” I get it! And that is exactly what they mean. So transformer models are efficient because they can use information embedded in the input data (“context”), and do so while obviating sequential processing bottlenecks. But not because they are “attending” to anything. They are not, because this would be impossible.
Biological creatures, of course, do attend selectively, by making use of focus: ignoring inputs with lower salience in order to devote additional resources to the most relevant aspects of the input stream. The terms ‘salience’, ‘relevant’, and even ‘resources’ imply intentionality and agency. None of these have been shown to be derivable via computation.
As with the constructs intelligence, and knowledge, the use of “attention” for describing a novel way to reduce prediction error (“entropy”) in GPTs perpetuates the AGI Fallacy and hurts the field of AI. In fact, my desire to address the topic of attention in this Substack owes no small part to how the imprecise use of language has turned the field of Artificial Intelligence into a 70-year-long dumpster fire. Like co-opting the slippery concept of ‘intelligence’ 65 years earlier (for which everyone has payed dearly), AI researchers and companies have now helped themselves to the concept of ‘attention’. Why not? The more people who can be convinced to believe such anthropomorphisms, the easier it is to hype machine learning models as the precursors to AGI.
This never ends well. It won’t this time either.
GenAI appears to be down and on the way out (as I’ve said it would). On the bright side, if it were actual AI, it would be freaking out right now. Instead, only people are.
Who needs Artificial Intelligence when we have an unlimited supply of Human Idiocy?
That’s it. Thanks for reading!
I hold a postdoctoral certification in clinical neuropsychology and a license to practice in California and Florida. I’ve been coding since mainframes were the only accessible computers and LISP was the ‘lingua franca’ of AI (ca. 1970-81), but when the Zilog microprocessors appeared (anyone remember the Z-80?), I learned to code in machine language (‘assembler code’). Finally, I hold Masters degrees — one quite recent — in computation and data science.
If interested, see this wonderful YouTube video about the implementation of so-called attention in transformer models by 3Blue1Brown.
> So transformer models are efficient because... But not because they are “attending” to anything. They are not, because this would be impossible.
Why is "attending" impossible? You simply state this, but don't explain why. What do you mean by "attend"?
Also, as far as I know, the transformer models use the word "attention" by learning a contextual filter that selects only for relevant features in the sequence and uses those to learn the context of the current token. In that sense, they use the word "attention" since it learns a computer definition of saliency for the current ML task.
> Biological creatures, of course, do attend selectively, by making use of focus: ignoring inputs with lower salience in order to devote additional resources to the most relevant aspects of the input stream.
I'm with you so far.
> The terms ‘salience’, ‘relevant’, and even ‘resources’ imply intentionality and agency.
Now you lost me. Is there a psychological definition of these terms that lead to these implications? Perhaps it would be helpful if you defined all five of these terms in the way you are using them. There is significant clash with their use in the computer science literature.
> None of these have been shown to be derivable via computation.
I know of many such computational derivations, but I'm using my AI background to understand your words. Again, it would be helpful if you defined your terms in the way you are using them.
I do a lot of cross-disciplinary reading and it's always very frustrating when words have overloaded meanings in different disciplines. Especially if their meanings are similar but with subtly different connotations only understandable with deep study of the field.