Attention is NOT All You Need

Jun 30, 2024

The "Attention Heads" in Transformer Models Do Not Attend to Anything

3 Comments

Jun 30, 2024Edited

> So transformer models are efficient because... But not because they are “attending” to anything. They are not, because this would be impossible.

Why is "attending" impossible? You simply state this, but don't explain why. What do you mean by "attend"?

Also, as far as I know, the transformer models use the word "attention" by learning a contextual filter that selects only for relevant features in the sequence and uses those to learn the context of the current token. In that sense, they use the word "attention" since it learns a computer definition of saliency for the current ML task.

> Biological creatures, of course, do attend selectively, by making use of focus: ignoring inputs with lower salience in order to devote additional resources to the most relevant aspects of the input stream.

I'm with you so far.

> The terms ‘salience’, ‘relevant’, and even ‘resources’ imply intentionality and agency.

Now you lost me. Is there a psychological definition of these terms that lead to these implications? Perhaps it would be helpful if you defined all five of these terms in the way you are using them. There is significant clash with their use in the computer science literature.

> None of these have been shown to be derivable via computation.

I know of many such computational derivations, but I'm using my AI background to understand your words. Again, it would be helpful if you defined your terms in the way you are using them.

I do a lot of cross-disciplinary reading and it's always very frustrating when words have overloaded meanings in different disciplines. Especially if their meanings are similar but with subtly different connotations only understandable with deep study of the field.

Expand full comment

Reply (1)

Lamb_OS

Jul 1, 2024

Jacob -

Thanks for your comments and questions (which I'll do my best to answer), and for subscribing. It's appreciated.

Also, while the groundwork for my ideas is derived from many sources in the life sciences (including neuroscience) and from data science (esp. predictive modeling, and especially neural nets), the "bigger picture" ideas about attention mechanisms in silicon vs. "meat-puters" are mine, as are any errors.

Now to offer answers to your

- By "attend", I mean to orient towards, or away from, something that is sensed via receptors ("a portion of its input stream") that has "valence" or "salience" (near synonyms) for the attender/system. There is no salience in the input stream (because valence and salience are *subjective*, and determined by the nervous system and the DNA that created it.

- There is, of course, contextual and positional information about the current token vis-à-vis other tokens in LLMs, and how to adjust the embeddings determined by back propagation during training. But attention involves changing behavior ("node values") based on, and only on, the organism's determination of the +/- valence of an input. See "A Brief History of Intelligence" by Max Bennett for volumes of experimental data and theory about this distinction.

- Therefore, from the point of view of biology, the ability to differentially react to inputs based on their probable value to the organism is "attention".

- The "value matrix/vector" for biological organisms exists *before* the attention mechanism is "computed", whereas in GPTs the weights of the "attention" matrix must be computed first.

- Why does this matter? Because adjusting the embeddings of the input tokens by using context already contained in the input stream is simply not selective attention, or any other kind of attention. It is, rather, entropy reduction which was labeled "attention" by folks who don't understand what attention is.

- I suppose the larger message is that data science should strenuously avoid anthropomorphisms from human cognitive functioning. It causes people to think AGI can arise from non-cellular substrates and without DNA, which is what I state to be impossible. Which leads these same folks to think that generalized intelligence is "a few years away", which has been claimed since Shannon and Turing developed information theory. But I maintain it's not possible. If you disagree, Jacob, then you are entitled to your opinion. You sound like a pretty smart guy!

Bill

- I wrote a lengthier post about this here.

Expand full comment

Reply (1)

Jacob Everist

Jul 1, 2024

> - I suppose the larger message is that data science should strenuously avoid anthropomorphisms from human cognitive functioning. It causes people to think AGI can arise from non-cellular substrates and without DNA, which is what I state to be impossible. Which leads these same folks to think that generalized intelligence is "a few years away", which has been claimed since Shannon and Turing developed information theory. But I maintain it's not possible. If you disagree, Jacob, then you are entitled to your opinion. You sound like a pretty smart guy!

I'm in agreement. I'm not a subscriber to the AGI hype. I'm just trying to isolate, understand, and reconstruct computational principles found in the brain. I don't even do the deep learning stuff. I'm focused on computational algorithms inspired from cortical columns.

It's an entirely different beast if you focus on the cortical column as the fundamental computational unit instead of the neuron cell. Neurons get demoted to constituents of a larger assemblage, and dendrites becomes the first-class computational units. Synapses are simple gates, bearing no weight to them. Neurons only activate if they pass they are selected from a winner-take-all competition. It's a different way of doing things, but it also leads to some interesting phenomenon and ways to think about representation and computation.

> There is no salience in the input stream (because valence and salience are *subjective*, and determined by the nervous system and the DNA that created it.

I'm in agreement here that the notion of salience is subjective to the computational agent. I've worked with objective measures in the past and they've had some traction. However, I think we've got all we could out of them.

The key challenge is how do you build this definition of salience? How do you measure and keep track of this subjective metric? How do you align it with your computational goal?

There's many heuristics for achieving it. There's the "nature" approach where you design a metric that is predisposed to your desired tasks, kind of like how infants are predisposed to recognize faces before anything else.

Then there's the "nurture" aspect, where you train your system to discriminate the salient features, kind of like how a musician trains to hear all of the subtle acoustic features of music that are opaque to me, who just appreciates music on a macro-superficial level.

The "nature" approach is really just a side-effect of how you design the computational system. Still, it's important to measure and formalize what assumptions you've made in the design and what information has been thrown out or demoted. The "nurture" approach is a dynamic learning process and can see its analogue in feature-selection algorithms. However, the key difference is that you need to let the task-oriented machinery drive the selection, training and tuning as it tries to figure things out. That's a whole other problem.

Expand full comment

Lamb_OS’s Substack

Attention is NOT All You Need