Attention May Be All We Need… But Why?
Image by Author | Ideogram
Introduction
A lot (if not nearly all) of the success and progress made by many generative AI models nowadays, especially large language models (LLMs), is due to the stunning capabilities of their underlying architecture: an advanced deep learning-based architectural model called the transformer. More specifically, one of the components inside the intricate transformer architecture has been pivotal in the success of these models: the attention mechanism.
This article takes a closer look at the attention mechanism of transformer architectures, explaining in simple terms how they work, how they process and gain an understanding of text information, and why they have constituted substantial advances compared to other previous approaches to understand and generate language.
Before and After Attention Mechanisms
Before the original transformer architecture revolutionized the machine learning and computational linguistics communities in 2017, previous approaches to processing natural language were predominantly based on recurrent neural network architectures (RNNs). In these models, text sequences like the one shown in the image below were processed in a purely sequential fashion, one token or word at a time.
But there is a caveat: while some information from recently processed tokens (the previous few words to the one being currently processed) can be retained by the so-called “memory cells” that form an RNN, this memorizing capability is limited. As a result, when trying to process longer, more complex sequences of text, long-range relationships between parts of the language are missed due to an effect similar to memory loss.

How recurrent architectures like RNNs process sequential text data
Image by Author
Luckily, with the emergence of transformer models, attention mechanisms arose to overcome this limitation in classical architectures like RNNs. The attention mechanism is the “soul” of the entire transformer model, the key component that fuels a much deeper understanding of language throughout the entire workflow happening across the rest of the vast transformer architecture.
Concretely, transformers typically use a form of this mechanism called self-attention, which weighs the importance of all tokens in a text sequence simultaneously, not one by one. This makes it possible to model and capture long-range dependencies — for instance, in a long text, two consecutive mentions of a person or a place that are several paragraphs away from each other. It also makes the processing of long text sequences much more efficient.
The self-attention mechanism not only weighs each element of the language as depicted below, but it also weighs the interrelationships between tokens. For example, it can detect dependencies between verbs and their corresponding subjects, even when they appear far apart in the text.

How transformers’ self-attention mechanism works
Image by Author
Anatomy of the Self-Attention Mechanism
By looking inside the self-attention mechanism, we will get a better understanding of how this approach helps transformer models understand the interrelationship between elements of a sequence in natural language.
Imagine a sequence of token embeddings (embeddings are numerical representations of a portion of text), from a text such as “Ramen is my favorite food.” The sequence of token embeddings is linearly projected into three distinct matrices — queries (Q), keys (K), and values (V) — each capturing a different role in the attention computation. These three matrices obtained upon a token are not identical to each other: they result from applying a different linear transformation to the token embedding; one linear transformation associated with queries, one for keys, and one for values. Their difference lies in the weights they use in the process, and these weights have been learned when the model was being trained.
We then take the first two token projections (query and key) and apply a scaled dot-product approach at the core of the self-attention mechanism. The dot-product approach computes a similarity score between the query and key vectors for any two tokens in the sequence — a value that reflects how much attention one word should pay to another. This yields an nxn
matrix of attention scores (n
is the number of tokens in our original sequence). Elements in this matrix of attention scores are a raw, preliminary indicator of the relationship between words in the sequence. The short code snippet below shows a minimal implementation of this mechanism using PyTorch:
import torch import torch.nn.functional as F
# 3 tokens, 4-dimensional embeddings Q = torch.rand(3, 4) K = torch.rand(3, 4) V = torch.rand(3, 4)
attention_scores = Q @ K.T # (3×4) x (4×3) -> 3×3 scaled_scores = attention_scores / (4 ** 0.5) # scale by sqrt(d_k) weights = F.softmax(scaled_scores, dim=–1) # softmax over last axis output = weights @ V # (3×3) x (3×4) -> 3×4 |

Inside an attention head
Image by Author
Going further, the raw attention scores are normalized or scaled using the softmax
mathematical function, resulting in a scaled attention weights matrix. The attention weights provide an adjusted view of the relevance or attention that the model must pay to each token in a sequence like “ramen is my favorite food.”
Attention weights are then multiplied by the third of the initial matrix projections we built earlier for each token, the values, to obtain updated token embeddings that incorporate relevant information about the sequence inside every single token’s embedding. This is like injecting into each word’s DNA a bit of information from DNA pieces from all the other words around it in the text that word is part of. And this is how in subsequent modules and layers, the information flows through across the transformer architecture, information about complex relationships between parts of the text is successfully captured.
Multi-headed Attention
Many real-world transformer applications go a step beyond and utilize an extended version of the self-attention mechanism we just analyzed. This mechanism is also commonly referred to as an attention head, and we can combine multiple heads into a single component to build a multi-headed attention mechanism. This helps in practice to parallelize multiple attention heads to learn different linguistic and semantic aspects in the sequence: one attention head may specialize in learning about context, the head next to it may specialize in syntax interactions, and so on.

Multi-headed attention mechanism
Image by Author
When using a multi-headed attention mechanism, the outputs of each head are concatenated and linearly projected onto the original embedding dimension to obtain a global enriched version of text embeddings that capture multiple linguistic and semantic nuances about the text.
Wrapping Up
This article provided a look inside the transformer architecture’s most successful component, which helped revolutionize the world of AI as a whole: the attention mechanisms. Through a deep but gentle dive, we explored how attention works and why it matters.
You can find a practical, code-based introduction to transformer models in this recently published Machine Learning Mastery article.