Encoders and Decoders in Transformer Models


Transformer models have revolutionized natural language processing (NLP) with their powerful architecture. While the original transformer paper introduced a full encoder-decoder model, variations of this architecture have emerged to serve different purposes. In this article, we will explore the different types of transformer models and their applications.

Let’s get started.

Encoders and Decoders in Transformer Models
Photo by Stephan Streuders. Some rights reserved.

Overview

This article is divided into three parts; they are:

  • Full Transformer Models: Encoder-Decoder Architecture
  • Encoder-Only Models
  • Decoder-Only Models

Full Transformer Models: Encoder-Decoder Architecture

The original transformer architecture, introduced in “Attention is All You Need,” combines an encoder and decoder specifically designed for sequence-to-sequence (seq2seq) tasks like machine translation. The architecture is illustrated below.

Transformer architecture from the paper “Attention is All You Need”

The encoder processes the input sequence (e.g., a sentence in the source language) into a contextual representation. It consists of a stack of identical layers, each containing a self-attention sublayer and a feed-forward sublayer.

The decoder follows a similar structure, processing the target sequence (e.g., a partial sentence in the target language). Each decoder layer contains three sublayers: self-attention, cross-attention, and feed-forward. The cross-attention sublayer is unique to the decoder, combining context from the encoder with the target sequence to generate the output.

In the full transformer model, the encoder and decoder are connected, but the entire input sequence must be processed by the encoder before the decoder can begin generating output. The encoder enables each token to attend to all other tokens in the input sequence, creating rich contextual representations.

The transformer’s signature feature is its attention layers. The attention output is a weighted sum of the value sequence $V$, where weights are attention scores computed by attending the query $Q$ to the key sequence $K$. While query and key sequences may differ in length, the value sequence must match the key sequence length. The result is a matrix $A$ of shape $(L_Q, L_K)$, where $A_{i,j}$ represents the attention score between the $i$-th query element and $j$-th key element.

Both encoder and decoder use self-attention, where query, key, and value sequences are identical before linear transformation. However, the decoder’s self-attention is causal, preventing attention between query element $i$ and key element $j$ when $i < j$. This design reflects the autoregressive nature of sequence generation: tokens should not know about future tokens.

The causal attention is implemented using a lower triangular matrix of ones as a mask:

and the mask, if printed, will look like this:

The attention score will be multiplied by the mask to compute the attention output. In most implementations, the mask is implemented as $-\infty$ or $0$ instead of $0$ or $1$, and it is added to the matrix before computing the softmax score. This approach uses faster addition instead of slower multiplication.

The decoder in the transformer model also uses cross-attention. It takes the query sequence from the previous layer in the decoder, while the key and value sequences come from the output of the encoder. This is how the decoder utilizes the encoder’s output to generate the final output.

The full transformer architecture is particularly well-suited for tasks where the input and output sequences can have different lengths and the output depends on the entire input context. Examples include machine translation, where the input is a sentence in the source language and the output is a sentence in the target language, and text summarization, where the input is a long article and the output is a paragraph summarizing the article.

Encoder-Only Models

While powerful, the encoder-decoder architecture is computationally intensive and introduces latency since the decoder must wait for the encoder to complete its processing. Encoder-only models simplify this by removing the decoder. An example is the BERT model, as shown below.

BERT architecture (left) and GPT-2 architecture (right)

BERT (Bidirectional Encoder Representations from Transformers) is one of the most popular encoder-only models. It processes the entire input sequence at once, producing contextual representations. Since both input and output are sequences, a task-specific model head is typically added for downstream applications.

For example, a “NER head” can label tokens as named entities, while a “sentiment analysis head” processes the entire sequence to produce a single sentiment score. Here’s how to create a BERT model:

This code does not load the pretrained weights, but you can see the model architecture, as printed below:

The model has an “embeddings” layer to transform input token IDs into a vector space of dimension 768. The model also has a “pooler” layer at the end to transform the output before feeding it to a task-specific model head. The main body of the BERT model is the BertEncoder module, which is a stack of 12 architecturally identical BertLayer modules. There are multiple linear, layer norm, and dropout layers in each BertLayer. However, there is only one BertAttention module, in which the multi-head self-attention is implemented.

Encoder-only models, such as BERT, use only the encoder part of the transformer architecture. They are usually trained using masked language modeling, where a random token in the input sequence is replaced with a special token, and the base model is trained to guess what the original token is. This allows the model to see the entire sequence to understand the context in order to perform the prediction.

Decoder-Only Models

Decoder-only models have become more common nowadays, thanks to the capabilities demonstrated by OpenAI’s GPT (Generative Pre-trained Transformer) models. While later versions, such as GPT-3.5 and GPT-4, are too large to fit in a single computer and are not open-source, earlier versions such as GPT-2 are open-source and small enough to handle. You can instantiate one using the Hugging Face Transformers library, as follows:

Or if you want to load the pretrained weights, you can do this instead:

The code prints the model architecture to the screen. The output is:

You may notice the use of Conv1D layers in the attention sub-layer of the model. It is indeed just a linear projection layer since the nx is the same as the input dimension size.

This architecture is also illustrated in the figure above. Comparing the two, you see that GPT-2 and BERT are very similar except where the LayerNorm is positioned. The BERT model uses post-norm while GPT-2 uses pre-norm. As mentioned above, the use of nn.Linear layer and nn.Conv1D layer is functionally equivalent.

Comparing the architecture of GPT-2 to the decoder of the transformer, you will notice the absence of the cross-attention sub-layer. This is because there is no encoder in the model; hence, there is no encoder output and no need for cross-attention.

Why is GPT-2 a decoder-only model if it is so similar to the BERT model?

The answer lies in how the model is trained. GPT-2 is trained using next-token prediction, meaning the model is trained to predict the next token in the sequence. The training always uses a causal mask, while the mask used in training BERT is a random one. Therefore, BERT expects to see the full context of the sequence, and all tasks make this assumption. GPT-2, however, expects to see only the partial sentence up to a point and assumes nothing about the future.

This difference in training may seem subtle, but it is the key difference between the two models and their capabilities. This also distinguishes encoder-only and decoder-only models regardless of their architecture. This explains why BERT can be used for NER, as you need to see the entire sentence to understand the grammatical structure and determine whether a token is a named entity. Similarly, GPT-2 can be used for text generation because it is good at completing a partial sentence.

Further Readings

Below are some further readings on the topic:

Summary

In this article, you explored the different types of transformer models and their applications. You learned that:

  • Full transformer models combine an encoder and a decoder for seq2seq tasks
  • Encoder-only models use bidirectional attention for understanding tasks
  • Decoder-only models use causal attention for generation tasks
  • Each architecture is optimized for specific use cases
  • The training approach, particularly the attention pattern, distinguishes encoder-only from decoder-only models

Understanding these differences is crucial for selecting the appropriate model architecture for your NLP task.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

…using transformer models with attention

Discover how in my new Ebook:

Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can

translate sentences from one language to another

Give magical power of understanding human language for
Your Projects

See What’s Inside

Leave a Reply

Your email address will not be published. Required fields are marked *