Tokenizers in Language Models – MachineLearningMastery.com


Tokenization is a crucial preprocessing step in natural language processing (NLP) that converts raw text into tokens that can be processed by language models. Modern language models use sophisticated tokenization algorithms to handle the complexity of human language. In this article, we will explore common tokenization algorithms used in modern LLMs, their implementation, and how to use them.

Let’s get started!

Tokenizers in Language Models
Photo by Belle Co. Some rights reserved.

Overview

This post is divided into five parts; they are:

  • Naive Tokenization
  • Stemming and Lemmatization
  • Byte-Pair Encoding (BPE)
  • WordPiece
  • SentencePiece and Unigram

Naive Tokenization

The simplest form of tokenization splits text into tokens based on whitespace. This is a common tokenization method used in many NLP tasks.

The output is:

While simple and fast, this approach has several limitations. Recall that a model handling text needs to know its vocabulary — the set of all possible tokens. Using this naive tokenization, the vocabulary consists of all words in the provided text. When training a model, you create the vocabulary from your training data. However, when using the trained model in your project, you may encounter words not in the vocabulary. In such cases, your model cannot handle them or must replace them with a special “unknown” token.

Another problem with naive tokenization is its poor handling of punctuation and special characters. For example, “world!” becomes one token, while in another sentence, “world” might be a separate token. This creates two different tokens in the vocabulary for essentially the same word. Similar issues arise with capitalization and hyphenation.

Why tokenize words by space? In English, space is how we separate words, and words are the basic units of language. You wouldn’t want to tokenize input by bytes, as you’d get meaningless alphabets that make it difficult for the model to understand the text’s meaning. Similarly, tokenizing by sentences isn’t ideal because there are multiple orders of magnitude more sentences than words. Training a model to understand text at the sentence level would require proportionally more training data.

However, are words the optimal level for tokenization? Ideally, you want to break down text into the smallest meaningful units. In German, space-based tokenization isn’t ideal due to numerous compound words. Even in English, prefixes and suffixes that aren’t standalone words carry meaning when combined with other words. For example, “unhappy” should be understood as “un-” + “happy”.

Therefore, you need a better tokenization method.

Stemming and Lemmatization

By implementing more sophisticated tokenization algorithms, you can create a better vocabulary. For example, this regular expression tokenizes text into words, punctuation, and numbers:

To further reduce vocabulary size, you can convert everything to lowercase:

and the output is:

However, this still doesn’t address the problem of word variations.

Stemming and lemmatization are two techniques for reducing words to their root form. Stemming is a more aggressive technique that removes prefixes and suffixes based on rules. Lemmatization is gentler, reducing words to their base form using a dictionary. Both are language-specific, but stemming may produce invalid words.

In English, the Porter stemming algorithm is commonly used. You can implement it using the nltk library:

and the output is:

You can see that “unstabl” is not a valid word, but it’s what the Porter stemming algorithm produces.

Lemmatization is gentler and almost always produces valid words. Here’s how to use the nltk library for lemmatization:

and the output is:

In both cases, you first tokenize the words and then transform them with a stemmer or lemmatizer. This normalization step produces a more consistent vocabulary. However, the fundamental tokenization issues, such as recognizing subwords, remain unsolved.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is one of the most widely used tokenization algorithms in modern language models. Originally created as a text compression algorithm, it was introduced for machine translation and later adopted by GPT models. BPE works by iteratively merging the most frequent adjacent pairs of characters or tokens in the training data.

The algorithm begins with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs into new tokens. This process continues until reaching the desired vocabulary size. For English text, you can start with just the alphabet and some punctuation, making the initial character set very small. Then, common letter combinations are introduced to the vocabulary iteratively. The resulting vocabulary contains both individual characters and common subword units.

BPE is trained on specific data, so the exact tokenization depends on the training data. Therefore, you need to save and load the BPE tokenizer model for use in your project.

BPE doesn’t specify how to define a word. For example, hyphenated words like “pre-trained” can be treated as one word or two words. This is determined by the “pre-tokenizer,” which in its simplest form splits words by spaces.

Many transformer models use BPE, including GPT, BART, and RoBERTa. You can use their trained BPE tokenizers. Here’s how to use the BPE tokenizer from the Hugging Face Transformers library:

and its output is:

You can see that the tokenizer uses “Ġ” to represent spaces between words. This is a special token used by BPE to represent word boundaries. Notice that words are neither stemmed nor lemmatized—“models” remains as is, not transformed to “model”.

An alternative to Hugging Face’s tokenizer is OpenAI’s tiktoken library. Here’s an example:

To train your own BPE tokenizer, the Hugging Face Tokenizers library is the easiest option. Here’s an example:

Running this, you will see:

The BpeTrainer object has more arguments for controlling the training process. In this example, you loaded a dataset using Hugging Face’s datasets library and trained the tokenizer on the text data. Each dataset is different—this one has “test”, “train”, and “validation” splits. Each split has one feature named “text” containing strings. We trained the tokenizer using ds["train"]["text"] and let the trainer find merges until reaching the desired vocabulary size.

You can see that the tokenizer’s state before and after training differs—tokens learned from the training data are added and associated with token IDs.

A key advantage of the BPE tokenizer is its ability to handle unknown words by breaking them down into known subword units.

WordPiece

WordPiece is a popular tokenization algorithm proposed by Google in 2016, used by BERT and its variants. It’s also a subword tokenization algorithm. Let’s see how it tokenizes a sentence:

The output of this code is:

From this output, you can see that the tokenizer splits “initialized” into “initial” and “##ized”. The “##” prefix indicates that this is a subword of the previous word. If a word isn’t prefixed with “##”, it’s assumed to have a space before it.

This result includes some BERT-specific design choices. In this BERT model, all text is converted to lowercase, which the tokenizer handles implicitly. BERT also assumes text sequences start with a [CLS] token and end with a [SEP] token. These special tokens are added automatically by the tokenizer. None of these are required by the WordPiece algorithm, so you might not see them in other models.

WordPiece is similar to BPE. Both start with the set of all characters and merge some into new vocabulary tokens. BPE merges the most frequent token pairs, while WordPiece uses a score formula that maximizes likelihood. The key difference is that BPE may create subword tokens from common words, while WordPiece typically keeps common words as single tokens.

Training a WordPiece tokenizer using the Hugging Face tokenizers library is similar to BPE. You can use the WordPieceTrainer to train the tokenizer. Here’s an example:

SentencePiece and Unigram

BPE and WordPiece are built from the bottom up. They start with the set of all characters and merge some into new vocabulary tokens. You can also build a tokenizer from the top down, starting with all words from the training data and pruning the vocabulary to the desired size.

Unigram is such an algorithm. Training a Unigram tokenizer involves removing vocabulary items in each step based on a log-likelihood score. Unlike BPE and WordPiece, the trained Unigram tokenizer isn’t rule-based but statistical. It saves the likelihood of each token, which is used to determine the tokenization of new text.

While it’s theoretically possible to have a standalone Unigram tokenizer, it’s most commonly seen as part of SentencePiece.

SentencePiece is a language-neutral tokenization algorithm that doesn’t require pre-tokenization of input text. It’s particularly useful for multilingual scenarios because, for example, English uses spaces to separate words, but Chinese doesn’t. SentencePiece handles such differences by treating input text as a stream of Unicode characters. It then uses either BPE or Unigram to create the tokenization.

Here’s how to use the SentencePiece tokenizer from the Hugging Face Transformers library:

and the output is:

Similar to WordPiece, a special prefix (underscore character, “_”) is added to distinguish subwords from words.

Training a SentencePiece tokenizer is also similar using the Hugging Face Tokenizers library. Here’s an example:

You can also use Google’s sentencepiece library for the same purpose.

Further Readings

Below are some further readings on the topic:

Summary

In this article, you explored different types of tokenization algorithms used in modern language models. You learned that:

  • BPE is widely used in GPT models and works by merging frequent adjacent pairs
  • WordPiece is used in BERT models and maximizes likelihood of training data
  • SentencePiece is more flexible and can handle different languages without pre-tokenization
  • Modern tokenizers include important features like special tokens, truncation, and padding

Understanding these tokenization algorithms is crucial for working with modern language models and preprocessing text data effectively.

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

…using transformer models with attention

Discover how in my new Ebook:

Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can

translate sentences from one language to another

Give magical power of understanding human language for
Your Projects

See What’s Inside

Leave a Reply

Your email address will not be published. Required fields are marked *