Tokenization is a crucial preprocessing step in natural language processing (NLP) that converts raw text into tokens that can be processed by language models. Modern language models use sophisticated tokenization algorithms to handle the complexity of human language. In this article, we will explore common tokenization algorithms used in modern LLMs, their implementation, and how to use them.
Let’s get started!
Tokenizers in Language Models
Photo by Belle Co. Some rights reserved.
Overview
This post is divided into five parts; they are:
- Naive Tokenization
- Stemming and Lemmatization
- Byte-Pair Encoding (BPE)
- WordPiece
- SentencePiece and Unigram
Naive Tokenization
The simplest form of tokenization splits text into tokens based on whitespace. This is a common tokenization method used in many NLP tasks.
text = “Hello, world! This is a test.” tokens = text.split() print(f“Tokens: {tokens}”) |
The output is:
Tokens: [‘Hello,’, ‘world!’, ‘This’, ‘is’, ‘a’, ‘test.’] |
While simple and fast, this approach has several limitations. Recall that a model handling text needs to know its vocabulary — the set of all possible tokens. Using this naive tokenization, the vocabulary consists of all words in the provided text. When training a model, you create the vocabulary from your training data. However, when using the trained model in your project, you may encounter words not in the vocabulary. In such cases, your model cannot handle them or must replace them with a special “unknown” token.
Another problem with naive tokenization is its poor handling of punctuation and special characters. For example, “world!” becomes one token, while in another sentence, “world” might be a separate token. This creates two different tokens in the vocabulary for essentially the same word. Similar issues arise with capitalization and hyphenation.
Why tokenize words by space? In English, space is how we separate words, and words are the basic units of language. You wouldn’t want to tokenize input by bytes, as you’d get meaningless alphabets that make it difficult for the model to understand the text’s meaning. Similarly, tokenizing by sentences isn’t ideal because there are multiple orders of magnitude more sentences than words. Training a model to understand text at the sentence level would require proportionally more training data.
However, are words the optimal level for tokenization? Ideally, you want to break down text into the smallest meaningful units. In German, space-based tokenization isn’t ideal due to numerous compound words. Even in English, prefixes and suffixes that aren’t standalone words carry meaning when combined with other words. For example, “unhappy” should be understood as “un-” + “happy”.
Therefore, you need a better tokenization method.
Stemming and Lemmatization
By implementing more sophisticated tokenization algorithms, you can create a better vocabulary. For example, this regular expression tokenizes text into words, punctuation, and numbers:
import re
text = “Hello, world! This is a test.” tokens = re.findall(r‘\w+|[^\w\s]’, text) print(f“Tokens: {tokens}”) |
To further reduce vocabulary size, you can convert everything to lowercase:
import re
text = “Hello, world! This is a test.” tokens = re.findall(r‘\w+|[^\w\s]’, text.lower()) print(f“Tokens: {tokens}”) |
and the output is:
Tokens: [‘hello’, ‘,’, ‘world’, ‘!’, ‘this’, ‘is’, ‘a’, ‘test’, ‘.’] |
However, this still doesn’t address the problem of word variations.
Stemming and lemmatization are two techniques for reducing words to their root form. Stemming is a more aggressive technique that removes prefixes and suffixes based on rules. Lemmatization is gentler, reducing words to their base form using a dictionary. Both are language-specific, but stemming may produce invalid words.
In English, the Porter stemming algorithm is commonly used. You can implement it using the nltk library:
import nltk from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize
# download the necessary resources if haven’t done so nltk.download(‘punkt_tab’)
text = “These models may become unstable quickly if not initialized.” stemmer = PorterStemmer() words = word_tokenize(text) stemmed_words = [stemmer.stem(word) for word in words] print(stemmed_words) |
and the output is:
[‘these’, ‘model’, ‘may’, ‘becom’, ‘unstabl’, ‘quickli’, ‘if’, ‘not’, ‘initi’, ‘.’] |
You can see that “unstabl” is not a valid word, but it’s what the Porter stemming algorithm produces.
Lemmatization is gentler and almost always produces valid words. Here’s how to use the nltk library for lemmatization:
import nltk from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize
# download the necessary resources if haven’t done so nltk.download(‘wordnet’)
text = “These models may become unstable quickly if not initialized.” lemmatizer = WordNetLemmatizer() words = word_tokenize(text) lemmatized_words = [lemmatizer.lemmatize(word) for word in words] print(lemmatized_words) |
and the output is:
[‘These’, ‘model’, ‘may’, ‘become’, ‘unstable’, ‘quickly’, ‘if’, ‘not’, ‘initialized’, ‘.’] |
In both cases, you first tokenize the words and then transform them with a stemmer or lemmatizer. This normalization step produces a more consistent vocabulary. However, the fundamental tokenization issues, such as recognizing subwords, remain unsolved.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) is one of the most widely used tokenization algorithms in modern language models. Originally created as a text compression algorithm, it was introduced for machine translation and later adopted by GPT models. BPE works by iteratively merging the most frequent adjacent pairs of characters or tokens in the training data.
The algorithm begins with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs into new tokens. This process continues until reaching the desired vocabulary size. For English text, you can start with just the alphabet and some punctuation, making the initial character set very small. Then, common letter combinations are introduced to the vocabulary iteratively. The resulting vocabulary contains both individual characters and common subword units.
BPE is trained on specific data, so the exact tokenization depends on the training data. Therefore, you need to save and load the BPE tokenizer model for use in your project.
BPE doesn’t specify how to define a word. For example, hyphenated words like “pre-trained” can be treated as one word or two words. This is determined by the “pre-tokenizer,” which in its simplest form splits words by spaces.
Many transformer models use BPE, including GPT, BART, and RoBERTa. You can use their trained BPE tokenizers. Here’s how to use the BPE tokenizer from the Hugging Face Transformers library:
from transformers import GPT2Tokenizer
# Load the GPT-2 tokenizer (which uses BPE) tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)
# Tokenize a text text = “Pre-trained models are available.” tokens = tokenizer.encode(text) print(f“Token IDs: {tokens}”) print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f“Decoded: {tokenizer.decode(tokens)}”) |
and its output is:
Token IDs: [6719, 12, 35311, 4981, 389, 1695, 13] Tokens: [‘Pre’, ‘-‘, ‘trained’, ‘Ġmodels’, ‘Ġare’, ‘Ġavailable’, ‘.’] Decoded: Pre-trained models are available. |
You can see that the tokenizer uses “Ġ” to represent spaces between words. This is a special token used by BPE to represent word boundaries. Notice that words are neither stemmed nor lemmatized—“models” remains as is, not transformed to “model”.
An alternative to Hugging Face’s tokenizer is OpenAI’s tiktoken library. Here’s an example:
import tiktoken
encoding = tiktoken.get_encoding(“cl100k_base”) text = “Pre-trained models are available.” tokens = encoding.encode(text) print(f“Token IDs: {tokens}”) print(f“Tokens: {[encoding.decode_single_token_bytes(t) for t in tokens]}”) print(f“Decoded: {encoding.decode(tokens)}”) |
To train your own BPE tokenizer, the Hugging Face Tokenizers library is the easiest option. Here’s an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from datasets import load_dataset from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.pre_tokenizers import Whitespace from tokenizers.trainers import BpeTrainer
ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”) print(ds)
tokenizer = Tokenizer(BPE(unk_token=“[UNK]”)) tokenizer.pre_tokenizer = Whitespace() trainer = BpeTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”]) print(tokenizer)
tokenizer.train_from_iterator(ds[“train”][“text”], trainer) print(tokenizer) tokenizer.save(“my-tokenizer.json”)
# reload the trained tokenizer tokenizer = Tokenizer.from_file(“my-tokenizer.json”) |
Running this, you will see:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
DatasetDict({ test: Dataset({ features: [‘text’], num_rows: 4358 }) train: Dataset({ features: [‘text’], num_rows: 1801350 }) validation: Dataset({ features: [‘text’], num_rows: 3760 }) }) Tokenizer(version=”1.0″, truncation=None, padding=None, added_tokens=[], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=BPE(dropout=None, unk_token=”[UNK]”, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={}, merges=[])) [00:00:04] Pre-processing sequences ███████████████████████████ 0 / 0 [00:00:00] Tokenize words ███████████████████████████ 608587 / 608587 [00:00:00] Count pairs ███████████████████████████ 608587 / 608587 [00:00:02] Compute merges ███████████████████████████ 25018 / 25018 Tokenizer(version=”1.0″, truncation=None, padding=None, added_tokens=[ {“id”:0, “content”:”[UNK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:1, “content”:”[CLS]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:2, “content”:”[SEP]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:3, “content”:”[PAD]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:4, “content”:”[MASK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=BPE(dropout=None, unk_token=”[UNK]”, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={“[UNK]”:0, “[CLS]”:1, “[SEP]”:2, “[PAD]”:3, “[MASK]”:4, …}, merges=[(“t”, “h”), (“i”, “n”), (“e”, “r”), (“a”, “n”), (“th”, “e”), …])) |
The BpeTrainer
object has more arguments for controlling the training process. In this example, you loaded a dataset using Hugging Face’s datasets
library and trained the tokenizer on the text data. Each dataset is different—this one has “test”, “train”, and “validation” splits. Each split has one feature named “text” containing strings. We trained the tokenizer using ds["train"]["text"]
and let the trainer find merges until reaching the desired vocabulary size.
You can see that the tokenizer’s state before and after training differs—tokens learned from the training data are added and associated with token IDs.
A key advantage of the BPE tokenizer is its ability to handle unknown words by breaking them down into known subword units.
WordPiece
WordPiece is a popular tokenization algorithm proposed by Google in 2016, used by BERT and its variants. It’s also a subword tokenization algorithm. Let’s see how it tokenizes a sentence:
from transformers import BertTokenizer
# Load the WordPiece tokenizer from BERT tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”)
# Tokenize a text text = “These models are usually initialized with Gaussian random values.” tokens = tokenizer.encode(text) print(f“Token IDs: {tokens}”) print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f“Decoded: {tokenizer.decode(tokens)}”) |
The output of this code is:
Token IDs: [101, 2122, 4275, 2024, 2788, 3988, 3550, 2007, 11721, 17854, 2937, 6721, 5300, 1012, 102] Tokens: [‘[CLS]’, ‘these’, ‘models’, ‘are’, ‘usually’, ‘initial’, ‘##ized’, ‘with’, ‘ga’, ‘##uss’, ‘##ian’, ‘random’, ‘values’, ‘.’, ‘[SEP]’] Decoded: [CLS] these models are usually initialized with gaussian random values. [SEP] |
From this output, you can see that the tokenizer splits “initialized” into “initial” and “##ized”. The “##” prefix indicates that this is a subword of the previous word. If a word isn’t prefixed with “##”, it’s assumed to have a space before it.
This result includes some BERT-specific design choices. In this BERT model, all text is converted to lowercase, which the tokenizer handles implicitly. BERT also assumes text sequences start with a [CLS]
token and end with a [SEP]
token. These special tokens are added automatically by the tokenizer. None of these are required by the WordPiece algorithm, so you might not see them in other models.
WordPiece is similar to BPE. Both start with the set of all characters and merge some into new vocabulary tokens. BPE merges the most frequent token pairs, while WordPiece uses a score formula that maximizes likelihood. The key difference is that BPE may create subword tokens from common words, while WordPiece typically keeps common words as single tokens.
Training a WordPiece tokenizer using the Hugging Face tokenizers library is similar to BPE. You can use the WordPieceTrainer
to train the tokenizer. Here’s an example:
from datasets import load_dataset from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.pre_tokenizers import Whitespace from tokenizers.trainers import WordPieceTrainer
ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”)
tokenizer = Tokenizer(WordPiece(unk_token=“[UNK]”)) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”])
tokenizer.train_from_iterator(ds[“train”][“text”], trainer) tokenizer.save(“my-tokenizer.json”) |
SentencePiece and Unigram
BPE and WordPiece are built from the bottom up. They start with the set of all characters and merge some into new vocabulary tokens. You can also build a tokenizer from the top down, starting with all words from the training data and pruning the vocabulary to the desired size.
Unigram is such an algorithm. Training a Unigram tokenizer involves removing vocabulary items in each step based on a log-likelihood score. Unlike BPE and WordPiece, the trained Unigram tokenizer isn’t rule-based but statistical. It saves the likelihood of each token, which is used to determine the tokenization of new text.
While it’s theoretically possible to have a standalone Unigram tokenizer, it’s most commonly seen as part of SentencePiece.
SentencePiece is a language-neutral tokenization algorithm that doesn’t require pre-tokenization of input text. It’s particularly useful for multilingual scenarios because, for example, English uses spaces to separate words, but Chinese doesn’t. SentencePiece handles such differences by treating input text as a stream of Unicode characters. It then uses either BPE or Unigram to create the tokenization.
Here’s how to use the SentencePiece tokenizer from the Hugging Face Transformers library:
from transformers import T5Tokenizer
# Load the T5 tokenizer (which uses SentencePiece+Unigram) tokenizer = T5Tokenizer.from_pretrained(“t5-small”)
# Tokenize a text text = “SentencePiece is a subword tokenizer used in models such as XLNet and T5.” tokens = tokenizer.encode(text) print(f“Token IDs: {tokens}”) print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f“Decoded: {tokenizer.decode(tokens)}”) |
and the output is:
Token IDs: [4892, 17, 1433, 345, 23, 15, 565, 19, 3, 9, 769, 6051, 14145, 8585, 261, 16, 2250, 224, 38, 3, 4, 434, 9688, 11, 332, 9125, 1] Tokens: [‘▁Sen’, ‘t’, ‘ence’, ‘P’, ‘i’, ‘e’, ‘ce’, ‘▁is’, ‘▁’, ‘a’, ‘▁sub’, ‘word’, ‘▁token’, ‘izer’, ‘▁used’, ‘▁in’, ‘▁models’, ‘▁such’, ‘▁as’, ‘▁’, ‘X’, ‘L’, ‘Net’, ‘▁and’, ‘▁T’, ‘5.’, ”] Decoded: SentencePiece is a subword tokenizer used in models such as XLNet and T5. |
Similar to WordPiece, a special prefix (underscore character, “_”) is added to distinguish subwords from words.
Training a SentencePiece tokenizer is also similar using the Hugging Face Tokenizers library. Here’s an example:
from datasets import load_dataset from tokenizers import SentencePieceUnigramTokenizer
ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”) tokenizer = SentencePieceUnigramTokenizer()
tokenizer.train_from_iterator(ds[“train”][“text”]) tokenizer.save(“my-tokenizer.json”) |
You can also use Google’s sentencepiece
library for the same purpose.
Further Readings
Below are some further readings on the topic:
Summary
In this article, you explored different types of tokenization algorithms used in modern language models. You learned that:
- BPE is widely used in GPT models and works by merging frequent adjacent pairs
- WordPiece is used in BERT models and maximizes likelihood of training data
- SentencePiece is more flexible and can handle different languages without pre-tokenization
- Modern tokenizers include important features like special tokens, truncation, and padding
Understanding these tokenization algorithms is crucial for working with modern language models and preprocessing text data effectively.