Tokenizers in Language Models - MachineLearningMastery.com -aitoolstv.com. All rights reserved.

Tokenization is a crucial preprocessing step in natural language processing (NLP) that converts raw text into tokens that can be processed by language models. Modern language models use sophisticated tokenization algorithms to handle the complexity of human language. In this article, we will explore common tokenization algorithms used in modern LLMs, their implementation, and how to use them.

Let’s get started!

Tokenizers in Language Models
Photo by Belle Co. Some rights reserved.

Overview

This post is divided into five parts; they are:

Naive Tokenization
Stemming and Lemmatization
Byte-Pair Encoding (BPE)
WordPiece
SentencePiece and Unigram

Naive Tokenization

The simplest form of tokenization splits text into tokens based on whitespace. This is a common tokenization method used in many NLP tasks.

text = “Hello, world! This is a test.” tokens = text.split() print(f”Tokens: {tokens}”)

text = “Hello, world! This is a test.”

tokens = text.split()

print(f“Tokens: {tokens}”)

The output is:

Tokens: [‘Hello,’, ‘world!’, ‘This’, ‘is’, ‘a’, ‘test.’]

Tokens: [‘Hello,’, ‘world!’, ‘This’, ‘is’, ‘a’, ‘test.’]

While simple and fast, this approach has several limitations. Recall that a model handling text needs to know its vocabulary — the set of all possible tokens. Using this naive tokenization, the vocabulary consists of all words in the provided text. When training a model, you create the vocabulary from your training data. However, when using the trained model in your project, you may encounter words not in the vocabulary. In such cases, your model cannot handle them or must replace them with a special “unknown” token.

Another problem with naive tokenization is its poor handling of punctuation and special characters. For example, “world!” becomes one token, while in another sentence, “world” might be a separate token. This creates two different tokens in the vocabulary for essentially the same word. Similar issues arise with capitalization and hyphenation.

Why tokenize words by space? In English, space is how we separate words, and words are the basic units of language. You wouldn’t want to tokenize input by bytes, as you’d get meaningless alphabets that make it difficult for the model to understand the text’s meaning. Similarly, tokenizing by sentences isn’t ideal because there are multiple orders of magnitude more sentences than words. Training a model to understand text at the sentence level would require proportionally more training data.

However, are words the optimal level for tokenization? Ideally, you want to break down text into the smallest meaningful units. In German, space-based tokenization isn’t ideal due to numerous compound words. Even in English, prefixes and suffixes that aren’t standalone words carry meaning when combined with other words. For example, “unhappy” should be understood as “un-” + “happy”.

Therefore, you need a better tokenization method.

Stemming and Lemmatization

By implementing more sophisticated tokenization algorithms, you can create a better vocabulary. For example, this regular expression tokenizes text into words, punctuation, and numbers:

import re text = “Hello, world! This is a test.” tokens = re.findall(r’\w+|[^\w\s]’, text) print(f”Tokens: {tokens}”)

import re

text = “Hello, world! This is a test.”

tokens = re.findall(r‘\w+|[^\w\s]’, text)

print(f“Tokens: {tokens}”)

To further reduce vocabulary size, you can convert everything to lowercase:

import re text = “Hello, world! This is a test.” tokens = re.findall(r’\w+|[^\w\s]’, text.lower()) print(f”Tokens: {tokens}”)

import re

text = “Hello, world! This is a test.”

tokens = re.findall(r‘\w+|[^\w\s]’, text.lower())

print(f“Tokens: {tokens}”)

and the output is:

Tokens: [‘hello’, ‘,’, ‘world’, ‘!’, ‘this’, ‘is’, ‘a’, ‘test’, ‘.’]

Tokens: [‘hello’, ‘,’, ‘world’, ‘!’, ‘this’, ‘is’, ‘a’, ‘test’, ‘.’]

However, this still doesn’t address the problem of word variations.

Stemming and lemmatization are two techniques for reducing words to their root form. Stemming is a more aggressive technique that removes prefixes and suffixes based on rules. Lemmatization is gentler, reducing words to their base form using a dictionary. Both are language-specific, but stemming may produce invalid words.

In English, the Porter stemming algorithm is commonly used. You can implement it using the nltk library:

import nltk from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize # download the necessary resources if haven’t done so nltk.download(‘punkt_tab’) text = “These models may become unstable quickly if not initialized.” stemmer = PorterStemmer() words = word_tokenize(text) stemmed_words = [stemmer.stem(word) for word in words] print(stemmed_words)

import nltk

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

# download the necessary resources if haven’t done so

nltk.download(‘punkt_tab’)

text = “These models may become unstable quickly if not initialized.”

stemmer = PorterStemmer()

words = word_tokenize(text)

stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

and the output is:

[‘these’, ‘model’, ‘may’, ‘becom’, ‘unstabl’, ‘quickli’, ‘if’, ‘not’, ‘initi’, ‘.’]

[‘these’, ‘model’, ‘may’, ‘becom’, ‘unstabl’, ‘quickli’, ‘if’, ‘not’, ‘initi’, ‘.’]

You can see that “unstabl” is not a valid word, but it’s what the Porter stemming algorithm produces.

Lemmatization is gentler and almost always produces valid words. Here’s how to use the nltk library for lemmatization:

import nltk from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize # download the necessary resources if haven’t done so nltk.download(‘wordnet’) text = “These models may become unstable quickly if not initialized.” lemmatizer = WordNetLemmatizer() words = word_tokenize(text) lemmatized_words = [lemmatizer.lemmatize(word) for word in words] print(lemmatized_words)

import nltk

from nltk.stem import WordNetLemmatizer

from nltk.tokenize import word_tokenize

# download the necessary resources if haven’t done so

nltk.download(‘wordnet’)

text = “These models may become unstable quickly if not initialized.”

lemmatizer = WordNetLemmatizer()

words = word_tokenize(text)

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

and the output is:

[‘These’, ‘model’, ‘may’, ‘become’, ‘unstable’, ‘quickly’, ‘if’, ‘not’, ‘initialized’, ‘.’]

[‘These’, ‘model’, ‘may’, ‘become’, ‘unstable’, ‘quickly’, ‘if’, ‘not’, ‘initialized’, ‘.’]

In both cases, you first tokenize the words and then transform them with a stemmer or lemmatizer. This normalization step produces a more consistent vocabulary. However, the fundamental tokenization issues, such as recognizing subwords, remain unsolved.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is one of the most widely used tokenization algorithms in modern language models. Originally created as a text compression algorithm, it was introduced for machine translation and later adopted by GPT models. BPE works by iteratively merging the most frequent adjacent pairs of characters or tokens in the training data.

The algorithm begins with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs into new tokens. This process continues until reaching the desired vocabulary size. For English text, you can start with just the alphabet and some punctuation, making the initial character set very small. Then, common letter combinations are introduced to the vocabulary iteratively. The resulting vocabulary contains both individual characters and common subword units.

BPE is trained on specific data, so the exact tokenization depends on the training data. Therefore, you need to save and load the BPE tokenizer model for use in your project.

BPE doesn’t specify how to define a word. For example, hyphenated words like “pre-trained” can be treated as one word or two words. This is determined by the “pre-tokenizer,” which in its simplest form splits words by spaces.

Many transformer models use BPE, including GPT, BART, and RoBERTa. You can use their trained BPE tokenizers. Here’s how to use the BPE tokenizer from the Hugging Face Transformers library:

from transformers import GPT2Tokenizer # Load the GPT-2 tokenizer (which uses BPE) tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”) # Tokenize a text text = “Pre-trained models are available.” tokens = tokenizer.encode(text) print(f”Token IDs: {tokens}”) print(f”Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f”Decoded: {tokenizer.decode(tokens)}”)

from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer (which uses BPE)

tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)

# Tokenize a text

text = “Pre-trained models are available.”

tokens = tokenizer.encode(text)

print(f“Token IDs: {tokens}”)

print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”)

print(f“Decoded: {tokenizer.decode(tokens)}”)

and its output is:

Token IDs: [6719, 12, 35311, 4981, 389, 1695, 13] Tokens: [‘Pre’, ‘-‘, ‘trained’, ‘Ġmodels’, ‘Ġare’, ‘Ġavailable’, ‘.’] Decoded: Pre-trained models are available.

Token IDs: [6719, 12, 35311, 4981, 389, 1695, 13]

Tokens: [‘Pre’, ‘-‘, ‘trained’, ‘Ġmodels’, ‘Ġare’, ‘Ġavailable’, ‘.’]

Decoded: Pre-trained models are available.

You can see that the tokenizer uses “Ġ” to represent spaces between words. This is a special token used by BPE to represent word boundaries. Notice that words are neither stemmed nor lemmatized—“models” remains as is, not transformed to “model”.

An alternative to Hugging Face’s tokenizer is OpenAI’s tiktoken library. Here’s an example:

import tiktoken encoding = tiktoken.get_encoding(“cl100k_base”) text = “Pre-trained models are available.” tokens = encoding.encode(text) print(f”Token IDs: {tokens}”) print(f”Tokens: {[encoding.decode_single_token_bytes(t) for t in tokens]}”) print(f”Decoded: {encoding.decode(tokens)}”)

import tiktoken

encoding = tiktoken.get_encoding(“cl100k_base”)

text = “Pre-trained models are available.”

tokens = encoding.encode(text)

print(f“Token IDs: {tokens}”)

print(f“Tokens: {[encoding.decode_single_token_bytes(t) for t in tokens]}”)

print(f“Decoded: {encoding.decode(tokens)}”)

To train your own BPE tokenizer, the Hugging Face Tokenizers library is the easiest option. Here’s an example:

from datasets import load_dataset from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.pre_tokenizers import Whitespace from tokenizers.trainers import BpeTrainer ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1″) print(ds) tokenizer = Tokenizer(BPE(unk_token=”[UNK]”)) tokenizer.pre_tokenizer = Whitespace() trainer = BpeTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”]) print(tokenizer) tokenizer.train_from_iterator(ds[“train”][“text”], trainer) print(tokenizer) tokenizer.save(“my-tokenizer.json”) # reload the trained tokenizer tokenizer = Tokenizer.from_file(“my-tokenizer.json”)

from datasets import load_dataset

from tokenizers import Tokenizer

from tokenizers.models import BPE

from tokenizers.pre_tokenizers import Whitespace

from tokenizers.trainers import BpeTrainer

ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”)

print(ds)

tokenizer = Tokenizer(BPE(unk_token=“[UNK]”))

tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”])

print(tokenizer)

tokenizer.train_from_iterator(ds[“train”][“text”], trainer)

print(tokenizer)

tokenizer.save(“my-tokenizer.json”)

# reload the trained tokenizer

tokenizer = Tokenizer.from_file(“my-tokenizer.json”)

Running this, you will see:

DatasetDict({ test: Dataset({ features: [‘text’], num_rows: 4358 }) train: Dataset({ features: [‘text’], num_rows: 1801350 }) validation: Dataset({ features: [‘text’], num_rows: 3760 }) }) Tokenizer(version=”1.0″, truncation=None, padding=None, added_tokens=[], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=BPE(dropout=None, unk_token=”[UNK]”, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={}, merges=[])) [00:00:04] Pre-processing sequences ███████████████████████████ 0 / 0 [00:00:00] Tokenize words ███████████████████████████ 608587 / 608587 [00:00:00] Count pairs ███████████████████████████ 608587 / 608587 [00:00:02] Compute merges ███████████████████████████ 25018 / 25018 Tokenizer(version=”1.0″, truncation=None, padding=None, added_tokens=[ {“id”:0, “content”:”[UNK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:1, “content”:”[CLS]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:2, “content”:”[SEP]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:3, “content”:”[PAD]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}, {“id”:4, “content”:”[MASK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}], normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=BPE(dropout=None, unk_token=”[UNK]”, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={“[UNK]”:0, “[CLS]”:1, “[SEP]”:2, “[PAD]”:3, “[MASK]”:4, …}, merges=[(“t”, “h”), (“i”, “n”), (“e”, “r”), (“a”, “n”), (“th”, “e”), …]))

DatasetDict({

test: Dataset({

features: [‘text’],

num_rows: 4358

})

train: Dataset({

features: [‘text’],

num_rows: 1801350

})

validation: Dataset({

features: [‘text’],

num_rows: 3760

})

Tokenizer(version=”1.0″, truncation=None, padding=None, added_tokens=[], normalizer=None,

pre_tokenizer=Whitespace(), post_processor=None, decoder=None, model=BPE(dropout=None,

unk_token=”[UNK]”, continuing_subword_prefix=None, end_of_word_suffix=None,

fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={}, merges=[]))

[00:00:04] Pre-processing sequences ███████████████████████████ 0 / 0

[00:00:00] Tokenize words ███████████████████████████ 608587 / 608587

[00:00:00] Count pairs ███████████████████████████ 608587 / 608587

[00:00:02] Compute merges ███████████████████████████ 25018 / 25018

Tokenizer(version=”1.0″, truncation=None, padding=None, added_tokens=[

{“id”:0, “content”:”[UNK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …},

{“id”:1, “content”:”[CLS]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …},

{“id”:2, “content”:”[SEP]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …},

{“id”:3, “content”:”[PAD]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …},

{“id”:4, “content”:”[MASK]”, “single_word”:False, “lstrip”:False, “rstrip”:False, …}],

normalizer=None, pre_tokenizer=Whitespace(), post_processor=None, decoder=None,

model=BPE(dropout=None, unk_token=”[UNK]”, continuing_subword_prefix=None,

end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False,

vocab={“[UNK]”:0, “[CLS]”:1, “[SEP]”:2, “[PAD]”:3, “[MASK]”:4, …}, merges=[(“t”, “h”),

(“i”, “n”), (“e”, “r”), (“a”, “n”), (“th”, “e”), …]))

The BpeTrainer object has more arguments for controlling the training process. In this example, you loaded a dataset using Hugging Face’s datasets library and trained the tokenizer on the text data. Each dataset is different—this one has “test”, “train”, and “validation” splits. Each split has one feature named “text” containing strings. We trained the tokenizer using ds["train"]["text"] and let the trainer find merges until reaching the desired vocabulary size.

You can see that the tokenizer’s state before and after training differs—tokens learned from the training data are added and associated with token IDs.

A key advantage of the BPE tokenizer is its ability to handle unknown words by breaking them down into known subword units.

WordPiece

WordPiece is a popular tokenization algorithm proposed by Google in 2016, used by BERT and its variants. It’s also a subword tokenization algorithm. Let’s see how it tokenizes a sentence:

from transformers import BertTokenizer # Load the WordPiece tokenizer from BERT tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”) # Tokenize a text text = “These models are usually initialized with Gaussian random values.” tokens = tokenizer.encode(text) print(f”Token IDs: {tokens}”) print(f”Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f”Decoded: {tokenizer.decode(tokens)}”)

from transformers import BertTokenizer

# Load the WordPiece tokenizer from BERT

tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”)

# Tokenize a text

text = “These models are usually initialized with Gaussian random values.”

tokens = tokenizer.encode(text)

print(f“Token IDs: {tokens}”)

print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”)

print(f“Decoded: {tokenizer.decode(tokens)}”)

The output of this code is:

Token IDs: [101, 2122, 4275, 2024, 2788, 3988, 3550, 2007, 11721, 17854, 2937, 6721, 5300, 1012, 102] Tokens: [‘[CLS]’, ‘these’, ‘models’, ‘are’, ‘usually’, ‘initial’, ‘##ized’, ‘with’, ‘ga’, ‘##uss’, ‘##ian’, ‘random’, ‘values’, ‘.’, ‘[SEP]’] Decoded: [CLS] these models are usually initialized with gaussian random values. [SEP]

Token IDs: [101, 2122, 4275, 2024, 2788, 3988, 3550, 2007, 11721, 17854, 2937, 6721, 5300, 1012, 102]

Tokens: [‘[CLS]’, ‘these’, ‘models’, ‘are’, ‘usually’, ‘initial’, ‘##ized’, ‘with’, ‘ga’, ‘##uss’, ‘##ian’, ‘random’, ‘values’, ‘.’, ‘[SEP]’]

Decoded: [CLS] these models are usually initialized with gaussian random values. [SEP]

From this output, you can see that the tokenizer splits “initialized” into “initial” and “##ized”. The “##” prefix indicates that this is a subword of the previous word. If a word isn’t prefixed with “##”, it’s assumed to have a space before it.

This result includes some BERT-specific design choices. In this BERT model, all text is converted to lowercase, which the tokenizer handles implicitly. BERT also assumes text sequences start with a [CLS] token and end with a [SEP] token. These special tokens are added automatically by the tokenizer. None of these are required by the WordPiece algorithm, so you might not see them in other models.

WordPiece is similar to BPE. Both start with the set of all characters and merge some into new vocabulary tokens. BPE merges the most frequent token pairs, while WordPiece uses a score formula that maximizes likelihood. The key difference is that BPE may create subword tokens from common words, while WordPiece typically keeps common words as single tokens.

Training a WordPiece tokenizer using the Hugging Face tokenizers library is similar to BPE. You can use the WordPieceTrainer to train the tokenizer. Here’s an example:

from datasets import load_dataset from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.pre_tokenizers import Whitespace from tokenizers.trainers import WordPieceTrainer ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1″) tokenizer = Tokenizer(WordPiece(unk_token=”[UNK]”)) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”]) tokenizer.train_from_iterator(ds[“train”][“text”], trainer) tokenizer.save(“my-tokenizer.json”)

from datasets import load_dataset

from tokenizers import Tokenizer

from tokenizers.models import WordPiece

from tokenizers.pre_tokenizers import Whitespace

from tokenizers.trainers import WordPieceTrainer

ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”)

tokenizer = Tokenizer(WordPiece(unk_token=“[UNK]”))

tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”, “[PAD]”, “[MASK]”])

tokenizer.train_from_iterator(ds[“train”][“text”], trainer)

tokenizer.save(“my-tokenizer.json”)

SentencePiece and Unigram

BPE and WordPiece are built from the bottom up. They start with the set of all characters and merge some into new vocabulary tokens. You can also build a tokenizer from the top down, starting with all words from the training data and pruning the vocabulary to the desired size.

Unigram is such an algorithm. Training a Unigram tokenizer involves removing vocabulary items in each step based on a log-likelihood score. Unlike BPE and WordPiece, the trained Unigram tokenizer isn’t rule-based but statistical. It saves the likelihood of each token, which is used to determine the tokenization of new text.

While it’s theoretically possible to have a standalone Unigram tokenizer, it’s most commonly seen as part of SentencePiece.

SentencePiece is a language-neutral tokenization algorithm that doesn’t require pre-tokenization of input text. It’s particularly useful for multilingual scenarios because, for example, English uses spaces to separate words, but Chinese doesn’t. SentencePiece handles such differences by treating input text as a stream of Unicode characters. It then uses either BPE or Unigram to create the tokenization.

Here’s how to use the SentencePiece tokenizer from the Hugging Face Transformers library:

from transformers import T5Tokenizer # Load the T5 tokenizer (which uses SentencePiece+Unigram) tokenizer = T5Tokenizer.from_pretrained(“t5-small”) # Tokenize a text text = “SentencePiece is a subword tokenizer used in models such as XLNet and T5.” tokens = tokenizer.encode(text) print(f”Token IDs: {tokens}”) print(f”Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”) print(f”Decoded: {tokenizer.decode(tokens)}”)

from transformers import T5Tokenizer

# Load the T5 tokenizer (which uses SentencePiece+Unigram)

tokenizer = T5Tokenizer.from_pretrained(“t5-small”)

# Tokenize a text

text = “SentencePiece is a subword tokenizer used in models such as XLNet and T5.”

tokens = tokenizer.encode(text)

print(f“Token IDs: {tokens}”)

print(f“Tokens: {tokenizer.convert_ids_to_tokens(tokens)}”)

print(f“Decoded: {tokenizer.decode(tokens)}”)

and the output is:

Token IDs: [4892, 17, 1433, 345, 23, 15, 565, 19, 3, 9, 769, 6051, 14145, 8585, 261, 16, 2250, 224, 38, 3, 4, 434, 9688, 11, 332, 9125, 1] Tokens: [‘▁Sen’, ‘t’, ‘ence’, ‘P’, ‘i’, ‘e’, ‘ce’, ‘▁is’, ‘▁’, ‘a’, ‘▁sub’, ‘word’, ‘▁token’, ‘izer’, ‘▁used’, ‘▁in’, ‘▁models’, ‘▁such’, ‘▁as’, ‘▁’, ‘X’, ‘L’, ‘Net’, ‘▁and’, ‘▁T’, ‘5.’, ”] Decoded: SentencePiece is a subword tokenizer used in models such as XLNet and T5.

Token IDs: [4892, 17, 1433, 345, 23, 15, 565, 19, 3, 9, 769, 6051, 14145, 8585, 261, 16, 2250, 224, 38, 3, 4, 434, 9688, 11, 332, 9125, 1]

Tokens: [‘▁Sen’, ‘t’, ‘ence’, ‘P’, ‘i’, ‘e’, ‘ce’, ‘▁is’, ‘▁’, ‘a’, ‘▁sub’, ‘word’, ‘▁token’, ‘izer’, ‘▁used’, ‘▁in’, ‘▁models’, ‘▁such’, ‘▁as’, ‘▁’, ‘X’, ‘L’, ‘Net’, ‘▁and’, ‘▁T’, ‘5.’, ”]

Decoded: SentencePiece is a subword tokenizer used in models such as XLNet and T5.

Similar to WordPiece, a special prefix (underscore character, “_”) is added to distinguish subwords from words.

Training a SentencePiece tokenizer is also similar using the Hugging Face Tokenizers library. Here’s an example:

from datasets import load_dataset from tokenizers import SentencePieceUnigramTokenizer ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”) tokenizer = SentencePieceUnigramTokenizer() tokenizer.train_from_iterator(ds[“train”][“text”]) tokenizer.save(“my-tokenizer.json”)

from datasets import load_dataset

from tokenizers import SentencePieceUnigramTokenizer

ds = load_dataset(“Salesforce/wikitext”, “wikitext-103-raw-v1”)

tokenizer = SentencePieceUnigramTokenizer()

tokenizer.train_from_iterator(ds[“train”][“text”])

tokenizer.save(“my-tokenizer.json”)

You can also use Google’s sentencepiece library for the same purpose.

Summary

In this article, you explored different types of tokenization algorithms used in modern language models. You learned that:

BPE is widely used in GPT models and works by merging frequent adjacent pairs
WordPiece is used in BERT models and maximizes likelihood of training data
SentencePiece is more flexible and can handle different languages without pre-tokenization
Modern tokenizers include important features like special tokens, truncation, and padding

Understanding these tokenization algorithms is crucial for working with modern language models and preprocessing text data effectively.

Tokenizers in Language Models – MachineLearningMastery.com

Overview

Naive Tokenization

Stemming and Lemmatization

Byte-Pair Encoding (BPE)

WordPiece

SentencePiece and Unigram

Further Readings

Summary

Learn Transformers and Attention!

Teach your deep learning model to read a sentence

Give magical power of understanding human language for
Your Projects

Leave a Reply Cancel reply

Benchmarking LLMs for global health

AI in process manufacturing: From operational gains to strategic advantage

The Role of AI and Robotics in Improving Efficiency in Manufacturing

The Future of Work: How AI and Robotics are Changing Industries

The Impact of Artificial Intelligence and Robotics on Society

Advancements in AI and Robotics: What to Expect in the Coming Years

Get started with Zapier

What is a personal CRM—and should you use one?

CEH Certification Cost (2025)

May 29, 2025 – Rates Stand Still – Forbes Advisor

Get started with Zapier

CEH Certification Cost (2025)

Overview

Naive Tokenization

Stemming and Lemmatization

Byte-Pair Encoding (BPE)

WordPiece

SentencePiece and Unigram

Further Readings

Summary

Learn Transformers and Attention!

Teach your deep learning model to read a sentence

Give magical power of understanding human language for Your Projects

Leave a Reply Cancel reply

Give magical power of understanding human language for
Your Projects