Evaluating Perplexity on Language Models




A language model is a probability distribution over sequences of tokens. When you train a language model, you want to measure how accurately it predicts human language use. This is a difficult task, and you need a metric to evaluate the model. In this article, you will learn about the perplexity metric. Specifically, you will learn:

  • What is perplexity, and how to compute it
  • How to evaluate the perplexity of a language model with sample data

Let’s get started.

Evaluating Perplexity on Language Models
Photo by Lucas Davis. Some rights reserved.

Overview

This article is divided into two parts; they are:

  • What Is Perplexity and How to Compute It
  • Evaluate the Perplexity of a Language Model with HellaSwag Dataset

What Is Perplexity and How to Compute It

Perplexity is a measure of how well a language model predicts a sample of text. It is defined as the inverse of the geometric mean of the probabilities of the tokens in the sample. Mathematically, perplexity is defined as:

$$
PPL(x_{1:L}) = \prod_{i=1}^L p(x_i)^{-1/L} = \exp\big(-\frac{1}{L} \sum_{i=1}^L \log p(x_i)\big)
$$

Perplexity is a function of a particular sequence of tokens. In practice, it is more convenient to compute perplexity as the mean of the log probabilities, as shown in the formula above.

Perplexity is a metric that quantifies how much a language model hesitates about the next token on average. If the language model is absolutely certain, the perplexity is 1. If the language model is completely uncertain, then every token in the vocabulary is equally likely; the perplexity is equal to the vocabulary size. You should not expect perplexity to go beyond this range.

Evaluate the Perplexity of a Language Model with HellaSwag Dataset

Perplexity is a dataset-dependent metric. One dataset you can use is HellaSwag. It is a dataset with train, test, and validation splits. It is available on the Hugging Face hub, and you can load it with the following code:

Running this code will print the following:

You can see that the validation split has 10,042 samples. This is the dataset you will use in this article. Each sample is a dictionary. The key "activity_label" describes the activity category, and the key "ctx" provides the context that needs to be completed. The model is expected to complete the sequence by selecting one of the four endings. The key "label", with values 0 to 3, indicates which ending is correct.

With this, you can write a short code to evaluate your own language model. Let’s use a small model from Hugging Face as an example:

This code loads the smallest GPT-2 model from the Hugging Face Hub. It is a 124M-parameter model that you can easily run on a low-profile computer. The model and tokenizer are loaded using the Hugging Face transformers library. You also load the HellaSwag validation dataset.

In the for-loop, you tokenize the activity label and the context. You also tokenize each of the four endings. Note that tokenizer.encode() is the method for using the tokenizer from the transformers library. It is different from the tokenizer object you used in the previous article.

Next, for each ending, you run the concatenated input and ending to the model. The input_ids tensor is a 2D tensor of integer token IDs with the batch dimension 1. The model returns an object, in which you extract the output logits tensor. This is different from the model you built in the previous article as this is a model object from the transformers library. You can easily swap it with your trained model object with minor changes.

GPT-2 is a decoder-only transformer model. It processes the input with a causal mask. For an input tensor of shape $(1, L)$, the output logits tensor has shape $(1, L, V)$, where $V$ is the vocabulary size. The output at position $p$ corresponds to the model’s estimate of the token at position $p+1$, depending on the input at positions 1 to $p$. Therefore, you extract the logits starting at offset $n-1$, where $n$ is the length of the combined activity label and context. You then convert the logits to log probabilities and compute the average over the length of each ending.

The value token_probs[j, token] is the log probability at position j for the token with ID token. The mean log-probability of each token in the ending is used to compute the perplexity. A good model is expected to identify the correct ending with the lowest perplexity. You can evaluate a model by counting the number of correct predictions over the entire HellaSwag validation dataset. When you run this code, you will see the following:

The code prints the perplexity of each ending and marks the correct answer with (O) or (!) and the model’s wrong prediction with (X). You can see that GPT-2 has a perplexity of 10 to 20, even for a correct answer. Advanced LLMs can achieve perplexity below 10, even with a much larger vocabulary size than GPT-2. More important is whether the model can identify the correct ending: the one that naturally completes the sentence. It should be the one with the lowest perplexity; otherwise, the model cannot generate the correct ending. GPT-2 achieves only 30% accuracy on this dataset.

You can also repeat the code with a different model. Here are the results:

  • model openai-community/gpt2: This is the smallest GPT-2 model with 124M parameters, used in the code above. The accuracy is 3041/10042 or 30.28%
  • model openai-community/gpt2-medium: This is the larger GPT-2 model with 355M parameters. The accuracy is 3901/10042 or 38.85%
  • model meta-llama/Llama-3.2-1B: This is the smallest model in the Llama family with 1B parameters. The accuracy is 5731/10042 or 57.07%

Therefore, it is natural to see higher accuracy with larger models.

Note that you should not compare perplexities across models with vastly different architectures. Since perplexity is a metric in the range of 1 to the vocabulary size, it highly depends on the tokenizer. You can see the reason when you compare the perplexity in the code above after replacing GPT-2 with Llama 3.2 1B: The perplexity is an order of magnitude higher for Llama 3, but the accuracy is indeed better. This is because GPT-2 has a vocabulary size of only 50,257, while Llama 3.2 1B has a vocabulary size of 128,256.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, you learned about the perplexity metric and how to evaluate the perplexity of a language model with the HellaSwag dataset. Specifically, you learned:

  • Perplexity measures how much a model hesitates about the next token on average.
  • Perplexity is a metric sensitive to vocabulary size.
  • Computing perplexity means computing the geometric mean of the probabilities of the tokens in the sample.





Leave a Reply

Your email address will not be published. Required fields are marked *