How LLMs Choose Their Words: A Practical Walk-Through of Logits, Softmax and Sampling




Large Language Models (LLMs) can produce varied, creative, and sometimes surprising outputs even when given the same prompt. This randomness is not a bug but a core feature of how the model samples its next token from a probability distribution. In this article, we break down the key sampling strategies and demonstrate how parameters such as temperature, top-k, and top-p influence the balance between consistency and creativity.

In this tutorial, we take a hands-on approach to understand:

  • How logits become probabilities
  • How temperature, top-k, and top-p sampling work
  • How different sampling strategies shape the model’s next-token distribution

By the end, you will understand the mechanics behind LLM inference and be able to adjust the creativity or determinism of the output.

Let’s get started.

How LLMs Choose Their Words: A Practical Walk-Through of Logits, Softmax and Sampling
Photo by Colton Duke. Some rights reserved.

Overview

This article is divided into four parts; they are:

  • How Logits Become Probabilities
  • Temperature
  • Top-k Sampling
  • Top-p Sampling

How Logits Become Probabilities

When you ask an LLM a question, it outputs a vector of logits. Logits are raw scores the model assigns to each possible next token in its vocabulary.

If the model has a vocabulary of $V$ tokens, it will output a vector of $V$ logits for each next word position. A logit is a real number. It is converted into a probability by the softmax function:

$$
p_i = \frac{e^{x_i}}{\sum_{j=1}^{V} e^{x_j}}
$$

where $x_i$ is the logit for token $i$ and $p_i$ is the corresponding probability. Softmax transforms these raw scores into a probability distribution. All $p_i$ are positive, and their sum is 1.

Suppose we give the model this prompt:

Today’s weather is so ___

The model considers every token in its vocabulary as a possible next word. For simplicity, let’s say there are only 6 tokens in the vocabulary:

The model produces one logit for each token. Here’s an example set of logits the model might output and the corresponding probabilities based on the softmax function:

Token Logit Probability
wonderful 1.2 0.0457
cloudy 2.0 0.1017
nice 3.5 0.4556
hot 3.0 0.2764
gloomy 1.8 0.0832
delicious 1.0 0.0374

You can confirm this by using the softmax function from PyTorch:

Based on this result, the token with the highest probability is “nice”. LLMs don’t always select the token with the highest probability; instead, they sample from the probability distribution to produce a different output each time. In this case, there’s a 46% probability of seeing “nice”.

If you want the model to give a more creative answer, how can you change the probability distribution such that “cloudy”, “hot”, and other answers would also appear more often?

Temperature

Temperature ($T$) is a model inference parameter. It is not a model parameter; it is a parameter of the algorithm that generates the output. It scales logits before applying softmax:

$$
p_i = \frac{e^{x_i / T}}{\sum_{j=1}^{V} e^{x_j / T}}
$$

You can expect the probability distribution to be more deterministic if $T<1$, since the difference between each value of $x_i$ will be exaggerated. On the other hand, it will be more random if $T>1$, as the difference between each value of $x_i$ will be reduced.

Now, let’s visualize this effect of temperature on the probability distribution:

This code generates a probability distribution over each token in the vocabulary. Then it samples a token based on the probability. Running this code may produce the following output:

and the following plot showing the probability distribution for each temperature:

The effect of temperature to the resulting probability distribution

The model may produce the nonsensical output “Today’s weather is so delicious” if you set the temperature to 10!

Top-k Sampling

The model’s output is a vector of logits for each position in the output sequence. The inference algorithm converts the logits to actual words, or in LLM terms, tokens.

The simplest method for selecting the next token is greedy sampling, which always selects the token with the highest probability. While efficient, this often yields repetitive, predictable output. Another method is to sample the token from the softmax-probability distribution derived from the logits. However, because an LLM has a very large vocabulary, inference is slow, and there is a small chance of producing nonsensical tokens.

Top-$k$ sampling strikes a balance between determinism and creativity. Instead of sampling from the entire vocabulary, it restricts the candidate pool to the top $k$ most probable tokens and samples from that subset. Tokens outside this top-$k$ group are assigned zero probability and will never be chosen. It not only accelerates inference by reducing the effective vocabulary size, but also eliminates tokens that should not be selected.

By filtering out extremely unlikely tokens while still allowing randomness among the most plausible ones, top-$k$ sampling helps maintain coherence without sacrificing diversity. When $k=1$, top-$k$ reduces to greedy sampling.

Here is an example of how you can implement top-$k$ sampling:

This code modifies the previous example by filling some tokens’ logits with $-\infty$ to make the probability of those tokens zero. Running this code may produce the following output:

The following plot shows the probability distribution after top-$k$ filtering:

The probability distribution after top-$k$ filtering

You can see that for each $k$, the probabilities of exactly $V-k$ tokens are zero. Those tokens will never be chosen under the corresponding top-$k$ setting.

Top-p Sampling

The problem with top-$k$ sampling is that it always selects from a fixed number of tokens, regardless of how much probability mass they collectively account for. Sampling from even the top $k$ tokens can still allow the model to choose from the long tail of low-probability options, which often leads to incoherent output.

Top-$p$ sampling (also known as nucleus sampling) addresses this issue by sampling tokens according to their cumulative probability rather than a fixed count. It selects the smallest set of tokens whose cumulative probability exceeds a threshold $p$, effectively creating a dynamic $k$ for each position to filter out unreliable tail probabilities while retaining only the most plausible candidates. When the model is sharp and peaked, top-$p$ yields fewer candidate tokens; when the distribution is flat, it expands accordingly.

Setting $p$ close to 1.0 approaches full sampling from all tokens. Setting $p$ to a very small value makes the sampling more conservative. Here is how you can implement top-$p$ sampling:

Running this code may produce the following output:

and the following plot shows the probability distribution after top-$p$ filtering:

The probability distribution after top-$p$ filtering

From this plot, you are less likely to see the effect of $p$ on the number of tokens with zero probability. This is the intended behavior as it depends on the model’s confidence in the next token.

Further Readings

Below are some further readings that you may find useful:

Summary

This article demonstrated how different sampling strategies affect an LLM’s choice of next word during the decoding phase. You learned to select different values for the temperature, top-$k$, and top-$p$ sampling parameters for different LLM use cases.





Leave a Reply

Your email address will not be published. Required fields are marked *