One of my favorite parts of my job is getting to answer random questions from our MVPs (Microsoft Valuable Professionals). For context, at Microsoft there are internal distribution lists that folks across the product teams, and engineering monitor which MVPs get access to ask questions. Often times questions/answers fall into the NDA only category so I can't talk about them publicly, but every once in awhile a fun technical question comes along that is in no way confidential, but isn't something that we would ever cover in the official Microsoft documentaion. This was one of those questions. I will scrub out any identifying details, and I may do some light editing, but for the most part below are my responses to the MVP.
"An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both."
In nucleus sampling, we compute the cumulative distribution over all the options for each subsequent token in decreasing probability order and cut it off once it reaches a particular probability specified by top_p. You should either alter temperature or top_p, but not both.
Recommended for advanced use cases only. You usually only need to use temperature.
maximum 1
minimum 0
I am trying to understand more about top_p. Why does setting top_p to 0 still generate tokens? I thought setting it to zero would result in no response, but instead I get results.
The Curious Case of Neural Text Degeneration
While each LLM provider might have their own slightly unique internal implementation and may handle edge cases differently in addition to how top_p will interact with other parameters, fundamentally top_p (nucleus sampling) features will be based on the approach described in the paper.
To make things more concrete for myself, I found it helpful to put together an adapted toy version of some of the open source implementations of the algorithm described in the paper and then stepped through the execution with different values for p:
Full code with comments included at the end of my message.
At least for me, it made it easier for me to see how setting top_p to 0 will end up essentially turning nucleus sampling into greedy sampling assuming a similar implementation to the one above is used.
Jeremy Howard touches on the Curious Case paper and shares a different implementation of nucleus sampling that takes a dependency on pytorch in this part of one his lectures which might be helpful to look at as well:
https://youtu.be/3oEb_fFmPnY?si=l2GU1PMnFnzen1Id&t=960
My toy code from above:
#########################################
import numpy as np
# Ensure p is between 0 and 1
p = .95
# Create sample logits (1000 random values between -10 and 10)
np.random.seed(42) # for reproducibility
logits = np.random.uniform(-10, 10, 1000)
# Convert logits to probabilities
probs = np.exp(logits) / np.sum(np.exp(logits)) #softmax(x_i) = exp(x_i) / Σ exp(x_j)
# Sort probabilities in descending order
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
# Calculate cumulative probabilities
cumulative_probs = np.cumsum(sorted_probs)
# Find the index where cumulative probability exceeds p
cutoff_idx = np.where(cumulative_probs > p)[0][0]
sorted_indices_to_remove = sorted_indices[cutoff_idx+1:]
# Zero out probabilities for tokens outside the nucleus
probs[sorted_indices_to_remove] = 0.0
# Re-normalize probabilities within the nucleus
probs /= np.sum(probs)
# Sample index according to the truncated+renormalized probability distribution
sampled_index = np.random.choice(len(probs), p=probs)
print(f"p = {p}")
print(f"Number of tokens in nucleus: {len(probs) - len(sorted_indices_to_remove)}")
print(f"Sampled index: {sampled_index}")
print(f"Probability of sampled index: {probs[sampled_index]:.4f}")
The MVP responded with a follow-up question that they were still confused:
Back to my original question: There appears to be consensus that Top-P = 0.0 empirically behaves like greedy sampling, but with no specific mechanism identified.
My response:
I think the best way of thinking about it is just that the behavior is a side effect of how the mathematical expressions get evaluated in common implementations of nucleus sampling when p = 0.
The intuition that top_p = 0 would exclude all tokens is a good one, but the reality when you look at what happens to p=0 in an implementation of the nucleus sampling algorithm is different. For common implementations p=0 will always take the highest probability token.
To understand what is happening you need to focus on this section of code where things start to get fun: (You could also write the next three lines as cutoff_idx = np.where(cumulative_probs > p)[0][0], but I broke it up to hopefully make more clear what is happening)
indices_where_true = np.where(cumulative_probs > p) first_array = indices_where_true[0] cutoff_idx = first_array[0]
The problem here is when p = 0 every single cumulative_probs value will be greater than 0 and therefore indices_where_true will consist of every possible index of probability values/tokens in the distribution with our highest probability token in the 0 index position since we normalized and presorted from highest to lowest probability.
The lines of code that are supposed to find the point to trim the tail of lower probability tokens when given a p of 0 will set a cutoff index at the highest prob token index.
Then:
sorted_indices_to_remove = sorted_indices[cutoff_idx+1:]
We take the highest probability token and go one index further [cutoff_idx+1] and trim off the tail which will consist of every other lower probability/token. So we now have only one token left in the nucleus and have greedy sampling since the only remaining token is index 0 where the highest probability token lives.
I will however still hedge and say I can't guarantee how any given LLM providers exact internal implementation will behave. OpenAI hasn't publicly released their implementation of nucleus sampling so by extension we can't do that with Azure OpenAI either. I can only say that my toy example, Jeremy Howard's pytorch implementation, and the original implementation from the shared repo for the Curious Case paper would all handle p=0 in the way I described above.
Since others have found this useful I will quick walk through this step-by-step using my toy code, please feel free to treat this as TLDR. There are also others on this dl who I am sure can explain this much better than I can:
First let's look at an example for p set to .9 to give a better sense of what things can look like with a GPT-2 size vocabulary. GPT-2's vocab has 50257 tokens.
We'll set p to .9. and will generate 50257 logits between -10 and 10: (we'll use a seed so anyone can re-run and get the same results)
The logits variable now contains values that look like:
If helpful we can visualize the random logits to get a better sense of the distribution
Then we will convert the logits to probabilities using the softmax function:
This would then look as follows:
Note the current max probability at index 0:
This sorting creates the following distribution:
Now we take the cumulative sum of our probabilities:
Find the indexes where cumulative_probs is greater than p. Note that this is a subset of our 50257 vocab of logits/tokens:
So we will take our original distribution set the cut-off index (dotted line) and move + 1 index forward and remove the low probability tail (grey):
Zero out probabilities outside of the nucleus:
If we grab the sum of probs after zeroing out values outside the nucleus you can see we are no longer normalized with all values adding up to 1:
So we then renormalize our probabilities:
Number of tokens in nucleus: 5751
Sampled index: 2243
Probability of sampled index: 0.0004
Now if we have all the same code but set p = 0. Everything up to this point in the code will be the same we haven't used p yet:
We'll still compute the same cumulative_sum:
Note that index 0 of our cumulative sum is our highest probability token.
Now when we calculate:
indices_where_true = np.where(cumulative_probs > p)
We end up with an array that hasn't shrunk in size at all because every index in cumulative_probs has a value greater than 0. Unlike when we did this for p=.9 where we ended up with a subset of our original array.
And now when we calculate our cutoff index we take index 0, and then we remove everything after that [cutoff_idx+1:].
From this point on the remaining code becomes almost meaningless in that we are guaranteed to sample this token since there is only one token in the nucleus which is was our highest probability token.
For some examples of how the distribution of in nucleus, versus dropped/truncated for other p values might look I did a couple more runs and ran the results through matplotlib (Keep in mind we are using randomly generated logits, this is not the same as what you would get without random generation, but it still helps in understanding ):
p= .1
Number of tokens in nucleus: 264 | Sampled index: 2243 | Probability of sampled index: 0.0036
p=.3
Number of tokens in nucleus: 887 | Sampled index: 2078 | Probability of sampled index: 0.0010
p=.75
Number of tokens in nucleus: 3470 | Sampled index: 2200 | Probability of sampled index: 0.0003