Dummy's Guide to Modern LLM Sampling
16 comments
·May 4, 2025neuroelectron
minimaxir
> For instance, why not use whole words as tokens?
Word-only tokenizers what people did in the RNN/LSTM days. There's no functional improvement over tokenization schemes like BPE or even WordPiece/SentencePiece, and it results in worse quality since you can't use meaningful semantic hints such as punctuation.
orbital-decay
One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.
Certain samplers described here like repetition penalty or DRY are just like this - the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?
Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.
Der_Einzige
No, it's done to turn an uncreative model into a creative model. This idea that sampling isn't that important or is some violation of the bitter lesson is exactly why I had to call out the whole academic field as having a giant blindspot for this kind of research in our oral presentation at ICLR!
Top n sigma has been around since mid 2024, min_p around since 2023 and we are still waiting for these innovations to be integrated outside of open source stuff (i.e. outside of HF/vllm). It's being done slowly on purpose by API providers because they don't want to deal with the risk of models being "too creative" (also high temp likely breaks their watermarking)
One other thing - making models aware of their own sampling settings is super easy if you just feed it back to the model every token or generation (say, using structured generation). Models can control their own sampling settings and thus "have access to its internal states" with just a tiny bit of extra programming (the model can write that code for you now lol)
orbital-decay
I guess variance is a better word for this. Creativity is a pretty loose term, for example most people will describe R1 as creative in RP/stories for its tendency to derail everything in an unhinged way, but it still lacks variance like every other modern model (kill the reasoning chain and look at logprobs to get what I mean). The bitter lesson is not some threshold and can't be violated, it describes a curve of diminishing returns. As long as you're on the shallow part, it's fine.
But the bigger problem is that the concepts are expressed before they're decoded into the output distribution. You can steer them to a degree by hacking the autoregressive transport, but if the model itself learned that this concept corresponds to that particular concept, not a set of concepts (and RL tends to do exactly that), fixing it with sampling is usually hard to impossible, you'll just lose accuracy/make it dumber as you basically force out-of-distribution outputs.
mdp2021
When the attempt is though to have the LLM output an "idea", not just a "next token", the selection over the logits vector should break that original idea... If the idea is complete, there should be no need to use sampling over the logits.
The sampling, in this framework, should not happen near the output level ("what will the next spoke word be").
minimaxir
LLMs are trained to maximize the probability of correct guesses for the next token, not "ideas". You cannot define an idea as a training loss objective.
mdp2021
That is an architectural problem. If you want the post to be rephrased: it is paradoxical to have changes made near the output level, "changing words before it says them", given that the expected is to work with ideas. (And even then, selection would not be at the output level - it would be during the definition of the structure.)
> You cannot define an idea as a training loss objective
What tells you so? If you see a technical limit, note e.g. that sentences and paragraphs can have their own position in an embedding space.
orbital-decay
Interpretability studies offer several orthogonal ways to look at this, it's like Newtonian vs Lagrangian mechanics. Autoregressive token prediction, pattern matching, idea conceptualization, pathfinding in the extremely multidimensional space...
Der_Einzige
Related to this, our min_p paper was ranked #18 out of 12000 submission at ICLR and got an oral:
https://iclr.cc/virtual/2025/oral/31888
Our poster was popular:
poster: https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.png?t=174...
oral presentation (watch me roast yoshua bengio on this topic and then have him be the first questioner, 2nd speaker starting around 19:30 min mark. My slides for the presentation are there too and really funny.): https://iclr.cc/virtual/2025/session/31936
paper: https://arxiv.org/abs/2407.01082
As one of the min_p authors, I can confirm that Top N sigma is currently the best general purpose sampler by far. Also, temperature can and should be scaled far higher than it is today. Temps of 100 are totally fine with techniques like min_p and top N sigma.
Also, the special case of top_k = 2 with ultra high temperature (one thing authors recommend against near the end) is very interesting in its own right. Doing it leads to spelling errors every ~10th word - but also seems to have a certain creativity to it that's quite interesting.
toxik
Are there any samplers that aren't basically greedy? I.e. actually searches the tree. I realize it's an absolutely insane branching factor and quite expensive to expand nodes at that, but it always seemed odd to me that we don't actually search.
antonvs
This is great! “Sampling” covers much more than I expected.
jlcases
[dead]
blt
This is pretty interesting. I didn't realize so much manipulation was happening after the initial softmax temperature choice.
minimaxir
It's worth noting that only some of these techniques are configurable in modern LLM API outputs. (usually only temperature/top-p/top-k since other penalties require overhead)
Der_Einzige
Most other penalties don't require much overhead (min_p is basically free).
Most techniques are not made available by API providers because they enable alignment breaking. It's the only explanation for why we are still stuck with only top_p, top_k, and temp of 0-2.
If you want proper sampler settings to be available, your options are oobabooga, sillytavern (dependent on your backend, so vllm backend for example doesn't have top-n sigma yet), or directly running huggingface code. There might be some marginal options here too but in general, sampling innovation is firmly in the hands of open source coomers right now and not in the hands of academics.
Love this and the way everything is mapped out and explained simply really opens up the opportunity for trying new things, and where you can do that effectively.
For instance, why not use whole words as tokens? Make a "robot" with a limited "robot dialect." Yes, no capacity for new words or rare words, but you could modify the training data and input data to translate those words into the existing vocabulary. Now you have a much smaller mapping that's literally robot-like and kind of gives the user an expectation of what kind of answers the robot can answer well, like C-3PO.