Can LLMs do randomness?

73 comments

·April 28, 2025

sgk284

Fun post! Back during the holidays we wrote one where we abused temperature AND structured output to approximate a random selection: https://bits.logic.inc/p/all-i-want-for-christmas-is-a-rando...

onionisafruit

I enjoyed that.

When you asked it to choose by picking a random number between 1 and 4, it skewed the results heavily to 2 and 3. It could have interpreted your instructions to mean literally between 1 and 4 (not inclusive).

sgk284

Yea, absolutely. That's a good point. We could have phrased that more clearly.

LourensT

could you use structured output to make a more efficient estimator for the logits based?

sgk284

The two mechanisms are a bit disjoint, so I don't think it's the right tool to do so. Though it could have been an interesting experiment.

captn3m0

Wouldn’t any randomness (for a fixed combination of hardware and weights) be a result of the temperature and any randomness inserted at inference-time?

Otherwise, doing a H/T comparison is just a proxy to what the underlying token probabilities are and the temperature configuration (+hardware differences for a remote-hosted model).

whoami_nr

Author here. Yeah totally agreed. The more rigorous way to do this would be to use a fixed seed and temp and in a local model setting and then sample the logprobs and then analyse that data.

I had an hour to kill and did this experiment.

delusional

Congratulations, this was all a test to see if there were anyone on HN with any knowledge of how LLMs work, and you gave the correct answer.

moffkalast

I was gonna say floating point errors might contribute especially at fp16 and fp8, but those are technically deterministic.

null

[deleted]

DimitriBouriez

One thing to consider: we don’t know if these LLMs are wrapped with server-side logic that injects randomness (e.g. using actual code or external RNG). The outputs might not come purely from the model's token probabilities, but from some opaque post-processing layer. That’s a major blind spot in this kind of testing.

avianlyric

The core of an LLM is completely deterministic. The randomness seen in LLM output is purely the result of post processing the output of the pure neural net part of the LLM, which exists explicitly to inject randomness into the generation process.

This what the “temperature” parameter of an LLM controls. Setting the temperature of an LLM to 0 effectively disables that randomness, but the result is a very boring output that’s likely to end up caught in a never ending loop of useless output.

orbital-decay

You're right, although tests like this have been done many times locally as well. This issue comes from the fact that RL usually kills the token prediction variance, disproportionately narrowing it to 2-3 likely choices in the output distribution even in cases where uncertainty calls for hundreds. This is also a major factor behind fixed LLM stereotypes and -isms. Base models usually don't exhibit that behavior and have sufficient randomness.

remoquete

Agreed. These tests should be performed on local models.

Repose0941

Is randomness even possible? You can't technically prove it just see it, more likely to be close to that, in https://www.random.org/#learn they talk a little about this

sebstefan

That's an interrogation as old as time

whoami_nr

Author here. I know 0-10 is one extra even number. I also just did this for fun so don't take the statistical significance aspect of it very seriously. You also need to run this multiple times with multiple temperature and top_p values to do this more rigorously.

segh

Cool experiment! My intuition suggests you would get a better result if you let the LLM generate tokens for a while before giving you an answer. Could be another experiment idea to see what kind of instructions lead to better randomness. (And to extend this, whether these instructions help humans better generate random numbers too.)

Mr_Modulo

In the summary at the top it says you used 0-10 but then for the prompt it says 1-10. I had assumed the summary was incorrect but I guess it's the prompt that's wrong?

dr_dshiv

Oh, surprising that Claude can do heads/tails.

In a project last year, I did a combination of LLMs plus a list of random numbers from a quantum computer. Random numbers are the only useful things quantum computers can produce—and that is one thing LLMs are terrible at

david-gpu

During my tenure at NVidia I met a guy that was working on special versions of to the kernels that would make them deterministic.

Otherwise, parallel floating point computations like these are not going to be perfectly deterministic, due to a combination of two factors. First, the order of some operations will be random due to all sorts of environmental conditions such as temperature variations. Second, floating point operations like addition are not ~~commutative~~ associative (thanks!!), which surprises people unfamiliar with how they work.

That is before we even talk about the temperature setting on LLMs.

enriquto

> floating point operations like addition are not commutative

maybe you meant associative? Floating point addition is commutative: a+b is always equal to b+a for any values of a and b. It is not associative, though: a+(b+c) is in general different to (a+b)+c, think what happens if a is tiny and b,c are huge, for example.

david-gpu

Sorry, yes, I meant associative. Thanks for the important correction.

To think that I used to do this for a living...

simulator5g

How is that any different? 1+(2+3) = 6

(1+2)+3 = 6

0.000001+(200000+300000) = 500000.000001

(0.000001+200000)+300000 = 500000.000001

david-gpu

You need to take it a step further, since e.g. 64-bit floats have a ton of mantissa bits.

Here's an example in python3.

    >>> "{:.2f}".format(1e16 + (1 + 1))
    '10000000000000002.00'
    >>> "{:.2f}".format((1e16 + 1) + 1)
    '10000000000000000.00'

enriquto

take b and c with opposite signs

jansan

What I find more important is the ability to get reproducible results for testing.

I do not know about other LLMs, but Cohere allows setting a seed value. When setting the same seed value it will always give you the same result for a specific prompt (of course unless the LLM gets an update).

OTOH I would guess that they normally simply generate a random seed value on the server side when processing a prompt, and it depends on their random number generator how random that really is.

ekianjo

That's expected behavior when you run LLM locally with a fixed seed and temperature at zero

bestest

I would suggest them to repeat the experiment while including sets from answers to "choose heads or tails" AND "choose tails or heads", ditto for numbers or rephrase the question to not include a "choice" (choose from 0 to 9 (btw, they're asking to choose from 0 to 10 inclusive, which is inherently wrong as the even subset is bigger in this case)), but rather "choose a random integer".

GuB-42

Is the LLM reset between each event?

If LLMs are anything like people, I would expect a different result depending on that. The idea that random events are independent is very unintuitive to us, resulting in what we call the Gambler's Fallacy. LLMs attempts at randomness are very likely to be just as biased, if not more.

maaaaattttt

I think randomness needs to be better defined. In the article it seems to be that randomness should be an evenly distributed type of event occurences. I agree that it is very unintuitive for us as, I believe, we assume randomness to be any sequence of event that doesn't follow any known/recognizable pattern. Show a section of the Fibonacci to a 10 yo kid and they will most likely find the sequence of numbers to be random (maybe they will note that it is always increasing, but that's it). Even in this article the fact that o1 always throws "heads" could indicate that it "knows" what randomness is, and is then just being random by throwing only heads.

I personnaly would define ideal randomness as a behavior that is fundamentally uncomputable and/or cannot be expressed as a mathematical function. If this definition holds than the question cannot apply to LLMs as they are a just (big) mathematical function.

mrdw

They should measure for different temperatures, where at 0 it will be the same output every time, but it's interesting to see how results will change for different temperatures from 0.01 to 2. But, I'm not sure if temperature is implemented the same way in all llms

baalimago

I'd be interested to see the bias in random character generation. It's something which would be closer to the domains of LLMs, seeing that they're 'next word generators' (based on probability).

How cryptographically secure would an LLM rng seed generator be?