Skip to content(if available)orjump to list(if available)

People are just as bad as my LLMs

People are just as bad as my LLMs

82 comments

·March 10, 2025

rainsford

> ...a lot of the safeguards and policy we have to manage humans own unreliability may serve us well in managing the unreliability of AI systems too.

It seems like an incredibly bad outcome if we accept "AI" that's fundamentally flawed in a way similar to if not worse than humans and try to work around it rather than relegating it to unimportant tasks while we work towards a standard of intelligence we'd otherwise expect from a computer.

LLMs certainly appear to be the closest to real AI that we've gotten so far. But I think a lot of that is due to the human bias that language is a sign of intelligence and our measuring stick is unsuited to evaluate software specifically designed to mimic the human ability to string words together. We now have the unreliability of human language processes without most of the benefits that comes from actual human level intelligence. Managing that unreliability with systems designed for humans bakes in all the downsides without further pursuing the potential upsides from legitimate computer intelligence.

jayd16

If the question inherently allows for "no-preference" to be valid but that is not a possible answer then you've left it to the person or llm to deal with that. If a human is not allowed to specify no preference why would you expect uniform results when you don't even ask for it? You only asked to pick the best. Even if they picked perfectly, its not defined in the task to make sure you select draws in a random way.

lxe

It's almost as if we trained LLMs on text produced by people.

MrMcCall

I love the posters that make fun of those corporate motivational posters.

My favorite is:

  No one is as dumb as all of us.
And they trained their PI* on that giant turd pile.

* Pseudo Intelligence

LoganDark

I don't count LLMs as intelligent. To a certain degree they can be a component of intelligence, but I don't count an LLM on its own.

tehsauce

There has been some good research published on this topic of how RLHF, ie aligning to human preferences easily introduces mode collapse and bias into models. For example, with a prompt like: "Choose a random number", the base pretrained model can give relatively random answers, but after fine tuning to produce responses humans like, they become very biased towards responding with numbers like "7" or "42".

robwwilliams

I assume 42 is a joke from deep history and The Hitchhiker’s Guide. Pretty amusing to read the Wikipedia entry:

https://en.wikipedia.org/wiki/42_(number)

sedatk

Douglas Adams picked 42 randomly though. :)

robertlagrant

Not at all. It was derived mathematically from the Question: What do you get if you multiply six by nine?

aidos

Why is that? Whenever I’m giving examples I almost always use 7, something ending in a 7 or something in the 70s

p1necone

1 and 10 are on the boundary, that's not random so those are out.

5 is exactly halfway, that's not random enough either, that's out.

2, 4, 6, 8 are even and even numbers are round and friendly and comfortable, those are out too.

9 feels too close to the boundary, it's out.

That leaves 3 and 7, and 7 is more than 3 so it's got more room for randomness in it right?

Therefore 7 is the most random number between 1 and 10.

da_chicken

The theory I've heard is that the more prime a number is, the more random it feels. 13 feels more awkward and weird, and it doesn't come up naturally as often as 2 or 3 do in everyday life. It's rare, so it must be more random! I'll give you the most random number I can think of!

People tend to avoid extremes, too. If you ask for a number between 1 and 10, people tend to pick something in the middle. Somehow, the ordinal values of the range seem less likely.

Additionally, people tend to avoid numbers that are in other ranges. Ask for a number from 1 to 100, and it just feels wrong to pick a number between 1 and 10. They asked for a number between 1 and 100. Not this much smaller range. You don't want to give them a number they can't use. There must be a reason they said 100. I wonder if the human RNG would improve if we started asking for numbers between 21 and 114.

thfuran

People also tend to botch random sequences by trying to avoid repetition or patterns.

foota

Okay, this is a nitpick, but I don't think ordinal can be used in that way. "Somehow, the ordinal values of the range seem less likely". I'd probably go with extremes of the range? Or endpoints?

Ethee

Veritasium actually made a video on this concept about a year ago: https://www.youtube.com/watch?v=d6iQrh2TK98

d4mi3n

My guess is that we bias towards numbers with cultural or personal significance. 7 is lucky in western cultures and is religiously significant (see https://en.wikipedia.org/wiki/7#Culture). 42 is culturally significant in science fiction, though that's a lot more recent. There are probably other examples, but I imagine the mean converges on numbers with multiple cultural touchpoints.

Jensson

I have never heard of 7 being a lucky number in western culture and your link doesn't support that. 3 is a lucky number, 13 is an unlucky number, 7 is nothing to me.

So I don't think its that, 7 is still a very common "random number" here even though there is no special cultural significance to it.

d0liver

I like prime numbers. Non-primes always feel like they're about to fall apart on me.

Shorel

They choose 37 =)

moffkalast

It's very funny that people hold the autoregressive nature of LLMs against them, while being far more hardline autoregressive themselves. It's just not consciously obvious.

antihipocrat

I wonder whether we hold LLMs to a different standard because we have a long term reinforced expectation for a computer to produce an exact result?

One of my first teachers said to me that a computer won't ever output anything wrong, it will produce a result according to the instructions it was given.

LLMs do follow this principle as well, it's just that when we are assessing the quality of output we are incorrectly comparing it to the deterministic alternative, and this isn't really a valid comparison.

absolutelastone

I think people tend to just not understand what autoregressive methods are capable of doing generally (i.e., basically anything an alternative method can do), and worse they sort of mentally view it as equivalent to a context length of 1.

mynameismon

Can you share any links about this?

thechao

Which is weird, because I thought we'd all agreed that the random number was 4?

https://xkcd.com/221/

le-mark

Human level artificial intelligence has never had much appeal to me, there are enough idiots in the world, why do we need artificial ones? Ie if average machine intelligence mirrored human IQ distribution?

roywiggins

Owners would love to be able to convert capital directly into products without any intermediate labor[0]. Fire your buildings full of programmers and replace them with a server farm that only gets faster and more efficient over time? That's a great position to be in, if you own the IP and/or server farm.

[0] https://qntm.org/mmacevedo

smallnix

Is my understanding wrong that LLMs are trained to emulate observed human behavior in their training data?

From that follows that LLMs fit to produce all kinds of human biases. Like preferring the first choice out of many, and the last our of many (primacy biases). Funnily the LLM might replicate the biases slightly wrong and by doing so produce new derived biases.

Terr_

[delayed]

ziaowang

This understanding is incomplete in my opinion. LLMs are more than emulating observed behavior. In the pre-training phase tasks like masked language model indeed train the model to mimic what they read (which of course contains lots of bias); but in the RLHF phase, the model tries to generate the best response judged by human evaluations (who tries to eliminate as much bias as possible in the process). In other words, they are trained to meet human expectations in this later phase.

But human expectations are also not bias-free (e.g. from the preferring-the-first-choice phenomenon)

Xelynega

I don't understand what you are saying.

How can the RLHF phase eliminate bias if it uses a process(human input) that has the same biases as the pre-training(human input)?

nthingtohide

Not only that if future AI distrusts humanity it is because history, literature and fiction is full of such scenarios and AI will learn those patterns and associated emotions from those texts. Humanity together will be responsible for creating a monster (if that scenario happens).

mplewis

LLMs don't emulate human behavior. They spit out chunks of words in an order that parrots some of their training data.

dkenyser

Correct me if I'm wrong, but I feel like we're splitting hairs here.

> spits out chunks of words in an order that parrots some of their training data.

So, if the data was created by humans then how is that different from "emulating human behavior?"

Genuinely curious as this is my rough interpretation as well.

icelancer

Are you sure that humans are much more than this in terms of spoken/written language?

educasean

Is this just pedantry or is there some insight to be gleaned by the distinction you made?

Xelynega

It's not really pedantic when there's an entire wikipedia page on the tendency for people to conflate the two: https://en.wikipedia.org/wiki/ELIZA_effect

I believe the distinction they're trying to make is between "sounding like a human"(being able to create output that we understand as language) and "thinking like a human"(having the capacity for experience, empathy, semantic comprehension, etc.)

MyOutfitIsVague

I can only assume that either they are trying to point out that words aren't behavior, and mimicking human writing isn't the same thing as mimicking human behavior, or it's some pot-shot at the capabilities of LLMs.

davidcbc

This is a more pedantic and meme-y way of saying the same thing.

markbergz

For anyone interested in these LLM pairwise sorting problems, check out this paper: https://arxiv.org/abs/2306.17563

The authors discuss the person 1 / doc 1 bias and the need to always evaluate each pair of items twice.

If you want to play around with this method there is a nice python tool here: https://github.com/vagos/llm-sort

fpgaminer

The paper basically sums to suggesting (and analyzing) these otpions:

* Comparing all possible pair permutations eliminates any bias since all pairs are compared both ways, but is exceedingly computationally expensive. * Using a sorting algorithm such as Quicksort and Heapsort is more computationally efficient, and in practice doesn't seem to suffer much from bias. * Sliding window sorting has the lowest computation requirement, but is mildly biased.

The paper doesn't seem to do any exploration of the prompt and whether it has any impact on the input ordering bias. I think that would be nice to know. Maybe assigning the options random names instead of ordinals would reduce the bias. That said, I doubt there's some magic prompt that will reduce the bias to 0. So we're definitely stuck with the options above until the LLM itself gets debiased correctly.

velcrovan

interleaving a bunch of people's comments and then asking the LLM to sort them out and rank them…seems like a poor method. The whole premise seems silly, actually. I don't think there's any lesson to draw here other than that you need to understand the problem domain in order to get good results from an LLM.

soared

Wouldn’t the same outcome be achieved much more simply by giving LLMs a two choices (colors, numbers, whatever), asking “pick one” and assessing the results in the same way?

ramity

You absolutely can. Deterministic inference is achievable, but it isn't as performant. The reason why sadly boils down to floating point math.

megadata

At least LLMs are very often ready so acknowledge they might be wrong.

It can be incredibly hard to get a person to acknowledge that they might be remotely wrong on a topic they really care about.

Or, for some people, the thought that they might be wrong about anything attall is just like blasphemy to them.

Xelynega

Is this not just because aggressive material was filtered out of training data and the system prompts usually include some preamble about being polite?

"Acknowledging they might be wrong" makes them sound like more than token predictors trained on polite sounding text.

roywiggins

Most of the reason LLMs will "admit they're wrong" is because they've been trained not to argue too hard, and to not hold strong preferences. It's a sort of customer service personality.

When you don't do that sufficiently you run the risk of producing the "Sydney" personality that Bing Chat had, which would argue back, and could go totally feral defending its incorrect beliefs about the world, to the point of insulting and belittling the user.

devit

The "person one" vs "person two" bias seems trivially solvable by running each pair evaluation twice with each possible labelling and the averaging the scores.

Although of course that behavior may be a signal that the model is sort of guessing randomly rather than actually producing a signal.

harrisonjackson

Agreed on the second part. Correcting for bias this way might average out the scores but not in a way that correctly evaluates the HN comments.

The LLM isn't performing the desired task.

It sounds possible to cancel out the comments where reversing the labels swaps the outcome because of bias. That will leave the more "extreme" HN comments that it consistently scored regardless of the label. But that may not solve for the intended task still.

rahimnathwani

  The LLM isn't performing the desired task.
It's 'not performing the task', in the same way that the humans ranking voice attractiveness are 'not performing the task'.

I wouldn't treat the output as complete garbage, just because it's somewhat biased by an irrelevant signal.

null

[deleted]