Skip to content(if available)orjump to list(if available)

Positional preferences, order effects, prompt sensitivity undermine AI judgments

shahbaby

Fully agree, I've found that LLMs aren't good at tasks that require evaluation.

Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI.

Nice to see an article that makes a more concrete case.

visarga

Humans aren't good at validation either. We need tools, experiments, labs. Unproven ideas are a dime a dozen. Remember the hoopla about room temperature superconductivity? The real source of validation is external consequences.

ken47

Human experts set the benchmarks and LLM’s cannot match them in most fields requiring sophisticated judgment.

They are very useful for some things, but sophisticated judgment is not one of them.

nimitkalra

Some other known distributional biases include self-preference bias (gpt-4o prefers gpt-4o generations over claude generations for eg) and structured output/JSON-mode bias [1]. Interestingly, some models have a more positive/negative-skew than others as well. This library [2] also provides some methods for calibrating/stabilizing them.

[1]: https://verdict.haizelabs.com/docs/cookbook/distributional-b... [2]: https://github.com/haizelabs/verdict

armchairhacker

LLMs are good at discovery, since they know a lot, and can retrieve that knowledge from a query that simpler (e.g. regex-based) search engines with the same knowledge couldn't. For example, an LLM that is input a case may discover an obscure law, or notice a pattern in past court cases which establishes precedent. So they can be helpful to a real judge.

Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.

There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.

TrackerFF

I see "panels of judges" mentioned once, but what is the weakness of this? Other than more resource.

Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)

nimitkalra

One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5.

But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example

You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.

[1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-...

sidcool

We went from Impossible to Unreliable. I like the direction as a techie. But not sure as a sociologist or an anthropologist.

tempodox

> We call it 'prompt-engineering'

I prefer to call it “prompt guessing”, it's like some modern variant of alchemy.

BurningFrog

"Prompt Whispering"?

th0ma5

Prompt divining

sanqui

Meanwhile in Estonia, they just agreed to resolve child support disputes using AI... https://www.err.ee/1609701615/pakosta-enamiku-elatisvaidlust...

giancarlostoro

I listen to online debates, especially political ones on various platforms, and man. The AI slop that people slap around at each other is beyond horrendous. I would not want an LLM being the final say on something critical. I want the opposite, an LLM should identify things that need follow up review by a qualified person, a person should still confirm the things that "pass" but they can then prioritize what to validate first.

batshit_beaver

I don't even trust LLMs enough to spot content that requires validation or nuance.

wagwang

Can't wait for the new field of AI psychology

gizajob

[flagged]

dang

Ok, but please don't post unsubstantive comments to Hacker News.

gizajob

Ok sorry. I’ll go back to slashdot.

tremon

At least until the LLM judges otherwise.

andrepd

[flagged]

dang

Comments like this break the site guidelines, and not just a little. Can you please review https://news.ycombinator.com/newsguidelines.html and take the intended spirit of this site more to heart? Note these:

"Please don't fulminate."

"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

"Please don't sneer, including at the rest of the community."

"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."

There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.

(We detached this comment from https://news.ycombinator.com/item?id=44074957)

ken47

[flagged]

dang

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

(We detached this comment from https://news.ycombinator.com/item?id=44074957)