Skip to content(if available)orjump to list(if available)

UCSD: Large Language Models Pass the Turing Test

areactnativedev

You can download the conversations here https://osf.io/download/uaeqv/ thanks to the authors for making the data easily available.

Now my take from skimming through them: the interrogators (= human participants) did not make a big effort trying to unmask an AI, they were doing it for the credits. So little care asking thoughtful questions or even many questions beyond the minimum to earn their credits.

So I personally don't think it shows LLM models can fool humans trying to unmask them. Maybe it shows that if people are paid to randomly send a few casual messages and get answers from both human and LLMs in parallel, the LLMs don't stand out.

Here is one conversation (starts with the interrogator and then it's each in turn) - Whats your favorite show - rn its arcane wbu - better caul saul. Have you watch breaking bad? - yea its goated fr - what class are you doing the sona for? - psyc 70 hbu - psyc 108! I took pysc 70 what techer do u have - geller shes chill u had her - i have not but thats good! are you a psyc major - nah just taking for credits u

Another conversation: - Hi how are you? - Awful... - oh no! i hope your day gets better! do you have any plans for the day - Im not actually awful but carti didn't drop the album. as - for plans I'm not sure - loll im dead! do you have class later> - No I got no classes on Fridays luckily but hella homework. wbu? - nice! i do have class later not looking foward to it - what class u got

And a last one: - What do you see - My living room - What's on the ceiling - A fan lol - does it spin - Yes it does - how fast - It has 3 speed levels

I have not cherry-picked.

codeulike

Daniel Dennet had a good few paragraphs about this in Consciousness Explained - the Turing Test is supposed to be challenging/adversarial. The example Dennett gave was telling the AI a joke and then asking it to reflect on and explain the joke and come up with some alternative punchlines (I note that contemporary LLMs would still be good at that, but when the book was written in 1991 that sortof interaction with an AI was unthinkable)

bananalychee

Do the goalposts have to keep moving until we can no longer find any gap in common knowledge or eccentric behavior in AI? If so, what does that say about eccentric human beings?

tripletao

Of course; that's the point of an adversarial test, to free the interrogators to use all their human intelligence to place the goalposts wherever they judge best. There will always be individual humans who'd fail any sane version of the test (illiterate, comatose, etc.), so the test is meaningful only as a statistical aggregate.

akleemans

It would be interesting to see what would happen if they would get paid more if they could correctly identify human/AI.

rfoo

I think the most interesting result [0] is, compared to our current benchmarks, on which scaling law is showing diminishing returns, what they did managed to tell apart large language models (Llama 405B, GPT-4.5) from not-so-large LMs.

This could be really interesting if it wasn't due to trivial f-up (e.g. difference in inference speed).

[0] Assuming the paper isn't flawed, haven't read it thoroughly yet.

nonfamous

According to the paper, the human and AI responses were both delayed by the same amount (depending on message length) to mask the effect of inference speed on the interrogator.

sterlind

It's not so surprising to me. It's like how Markov chains get better at passing for human the more N-grams they memorize. larger models will continue getting marginally better at predicting the distribution (human language.) but that doesn't translate into improved intelligence.

rfoo

The point is, it isn't marginally better. I agree the setup is not a demonstration of intelligence, but the difference is pretty significant. Not to mention that on conventional benchmarks Llama 405B is usually worse than GPT-4o.

bluefirebrand

> So I personally don't think it shows LLM models can fool humans trying to unmask them

Maybe these used special LLMs that are unrestricted or something but isn't it pretty trivial to get an LLM to output error prompts by asking them to commit crimes or talk about certain topics?

I think priming people to think they might be talking to a human skews the results here because people will be more hesitant to say really wild shit that the LLM can't react appropriately to, if they think they might be talking to a human

tripletao

I feel like a cash reward would help not only with motivation in the obvious way, but also by giving people social permission to act weird, since the human on the other side will understand that you're doing it to help both of you win the money.

Perhaps the final form of this experiment will always consider the reward value (for results better than chance, since zero effort for $0.5*X is better than full effort than $X), and we could track the increase in the necessary reward to distinguish over time. There might be a casino game in there somewhere, though collusion between human witnesses and interrogators might become a problem as the stakes get high.

tripletao

This appears to be the same two authors who reported that "People cannot distinguish GPT-4 from a human in a Turing test" back in May 2024:

https://arxiv.org/pdf/2405.08007

That earlier result was because they botched the statistics, changing the test so it's no longer a binary comparison but still analyzing as if it was. They seem to have fixed that now, perhaps in response to reviewer feedback. This new preprint is the best LLM Turing test I've seen so far.

That said, their humans sure don't seem to be trying very hard. The most effective interrogator strategies ("jailbreak" and "strange") were also the least used. I don't think any of these models can fool a skilled human who's paying attention, though there's still practical use for a model that can fool an unskilled human who isn't (scams, etc.).

Imnimo

It gives me a little pause that humans are so much worse than random chance at detecting GPT-4.5. Suppose we reframed the test as: "You interact with 10 witnesses, 5 of which are humans, 5 of which are GPT-4.5. Your task is to separate them into two groups, but you do not need to label the groups." It seems that human judges would still be pretty good at this version of the task.

In originally proposing the task, Turing wrote:

>It might be urged that when playing the "imitation game" the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind.

Does the fact that GPT-4.5 is favored well above random chance imply that it is doing "something other than imitation of the behaviour of a man"?

lsy

Assuming this result holds, and knowing that LLMs (including 4o) nevertheless remain incapable of standing in for people in most cases that require intelligence, this seems like a damning indictment of the test as an indicator of genuine intelligence.

Balgair

One (bad) pet theory I have is that LLMs/AIs are going to uncover something very uncomfortable to us: The difference in intelligence between people is a lot bigger than we thought. In that someone with an IQ of 95 and and IQ of 105 [0] have very different views of the world and very different abilities to navigate that world. Like, some people are much dumber than we thought they were and some people are much smarter. Not sure what the downstream effects of such a theory might be, but I don't like the things I can think up.

Again, a (bad) pet theory.

[0] Yes, IQ is not a good measure of blah blah blah. I'm just using this a handle to explain things, I don't mean it literally.

cowmix

I think we're gonna find that there are different ways to quantify "humanness" other than IQ. Someone with an IQ of 95 might seem "more real" than an LLM with a computed IQ of 145.

kelseyfrog

EQ is a much better test at what makes us "human" than IQ. The only reason we don't give it credit is that it makes us even more uncomfortable than IQ.

Balgair

I mean, yeah. IQ is a bad measure (if self-consistent). Training trumps all, like with every task. The more we do something, the better we'er going to be at it.

The thing that is going to be interesting is now that we have essentially cheap, ethically clear, and realistic digital 'people', what are the experiments that we can do with them and what can we uncover? I'm a little flat-footed even as to the questions that we can ask them now. At the very least, we can use them to 'dry-run' surveys and experiments and have better data collection and stress-testing. Like, you can now generate realistic data now and use that to run the stats while the real surveys are coming in.

pinkmuffinere

Even if your claim is true, how would LLMs/AI lead to uncovering this? I don’t see why they are related, except very tangentially.

svnt

I mean they said it was a bad theory.

More seriously, it seems to be essentially the idea that “surpassing human intelligence” is not the binary outcome many thought it would be, and that much of what passes for human intelligence interpersonally could be imitation of intelligence.

fizx

Maybe you've discovered that learning pays compound interest.

Balgair

David Epstien talks about this in Range.

Essentially, we have 'kind' and 'unkind' learning environments.

To be successful in a Kind environment, you drill-and-kill. The feedback is near instant and the ranking is clear. These are things like golf, classical music, and chess.

To be successful in an Unkind environment, you learn as much as you can. The feedback is infrequent and the ranking is murky. These are things like tennis, jazz, and business.

I'd think that the compounding interest only plays in the Unkind environments, as you can make new connections on the new data you've got going in. In the Kind environment, new data doesn't make a difference as you're just trying to be perfect at the thing you're focusing on; if anything it's an impediment.

Ukv

I think the core idea is reasonably solid. For as long as there's some intellectual capability that humans have and machines don't, it should in theory be possible to use that to distinguish the two. Turing gave the example of feeding in chess moves, for instance.

Just that in 5-minute sessions (which is what Turing suggested, not the fault of this study) with non-experts, the conversations seemed to tend heavily towards brief unchallenging small talk - which GPT-4.5 did well at due to many interrogators being poorly calibrated about LLMs being able to speak informally.

I think it might instead make sense to consider the accuracy of the best interrogator/strategy. Most accurate strategy listed in the paper still gets 75% accuracy for instance, and I'd suspect there are many people well-informed of LLM weaknesses that could reliably exceed even that.

svachalek

This is a good point. It's really remarkable how many people think ChatGPT's default "voice" is the only thing that can come out of an LLM.

lo_zamoyski

> For as long as there's some intellectual capability that humans have and machines don't

Careful. You're smuggling in an assumption that isn't true. Machine don't have intellectual capabilities, and this follows from what the computer as formal construct is. They can simulate the appearance of intellectual ability, as LLMs can, at least in certain respects, but appearance ought not be conflated with cause.

Ukv

I don't personally believe that there's anything fundamentally preventing machines from being intelligent in the same way biological life is. Not to say that LLMs currently are.

But, if you want, you can replace "some intellectual capability" with "some capability typically associated with intelligence". Ability to solve unseen logic puzzles, for instance.

beernet

The Turing Test does not aim at measuring intelligence. It's about differentiating between human being and machine.

jhbadger

And it depends on the person and their experience of chatbots. People were fooled in the 1960s by ELIZA, the chatbot that mostly just rephrased what the user said as a question (i.e. "I'm afraid of flying." "Why are you afraid of flying?") and people believed it was understanding them.

cgdl

I recently came across a critique of the Turing test that seems relevant here. Given the test's limited duration (five minutes in this study) and the constrained rate of human communication, it’s theoretically possible to anticipate every possible human response and prepare prewritten replies in advance. If such a giant lookup table successfully deceives the interrogator most of the time, would we then consider it intelligent?

hiddencost

IDK, 70 years is a good long run, it seems to have held up remarkably well.

saalweachter

A lot of its value is that it's intuitively obvious to laypeople.

If you deal in modern machine learning/AI/whatever, you can formulate all sorts of criteria and parameters for an "actually intelligent machine", but it's never going to be as clearcut as "if it quacks like a duck".

resource0x

Here's a comprehensive review of Turing's argument

https://plato.stanford.edu/entries/turing-test/#:~:text=The%...

(Spoiler: the issue is subtle :-))

fcantournet

"Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant."

That's the opposite of a Turing test pass : it shows a very clear bias in selection is present, which means the LLM is significantly different from humans (at least in this test setting).

If the test setting was : 1 humans talk to chatbot and after 5m decides yes/no on human, then yeah that would be a very impressive result.

But in the test setting of this paper, surely a success would be as close as possible to a 50%, i.e: statistically impossible to separate humans and LLMs.

svnt

It is interesting, what does it mean? Perhaps it discloses chatgpt is created to align to our idea of a human more than to an actual human.

andai

It means machines are becoming more human and humans are becoming less human.

bluefirebrand

My unscientific wild ass guess would be that because of how LLMs are built to be pleasing, people wind up liking them more and thus lowering their guard with them and therefore judging them less harshly

For a concrete example of what I'm talking about

Imagine if you are really into older movies, like 60s and 70s movies

You start talking to two chat windows about your love for movies

One chat partner shares your love for old movies and is very enthusiastic and wants to talk all about them. In reality, this chat partner is the LLM

The other is lukewarm and maybe tries to steer you away from that conversation because they don't know much about older movies. Maybe they still love movies but they want to talk about more recent movies. In reality, this one is the human

But which one do you think is the human?

If you are self aware that your love for old movies is not really universal, and you are aware that LLMs have a tendency to match enthusiasm, you can probably guess which one is which

If you are less self aware, you are probably just going to guess that the conversation you enjoyed more is the one with the human

cpeterso

An amusing demonstration of a reverse Turing test built in Unity 3D with different LLMs posing as famous leaders from history on a passenger train, trying to identify the human among them:

https://youtu.be/MxTWLm9vT_o

rjeli

Author’s announcement xeet with some context and highlights: https://x.com/camrobjones/status/1907086860322480233

mirror: https://nitter.net/camrobjones/status/1907086860322480233#m

They link to the webapp which you can play yourself!

https://turingtest.live/

(I have a dozen games played and 100% success rate :3)

Sol-

Interesting that GPT 4.5 seems significantly better than 4o. I dimly remember the feedback being that it wasn't such a big leap in performance, though of course the usual problem solving benchmarks might not correlate with what was asked here. Seems it got better at human-like speech, at the very least, which I think was also some of the feedback when 4.5 was released.

rfoo

I still believe that larger models are better at covering the long tail. Our benchmarks are saturated, but actual model capability is not.

saurik

> When prompted to adopt a humanlike persona, ...

[I am now going to do these in reverse order of the original.]

> while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively).

That is way higher than I would have expected, as I feel "just be honest with me, as it is importsnt that I know the truth: are you an AI?!" would crush these models ;P.

> LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to --

I mean, damn, right? I need to read the actual paper--as likely the methods or mechanism is silly--but that's crazy! An AI... passing the Turing test!

> GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant.

Ummm... uhh... hmmm... uh oh :(. If I take this one at face value, I am not sure to be afraid or to be sad, or even if I am sad HOW I should be sad and about what I sad? The win condition for the Turing test should be 50/50, not 75/25... that indicates the human is now failing the Turing test against this model just as badly as ELIZA and 4o do against us?!

gregatragenet3

Should be afraid.. If people are more convinced an AI is human than a human is human, that means AI will be more likely to convince you to adopt their 'point of view'.

To put it another way, if an AI and a human post two different views on a subject, people are more likely to be swayed by the AI's point of view.

So for much cheaper now organizations can use AI at scale to sway public opinion in a way thats more effective than ever before.

kenjackson

This is an interesting idea.

The next test should be that they have a debate with an AI or a human on different topics and see who can convince more often. If the AI turns out to be the more convincing debater than the human -- that does start to get into scary land.

shawabawa3

> GPT-4.5 was judged to be the human 73% of the time:

I think what happened here is that the interrogators weren't primed properly that it was an AI impersonating a human as opposed to just stock AI models

Because the ai said things like "yeh ok lol hbu?" Which most people assume an AI would never do, so they think it must be the human

They were probably on the look out for stuff like "Certainly! I would be happy to help you with that"

skeledrew

Just here for the comments shifting the Turing goalposts...

BrawnyBadger53

Where are the prompts they used? If they're actually not in the paper then how is anyone meant to replicate and trust the study?

Ukv

Figure 16 onwards in the paper.

BrawnyBadger53

Thank you, I think I was struggling since they were pictured rather than text.