"The Illusion of Thinking" – Thoughts on This Important Paper

77 comments

·June 10, 2025

ofrzeta

The conversational interface is a bit of a dark pattern, too, because it exaggerates what an LLM can do and creates an illusion that makes using LLMs a bit more addictive than a technical interface that says "generating random data that may be useful".

roywiggins

The completion interface was way less usable but more honest

imiric

This a balanced take on the subject. As usual, opinions on extreme ends should not be taken seriously. Modern ML tools are neither intelligent nor apocalyptic. What they are is very good at specific tasks, and the sooner we focus on those, the less time and resources will be wasted on tasks they're not good at.

Benchmarks and puzzles don't matter. AI companies will only use them to optimize their training, so that they can promote their results and boost their valuation on hype alone. They're more useful for marketing than as a measurement of real-world capabilities.

Judging by the increase of these negative discussions, we seem to be near or at the Peak of Inflated Expectations w.r.t. AI, and we'll be better off once we surpass it.

seydor

> The Costs of Anthropomorphizing AI

These were not costs , but massive benefits for hyped AI startups. They attracted the attention of a wide audience of investors including clueless ones, they brought in politicians and hence the media attention, and created a FOMO race for them to plant flags before each other.

Animats

"We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."

That's the real conclusion of the Apple paper. It's correct. LLMs are terrible at arithmetic, or even counting. We knew that. So, now what?

It would be interesting to ask the "AI system" to write a program to solve such puzzle problems. Most of the puzzles given have an algorithmic solution.

This may be a strategy problem. LLMs may need to internalize Polya's How to Solve It?[2] Read the linked Wikipedia article. Most of those are steps an LLM can do, but a strategy controller is needed to apply them in a useful order and back off when stuck.

The "Illusion of thinking" article is far less useful than the the Apple paper.

(Did anybody proofread the Apple paper? [1] There's a misplaced partial sentence in the middle of page 2. Or a botched TeX macro.)

[1] https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

[2] https://en.wikipedia.org/wiki/How_to_Solve_It

withinboredom

Eventually (many of us have hit it, and if you haven’t, you will), your program will hit a certain size/complexity where the AI also falls flat. These tests basically try to measure that limit, not just solve the puzzle.

scotty79

So far they've shown that it fails early if the answer would exceed context size anyways. At least that was the case for the Hanoi problem from what I heard. They somehow failed to notice that though.

tsurba

Is it a puzzle if there is no algorithm?

But testing via coding algos to known puzzles is problematic as the code may be in the training set. Hence you need new puzzles, which is kinda what ARC was meant to do, right? Too bad OpenAI lost credibility for that set by having access to it, but ”verbally promising” (lol) not to train on it, etc.

pram

LLMs are awful and untrustworthy at math on their own for sure, but I found out you can tell Claude to use bc and suddenly they're not so bad when they have a calculator. It's as smart as me in that respect ;P

imiric

You underestimate your abilities, and overestimate the LLM's.

Which version of `bc` was the LLM trained on? Perhaps this is not so important for a stable tool that doesn't change much, but it's critical for many programs and programming libraries. I lost count of the number of times when an LLM generated code that doesn't work for the current version of the library I'm using. In some cases you can tell it to use a specific version, or even feed it the documentation for that version, but that often fails to correct the issue.

And this is without considering the elephant in the room: hallucination. LLMs will often mix up or invent APIs that don't exist. Apologists will say that agents are the solution, and that feeding the error back to the LLM will fix it, but that is often not the case either. That might even make things worse by breaking something else, especially in large-context sessions.

pram

Oh yeah it’s still going to hallucinate eventually, 100%. My point is mostly you can take it from “absolutely useless” to “situationally useful” if you’re honest about its limitations and abilities.

Nothing has convinced me that LLMs will ever be good at math on their own.

rednafi

A large part of these anthropomorphic narratives was pushed by SV nerds to grab shareholder attention.

LLMs are transformative, but a lot of the tools around them already treat them as opaque function calls. Instead of piping text to sed, awk, or xargs, we’re just piping streams to these functions. This analogy can stretch to cover audio and video usage too. But that’s boring and doesn’t explain why you suddenly have to pay more for Google Work Suite just to get bombarded by AI slop.

This isn’t to undermine the absolutely incredible achievements of the people building this tech. It’s to point out the absurdity of the sales pitch from investors and benefactors.

But we’ve always been like this. Each new technology promises the world, and many even manage to deliver. Or is it that they succeed only because they overpromise and draw massive attention and investment in the first place? IDK, we’ll see.

SoKamil

That got me thinking - what if we stripped that conversational sound-like-a-human-be-safe layer and focused RLHF on being best at transforming text for API usages?

pona-a

The early OpenAI Instruct models were more like that. The original GPT-3 was only trained to predict the next token, then they used RLHF to make them interpret everything as queries, so that "Explain the theory of gravity to a 6 year old." wouldn't complete to "Explain the theory of relativity to a 6 year old in a few sentences." ChatGPT was probably that, expanded to multi-turn conversation. You can see the beginnings of that ChatGPT style in those examples.

https://openai.com/index/instruction-following/

tough

you gotta aim for the stars to maybe with luck reach the moon

if you only aim for the moon, you’ll never break orbit.

rednafi

If you don’t aim for the moon, you’re gonna surely miss it. The number of orbital calculations you need to get right to land on the moon is bonkers /s.

tough

lmao true never played kerbal space tbh

withinboredom

The number of people I’ve run into that think ChatGPT (or whatever model they are using) is still thinking about whatever it is they’ve talked about before while they’re not using it is non-zero. It doesn’t help that the models sometimes say things like “give me five minutes to think on it” and stuff like that.

tough

yeah man, the antrophomorphization is bad.

Lately chatgpt is like -let me show you an example I used before- like its a real professional.

it's all in the context, its our duty to remember these are LARP machines

I copy - pasted something about a kid their parent was asking online. Chatgpt said:

⸻

A Personal Anecdote (Borrowed)

A parent I know gave their daughter a “leader hat.” [..]

How in the hell would a large language model have past personal anecdotes and know other parents, idk

elif

It is not the models anecdote but that of the parent.

And knowing is something even dumb programs do with data. It doesn't imply human cognition

tough

I mean yeah the "borrowed" is doing a lot of heavy lifting.

But that was just like the latest, I just don't love how it talks like it was an alive person at the other end, maybe I should add something to my personal prompt to avoid it.

TeMPOraL

> yeah man, the antrophomorphization is bad.

Unfortunately, it's also the least wrong approach. People who refuse to entertain thinking about LLMs as quasi-humans, are the ones perpetually confused about what LLMs can or cannot do and why. They're always up in arms about prompt injections and hallucinations, and keep arguing those are bugs that need to be fixed, unable to recognize them as fundamental to the model's generality and handling natural language. They keep harping about "stochastic parrots" and Naur's "program theory", claiming LLMs are unable to reason or model in the abstract, despite many published research that lobotomizes the models live and nail concepts as they form and activate.

If you squint and imagine LLM to be a person, suddenly all of these things become apparent.

So I can't really blame people - especially non-experts - for sticking to an approach that's actually yielding much better intuition than the alternatives.

tsimionescu

Anthropomorphization also leads to very wrong conclusions though. In particular, we have a theory of mind that we apply in relation to other humans, mostly on the basis of "they haven't lied to me so far, so it's unlikely they will suddenly start lying to me now". But this is dead wrong in relation to output from an LLM - just because it generated a hundred correct answers doesn't tell you anything about how likely the 101st one is to be be correct and not a fabrication. Trust is always misplaced if put into results returned by an LLM, fundamentally.

southernplaces7

Right here on HN, where you'd think that most people would be able to think better, there's no shortage of AI fanboys who seriously try to frame the idea that LLMs may be conscious, and that we can't really be sure if we're conscious, since, you know, science doesn't know what causes consciousness.

I guess me feeling conscious, and thus directing much of my life and activity around that notion is just a big trick, and i'm no different from ChatGPT, only inferior and with lower electrical needs.

IshKebab

This is a bad faith representation of what people are actually saying which is:

1. As far as we know there is nothing non-physical in the brain (a soul or magical quantum microtubules or whatever). Everything that happens in your head obeys the laws of physics.

2. We definitely are conscious.

3. Therefore consciousness can arise from purely physical processes.

4. Any physical process can be computed. And biological thought appears as far as we can see to simply be an extraordinarily complex computation.

5. Therefore it is possible for a sufficiently fast & appropriately programmed computer to be conscious.

6. LLMs bear many resemblances to human thinking (even though they're obviously different in many ways too), and we've established that computers can be conscious, so you can't trivially rule out LLMs being conscious.

At least for now, I don't think anyone sane says today's LLMs are actually conscious. (Though consciousness is clearly a continuum... so maybe they're as conscious as a worm or whatever.)

The main point is that people trivially dismissing AI consciousness or saying "it can't think" because it's only matrix multiplications are definitely wrong.

It might not be conscious because consciousness requires more complexity, or maybe it requires some algorithmic differences (e.g. on-line learning). But it definitely isn't not-conscious just because it is maths or running on a computer or deterministic or ...

yusina

These discussions remind me an awful lot of discussions between my religios and my atheist and my agnostic friends.

Reglious friend: Well there must be a god, I can feel him, and how else would all this marvellous reality around us have come to existence?

Atheist friend: How can you claim that? Where is the evidence? There could be many other explanations, we just don't know, and because of that it doesn't make sense to postulate the existence of a god. And besides, it's just a marketing stunt of the powerful to stay in power.

Agnostic friend: Why does any of this matter? Nature is beautiful, sometimes cruel, I'm glad I'm alive, and whether there is a god or not, I don't know or really care, it's insubstantial to my life.

That's exactly how the discussions around LLM consciousness or intelligence or sentience always go if you put enough tech folks into the conversation.

rsanheim

> Any physical process can be computed

Um, really? How so? Show me the computations that model gut biome. Or pain, or the immune system, or how plants communicate, or how the big bang happened (or didn’t).

And now define “physical” wrt things like pain, or sensation, or the body, or consciousness.

We know so little about so much! it’s amazing when people speak with such certainty about things like consciousness and computation.

I don’t disagree with your main point. I would just say we don’t know very much about our own consciousness and mindbody. And humans have been studying that at least a few thousand years.

Not everyone on HN or who works with or studies AI believes AGI is just the next level on some algorithm or scaling level we haven’t unlocked yet.

Veen

The "as far as we know" in premise one is hand-waving the critical point. We don't know how or if conscious arises from purely physical systems. It may be physical, but inadequately replicated by computer systems (it may be non-computable). It may not be physical in the way we use that concept. We just don't know. And the fact that we don't know puts the conclusion in doubt.

Even if consciousness arises from a physical system, it doesn't follow that computation on a physical substrate can produce consciousness, even if it mimics some of the qualities we associate with consciousness.

null

[deleted]

imiric

> Right here on HN, where you'd think that most people would be able to think better, there's no shortage of AI fanboys who seriously try to frame the idea that LLMs may be conscious

Why does this surprise you? This is a community hosted by a Silicon Valley venture capital firm, and many of its members are either employed by or own companies in the AI industry. They are personally invested in pushing the narrative being criticized here, so the counter arguments are expected.

visarga

"The Illusion of Thinking" is bad, but so is the opposite "it's just math and code". They might not be reasoning like humans, but they are not reducible to just code and math either. They do something new, something that just math and code did not do before.

safety1st

I don't really agree. I use multiple LLMs every day and I feel like I get the most mileage out of them when I think about them as exactly that. Super good text transformers that can paraphrase anything that's been posted to the Internet.

There are a complexities beyond that of course. It can compare stuff, it can iterate through multiple steps ("Reason," though when you look at what it's doing, you definitely see how that term is a bit of a stretch). Lots of emergent applications yet to be discovered.

But yeah, it's math (probabilities to generate the next token) and code (go run stuff). The best applications will likely not be the ones that anthropomorphize it, but take advantage of what it really is.

XenophileJKO

Have you used them to code? I think that is the tipping point where the level of interaction is so high, when the models are troubleshooting. That is when you can start to get a little philosophical.

The models make many mistakes and are not great software architects, but watch one or even two models work together to solve a bug and you'll quickly rethink "text transformers".

safety1st

All the time. And what stands out is that when I hit a problem in a framework that's new and badly documented, they tend to fall apart. When there's a lot of documentation, StackOverflow discussions etc. for years about how something works they do an excellent job of digging those insights up. So it fits within my mental model of "find stuff on the web and run a text transformer" pretty well. I don't mean to underestimate the capabilities, but I don't think we need philosophy to explain them either.

If we ever develop a LLM that's able to apply symbolic logic to text, for instance to assess an argument's validity, develop an accurate proof step by step, and do this at least as well as many human beings, then I'll concede that we've invented a reasoning machine. Such a machine might very well be a miraculous invention in this age of misinformation, but I know of no work in that direction and I'm not at all convinced it's a natural outgrowth of LLMs which are so bad at math (perhaps they'd be of use in the implementation).

jmsdnns

> just code and math either

this is literally what they are

digbybk

I think it’s the “just” that they are taking issue with. We are “just” neurons. But we demonstrate interesting emergent behaviors that, in principle, can be reduced to firing neurons but in practice we don’t understand and shouldn’t diminish with the word “just”.

jmsdnns

fair point!

jstanley

What part of an LLM do you think is not made out of code and maths?

victorbjorklund

I think the point is we wouldnt say "humans dont reason. It is just chemical and eletrical signals"

jstanley

Right. The error is in thinking that "just code and maths" can't reason.

LLMs very obviously are reducible to "just code and maths". We know that, because that is how they are made.

pepinator

We call it breakthrough. And it's just math and code. We've had many of those.

cantor_S_drug

Here's something to think about:

Can we consider AI conscious similar to how Hardy recognised Ramanujan's genius? That if AI weren't to be conscious then they won't have the imagination to write what it wrote.

mkl

It doesn't take conciousness to predict the next word repeatedly.

fc417fc802

That's reductive. Cherry pick the most lucid LLM output and the least lucid human output and I think at this point the former clearly exceeds the latter. If an LLM is "just" predicting the next word repeatedly then what does that say about humans?

grantcas

[dead]

diimdeep

Here is another opinion on the original paper

- Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity https://arxiv.org/abs/2506.09250

tkgally

The points made in the abstract of that paper look like pretty serious challenges to the Apple paper. What might be the counterobjections?

imiric

Personal experience?

I don't care about benchmarks, nor LLMs' capability to solve puzzles. This is the first thing AI companies optimize their training for, which is misleading and practically false advertising.

I care about how good LLMs are for helping me with specific tasks. Do they generate code that on the surface appears correct, yet on closer inspection has security and performance issues, is unnecessarily complex, often doesn't even compile, which takes me more time to troubleshoot and fix than if I were to write it myself? Do they explain concepts confidently while being wrong, which I have no way of knowing unless I'm a domain expert? Do they repeat all these issues even after careful re-prompting and with all the contextual information they would need? Does all this waste my time more than it helps?

The answer is yes to all of the above.

So while we can argue whether LLMs can think and reason, my personal experience tells me that they absolutely cannot, and that any similarity to what humans can do is nothing but an illusion.

jmsdnns

Worth looking at how many papers the authors have published before assuming they challenge Apple.

frozenseven

If they make a solid case against the Apple paper, why does it matter? It's not like they're the first ones to criticize (read: easily debunk) Apple on such grounds.

null

[deleted]

LazySpoon

[dead]