Seven replies to the viral Apple reasoning paper and why they fall short

219 comments

·June 14, 2025

thomasahle

> 1. Humans have trouble with complex problems and memory demands. True! But incomplete. We have every right to expect machines to do things we can’t. [...] If we want to get to AGI, we will have to better.

I don't get this argument. The paper is about "whether RLLMs can think". If we grant "humans make these mistakes too", but also "we still require this ability in our definition of thinking", aren't we saying "thinking in humans is a illusion" too?

whatagreatboy

the real ability of intelligence is to correct mistakes in a gradual and consistent way.

FINDarkside

Agreed. But also his point about AGI is incorrect. AI that will perform on the level of average human in every task is AGI by definition.

pzo

Why AGI need to be even as good as average human. If you get someone with 80 IQ is still smart enough to reason and do plenty of menial tasks. Also not sure why AGI need to be as good in every task? Average human will excel others at few tasks and sux terribly in many others.

Someone

Because that’s how AGI is defined. https://en.wikipedia.org/wiki/Artificial_general_intelligenc...: “Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks”

But yes, you’re right that software needs not be AGI to be useful. Artificial narrow intelligence or weak AI (https://en.wikipedia.org/wiki/Weak_artificial_intelligence) can be extremely useful, even something as narrow as a services that transcribes speech and can’t do anything else.

simonw

That very much depends on which AGI definition you are using. I imagine there are a dozen or so variants out there. See also "AI" and "agents" and (apparently) "vibe coding" and pretty much every other piece of jargon in this field.

FINDarkside

I think it's very widely accepted definition and there's really no competing definitions either as far as I know. While some people might think AGI means superintelligence, it's only because they've heard the term but never bothered to look up what it means.

usef-

Yes. I wonder if he was thinking of ASI, not AGI

adastra22

Most people are. One of my pet peeves is that people falsely equate AGI with ASI, constantly. We have had full AGI for years now. It is a powerful tool, but not what people tend to think of as god-like “AGI.”

aaron695

[dead]

null

[deleted]

math_dandy

I was hoping the accepted definition would not use humans as a baseline, rather that humans would be an (the) example of AGI.

bastawhiz

The A in AGI is "artificial" which sort of precludes humans from being AGI (unless you have a very unconventional belief about the origin of humans).

Since there's not really a whole lot of unique examples of general intelligence out there, humans become a pretty straightforward way to compare.

mathgradthrow

the average human is good at something, and sucks at almost everything. Human performance at chess and average performance at chess differ by 7 orders of magnitude.

datadrivenangel

Your standard model of human needs a little bit of fine tuning for most games.

jltsiren

AGI should perform on the level of an experienced professional in every task. The average human is useless for pretty much everything but capable of learning to perform almost any task, given enough motivation and effort.

Or perhaps AGI should be able to reach the level of an experienced professional in any task. Maybe a single system can't be good at everything, if there are inherent trade-offs in learning to perform different tasks well.

godelski

For comparison, the average person can't print Hello World in python. Your average programmer (probably) can.

It's surprisingly simple to be above average in most tasks. Which people often confuse with having expertise. It's probably pretty easy to get into the 80th percentile of most subjects. That won't make you the 80th percentile of people that do the thing, but most people don't. I'd wager 80th percentile is still amateur.

MoonGhost

> The average human is useless for pretty much everything but capable of learning to perform almost any task

But only the limited number of tasks per human.

> Or perhaps AGI should be able to reach the level of an experienced professional in any task.

Even if it performs just better than untrained human but on any task this will be superhuman level. As no human can do it.

autobodie

Agree. Both sides of the argument are unsatisfying. They seem like quantitative answers to a qualitative question.

serbuvlad

"Have we created machines that can do something qualitatevely similar to that part of us that can correlate known information and pattern recognition to produce new ideas and solutions to problems -- that part we call thinking?"

I think the answer to this question is certainly "Yes". I think the reason people deny this is because it was just laughably easy in retrospect.

In mid-2022 people were like. "Wow this GPT3 thing generates kind of coherent greentexts"

Since then really only we got: larger models, larger models, search, agents, larger models, chain-of-thought and larger models.

And from a novelty toy we got a set of tools that at the very least massively increase human productivity in a wide range of tasks and certainly pass any Turing test.

Attention really was all you needed.

But of course, if you ask a buddhist monk, he'll tell you we are attention machines, not computation machines.

He'll also tell you, should you listen, that we have a monkey in our mind that is constantly producing new thoughts. This monkey is not who we are, it's an organ. It's thoughts are not our thoughts. It's something we perceive. And that we shouldn't identify with.

Now we have thought-genrating-monkeys with jet engines and adrenaline shots.

This can be good. Thought-genrating-monkeys put us on the moon and wrote Hamlet and the Oddesy.

The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

autobodie

> The key is to not become a slave to them. To realize that our worth consists not in our ability to think. And that we are more than that.

I cannot afford to consider whether you are right because I am a slave to capital, and therefore may as well be a slave to capital's LLMs. The same goes for you.

viccis

>I think the answer to this question is certainly "Yes".

It is unequivocally "No". A good joint distribution estimator is always by definition a posteriori and completely incapable of synthetic a priori thought.

jes5199

I think the Apple paper is practically a hack job - the problem was set up in such a way that the reasoning models must do all of their reasoning before outputting any of their results. Imagine a human trying to solve something this way: you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower - and past a certain size/complexity, it would be impossible.

And this isn’t how LLMs are used in practice! Actual agents do a thinking/reasoning cycle after each tool-use call. And I guarantee even these 6-month-old models could do significantly better if a researcher followed best practices.

xtracto

I think the paper got unwanted attention... for a scientific paper. It's like that old paper about a "gravity shielding" podkelnov rings experiment that got publicized by some UK news paper as "scientists find antigravity" and ended up destroying the Russian author's career.

By the way, it seems Appke researchers got inspired by this [1] older chinese paper to get their title. The Chinese author's made a very similar argument, without the experiments. I myself believe Apple experiments are just good curiosities, but don't drive as much of a point as they believe.

[1] https://arxiv.org/abs/2506.02878

Brystephor

Forcing reasoning is analogous to requiring a student to show their work when solving a problem if im understanding the paper correctly.

> you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower

This part i dont understand. Why would coming up with an algorithm (e.g. a simple pattern) and reciting it be impossible? The paper doesnt mention the models coming up with the algorithm at all AFAIK. If the model was able to come up with the pattern required to solve the puzzles and then also execute (e.g. recite) the pattern, then that'd show understanding. However the models didn't. So if the model can answer the same question for small inputs, but not for big inputs, then doesnt that imply the model is not finding a pattern for solving the answer but is more likely pulling from memory? Like, if the model could tell you fibbonaci numbers when n=5 but not when n=10, that'd imply the numbers are memorized and the pattern for generation of numbers is not understood.

jsnell

The paper doesn't mention it because either the researchers did not care to check the outputs manually, or reporting what was in the outputs would have made it obvious what their motives were.

When this research has been reproduced, the "failures" on the Tower of Hanoi are the model printing out a bunch of steps, saying there is no point in doing it thousands of times more. And they they'd either output an the algorithm for printing the rest in words or code

qarl

> The paper doesnt mention the models coming up with the algorithm at all AFAIK.

And that's because they specifically hamstrung their tests so that the LLMs were not "allowed" to generate algorithms.

If you simply type "Give me the solution for Towers of Hanoi for 12 disks" into chatGPT it will happily give you the answer. It will write program to solve it, and then run that program to produce the answer.

But according to the skeptical community - that is "cheating" because it's using tools. Nevermind that it is the most effective way to solve the problem.

https://chatgpt.com/share/6845f0f2-ea14-800d-9f30-115a3b644e...

zoul

This is not about finding the most effective solution, it’s about showing that they “understand” the problem. Could they write the algorithm if it were not in their training set?

Too

How can one know that's not coming from the pre-trained data. The paper is trying to evaluate whether the LLM has general problem solving ability.

wohoef

Good article giving some critique to Apple's paper and Gary Marcus specifically.

https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...

godelski

  > this is a preprint that has not been peer reviewed.

This conversation is peer review...

You don't need a conference for something to be peer reviewed, you only need... peers...

In fact, this paper is getting more peer review than most works. Conferences are notoriously noisy as reviewers often don't care and are happy to point out criticisms. All works have valid criticisms... Finding criticisms is the easy part. The hard part is figuring out if these invalidate the claims or not.

hintymad

Honest question: does the opinion of Gary Marcus still count? His criticism seems more philosophical than scientific. It's hard for me see what he builds or reasons to get to his conclusions.

zer00eyz

> seems more philosophical than scientific

I think this is a fair assessment but reason, and intelligence dont really have an established control or control group. If you build a test and say "Its not intelligent because it can't..." and someone goes out and add's that feature in is it suddenly now intelligent?

If we make a physics break through tomorrow is there any LLM that is going to retain that knowledge permanently as part of its core or will they all need to be re-trained? Can we make a model that is as smart as a 5th grader without shoving the whole corpus of human knowledge into it, folding it over twice and then training it back out?

The current crop of tech doesn't get us to AGI. And the focus to make it "better" is for the most part a fools errand. The real winners in this race are going to be those who hold the keys to optimization: short retraining times, smaller models (with less upfront data), optimized for lower performance systems.

hintymad

> The current crop of tech doesn't get us to AGI

I actually agree with this. Time and again, I can see that LLMs do not really understand my questions, let alone being able to perform logical deductions beyond in-distribution answers. What I’m really wondering is whether Marcus’s way of criticizing LLMs is valid.

Workaccount2

What gets me, and the author talks about it in the post, is that people will readily attribute correct answers to "its in the training set" but nobody says anything about incorrect answers that are in the training set. LLMs get stuff in the training set wrong all the time, but nobody uses it as evidence that it probably can't lean too hard on it's memorization for complex questions it does get right.

It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

thrwaway55

Do you hypothese that they see more wrong examples then right? Why is there concern about model collapse if they are reasoning and can sort it out, why does the data even need to be scrubbed before training?

How many r's really are in Strawberry?

Jensson

> It puts LLMs in an impossible position; if they are right, they memorized it, if they are wrong, they cannot reason.

Both of those can be true at the same time though. They memorize a lot of things, but its fuzzy and when they remember wrong they cannot fix it via reasoning.

Workaccount2

It's more than fuzzy, they are packing exabytes, perhaps zetabytes of training data into a few terabytes. Without any reasoning ability it must be divine intervention that they ever get anything right...

DanAtC

[flagged]

diego898

Sorry - what do you mean by yud-cult? Searching google didn’t help me (as far as I can tell) - I view LW from an outside perspective as well, but don’t understand the reference

jazzypants

They're referring to the founder of that website, Eliezer Yudkowsky, who is controversial due to his 2023 Time article that called for a complete halt on the development of AI.

https://en.m.wikipedia.org/wiki/Eliezer_Yudkowsky

https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-no...

lexh

Use the debatably intelligent machines for this sort of question not Google.

It seems “Yud” here is a shorthand for Yudkowsky. Hinted by the capitalization.

f33d5173

Elezier yudkowsky, often referred to as yud, started lw.

labrador

The key insight is that LLMs can 'reason' when they've seen similar solutions in training data, but this breaks down on truly novel problems. This isn't reasoning exactly, but close enough to be useful in many circumstances. Repeating solutions on demand can be handy, just like repeating facts on demand is handy. Marcus gets this right technically but focuses too much on emotional arguments rather than clear explanation.

swat535

If that was the case, it would have been great already but these tools can’t even do that. They frequently make mistake repeating the same solutions available everywhere during their “reasoning” process and fabricates plausible hallucinations which you then have to inspect carefully to catch.

woopsn

That alone would be revolutionary - but still aspirational for now. The other day Gemini mixed up left and right on me in response to basic textbook problem.

Jabrov

I’m so tired of hearing this be repeated, like the whole “LLMs are _just_ parrots” thing.

It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data. You can test this out in so many ways, and there’s so many examples out there.

______________

Edit for responders, instead of replying to each:

We obviously have to define what we mean by "reasoning" and "solving novel problems". From my point of view, reasoning != general intelligence. I also consider reasoning to be a spectrum. Just because it cannot solve the hardest problem you can think of does not mean it cannot reason at all. Do note, I think LLMs are generally pretty bad at reasoning. But I disagree with the point that LLMs cannot reason at all or never solve any novel problems.

In terms of some backing points/examples:

1) Next token prediction can itself be argued to be a task that requires reasoning

2) You can construct a variety of language translation tasks, with completely made up languages, that LLMs can complete successfully. There's tons of research about in-context learning and zero-shot performance.

3) Tons of people have created all kinds of challenges/games/puzzles to prove that LLMs can't reason. One by one, they invariably get solved (eg. https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224..., https://ahmorse.medium.com/llms-and-reasoning-part-i-the-mon...) -- sometimes even when the cutoff date for the LLM is before the puzzle was published.

4) Lots of examples of research about out-of-context reasoning (eg. https://arxiv.org/abs/2406.14546)

In terms of specific rebuttals to the post:

1) Even though they start to fail at some complexity threshold, it's incredibly impressive that LLMs can solve any of these difficult puzzles at all! GPT3.5 couldn't do that. We're making incremental progress in terms of reasoning. Bigger, smarter models get better at zero-shot tasks, and I think that correlates with reasoning.

2) Regarding point 4 ("Bigger models might to do better"): I think this is very dismissive. The paper itself shows a huge variance in the performance of different models. For example, in figure 8, we see Claude 3.7 significantly outperforming DeepSeek and maintaining stable solutions for a much longer sequence length. Figure 5 also shows that better models and more tokens improve performance at "medium" difficulty problems. Just because it cannot solve the "hard" problems does not mean it cannot reason at all, nor does it necessarily mean it will never get there. Many people were saying we'd never be able to solve problems like the medium ones a few years ago, but now the goal posts have just shifted.

aucisson_masque

> It’s patently obvious that LLMs can reason and solve novel problems not in their training data.

Would you care to tell us more ?

« It’s patently obvious » is not really an argument, I could say just as well that everyone know LLM can’t resonate or think (in the way we living beings do).

socalgal2

I'm working on new API. I asked the LLM to read the spec and write tests for it. It does. I don't know if that's "reasoning". I know that no tests exist for this API. I know that the internet is not full of training data for this API because it's a new API. It's also not a CRUD API or some other API that's got a common pattern. And yet, with a very short prompt, Gemini Code Assist wrote valid tests for a new feature.

It certainly feels like more than fancy auto-complete. That is not to say I haven't run into issue but I'm still often shocked at how far it gets. And that's today. I have no idea what to expect in 6 months, 12, 2 years, 4, etc.

travisjungroth

Copied from a past comment of mine:

I just made up this scenario and these words, so I'm sure it wasn't in the training data.

Kwomps can zark but they can't plimf. Ghirns are a lot like Kwomps, but better zarkers. Plyzers have the skills the Ghirns lack.

Quoning, a type of plimfing, was developed in 3985. Zhuning was developed 100 years earlier. I have an erork that needs to be plimfed. Choose one group and one method to do it.

> Use Plyzers and do a Quoning procedure on your erork.

If that doesn't count as reasoning or generalization, I don't know what does.

null

[deleted]

astrange

> It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data.

So can real parrots. Parrots are pretty smart creatures.

null

[deleted]

goalieca

So far they cannot even answer questions which are straight up fact checking and search engine like queries. Reasoning means they would be able to work through a problem and generate a proof they way a student might.

Workaccount2

So if they have bad memory, then they must be reasoning to get the correct answer for the problems they do solve?

labrador

I've done this excercise dozens of times because people keep saying it, but I can't find an example where this is true. I wish it was. I'd be solving world problems with novel solutions right now.

People make a common mistake by conflating "solving problems with novel surface features" with "reasoning outside training data." This is exactly the kind of binary thinking I mentioned earlier.

jjaksic

"Solving novel problems" does not mean "solving world problems that even humans are unable to solve", it simply means solving problems that are "novel" compared to what's in the training data.

Can you reason? Yes? Then why haven't you cured cancer? Let's not have double standards.

jhanschoo

I think that "solving world problems with novel solutions" is a strawman for an ability to reason well. We cannot solve world problems with reasoning, because pure reasoning has no relation to reality. We lack data and models about the world to confirm and deny our hypotheses about the world. That is why the empirical sciences do experiments instead of sit in an armchair and mull all day.

bfung

Any links or examples available? Curious to try it out

null

[deleted]

andrewmcwatters

It's definitely not true in any meaningful sense. There are plenty of us practitioners in software engineering wishing it was true, because if it was, we'd all have genius interns working for us on Mac Studios at home.

It's not true. It's plainly not true. Go have any of these models, paid, or local try to build you novel solutions to hard, existing problems despite being, in some cases, trained on literally the entire compendium of open knowledge in not just one, but multiple adjacent fields. Not to mention the fact that being able to abstract general knowledge would mean it would be able to reason.

They. Cannot. Do it.

I have no idea what you people are talking about because you cannot be working on anything with real substance that hasn't been perfectly line fit to your abundantly worked on problems, but no, these models are obviously not reasoning.

I built a digital employee and gave it menial tasks that compare to current cloud solutions who also claim to be able to provide you paid cloud AI employees and these things are stupider than fresh college grads.

aucisson_masque

That’s the opposite of reasoning tho. Ai bros want to make people believe LLM are smart but they’re not capable of intelligence and reasoning.

Reasoning mean you can take on a problem you’ve never seen before and think of innovative ways to solve it.

LLM can only replicate what is in its data, it can in no way think or guess or estimate what will likely be the best solution, it can only output a solution based on a probability calculation made on how frequent it has seen this solution linked to this problem.

labrador

You're assuming we're saying LLMs can't reason. That's not what we're saying. They can execute reasoning-like processes when they've seen similar patterns, but this breaks down when true novel reasoning is required. Most people do the same thing. Some poeple can come up with novel solutions to new problems, but LLMs will choke. Here's an example:

Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."

I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:

    Roughly 3 million shipwrecks on ocean floors globally
    Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
    So ~3,000 ships with pianos sunk
    Average maybe 0.5 pianos per ship (not all passenger areas had them)
    Estimate: ~1,500 pianos

*Claude Sonnet 4, Google Gemini 2.5 and GPT 4o

kgeist

GPT4o isn't considered an "advanced" LLM at this point. It doesn't use reasoning.

I gave your prompt to o3 pro, and this is what I got without any hints:

  Historic shipwrecks (1850 → 1970)
  • ~20 000 deep water wrecks recorded since the age of steam and steel  
  • 10 % were passenger or mail ships likely to carry a cabin class or saloon piano   
  • 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000

  Modern container losses (1970 → today)
  • ~1 500 shipping containers lost at sea each year  
  • 1 in 2 000 containers carries a piano or electric piano   
  • Each piano container holds ≈ 5 units   
  • 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190

  Coastal disasters (hurricanes, tsunamis, floods)
  • Major coastal disasters each decade destroy ~50 000 houses  
  • 1 house in 50 owns a piano   
  • 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250

  Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300

  Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.

FINDarkside

What does "choked on it" mean for you? Gemini 2.5 pro gives this, even estimating what amouns of those 3m ships that sank after pianos became common item. Not pasting the full reasoning here since it's rather long.

Combining our estimates:

From Shipwrecks: 12,500 From Dumping: 1,000 From Catastrophes: 500 Total Estimated Pianos at the Bottom of the Sea ≈ 14,000

Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.

Jabrov

That seems like a totally reasonable response ... ?

dialup_sounds

How much of that is inability to reason vs. being trained to avoid making things up?

ummonk

Most of the objections and their counterarguments seem like either poor objections (e.g. ad hominem against the first listed author) or seem to be subsumed under point 5. It’s annoying that most of this post focuses so much effort on discussing most of the other objections when the important discussion is the one to be had in point 5:

I.e. to what extent are LLMs able to reliably make use of writing code or using logic systems, and to what extent does hallucinating / providing faulty answers in the absence of such tool access demonstrate an inability to truly reason (I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

thomasahle

> I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?

That's what the models did. They gave the first 100 steps, then explained how it was too much to output all of it, and gave the steps one would follow to complete it.

They were graded as "wrong answer" for this.

---

Source: https://x.com/scaling01/status/1931783050511126954?t=ZfmpSxH...

> If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large: "Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"

> At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.

sponnath

Didn't they start failing well before they hit token limits? I'm not sure what the point the source you linked to is trying to make.

emp17344

Why should we trust a guy with the following twitter bio to accurately replicate a scientific finding?

>lead them to paradise

>intelligence is inherently about scaling

>be kind to us AGI

Who even is this guy? He seems like just another r/singularity-style tech bro.

FINDarkside

I don't think most of the objections are poor at all apart from 3, it's this article that seems to make lots of strawmans. Especially the first objection is often heard because people claim "this paper proves LLMs don't reason". The author moves goalposts and is arguing against about whether LLMs lead to AGI, which is already a strawman for those arguments. And in addition, he even seems to misunderstand AGI, thinking it's some sort of super intelligence ("We have every right to expect machines to do things we can’t"). AI that can do everything at least as good as average human is AGI by definition.

It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi. I bet average person will not be able to "one-shot" you the moves to 8 disk tower of Hanoi without writing anything down or tracking the state with the actual disks. LLMs have far bigger obstacles to reaching AGI though.

5 is also a massive strawman with the "not see how well it could use preexisting code retrieved from the web" as well, given that these models will write code to solve these kind of problems even if you come up with some new problem that wouldn't exist in its training data.

Most of these are just valid the issues in the paper. They're not supposed to be some kind of arguments that try to make everything the paper said invalid. The paper didn't really even make any bold claims, it only concluded LLMs have limitations in its reasoning. It had a catchy title and many people didn't read past that.

chongli

It's especially weird argument considering that LLMs are already ahead of humans in Tower of Hanoi

No one cares about Towers of Hanoi. Nor do they care about any other logic puzzles like this. People want AIs that solve novel problems for their businesses. The kind of problems regular business employees solve every single day yet LLMs make a mess of.

The purpose of the Apple paper is not to reveal the fact that LLMs routinely fail to solve these problems. Everyone who uses them already knows this. The paper is an argument for why this happens (lack of reasoning skills).

No number of demonstrations of LLMs solving well-known logic puzzles (or other problems humans have already solved) will prove reasoning. It's not interesting at all to solve a problem that humans have already solved (with working software to solve every instance of the problem).

ummonk

I'm more saying that points 1 and 2 get subsumed under point 5 - to the extent that existing algorithms / logical systems for solving such problems are written by humans, an AGI wouldn't need to match the performance of those algorithms / logical systems - it would merely need to be able to create / use such algorithms and systems itself.

You make a good point though that the question of whether LLMs reason or not should not be conflated with the question of whether they're on the pathway to AGI or not.

FINDarkside

Right, I agree there. Also that's something LLMs can already do. If you give the problem to ChatGPT o3 model, it will actually write python code, run it and give you the solution. But I think points 1 and 2 are still very valid things to talk about, because while Tower of Hanoi can be solved by writing code that doesn't apply to every problem that would require extensive reasoning.

neoden

> Puzzles a child can do

Certainly, I couldn't solve Hanoi's towers with 8 disks purely in my mind without being able to write down the state of every step or having a physical state in front of me. Are we comparing apples to apples?

hellojimbo

The only real point is number 5.

> Huge vindication for what I have been saying all along: we need AI that integrates both neural networks and symbolic algorithms and representations

This is basically agents which is literally what everyone has been talking about for the past year lol.

> (Importantly, the point of the Apple paper goal was to see how LRM’s unaided explore a space of solutions via reasoning and backtracking, not see how well it could use preexisting code retrieved from the web.

This is a false dichotomy. The thing that apple tested was dumb and dl'ing code from the internet is also dumb. What would've been interesting is, given the problem, would a reasoning agent know how to solve the problem with access to a coding env.

> Do LLM’s conceptually understand Hanoi?

Yes and the paper didn't test for this. The paper basically tested the equivalent of, can a human do hanoi in their head.

I feel like what the author is advocating for is basically a neural net that can send instructions to an ALU/CPU, but I haven't seen anything promising that shows that its better than just giving an agent access to a terminal

bluefirebrand

I'm glad to read articles like this one, because I think it is important that we pour some water on the hype cycle

If we want to get serious about using these new AI tools then we need to come out of the clouds and get real about their capabilities

Are they impressive? Sure. Useful? Yes probably in a lot of cases

But we cannot continue the hype this way, it doesn't serve anyone except the people who are financially invested in these tools.

senko

Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

This article may seem reasonable, but here he's defending a paper that in his previous article he called "A knockout blow for LLMs".

Many of his articles seem reasonable (if a bit off) until you read a couple dozen a spot a trend.

steamrolled

> Gary Marcus isn't about "getting real", it's making a name for himself as a contrarian to the popular AI narrative.

That's an odd standard. Not wanting to be wrong is a universal human instinct. By that logic, every person who ever took any position on LLMs is automatically untrustworthy. After all, they made a name for themselves by being pro- or con-. Or maybe a centrist - that's a position too.

Either he makes good points or he doesn't. Unless he has a track record of distorting facts, his ideological leanings should be irrelevant.

senko

He makes many very good points:

For example he continusly calls out AGI hype for what it is, and also showcases dangers of naive use of LLMs (eg. lawyers copy-pasting hallucinated cases into their documents, etc). For this, he has plenty of material!

He also makes some very bad points and worse inferences: that LLMs as a technology are useless because they can't lead to AGI, that hallucation makes LLMs useless (but then he contradicts himself in another article conceding they "may have some use"), that because they can't follow an algorithm they're useless, etc, that scaling laws are over therefore LLMs won't advance (he's been making that for a couple of years), that AI bubble will collapse in a few months (also a few years of that), etc.

Read any of his article (I've read too many, sadly) and you'll never come to the conclusion that LLMs might be a useful technology, or be "a good thing" even in some limited way. This just doesn't fit with reality I can observe with my own eyes.

To me, this shows he's incredibly biased. That's okay if he wants to be a pundit - I couldn't blame Gruber for being biased about Apple! But Marcus presents himself as the authority on AI, a scientist, showing a real and unbiased view on the field. In fact, he's as full of hype as Sam Altman is, just in another direction.

Imagine he was talking about aviation, not AI. 787 dreamliner crashes? "I've been saying for 10 years that airplanes are unsafe, they can fall from the sky!" Boeing the company does stupid shit? "Blown door shows why airplane makers can't be trusted" Airline goes bankrupt? "Air travel winter is here"

I've spoken to too many intelligent people who read Marcus, take him at his words and have incredibly warped views on the actual potential and dangers of AI (and send me links to his latest piece with "so this sounds pretty damning, what's your take?"). He does real damage.

Compare him with Simon Willison, who also writes about AI a lot, and is vocal about its shortcomings and dangers. Reading Simon, I never get the feeling I'm being sold on a story (either positive or negative), but that I learned something.

Perhaps a Marcus is inevitable as a symptom of the Internet's immune system to the huge amount of AI hype and bullshit being thrown around. Perhaps Gary is just fed up with everything and comes out guns blazing, science be damned. I don't know.

But in my mind, he's as much BSer as the AGI singularity hypers.

sinenomine

Marcus' points routinely fail to pass scrutiny, nobody in the field takes him seriously. If you seek real scientifically interesting LLM criticism, read François Chollet and his Arc AGI series of evals.

adamgordonbell

This!

For all his complaints about llms, his writing could be generated by an llm with a prompt saying: 'write an article responding to this news with an essay saying that you are once again right that this AI stuff is overblown and will never amount to anything.'

woopsn

Given that the links work, the quotes were actually said, numbers are correct, cited research actually exists etc we can immediately rule that out.

2muchcoffeeman

What’s the argument here that he’s not considering all the information regarding GenAI?

That there’s a trend to his opinion?

If I consider all the evidence regarding gravity, all my papers will be “gravity is real”.

In what ways is he only choosing what he wants to hear?

senko

Replied elsewhere in the thread: https://news.ycombinator.com/item?id=44279283

To your example about gravity, I argue that he goes from "gravity is real" to "therefore we can't fly", and "yeah maybe some people can but that's not really solving gravity and they need to go down eventually!"

ramchip

I was very put off by his article "A knockout blow for LLMs?", especially all the fuss he was making about using his own name as a verb to mean debunking AI hype...

ninjin

Marcus comes with a very standard cognitive science criticism of statistical approaches to artificial intelligence, many parts of which dates back to the late 50s from when the field was born and moved to distance itself from behaviourism. The worst part to me is not that his criticism is entirely wrong, but rather that it is obvious and yet peddled as something that those of us that develop statistical approaches are completely ignorant of. To make matters worse, instead of developing alternative approaches (like plenty of my colleagues in cognitive science do!), he simply reiterates pretty much the same points over and over and has done so at least for the last twenty or so years. He and others paint themselves as sceptics and bulwarks against the current hype (which I can assure you, I hate at least as much as they do). But, to me, they are cynics, not sceptics.

I try to maintain a positive and open mind of other researchers, but Marcus lost me pretty much at "first contact" when a student in the group who leaned towards cognitive science had us read "Deep Learning: A Critical Appraisal" by Marcus (2018) [1] back around when it was published. Finally I could get into the mind of this guy so many people were talking about! 27 pages and yet I learned next to nothing new as the criticism was just the same one we have heard for decades: "Statistical learning has limits! It may not lead to 'truly" intelligent machines!". Not only that, the whole piece consistently conflates deep learning and statistical learning for no reason at all, reads as if it was rushed (and not proofed), emphasises the author's research strongly rather than giving a broad overview, etc. In short, it is bad, very bad as a scientific piece. At times, I read short excerpts of an article Marcus has written and yet sadly it is pretty much the same thing all over again.

[1]: https://arxiv.org/abs/1801.00631

There is a horrible market to "sell" hype when it comes to artificial intelligence, but there is also a horrible market to "sell" anti-hype. Sadly, both brings traffic, attention, talk invitations, etc. Two largely unscientific tribes, that I personally would rather do without, with their own profiting gurus.

newswasboring

What exactly is your objection here? That the guy has an opinion and is writing about it?

senko

Replied elsewere in the thread: https://news.ycombinator.com/item?id=44279283

bobxmax

[flagged]

g-b-r

I see the opposite, the wide majority of people commenting on Hacker News seem now very favorable to LLMs.

fhd2

Even of the people invested in these tools, hype only benefits those attempting a pump and dump scheme, or those selling training, consulting or similar services around AI.

People who try to make genuine progress, while there's more money in it now, might just have to deal with another AI winter soon at this rate.

bluefirebrand

> hype only benefits those attempting a pump and dump scheme

I read some posts the other day saying Sam Altman sold off a ton of his OpenAI shares. Not sure if it's true and I can't find a good source, but if it is true then "pump and dump" does look close to the mark

aeronaut80

You probably can’t find a good source because sources say he has a negligible stake in OpenAI. https://www.cnbc.com/amp/2024/12/10/billionaire-sam-altman-d...

spookie

Think the same thing, we need more breakthroughs. Until then, it is still risky to rely on AI for most applications.

The sad thing is that most would take this comment the wrong way. Assuming it is just another doomer take. No, there is still a lot to do, and promissing the world too soon will only lead to disappointment.

Zigurd

This is the thing of it: "for most applications."

LLMs are not thinking. They way they fail, which is confidently and articulately, is one way they reveal there is no mind behind the bland but well-structured text.

But if I was tasked with finding 500 patents with weak claims or claims that have been litigated and knocked down, I would turn into LLMs to help automate that. One or two "nines" of reliability is fine, and LLMs would turn this previously impossible task into something plausible to take on.

mountainriver

I’ll take critiques from someone who knows what a test train split is.

The idea that a guy so removed from machine learning has something relevant to say about its capabilities really speaks to the state of AI fear

Spooky23

The idea that practitioners would try to discredit research to protect the golden goose from critique speaks to human nature.

mountainriver

No one is discrediting research from valid places, this is the victim alt-right style narrative that seems to follow Gary Marcus around. Somehow the mainstream is "suppressing" the real knowledge

devwastaken

experts are often blinded by their paychecks to see how nonsense their expertise is

mountainriver

Not knowing the most basic things about the subject you are critiquing is utter nonsense. Defending someone who does this is even worse

soulofmischief

[citation needed]

bandrami

How actually useful are they though? We've had more than a year now of saying these things 10X knowledge workers and creatives, so.... where is the output? Is there a new office suite I can try? 10 times as many mobile apps? A huge new library of ebooks? Is this actually in practice producing things beyond Ghibli memes and RETVRN nostalgia slop?

2muchcoffeeman

I think it largely depends on what you’re writing. I’ve had it reply to corporate emails which is good since I need to sound professional not human.

If I’m coding it still needs a lot of baby sitting and sometimes I’m much faster than it.

Gigachad

And then the person on the end is using AI to summarise the email back to normal English. To what end?

bandrami

So this would be an interesting output to measure but I have no idea how we would do that: has the volume of corporate email gone up? Or the time spent creating it gone down?

bigyabai

There's something innately funny about "HN's undying optimism" and "bad-news paper from Apple" reaching a head like this. An unstoppable object is careening towards an impervious wall, anything could happen.

DiogenesKynikos

I don't understand what people mean when they say that AI is being hyped.

AI is at the point where you can have a conversation with it about almost anything, and it will answer more intelligently than 90% of people. That's incredibly impressive, and normal people don't need to be sold on it. They're just naturally impressed by it.

woopsn

If the claims about AI were that it is a great or even incredible chat app, there would be no mismatch.

I think normal people understand curing all disease, replacing all value, generating 100x stock market returns, uploading our minds etc to be hype.

I said a few days ago, LLM is amazing product. Sad that these people ruin their credibility immediately upon success.

FranzFerdiNaN

I don’t need a tool that’s right maybe 70% of the time (and that’s me being optimistic). It needs to be right all the time or at least tell you when it doesn’t know for sure, instead of just making up something. Comparing it to going out in the streets and asking random people random questions is not a good comparison.

amohn9

It might not fit your work, but there are tons of areas where “good enough” can still provide a lot of value. I’m sure you’d be thrilled with a tool that could correctly tell you if Apple’s stock was going up or down tomorrow 70% of the time.

newswasboring

> I don’t need a tool that’s right maybe 70% of the time (and that’s me being optimistic).

Where are you getting this from? 70%?

null

[deleted]

travisgriggs

I get even better results talking to myself.

georgemcbay

AI, in the form of LLMs, can be a useful tool.

It is still being vastly overhyped, though, by people attempting to sell the idea that we are actually close to an AGI "singularity".

Such overhype is usually easy to handwave away as like not my problem. Like, if investors get fooled into thinking this is anything like AGI, well, a fool and his money and all that. But investors aside this AI hype is likely to have some very bad real world consequences based on the same hype-men selling people on the idea that we need to generate 2-4 times more power than we currently do to power this godlike AI they are claiming is imminent.

And even right now there's massive real world impact in the form of say, how much grok is polluting Georgia.

hellohello2

Its quite simple, people upvote content that makes them feel good. Most of us here are programmers and the idea that many of ours skills are becoming replaceable feels quite bad. Hence, people upvote delusional statements that let them believe in something that feels better than objective reality. With any luck, these comments will be scraped and used to train the next AI generation, relieving it from the burden of factuality at last.

landl0rd

[flagged]

hrldcpr

In case anyone else missed the original paper (and discussion):

https://news.ycombinator.com/item?id=44203562

dang

Thanks! Macroexpanded:

The Illusion of Thinking: Strengths and limitations of reasoning models [pdf] - https://news.ycombinator.com/item?id=44203562 - June 2025 (269 comments)

Also this: A Knockout Blow for LLMs? - https://news.ycombinator.com/item?id=44215131 - June 2025 (48 comments)

Were there others?

thomasahle

> 5. A student might complain about a math exam requiring integration or differentiation by hand, even though math software can produce the correct answer instantly. The teacher’s goal in assigning the problem, though, isn’t finding the answer to that question (presumably the teacher already know the answer), but to assess the student’s conceptual understanding. Do LLM’s conceptually understand Hanoi? That’s what the Apple team was getting at. (Can LLMs download the right code? Sure. But downloading code without conceptual understanding is of less help in the case of new problems, dynamically changing environments, and so on.)

Why is he talking about "downloading" code? The LLMs can easily "write" out out the code themselves.

If the student wrote a software program for general differentiation during the exam, they obviously would have a great conceptual understanding.

autobodie

If the student could reference notes a fraction of the size of the LLM then I would not be convinced.

Workaccount2

LLMs are (suspected) a few TB in size.

Gemma 2 27B, one of the top ranked open source models, is ~60GB in size. LLama 405B is about 1TB.

Mind you that they train on likely exabytes of data. That alone should be a strong indication that there is a lot more than memory going on here.

exe34

I suspect human memory consists of a lot more bits than LLMs encode.

autobodie

I rest my case — the question concerns a quality, not a quantity. These juvenile comparisons are mere excuses.

Illniyar

I find it weird that people are taking the original paper to be some kind of indictment against llms. It's not like LLMs failing at doing Hanoi tower problem at higher levels is new, the paper took an existing method that was done before.

It was simply comparing the effectiveness of reasoning and non reasoning models on the same problem.

skywhopper

The quote from the Salesforce paper is important: “agents displayed near-zero confidentiality awareness”.