The Illusion of Thinking: Strengths and Limitations of Reasoning Models

bilsbie

Interestingly I just hit an example of this. Highly specific but I was asking about pickleball strategy and grok and Claude both couldn’t seem to understand you can’t aim at the opponent’s feet when you’re hitting up.

Just kept regurgitating internet advice and I couldn’t get it to understand the reasoning on why it was wrong.

jqpabc123

Hey --- if the internet says it, it can't be wrong.

bilsbie

In this case it found generic advice and was confusing itself.

bilsbie

I wonders if there’s past symbolic reasoning research we could integrate into LLMs. They’re really good at parsing text and understanding the relationships between objects ie getting the “symbols” correct.

Maybe we plug into something like prolog (or other such strategies?)

skue

Previous discussion: https://news.ycombinator.com/item?id=44203562

piskov

All "reasoning" models hit a complexity wall where they completely collapse to 0% accuracy.

No matter how much computing power you give them, they can't solve harder problems.

This research suggests we're not as close to AGI as the hype suggests.

Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.

Apple's researchers used controllable puzzle environments specifically because:

• They avoid data contamination • They require pure logical reasoning • They can scale complexity precisely • They reveal where models actually break

Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.

This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.

https://x.com/RubenHssd/status/1931389580105925115

jmogly

I think this might be part of the reason Apple is “behind” on generative AI … LLMs have not really proven to be useful outside of relatively niche areas such as coding assistants, legal boiler plate and research, and maybe some data science/analysis which I’m less familiar with

Other “end user” facing use cases have so far been comically bad or possibly harmful, and they just don’t meet the quality bar for inclusion in Apple products, which as much as some people like to doo doo on them and say they have gotten worse, still have a very high expectations of quality and UX from customers.

bilsbie

Dumb question. Was this already posted? I thought I saw it yesterday.

wohoef

https://news.ycombinator.com/item?id=44203562

HN

The Illusion of Thinking: Strengths and Limitations of Reasoning Models

The Illusion of Thinking: Strengths and Limitations of Reasoning Models