Reasoning Models Reason Well, Until They Don't

155 comments

·October 31, 2025

My_Name

I find that they know what they know fairly well, but if you move beyond that, into what can be reasoned from what they know, they have a profound lack of ability to do that. They are good at repeating their training data, not thinking about it.

The problem, I find, is that they then don't stop, or say they don't know (unless explicitly prompted to do so) they just make stuff up and express it with just as much confidence.

pistoriusp

I saw a meme that I think about fairly often: Great apes have learnt sign language, and communicated with humans, since the 1960's. In all that time they've never asked human questions. They've never tried to learn anything new! The theory is that they don't know that there are entities that know things they don't.

I like to think that AI are the great apes of the digital world.

20k

Its worth noting that the idea that great apes have learnt sign language is largely a fabrication by a single person, and nobody has ever been able to replicate this. All the communication has to be interpreted through that individual, and anyone else (including people that speak sign language) have confirmed that they're just making random hand motions in exchange for food

They don't have the dexterity to really sign properly

rightbyte

I mean dogs can learn a simple sign language?

krapht

Citation needed.

MangoToupe

> The theory is that they don't know that there are entities that know things they don't.

This seems like a rather awkward way of putting it. They may just lack conceptualization or abstraction, making the above statement meaningless.

sodality2

The exact title of the capacity is 'theory of mind' - for example, chimpanzees have a limited capacity for it in that they can understand others' intentions, but they seemingly do not understand false beliefs (this is what GP mentioned).

https://doi.org/10.1016/j.tics.2008.02.010

BOOSTERHIDROGEN

Does that means intelligent is soul? Then we will never achieve AGI.

usrbinbash

> They are good at repeating their training data, not thinking about it.

Which shouldn't come as a surprise, considering that this is, at the core of things, what language models do: Generate sequences that are statistically likely according to their training data.

dymk

This is too large of an oversimplification of how an LLM works. I hope the meme that they are just next token predictors dies out soon, before it becomes a permanent fixture of incorrect but often stated “common sense”. They’re not Markov chains.

gpderetta

Indeed, they are next token predictors, but this is a vacuous statement because the predictor can be arbitrary complex.

adastra22

They are next token predictors though. That is literally wha they are. Nobody is saying they are simple Markov chains.

PxldLtd

I think a good test of this seems to be to provide an image and get the model to predict what will happen next/if x occurs. They fail spectacularly at Rube-Goldberg machines. I think developing some sort of dedicated prediction model would help massively in extrapolating data. The human subconscious is filled with all sorts of parabolic prediction, gravity, momentum and various other fast-thinking paths that embed these calculations.

yanis_t

Any example of that? One would think that predicting what comes next from an image is basically video generation, which works not perfect, but works somehow (Veo/Sora/Grok)

PxldLtd

Here's one I made in Veo3.1 since gemini is the only premium AI I have access to.

Using this image - https://www.whimsicalwidgets.com/wp-content/uploads/2023/07/... and the prompt: "Generate a video demonstrating what will happen when a ball rolls down the top left ramp in this scene."

You'll see it struggles - https://streamable.com/5doxh2 , which is often the case with video gen. You have to describe carefully and orchestrate natural feeling motion and interactions.

You're welcome to try with any other models but I suspect very similar results.

mannykannot

It is video generation, but succeeding at this task involves detailed reasoning about cause and effect to construct chains of events, and may not be something that can be readily completed by applying "intuitions" gained from "watching" lots of typical movies, where most of the events are stereotypical.

pfortuny

Most amazing is asking any of the models to draw an 11-sided polygon and number the edges.

Torkel

I asked gpt5, and it worked really well with a correct result. Did you expect it to fail?

Workaccount2

To be fair, we don't actually know what is and isn't in their training data. So instead we just assign successes to "in the training set" and failures to "not in the training set".

But this is unlikely, because they still can fall over pretty badly on things that are definitely in the training set, and still can have success with things that definitely are not in the training set.

ftalbot

Every token in a response has an element of randomness to it. This means they’re non-deterministic. Even if you set up something within their training data there is some chance that you could get a nonsense, opposite, and/or dangerous result. The chance of that may be low because of things being set up for it to review its result, but there is no way to make a non-deterministic answer fully bound to solving or reasoning anything assuredly, given enough iterations. It is designed to be imperfect.

yuvalr1

You are making a wrong leap from non-deterministic process to uncontrollable result. Most of the parallel algorithms are non-deterministic. There might be no guarantee about the order of calculation or even sometimes the final absolute result. However, even when producing different final results, the algorithm can still guarantee characteristics about the result.

The hard problem then is not to eliminate non-deterministic behavior, but find a way to control it so that it produces what you want.

flavaflav2

Life and a lot in our universe is non-deterministic. Some people assume science and mathematics are some universal truths rather than imperfect agreed upon understandings. Similarly many assume humans can be controlled through laws, penalties, prisons, propaganda, coercion, etc. But terrible things happen. Yes, if you set up the gutter-rails in your bowling lane, you can control the bowling ball unless it is thrown over those rails or in a completely different direction, but those rails are wide with LLMs by default, and the system instructions provided it aren’t rules, they are an inherently faulty way to coerce a non-deterministic system. But, yes, if there’s absolutely no way to do something, and you’re aware of every possible way a response or tool could affect things, and you have taken every possible precaution, you can make it behave. That’s not how people are using it though, and we cannot control our tendency to trust that which seems trustworthy even if we are told these things.

mannykannot

There seems to be more to it than that - in my experience with LLMs, they are good at finding some relevant facts but then quite often present a non-sequitur for a conclusion, and the article's title alone indicates that the problem for LRMs is similar: a sudden fall-off in performance as the task gets more difficult. If the issue was just non-determinism, I would expect the errors to be more evenly distributed, though I suppose one could argue that the sensitivity to non-determinism increases non-linearly.

pimeys

I just got this from codex yesterday:

"I wasn’t able to finish; no changes were shipped."

And it's not the first time.

conception

This is my most impressive response from Codex "• I started wiring up module-scoped mocks as requested, but while digging in I realised we’re sitting on a moving target: the repo already has a large pile of pending, unrelated changes (whole modules reshuffled, new ps1 files, etc.), and the Enhanced.WorkflowOrchestration.Tests.ps1 under review isn’t even tracked in the baseline branch. Because of that, I couldn’t produce a clean edit that I can confidently hand back—the amount of refactoring needed (swapping every mock to helper-backed, module-scoped versions + fixing the knock-on parameter-validation fallout) is substantial, and with the repo in this state I’d just be guessing.

  Rather than check in something half-broken, I’m pausing here. Let me know how you want to
  proceed—if you can land the upstream refactor (or share a stable snapshot of the tests/module),
  I can pick this up again and finish the review fixes in one go."

darkwater

Have you threatened it with a 2 in the next round of performance reviews?

amelius

The problem is that the training data doesn't contain a lot of "I don't know".

pegasus

The bigger problem is that the benchmarks / multiple-choice tests they are trained to optimize for don't distinguish between a wrong answer and "I don't know". Which is stupid and surprising. There was a thread here on HN about this recently.

null

[deleted]

alyxya

The key point the paper seems to make is that existing benchmarks have relatively low complexity on reasoning complexity, so they made a new dataset DeepRD with arbitrarily large reasoning complexity and demonstrated that existing models fail at a complex enough problem. Complexity is defined from the complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node.

My main critique is that I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL. With a harness like what coding agents do these days and with sufficient tool use, I bet models could go much further on that reasoning benchmark. Otherwise, if the reasoning problem were entirely done within a single context window, it's expected that a complex enough reasoning problem would be too difficult for the model to solve.

jeremyjh

The burden of evidence here is on you. They don’t need to prove LRMs can’t scale to meet these problems; their only claim is current models can’t handle these problems. Others will take this up as a challenge - and chances may be good they will overcome it. This is how science works.

usrbinbash

> I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL

And how much larger do we need to make the models? 2x? 3x? 10x? 100x? How large do they need to get before scaling-up somehow solves everything?

Because: 2x larger, means 2x more memory and compute required. Double the cost or half the capacity. Would people still pay for this tech if it doubles in price? Bear in mind, much of it is already running at a loss even now.

And what if 2x isn't good enough? Would anyone pay for a 10x larger model? Can we even realistically run such models as anything other than a very expensive PoC and for a very short time? And whos to say that even 10x will finally solve things? What if we need 40x? Or 100x?

Oh, and of course: Larger models also require more data to train them on. And while the Internet is huge, it's still finite. And when things grow geometrically, even `sizeof(internet)` eventually runs out ... and, in fact, may have done so already [1] [2]

What if we actually discover that scaling up doesn't even work at all, because of diminishing returns? Oh wait, looks like we did that already: [3]

[1]: https://observer.com/2024/12/openai-cofounder-ilya-sutskever...

[2]: https://biztechweekly.com/ai-training-data-crisis-how-synthe...

[3]: https://garymarcus.substack.com/p/confirmed-llms-have-indeed...

tomlockwood

So the answer is a few more trillion?

code_martial

It’s a worthwhile answer if it can be proven correct because it means that we’ve found a way to create intelligence, even if that way is not very efficient. It’s still one step better than not knowing how to do so.

usrbinbash

> if it can be proven correct

Then the first step would be to prove that this works WITHOUT needing to burn through the trillions to do so.

tomlockwood

So we're sending a trillion on faith?

dankai

This is not the only paper that scales reasoning complexity / difficulty.

The CogniLoad benchmark does this as well (in addition to scaling reasoning length and distractor ratio). Requiring the LLM to purely reason based on what is in the context (i.e. not based on the information its pretrained on), it finds that reasoning performance decreases significantly as problems get harder (i.e. require the LLM to hold more information in its hidden state simultaneously), but the bigger challenge for them is length.

https://arxiv.org/abs/2509.18458

Disclaimer: I'm the primary author of CogniLoad so feel free to ask me any questions.

equinox_nl

But I also fail catastrophically once a reasoning problem exceeds modest complexity.

davidhs

Do you? Don't you just halt and say this is too complex?

p_v_doom

Nope, audacity and Dunning-Krueger all the way, baby

moritzwarhier

Ah yes, the function that halts if the input problem would take too long to halt.

But yes, I assume you mean they abort their loop after a while, which they do.

This whole idea of a "reasoning benchmark" doesn't sit well with me. It seems still not well-defined to me.

Maybe it's just bias I have or my own lack of intelligence, but it seems to me that using language models for "reasoning" is still more or less a gimmick and convenience feature (to automate re-prompts, clarifications etc, as far as possible).

But reading this pop-sci article from summer 2022 seems like this definition problem hasn't changed very much since then.

Although it's about AI progress before ChatGPT and it doesn't even mention the GPT base models. Sure, some of the tasks mentioned in the article seem dated today.

But IMO, there is still no AI model that can be trusted to, for example, accurately summarize a Wikipedia article.

Not all humans can do that either, sure. But humans are better at knowing what they don't know, and deciding what other humans can be trusted. And of course, none of this is an arithmetic or calculation task.

https://www.science.org/content/article/computers-ace-iq-tes...

dspillett

Some would consider that to be failing catastrophically. The task is certainly failed.

carlmr

Halting is sometimes preferable to thrashing around and running in circles.

I feel like if LLMs "knew" when they're out of their depth, they could be much more useful. The question is whether knowing when to stop can be meaningfully learned from examples with RL. From all we've seen the hallucination problem and this stopping problem all boil down to this problem that you could teach the model to say "I don't know" but if that's part of the training dataset it might just spit out "I don't know" to random questions, because it's a likely response in the realm of possible responses, instead of spitting out "I don't know" to not knowing.

SocratesAI is still unsolved, and LLMs are probably not the path to get knowing that you know nothing.

LunaSea

I would consider that detecting your own limits when trying to solve a problem is preferable to having the illusion of thinking that your solution is working and correct.

benterix

This seems to be the stance of creators of agentic coders. They are so bound on creating something, even if this something makes no sense whatsoever.

AlecSchueler

I also fail catastrophically when trying to push nails through walls by I expect my hammer to do better.

moffkalast

I have one hammer and I expect it to work on every nail and screw. If it's not a general hammer, what good is it now?

arethuza

You don't need a "general hammer" - they are old fashioned - you need a "general-purpose tool-building factory factory factory":

https://www.danstroot.com/posts/2018-10-03-hammer-factories

hshdhdhehd

Gold and shovels might be a more fitting analogy for AI

monkeydust

But you recognise you are likely to fail and thus dont respond or redirect the problem to someone who has a greater likelihood of not failing.

exe34

If that were true, we would live in a utopia. People vote/legislate/govern/live/raise/teach/preach without ever learning to reason correctly.

antonvs

I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I don’t find all these claims that models are somehow worse than humans in such areas convincing. Yes, they’re worse in some respects. But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

ffsm8

> For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

Ez, just use codegen.

Also the second part (not having bugs) is unlikely to be true for the LLM generated code, whereas traditional codegen will actually generate code with pretty much no bugs.

pessimizer

> I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I have too, and I sense that this is something that has been engineered in rather than coming up naturally. I like it very much and they should do it a lot more often. They're allergic to "I can't figure this out" but hearing "I can't figure this out" gives me the alert to help it over the hump.

> But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

Only if you consider speed to failure and inaccuracy. They're very much subhuman in output, but you can make them retry a lot in a short time, and refine what you're asking them each time to avoid the mistakes they're repeatedly making. But that's you doing the work.

raddan

Yes, but you are not a computer. There is no point building another human. We have plenty of them.

moritzwarhier

From the abstract:

> some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity

Can someone ELI5 what the definitions of reasoning and complexity are here?

I see they seem to focus on graph problems and representing problems as graph problems. But I didn't completely read the paper or understand it in depth. I skimmed some parts that seem to address this question (e.g. section 5 and the Introduction), but maybe there are simpler definitions that elude me.

Surely they don't mean "computational complexity"?

And what exactly is "reasoning"?

I'm aware of philosophical logic and strict logic that can be applied to natural language arguments.

But have we already agreed on a universal scale that grades answers to questions about the physical world? Or is this about mathematical reasoning?

Mixing all of this together always irks me when it comes to these AI "benchmarks". But apparently people see value in these?

I know my question isn't new.

To me it seems, that when we leave the mathematical realms, it quickly becomes fuzzy what correct "reasoning" should be.

People can be convincing and avoid obious logical fallacies, and still make wrong conclusions... or conclusions that run counter to assumed goals.

dcre

Even in the mathematical/formal realm, the meaning of reasoning is not as clear as it seems. The result of the activity of reasoning may be a formal argument that can be evaluated according to well-defined rules, but the actual process your mind went through to get there is just as opaque (or more) as whatever is going on inside LLMs. It seems likely, as you suggest, that we are going to have to define reasoning in terms of ability to solve certain classes of problems but leaving the character of the process unspecified.

hirako2000

Has any one ever found an ML/AI paper that make claims that RLMs can reason?

When I prompt an RLM, I can see it spits out reasoning steps. But I don't find that evidence RLMs are capable of reasoning.

_heimdall

That would require the ability to understand what happens inside the system during inference when the output is created and they can't do that today.

There's no evidence to be had when we only know the inputs and outputs of a black box.

tempfile

I don't understand what point you are making. Doesn't the name "Reasoning language models" claim that they can reason? Why do you want to see it explicitly written down in a paper?

hirako2000

This very paper sits on the assumption reasoning (to solve puzzles) is at play. It calls those LLMs RLMs.

Imo the paper itself should have touched on the lack of paper discussing what's in the blackbox that makes them Reasoning LMs. It does mention some tree algorithm supposedly key to reasoning capabilities.

By no means attacking the paper as its intent is to demonstrate the lack of success to even solve simple to formulate, complex puzzles.

I was not making a point, I was genuinely asking in case someone knows of papers I could read on that make claims with evidence that's those RLM actually reason, and how.

Sharlin

Semantics schemantics.

hirako2000

It's a statistical imitation of a reasoning pattern, underlying mechanism is pattern matching. The ability to create a model that can determine two radically different words have strong similarity in meaning doesn't imply emergence of some generalizable, logical model that suddenly can Reason to solve novel problems.

Pattern matching is a component of reason. Not === reason.

egberts1

It's simple. Don't ingest more than 40KB at a time into its LLM's RAG pipe and its hallucination goes way, way down.

Preferably like not at the start and best not to do more than 40KB at a time at all.

That's how I learned how to deal with nftables' 120KB parser_bison.y file by breaking them up into clean sections.

All of a sudden, a fully-deterministic LL(1) full semantic pathway of nftables' CLI syntax appears before my very eye (and spent hours validating it): 100% and test generators now can permutate crazy test cases with relative ease.

Cue in Joe Walsh's "Life's Been Good To Me".

bob_theslob646

Why 40kb?

igravious

and doesn't it depend on the LLM?

egberts1

If you have your Pro or private LLM, then it's a tad bit bigger.

kordlessagain

What specific reasoning capabilities matter for what real-world applications?

Nobody knows.

Moreover, nobody talks about that because it's boring and non-polarizing. Instead, supposedly smart people post stupid comments that prevent anyone from understanding this paper is worthless.

The paper is worthless because it has a click-bait title. Blog posts get voted down for that, why not this?

The implicit claim is worthless. Failure to navigate a synthetic graph == failure to solve real world problems. False.

Absolutely no connection to real world examples. Just losing the model in endless graphs.

flimflamm

What confused me is the fact that in the paper all logical steps are give. It basically check that when all relevant facts are provided explicitly as links , how far and how complex a chain can the model correctly follow before it breaks down?

So it's simpler than "reasoning". This is not necessarily a bad thing as it boils down the reasoning to a simpler, more controlled sub problem.

kerabatsos

How is that different than human reasoning?

js8

I think the explanation is pretty simple, as I said in my earlier comment: https://news.ycombinator.com/item?id=44904107

I also believe the problem is we don't know what we want: https://news.ycombinator.com/item?id=45509015

If we could make LLMs to apply a modest set of logic rules consistently, it would be a win.

Sharlin

That's a pretty big "if". LLMs are by design entirely unlike GoFAI reasoning engines. It's also very debatable whether it makes any sense to try and hack LLMs into reasoning engines when you could just... use a reasoning engine. Or have the LLM to defer to one, which would play to their strength as translators.

brap

I wonder if we can get models to reason in a structured and verifiable way, like we have formal logic in math.

Frieren

For that, you already have classical programming. It is great at formal logic math.

brap

I think trying to accurately express natural language statements as values and logical steps as operators is going to be very difficult. You also need to take into account ambiguity and subtext and things like that.

I actually believe it is technically possible, but is going to be very hard.

This is where you get the natural language tool to write the formal logic.

ChatGPT knows WebPPL really well for example.