Understanding Reasoning LLMs
30 comments
·February 6, 2025aithrowawaycomm
I like Raschka's writing, even if he is considerably more optimistic about this tech than I am. But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878
They are certainly capable of doing is a wide variety of computations that simulate reasoning, and maybe that's good enough for your use case. But it is unpredictably brittle unless you spend a lot on o1-pro (and even then...). Raschka has a line about "whether and how an LLM actually 'thinks' is a separate discussion" but this isn't about semantics. R1 clearly sucks at deductive reasoning and you will not understand "reasoning" LLMs if you take DeepSeek's claims at face value.
It seems especially incurious for him to copy-paste the "a-ha moment" from Deepseek's technical report without critically investigating it. DeepSeek's claims are unscientific, without real evidence, and seem focused on hype and investment:
This moment is not only an "aha moment" for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.
The "aha moment" serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.
Perhaps it was able to solve that tricky Olympiad problem, but there are an infinite variety of 1st grade math problems it is not able to solve. I doubt it's even reliably able to solve simple variations of that root problem. Maybe it is! But it's frustrating how little skepticism there is about CoT, reasoning traces, etc.UniverseHacker
> they are incapable of even the simplest "out-of-distribution" deductive reasoning
But the link demonstrates the opposite- these models absolutely are able to reason out of distribution, just not with perfect fidelity. The fact that they can do better than random is itself really impressive. And o1-preview does impressively well, only vary rarely getting the wrong answer on variants of that Alice in Wonderland problem.
If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.
Overall, poor reasoning that is better than random but frequently gives the wrong answer is fundamentally, categorically entirely different from being incapable of reasoning.
Jensson
> If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.
You don't seem to understand how they work, they recurse their solution meaning if they have remembered components it parrots back sub solutions. Its a bit like a natural language computer, that way you can get them to do math etc, although the instruction set isn't of a turing language.
They can't recurse sub sub parts they haven't seen, but problems that has similar sub parts can of course be solved, anyone understands that.
danielmarkbruce
anyone saying an LLM is a stochastic parrot doesn't understand them... they are just parroting what they heard.
bloomingkales
There is definitely a mini cult of people that want to be very right about how everyone else is very wrong about AI.
Legend2440
>But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning:
That's not actually what your link says. The tweet says that it solves the simple problem (that they originally designed to foil base LLMs) so they had to invent harder problems until they found one it could not reliably solve.
suddenlybananas
Did you see how similar the more complicated problem is? It's nearly the exact same problem.
blovescoffee
The other day I fed a complicated engineering doc for an architectural proposal at work into R1. I incorporated a few great suggestions into my work. Then my work got reviewed very positively by a large team of senior/staff+ engineers (most with experience at FAANG; ie credibly solid engineers). R1 was really useful! Sorry you don’t like it but I think it’s unfair to say it sucks at reasoning.
scarmig
> But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878
Your link says that R1, not all models like R1, fails at generalization.
Of particular note:
> We expose DeepSeek R1 to the variations of AIW Friends problem and compare model behavior to o1-preview, o1-mini and Claude 3.5 Sonnet. o1-preview handles the problem robustly, DeepSeek R1 shows strong fluctuations across variations with distribution very similar to o1-mini.
Legend2440
The way the authors talk about LLMs really rubs me the wrong way. They spend more of the paper talking up the 'claims' about LLMs that they are going to debunk than actually doing any interesting study.
They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.
suddenlybananas
>They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.
And lo and behold, they still found a glaring failure. You can't fault them for not buying into the hype.
vector_spaces
Is there any work being done in training LLMs on more restricted formal languages? Something like a constraint solver or automated theorem prover, but much lower level. Specifically something that isn't natural language. That's the only path I could see towards reasoning models being truly effective
I know there is work being done with e.g. Lean integration with ChatGPT, but that's not what I mean exactly -- there's still this shakey natural-language-trained-LLM glue in the driver's seat
Like I'm envisioning something that has the creativity to try different things, but then JIT compile their chain of thought, and avoid bad paths
mindwok
How would that be different from something like ChatGPT executing Lean? That's exactly what humans do, we have messy reasoning that we then write down in formal logic and compile to see if it holds.
gsam
In my mind, the pure reinforcement learning approach of DeepSeek is the most practical way to do this. Essentially it needs to continually refine and find more sound(?) subspaces of the latent (embedding) space. Now this could be the subspace which is just Python code (or some other human-invented subspace), but I don't think that would be optimal for the overall architecture.
The reason why it seems the most reasonable path is because when you create restrictions like this you hamper search viability (and in a high multi-dimensional subspace, that's a massive loss because you can arrive at a result from many directions). It's like regular genetic programming vs typed-genetic programming. When you discard all your useful results, you can't go anywhere near as fast. There will be a threshold where constructivist, generative schemes (e.g. reasoning with automata and all kinds of fun we've neglected) will be the way forward, but I don't think we've hit that point yet. It seems to me that such a point does exist because if you have fast heuristics on when types unify, you no longer hamper the search speed but gain many benefits in soundness.
One of the greatest human achievements of all time is probably this latent embedding space -- one that we can actually interface with. It's a new lingua franca.
These are just my cloudy current thoughts.
danielmarkbruce
fwiw, most people don't really grok the power of latent space wrt language models. Like, you say it, I believe it, but most people don't really grasp it.
ngneer
Nice article.
>Whether and how an LLM actually "thinks" is a separate discussion.
The "whether" is hardly a discussion at all. Or, at least one that was settled long ago.
"The question of whether a computer can think is no more interesting than the question of whether a submarine can swim."
--Edsger Dijkstra
cwillu
The document that quote comes from is hardly a definitive discussion of the topic.
“[…] it tends to divert the research effort into directions in which science can not—and hence should not try to—contribute.” is a pretty myopic take.
null
dhfbshfbu4u3
Great post, but every time I read something like this I feel like I am living in a prequel to the Culture.
prideout
This article has a superb diagram of the DeepSeek training pipeline.
dr_dshiv
How important is it that the reasoning takes place in another thread versus just chain-of-thought in the same thread? I feel like it makes a difference, but I have no evidence.
gibsonf1
There are no LLMs that reason, its an entirely different statistical process as compared to human reasoning.
tmnvdb
"There are no LLMS that reason" is a claim about language, namely that the word 'reason' can only ever be applied to humans.
oxqbldpxo
Amazing accomplishments by brightest minds only to be used to write history by the stupidest people.
behnamoh
doesn't it seem like these models are getting to the point where even conceiving their training and development is less and less possible for the general public?
I mean, we already knew only a handful of companies with capital could train them, but at least the principles, algorithms, etc. were accessible to individuals who wanted to create their own - much simpler - models.
it seems that era is quickly ending, and we are entering the era of truly "magic" AI models that no one knows how they work because companies keep their secret sauces...
fspeech
Recent developments like V3, R1 and S1 are actually clarifying and pointing towards more understandable, efficient and therefore more accessible models.
HarHarVeryFunny
I don't think it's realistic to expect to have access to the same training data as the big labs that are paying people to generate it for them, but hopefully there will be open source ones that are still decent.
At the end of the day current O1-like reasoning models are still just fine-tuned LLMs, and don't even need RL if you have access to (or can generate) a suitable training set. The DeepSeek R1 paper outlined their bootstrapping process, and HuggingFace (and no doubt others) are trying to duplicate it.
antirez
In recent weeks what's happening is exactly the contrary.
Nice explainer. The R1 paper is a relatively easy read. Very approachable, almost conversational.
I say this because I am constantly annoyed by poor, opaque writing in other instances. In this case, DS doesn’t need to try to sound smart. The results speak for themselves.
I recommend anyone who is interested in the topic to read the R1 paper, their V3 paper, and DeepSeekMath paper. They’re all worth it.