Does RL Incentivize Reasoning in LLMs Beyond the Base Model?

spwa4

I don't like papers that ask a question in the title, so here's the answer:

"RL boosts sampling efficiency but reduces the reasoning capacity boundary."

Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers.

sitkack

My gut feeling when using DeepSeek is that its performance is a lot smoother, the responses feel more robust and not as brittle.

cma

I'm pretty sure RL causes catastrophic forgetting of its base knowledge and that's why o3 hallucinates so much more.

If you mess around with trained weights you're going to delete some base knowledge, as least the knowledge that is outside of the tasks you RL on.

macleginn

‘Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.’ — wouldn't any kind of RL fail to converge or even progress at all if the solution weren't to be found in the base model distribution? The way training is set up, the models absolutely need to be able to find right solutions in a reasonable time, otherwis there wouldn't be any training signal.

psb217

That depends a bit on the length of the RL training and the distribution of problems you're training on. You're correct that RL won't get any "traction" (via positive rewards) on problems where good behavior isn't already in the model's behavior distribution.

However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so.

yorwba

They write "We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses." but the example answer they show at the end only gets the correct number due to two errors canceling out. The model calculates 195+367+562+900 and gets 1924 instead of 2024, and also turns -437 - 2*234 into -805 instead of -905, but in total 1924-805 = 2024-905 = 1119 and from there the remaining steps are correct again.

It would be interesting to know how much of the sampling efficiency improvement from reinforcement learning is due to being better at basic arithmetic (something which could also be achieved by giving the model access to a calculator tool) and how much is due to choosing the correct approach for solving the problem more often.

HN

Does RL Incentivize Reasoning in LLMs Beyond the Base Model?

Does RL Incentivize Reasoning in LLMs Beyond the Base Model?