Dispelling misconceptions about RLHF

29 comments

·August 17, 2025

josh-sematic

The mechanisms the author describe are used for RLHF, but are not sufficient for training the recent slew of “reasoning models.” To do that, you have to generate rewards not based on proximity to some reference full answer transcript, but rather based on how well the final answer (ex: the part after the “thinking tokens”) meets your reward criteria. This turns out to be a lot harder to do than the mechanisms used for RLHF which is one reason why we had RLHF for a while before we got the “reasoning models.” It’s also the only way you can understand the Sutskever quote “You’ll know your RL is working when the thinking tokens are no longer English” (a paraphrase, pulled from my memory).

abatilo

FWIW, that was Karpathy, not Sutskever:

https://x.com/karpathy/status/1835561952258723930?s=19

null

[deleted]

pas

sorry, could you explain why is it harder, where the complexity creeps in (compared to some naive "pattern matching the end of the response" tactic)? thanks!

Lerc

The pattern matching compares what was said against an example of what a correct response could say.

Checking a token at a time evaluates if it is going to produce a correct final answer. The intermediate text can be whatever it needs to arrive at that answer, but training at the per token level means training those very tokens that you want to allow the model the leeway to consider. It needs another model to adjudicate how well things are going from incomplete answers.

I'm not sure how much the adjudicator evaluates based upon knowing the final answer or based upon the quality of the reasoning of the model being trained. I'd be inclined to train two adjudicators, one that knows the answers and one that doesn't. I'm sure there would be interesting things to see in their differential signal.

markisus

Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.

einrealist

> “Successful” is importantly distinct from “correct.”

This is the most important sentence describing the fundamental issue that LLMs have. This severely limits the technology's useful applications. Yet OpenAI and others constantly lie about it.

The article very clearly explains why models won't be able to generalise unless RL is performed constantly. But that's not scalable, has other problems in itself. For example, it still runs into paradoxes where the training mechanism has to know the answer in order to formulate the question. (This is precisely where the concept of World Models comes in or why symbolism becomes more important.)

LLMs perform well in highly specialised scenarios with a well-defined and well-known problem space. It's probably possible to increase accuracy and correctness by using lots of interconnected models that can perform RL with each other. Again, this raises questions of scale and feasibility. But I think our brains (together with the other organs) work this way.

getnormality

Agreed re the "successful" discussion, we're getting a much appreciated essential point here. I think it would be slightly better expressed by simply saying that we want a 0% error rate. Giving a correct answer and saying "I don't know" are both just ways of avoiding error.

Foreignborn

can you say more about world models or symbolism?

i thought world models like genie 3 would be the training mechanism, but i likely misunderstand.

einrealist

A World Model is a theoretical type of model that has knowledge about the "real world" (or whatever world or bounds you define). It can infer causalities from concepts within this world.

Yes, you can use Genie 3 to train other models. Its far from perfect. You still need to train Genie 3. And its training and outputs must be useful in the context of what you want to train other models with. That's a paradox. The feedback loop needs to produce useful results. And Genie 3 can still hallucinate or produce implausible responses. Symbolism is a wide term. But a "World Model" needs it to make sense between concepts (e.g. Ontologies or the relation of movement and gravity).

logicchains

>The feedback loop needs to produce useful results. And Genie 3 can still hallucinate or produce implausible responses

The solution to this is giving the model a physical body and actually letting it interact with the real world and learn from it. But no lab dares to try this because allowing a model to learn from experience would mean allowing it to potentially change its views/alignment.

vertere

I'm confused about their definition of RL.

> ... SFT is a subset of RL.

> The first thing to note about traditional SFT is that the responses in the examples are typically human written. ... But it is also possible to build the dataset using responses from the model we’re about to train. ... This is called Rejection Sampling.

I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?

unoti

> I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?

Sutton and Barto define reinforcement learning as "learning what to do- how to map situations to actions-- so as to maximize a numerical reward signal". This is from their textbook on the topic.

That's a pretty broad definition. But the general formulation of RL involves a state of the world and the ability to take different actions given that state. In the context of an LLM, the state could be what has been said so far, and the action could be what token to produce next.

But as you noted, if you take such a broad definition of RL, tons of machine learning is also RL. When people talk about RL they usually mean the more specific thing of letting a model go try things and then be corrected based on the observations of how that turned out.

Supervised learning defines success by matching the labels. Unsupervised learning is about optimizing a known math function (for example, predicting the likelihood that words would appear near each other). Reinforcement learning would maximize a reward function that may not be directly known by the model, and it learns to optimize it by trying things and observing the results and getting a reward/penalty.

andy99

A couple things I've seen go by that make the connection. I haven't looked at them closely enough to have an opinion.

> https://arxiv.org/abs/2507.12856

> https://justinchiu.netlify.app/blog/sftrl/

macleginn

Everything the post says about the behaviour of OpenAI models seems to be based on pure speculation.

yorwba

Yeah, in my opinion you can just skip that part and go straight to the author's description of failing to train their own model at first and what they ended up changing to make it work: https://aerial-toothpaste-34a.notion.site/How-OpenAI-Misled-...

williamtrask

Nit: the author says that supervised fine tuning is a type of RL, but it is not. RL is about delayed reward. Supervised fine tuning is not in any way about delayed reward.

jampekka

RL is about getting numerical feedback of outputs, in contrast to supervised learning where there are examples of what the output should be. There are many RL problems with no delayed rewards, e.g. multi-armed bandits.

Admittely most interesting cases do have delays.

ProofHouse

Well they can be used together in some contexts so while they are different, you could also say RL can help Supervised Fine Tuning for further optimization

tempusalaria

SFT is part of the classic RLHF process though

m-s-y

RLHF -> Reinforced Learning with Human Feedback

It’s not defined until the 13th paragraph of the linked article.

mehulashah

This article is really two. One that describes RL. The other is how they applied it. The former was quite helpful because it demystified much of the jargon that I find in AI. All branches of science have jargon. I find the AI ones especially impenetrable.

Nevermark

Another way to do reinforcement learning is to train a model to judge the quality of its own answers, to match judgements from experts or synthetically created. Until it develops an ability to judge its answer quality even if it can’t yet use that information to improve its responses.

It can be easier to recognize good responses than generate them.

Then feed it queries, generating its responses and judgements. Instead of training the responses to match response data, train it to output a high positive judgement, but while holding its “judgment” weight values constant. To improve its judgement values, the model is now being trained to give better answers since the judgment weights being back propagated act as a distributor of information from judgement back to how the responses should change to improve.

Learn to predict/judge what is good or bad. Then learn to maximize good and minimize bad using the judgment/prediction as a proxy for actual feedback.

This technique is closer to traditional human/animal reinforcement learning.

How we learn to predict situations that will cause us pain or positive affects, then learn to choose actions that minimize our predictions of bad, and maximize our predictions of good. Which is much more efficient way to learn than the expense of having to actually experience everything and always get explicit external feedback.

There are a many many ways to do reinforcement learning.

varispeed

The snag is: 'experts' aren’t neutral oracles. Many are underpaid and end up parroting whoever funds them. Lobby groups quietly buy authority all the time. So the real challenge isn’t just training on expert judgments, it’s making the model sharp enough to spot the BS in those judgments - otherwise you’re just encoding the bias straight into the weights.

htfu

Which is why the foundation players must soon take on the additional role of being an ad buyer.

Interactive stuff, within content. A mini game in a game, school homework of course, or "whichever text box the viewer looks at longest by WorldCoin Eyeball Tracker for Democracy x Samsung" for an interstitial turned captcha.

Better hope your taste isn't too bland and derivative!

Amazon and Ali soon lap the field by allowing coupon farming, but somehow eventually end up where they started.

null

[deleted]

byyoung3

this seems to disagree with a lot of research showing RL is not necessary for reasoning -- im not sure about alignment

schlipity

The site is designed poorly and is stopping me from reading the article. I use NoScript, and it immediately redirects me to a "Hey you don't have javascript enabled, please enable it to read" page that is on a different domain from the website the article is on. I tried to visit notion.site to try and whitelist it temporarily, but it redirects back to notion.so and notion.com.

Rather than jump through more hoops, I'm just going to give up on reading this one.

HN

Dispelling misconceptions about RLHF

Dispelling misconceptions about RLHF