Skip to content(if available)orjump to list(if available)

RLHF Book

RLHF Book

40 comments

·February 1, 2025

alexhutcheson

Glad to see the author making a serious effort to fill the gap in public documentation of RLHF theory and practice. The current state of the art seems to be primarily documented in arXiv papers, but each paper is more like a "diff" than a "snapshot" - you need to patch together the knowledge from many previous papers to understand the current state. It's extremely valuable to "snapshot" the current state of the art in a way that is easy to reference.

My friendly feedback on this work-in-progress: I believe it could benefit from more introductory material to establish motivations and set expectations for what is achievable with RLHF. In particular, I think it would be useful to situate RLHF in comparison with supervised fine-tuning (SFT), which readers are likely familiar with.

Stuff I'd cover (from the background of an RLHF user but non-specialist):

Advantages of RLHF over SFT:

- Tunes on the full generation (which is what you ultimately care about), not just token-by-token.

- Can tune on problems where there are many acceptable answers (or ways to word the answer), and you don't want to push the model into one specific series of tokens.

- Can incorporate negative feedback (e.g. don't generate this).

Disadvantages of RLHF over SFT:

- Regularization (KL or otherwise) puts an upper bound on how much impact RLHF can have on the model. Because of this, RLHF is almost never enough to get you "all the way there" by itself.

- Very sensitive to reward model quality, which can be hard to evaluate.

- Much more resource and time intensive.

Non-obvious practical considerations:

- How to evaluate quality? If you have a good measurement of quality, it's tempting to just incorporate it in your reward model. But you want to make sure you're able to measure "is this actually good for my final use-case", not just "does this score well on my reward model?".

- How prompt engineering interacts with fine-tuning (both SFT and RLHF). Often some iteration on the system prompt will make fine-tuning converge faster, and with higher quality. Conversely, attempting to tune on examples that don't include a task-specific prompt (surprisingly common) will often yield subpar results. This is a "boring" implementation detail that I don't normally see included in papers.

Excited to see where this goes, and thanks to the author for willingness to share a work in progress!

gr3ml1n

SFT can be used to give negative feedback/examples. That's one of the lesser-known benefits/tricks of system messages. E.g:

  System: You are a helpful chatbot.
  User: What is 1+1?
  Assistant: 2.
And

  System: You are terrible at math.
  User: What is 1+1?
  Assistant: 0.

cratermoon

    System: It's a lovely morning in the village and you are a horrible goose.
    User: Throw the rake into the lake

cratermoon

Is there not a survey paper on RLHF equivalent to the "A Survey on Large Language Model based Autonomous Agents" paper? Someone should get on that.

_giorgio_

*

1 point by _giorgio_ 0 minutes ago | next | edit | delete [–]

https://arxiv.org/abs/2412.05265

Reinforcement Learning: An Overview Kevin Murphy

    This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs). 
From: Kevin Murphy [view email] [v1] Fri, 6 Dec 2024 18:53:49 UTC (6,099 KB)

kadushka

Has r1 made RLHF obsolete?

alexhutcheson

DeepSeek-R1 had an RLHF step in their post-training pipeline (section 2.3.4 of their technical report[1]).

In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data.

If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF.

[1] https://arxiv.org/pdf/2501.12948

bryan0

Yes but these were steps were not used in R1-zero where its reasoning capabilities were trained.

natolambert

As the other commenter said, R1 required very standard RLHF techniques too. But a fun way to think about it is that reasoning models are going to be bigger and uplift the RLHF boat.

But we need a few years to establish basics before I can write a cumulative RL for LLMs book ;)

JackYoustra

This is a GREAT book, if you decide to write it in a rolling fashion you'd have at least one reader from the start :)

gr3ml1n

This feels like a category mistake. Why would R1 make RLHF obsolete?

drmindle12358

You meant to ask "Has r1 made SFT obsolete?" ?

natolambert

Author here! Just wanted to say that this is indeed in a good place to share, some very useful stuff, but is also very work in progress. I'm may 60% or so to my first draft. Said progress is coming every day and I happily welcome fixes or suggestions on GitHub.

pknerd

Thanks. Is there a PDF version? I kind of feel difficulty switching links.

dgfitz

> Reinforcement learning from human feedback (RLHF)

In case anyone else didn’t know the definition.

Knowing the definition it sounds kind of like “learn what we tell you matters” in a sense.

Not unlike how the world seems to work today. High hopes for the future…

npollock

A quote I found helpful:

"reinforcement learning from human feedback .. is designed to optimize machine learning models in domains where specifically designing a reward function is hard"

https://rlhfbook.com/c/05-preferences.html

codybontecou

How do we draw the line between a hard and not-so-hard reward function?

seanhunter

I think if you are able to define a reward function then it sort of doesn’t matter how hard it was to do that - if you can’t then RLHF is your only option.

For example, say you’re building a chess AI that you’re going to train using reinforcement learning alphazero-style. No matter how fancy the logic that you want to employ to build the AI itself, it’s really easy to make a reward function. “Did it win the game” is the reward function.

On the other hand, if you’re making an AI to write poetry. It’s hard/impossible to come up with an objective function to judge the output so you use RLHF.

It lots of cases the whole design springs from the fact that it’s hard to make a suitable reward function (eg GANs for generation of realistic faces is the classic example). What makes an image of a face realistic? So Goodfellow came up with the idea of having two nets one which tries to generate and one which tries to discern which images are fake and which real. Now the reward functions are easy. The generator gets rewarded for generating images good enough to fool the classifier and the classifier gets rewarded for being able to spot which images are fake and which real.

_giorgio_

https://arxiv.org/abs/2412.05265

Reinforcement Learning: An Overview Kevin Murphy

    This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs). 

From: Kevin Murphy [view email] [v1] Fri, 6 Dec 2024 18:53:49 UTC (6,099 KB)

brcmthrowaway

Whats the difference between RLHF and distillation?

tintor

They are different processes.

- RLHF: Turns pre-trained model (which just performs autocomplete of text) into a model that you can speak with, ie. answer user questions and refuse providing harmful answers.

- Distillation: Transfer skills / knowledge / behavior from one model (and architecture) to a smaller model (and possibly different architecture), by training second model on output log probs of first model.

gr3ml1n

Your description of distillation is largely correct, but not RLHF.

The process of taking a base model that is capable of continuing ('autocomplete') some text input and teaching it to respond to questions in a Q&A chatbot-style format is called instruction tuning. It's pretty much always done via supervised fine-tuning. Otherwise known as: show it a bunch of examples of chat transcripts.

RLHF is more granular and generally one of the last steps in a training pipeline. With RLHF you train a new model, the reward model.

You make that model by having the LLM output a bunch of responses, and then having humans rank the output. E.g.:

  Q: What's the Capital of France? A: Paris
Might be scored as `1` by a human, while:

  Q: What's the Capital of France? A: Fuck if I know
Would be scored as `0`.

You feed those rankings into the reward model. Then, you have the LLM generate a ton of responses, and have the reward model score it.

If the reward model says it's good, the LLM's output is reinforced, i.e.: it's told 'that was good, more like that'.

If the output scores low, you do the opposite.

Because the reward model is trained based on human preferences, and the reward model is used to reinforce the LLMs output based on those preferences, the whole process is called reinforcement learning from human feedback.

tintor

Thanks.

Here is presentation by Karpathy explaining different stages of LLM training. Explains many details in a form suitable for beginners.

https://www.youtube.com/watch?v=bZQun8Y4L2A

JTyQZSnP3cQGa8B

> answer user questions and refuse providing harmful answers.

I wonder why this thing can have so much hype. Here is the NewGCC, it's a binary only compiler that refuses to compile applications that it doesn't like... What happened to all the hackers that helped create the open-source movement? Where are they now?

null

[deleted]

brcmthrowaway

So RLHF is the secret sauce behind modern LLMs?

anon373839

No, this isn't quite right. LLMs are trained in stages:

1. Pre-training. In this stage, the model is trained on a gigantic corpus of web documents, books, papers, etc., and the objective is to predict the next token of each training sample correctly.

2. Supervised fine-tuning. In this stage, the model is shown examples of chat transcripts that are formatted with a chat template. The examples show a user asking a question and an assistant providing an answer. The training objective is the same as in #1: to predict the next token in the training example correctly.

3. Reinforcement learning. Prior to R1, this has mainly taken the form of training a reward model on top of the LLM to steer the model toward arriving at whole sequences that are preferred by human feedback (although AI feedback is a similar reward that is often used instead). There are different ways to do this reward model. When OpenAI first published the technique (probably their last bit of interesting open research?), they were using PPO. There are now a variety of ways to do the reward model, including methods like Direct Preference Optimization that don't use a separate reward model at all and are easier to do.

Stage 1 teaches the model to understand language and imparts world knowledge. Stage 2 teaches the model to act like an assistant. This is where the "magic" is. Stage 3 makes the model do a better job of being an assistant. The traditional analogy is that Stage 1 is the cake; Stage 2 is the frosting; and Stage 3 is the cherry on top.

R1-Zero departs from this "recipe" in that the reasoning magic comes from the reinforcement learning (stage 3). What DeepSeek showed is that, given a reward to produce a correct response, the model will learn to output chain-of-thought material on its own. It will, essentially, develop a chain-of-thought language that helps it accomplish the end goal. This is the most interesting part of the paper, IMO, and it's a result that's already been replicated on smaller base models.

noch

> So RLHF is the secret sauce behind modern LLMs?

Karpathy wrote[^0]:

"

RL is powerful. RLHF is not.

[…]

And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.

[…]

No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.

"

---

[^0]: https://x.com/karpathy/status/1821277264996352246

1024core

Are there any books (on RL) which are more hands-on and look at the implementations more than the theory?

tintor

Richard Sutton - Reinforcement Learning, An Introduction

1024core

I have looked at Sutton, but it doesn't seem very "implementation oriented" to me.

projectstarter

Need epub version of this

null

[deleted]