alexhutcheson
kadushka
Has r1 made RLHF obsolete?
alexhutcheson
DeepSeek-R1 had an RLHF step in their post-training pipeline (section 2.3.4 of their technical report[1]).
In addition, the "reasoning-oriented reinforcement learning" step (section 2.3.2) used an approach that is almost identical to RLHF in theory and implementation. The main difference is that they used a rule-based reward system, rather than a model trained on human preference data.
If you want to train a model like DeepSeek-R1, you'll need to know the fundamentals of reinforcement learning on language models, including RLHF.
dgfitz
> Reinforcement learning from human feedback (RLHF)
In case anyone else didn’t know the definition.
Knowing the definition it sounds kind of like “learn what we tell you matters” in a sense.
Not unlike how the world seems to work today. High hopes for the future…
nejsjsjsbsb
This is good also https://huyenchip.com/2023/05/02/rlhf.html
brcmthrowaway
Whats the difference between RLHF and distillation?
Glad to see the author making a serious effort to fill the gap in public documentation of RLHF theory and practice. The current state of the art seems to be primarily documented in arXiv papers, but each paper is more like a "diff" than a "snapshot" - you need to patch together the knowledge from many previous papers to understand the current state. It's extremely valuable to "snapshot" the current state of the art in a way that is easy to reference.
My friendly feedback on this work-in-progress: I believe it could benefit from more introductory material to establish motivations and set expectations for what is achievable with RLHF. In particular, I think it would be useful to situate RLHF in comparison with supervised fine-tuning (SFT), which readers are likely familiar with.
Stuff I'd cover (from the background of an RLHF user but non-specialist):
Advantages of RLHF over SFT:
- Tunes on the full generation (which is what you ultimately care about), not just token-by-token.
- Can tune on problems where there are many acceptable answers (or ways to word the answer), and you don't want to push the model into one specific series of tokens.
- Can incorporate negative feedback (e.g. don't generate this).
Disadvantages of RLHF over SFT:
- Regularization (KL or otherwise) puts an upper bound on how much impact RLHF can have on the model. Because of this, RLHF is almost never enough to get you "all the way there" by itself.
- Very sensitive to reward model quality, which can be hard to evaluate.
- Much more resource and time intensive.
Non-obvious practical considerations:
- How to evaluate quality? If you have a good measurement of quality, it's tempting to just incorporate it in your reward model. But you want to make sure you're able to measure "is this actually good for my final use-case", not just "does this score well on my reward model?".
- How prompt engineering interacts with fine-tuning (both SFT and RLHF). Often some iteration on the system prompt will make fine-tuning converge faster, and with higher quality. Conversely, attempting to tune on examples that don't include a task-specific prompt (surprisingly common) will often yield subpar results. This is a "boring" implementation detail that I don't normally see included in papers.
Excited to see where this goes, and thanks to the author for willingness to share a work in progress!