Reinforcement Learning: An Overview

maxrmk

I read this to get up to speed on RL for LLMs. If you have limited time, I’d recommend reading the entire first chapter covering the basics and some terminology and then section 5.4 on RL for LLMs.

I struggled a lot with the first chapter, and had to look up a lot of terms that weren’t defined. But ultimately it was one of the most worthwhile things I read, and has helped me follow along with other important papers.

ksd482

Thank you for the recommendation.

Did you mean section 5.6? That's LLMs and RL. Section 5.4 is Imitation Learning.

JaggerFoo

Uses the acronym MDP several times before defining it. So perhaps not an introductory paper, but geared to those already immersed in the field.

DullPointer

“Markov Decision Process (MDPs)” appears on the first page of the table of contents and is defined on the page indicated there.

The term is also used/linked in the fourth paragraph of the Wikipedia page for Reinforcement Learning.

It’s much more a table-stakes for talking about what problems the field tries to solve term than an exclusive preserve of the deeply immersed term.

It’s a bit like a primer on machine learning using the word “regression” casually a few times before actually defining it. Good editing practice? No. An actual road block to learning? Also no.

null

[deleted]

ChrisArchitect

RLHF Book

https://news.ycombinator.com/item?id=42902936

sgd99

I was just thinking of getting into RL, made some progress with q-learning. perfect timing!

t55

this is the most up-to-date summary i could find

esafak

Kevin's books are solid. Disappointing that this one does not cover Deepseek's Group Relative Policy Optimization (GRPO) algorithm, from last March. Is it the SOTA in LLM training?

armcat

Kevin's books tend to be more foundational based on battle-tested techniques (I love his probabilistic ML book series, https://probml.github.io/pml-book/). GRPO is a relatively new technique introduced by the DeepSeek team, and their seminal DeepSeekMath paper is actually a great resource. In short, it improves over PPO by not having to train a separate critic, thereby saving computational resources.

maxrmk

There’s some disagreement over whether or not GRPO is the important part of deepseek or not. I’m personally in camp “it was the data and reward functions” and that GRPO wasn’t the key part, but others would disagree.

Zacharias030

I believe that if the reader is familiar with PPO, they will immediately understand GRPO as well.

I‘ve heard people say that GRPO giving a zero gradient in cases where neither the current sample nor the group scores give any reward is advantageous for optimization. It avoids killing your base model with low signal to noise updates, that can be a problem in PPO where the critic usually causes a non-zero gradient even for samples where one would rather be like „problem too hard for now, skip.“

I‘d be curious to hear you lay out your thoughts though!

maxrmk

I _think_ that's a factor of using rule based reward functions and not actually a feature of GRPO? The original formulation of GRPO from deepseek math uses a neural reward model that is trying to predict human rankings of responses, and in that configuration won't see 0 gradient updates.

Flipping it around, if you swapped out the neural reward model in PPO with a reward function that can return zero, I thiiinnkkk it would be able to produce zero (or very low) gradient updates.

I'll be the first to admit that I don't know enough about the space to say though. I'm still a beginner here.

HN

Reinforcement Learning: An Overview

Reinforcement Learning: An Overview