Reinforcement Learning: An Overview
11 comments
·February 2, 2025JaggerFoo
Uses the acronym MDP several times before defining it. So perhaps not an introductory paper, but geared to those already immersed in the field.
null
sgd99
I was just thinking of getting into RL, made some progress with q-learning. perfect timing!
t55
this is the most up-to-date summary i could find
esafak
Kevin's books are solid. Disappointing that this one does not cover Deepseek's Group Relative Policy Optimization (GRPO) algorithm, from last March. Is it the SOTA in LLM training?
maxrmk
There’s some disagreement over whether or not GRPO is the important part of deepseek or not. I’m personally in camp “it was the data and reward functions” and that GRPO wasn’t the key part, but others would disagree.
Zacharias030
I believe that if the reader is familiar with PPO, they will immediately understand GRPO as well.
I‘ve heard people say that GRPO giving a zero gradient in cases where neither the current sample nor the group scores give any reward is advantageous for optimization. It avoids killing your base model with low signal to noise updates, that can be a problem in PPO where the critic usually causes a non-zero gradient even for samples where one would rather be like „problem too hard for now, skip.“
I‘d be curious to hear you lay out your thoughts though!
maxrmk
I _think_ that's a factor of using rule based reward functions and not actually a feature of GRPO? The original formulation of GRPO from deepseek math uses a neural reward model that is trying to predict human rankings of responses, and in that configuration won't see 0 gradient updates.
Flipping it around, if you swapped out the neural reward model in PPO with a reward function that can return zero, I thiiinnkkk it would be able to produce zero (or very low) gradient updates.
I'll be the first to admit that I don't know enough about the space to say though. I'm still a beginner here.
armcat
Kevin's books tend to be more foundational based on battle-tested techniques (I love his probabilistic ML book series, https://probml.github.io/pml-book/). GRPO is a relatively new technique introduced by the DeepSeek team, and their seminal DeepSeekMath paper is actually a great resource. In short, it improves over PPO by not having to train a separate critic, thereby saving computational resources.
I read this to get up to speed on RL for LLMs. If you have limited time, I’d recommend reading the entire first chapter covering the basics and some terminology and then section 5.4 on RL for LLMs.
I struggled a lot with the first chapter, and had to look up a lot of terms that weren’t defined. But ultimately it was one of the most worthwhile things I read, and has helped me follow along with other important papers.