Implementing DeepSeek R1's GRPO algorithm from scratch
3 comments
·April 13, 2025cubefox
xcodevn
Author here: (1) We didn't remove the stddev term. (2) We use token-level loss (every token has the same weight), which is very similar to what Dr. GRPO does. However, we compute the mean gradient per token, while Dr. GRPO computes the sum. Typically, these are equivalent. However, since we're also doing gradient accumulation over micro-batches to reduce memory usage during training, this led to a bug in our implementation: it gives more weight to tokens in short sequences than to those in long sequences.
Interestingly, this is the same bug that most open-source LLM training frameworks (such as HF Trainer) had and only recently fixed.
In short, I'm working on a quick fix, after that, using sum or mean should yield equivalent results.
P.S. Fixed!
cubefox
Cool!
I wonder whether they implemented the GRPO correction from this paper, which fixes overly long response lengths: https://arxiv.org/abs/2503.20783
I guess probably not, as they don't mention it.