Mini-R1: Reproduce DeepSeek R1 "Aha Moment"
11 comments
·January 31, 2025madars
senko
> that's not enough as you can rebuild access to builtins from objects
In this specific case, it's safe, as that wouldn't pass the regex just a few line before the eval :
# Define a regex pattern that only allows numbers,
# operators, parentheses, and whitespace
allowed_pattern = r'^[\d+\-*/().\s]+$'
Commenting on the R1 reproduction, the heavy lifting there is done by huggingface's trl[0] library, and the heavy use of compute.[0] Transformer Reinforcement Learning - https://huggingface.co/docs/trl/en/index
mxwsn
What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.
So what are the chances of randomly guessing a solution?
The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.
This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?
The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.
[0]: https://www.philschmid.de/static/blog/mini-deepseek-r1/tenso... [1]: https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3t...
senko
> What's surprising about this is how sparsely defined the rewards are
Yeah, I would expect the rewards not to be binary. One could easily devise a scoring function in range [0-1] that would depend on how far the model is from the "correct" answer (for example, normalized Levenshtein distance). Whether that would actually do any good is anyone's guess.
singularity2001
"Conclusion
The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working.
In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model."
yurlungur
https://github.com/Jiayi-Pan/TinyZero what about this one?
thorum
I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?
NitpickLawyer
That has already improved a lot. Initially they were generating new samples w/ transformers, and were talking in github issues about using vLLM to batch generate samples. Lower in the blog post it seems they already did that.
deneas
I'd imagine using optimized/faster reward functions could already make a difference.
rmrf100
this is really cool!
moonshotideas
Wow!
One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:
but that's not enough as you can rebuild access to builtins from objects and then go from there: https://ideone.com/qzNtyuBy the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: "The user's request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines."). So I just cancelled my ChatGPT subscription - why did we ever put up with this? "This distillation thingie sounds pretty neat!"