Absolute Zero: Reinforced Self-Play Reasoning with Zero Data
17 comments
·May 11, 2025a2128
ethan_smith
The breakthrough here is eliminating the need for human-labeled reasoning data while still achieving SOTA results, which has been a major bottleneck in developing reasoning capabilities.
scotty79
I think at this point the initial process of exposing the empty model to all the available domain data in bulk is no longer interesting to many people. It's an obvious first step so it's barely mentioned anymore. What's currently worked on is what you do afterwards to get a useful tool in the end.
macrolime
Pretty sure OpenAI and/or DeepMind have already been doing something very similar for a while already, just without publishing it.
FieryTransition
Agreed, it's a pretty obvious solution to the problems once you are immersed in the problem space. I think it's much harder to setup an efficient training pipeline for this which does every single little detail in the pipeline correctly while being efficient.
gitroom
sometimes i feel like the whole self-play thing is kinda the obvious path now but still nuts seeing it actually work better than huge data dumps. you ever wonder how much of progress is just crazy good pipelines versus actual breakthroughs?
Waterluvian
Related to this: has anyone seen a model respond with “oh wait I was wrong…” when you follow-up with a “can you explain why this answer is right?”
I still find that my uses of GPT and others still struggle with a sort of tunnel vision.
squillion
Warning: abuse of this technique may cause the model to go blind.
ogogmad
Is this a joke about wanking?
QuadmasterXLII
For everyone who says “modern incentives forbid publishing negative results,” let this stand as a counterexample!
nullc
Be nice to see some of these run on languages the pretrained model is a little less good at than Python and JS.
mentalgear
"Despite using zero human-curated data, AZR achieves state-of-the-art results on diverse coding and math reasoning benchmarks, even outperforming models trained on large in-domain datasets. This demonstrates the potential for sophisticated reasoning skills to emerge purely through self-play without domain-specific supervision."
wiz21c
> "sophisticated reasoning skills"
Does it mean that it uses the data it has to the maximum possible level to produce new reasoning (that add to those produced by less algorithms). IOW, are we still in the realm of: with a given data set, A.I. can produce up to N reasoning capabilities and consequently, can't produce more than that ? IOW, reasoning is bound by knowledge ? And therefore, maybe we could just start from a data/knowledge set in which we add some randomness and self play until some form of reasoning emerge ?
MoonGhost
Up to N at a time probably. Then move on using them. The problem is the longer the chain, the more likely it will deviate from the reality. It will include non-obvious atomic decisions and wrong assumptions. This will make the whole thing unstable. I.e. without strict human supervision it likely will start producing crap. Probably some self double checks can help, but still. On the other hand humans aren't that smart either...
To be clear, this is not a model trained on zero data, this is a pretrained model (Qwen 2.5 trained on 18 trillion tokens) finetuned using self-generated data grounded by a Python interpreter