Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"

26 comments

·March 6, 2025

Imnimo

>To speed up our experiments, we omitted the Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers.

I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn't make reasoning illegible?)

>the 32B model’s response lengths collapsing, especially after reaching peak performance.

I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do?

bradhilton

Yeah, it may help. In this paper[1], the author used a KL penalty of 0.01 for general tasks and 0.001 for mathematical. I tend to think it's probably not very important unless you're trying to optimize for human preferences.

As for response length, I think the model internalizes the logic and doesn't deliberate its answers through context creation. I don't think this is necessarily good for general reasoning, but for a specific task it would cut down inference costs. Just depends on what you're optimizing for. To encourage more general reasoning, I think a broader train and validation set would be helpful.

[1] https://arxiv.org/html/2501.03262v1

jstanley

I keep seeing people mention "illegible reasoning" but I'd be fascinated to see an example of what it actually looks like. Do you have any examples?

Apparently DeepSeek-R1 can switch between English, Chinese, and gibberish, and even the gibberish helps it think! That's fascinating, but all I can find is people saying it, nobody showing it.

Imnimo

Here's an example of language switching:

https://gr.inc/question/although-a-few-years-ago-the-fundame...

In the dropdown set to DeepSeek-R1, switch to the LIMO model (which apparently has a high frequency of language switching).

I'm not sure about examples of gibberish or totally illegible reasoning. My guess is that since R1-Zero still had the KL penalty, it should all be somewhat legible - the KL penalty encourages the model to not move too far from what the base model would say in any given context.

jstanley

Thanks, that's cool to see. I hadn't seen this site before but browsing around I also found this example: https://gr.inc/question/why-does-the-professor-say-this-good... - also with LIMO.

layer8

GRPO = Group Relative Policy Optimization

https://arxiv.org/abs/2402.03300

kcorbitt

One of the authors here. Happy to answer any questions about our methods/results!

pama

Can you elaborate on this point:

“ We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples.”

In particular, did you need to change the hyperparameters much, and did this limited recipe show different improvements for the larger vs smaller models? Also, how did you select these 16 examples?

bradhilton

No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.

We only tested this with the 14B model. You can see the run here:

https://wandb.ai/bradhilton/rl-experiments/runs/062

Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.

pama

Thanks.

malcolmgreaves

Please define an acronym the first time you use it in the body text. I had to scroll about 20% the way through your article just to understand the title.

bradhilton

Great point! Thanks for the feedback.

bydgjohc

Any hypotheses on why the performance dropped suddenly while training?

bradhilton

Hi, other author here. I think the models converged on shallow/greedy strategies that improved performance up to a point, but are ultimately shortsighted, especially for harder puzzles.

Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.

Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.

Other ideas to explore include:

- Distilling responses from stronger models - Encouraging exploration with entropy regularization or reward shaping - Training from base models instead of instruct models, like DeepSeek-R1-Zero

bradhilton

As for why they dropped suddenly, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops. It's as if its fate has already been sealed many iterations ago.

a_wild_dandan

[flagged]

null

[deleted]

mdp2021

Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.

null

[deleted]

snovv_crash

Do you have any other logic puzzles you could use to see if the performance generalises?

kcorbitt

To be honest, I don't expect the performance to generalize to other task types with this specific training regime. If we had a panel of like 30 logic puzzles and cross-trained against all of them simultaneously it might though.

I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).

randomcatuser

Wait, what's the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?

Would be super interesting to see which one is more data-efficient!

bradhilton

Great question! So the dataset includes prompts and solutions, but no "gold" answer per se to use for SFT. You could sample responses from larger models and then train the smaller model on their answers, but as outlined in the benchmarks there is still a lot of headroom on this task and I wouldn't expect that to get the same results. At the very least you would probably want to do rejection sampling to discard bad results. It would definitely be a good experiment!

Tostino

I couldn't quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc).

bradhilton

We trained all the parameters. Those would definitely be interesting ablations. I would also like to see how much of a performance hit we would take with PEFT methods like LoRA.

behnamoh

this is the same team that a few months ago here on hacker news talked about how to do fine-tuning on large language models, and then made it close source.

HN

Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"

Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"