Skip to content(if available)orjump to list(if available)

There May Not Be Aha Moment in R1-Zero-Like Training

Jean-Papoulos

>We found Superficial Self-Reflection (SSR) from base models’ responses, in which case self-reflections do not necessarily lead to correct final answers.

I must be missing something here. No one was arguing that the AI answers are correct to begin with, just that self-reflection leads to more correct answers when compared to not using the process ?

littlestymaar

TL;DR;

Base models exhibit what rhe authors call "Superficial Self-Reflection" where it looks like it's reasoning but it doesn't lead to an actual improvement in answer quality. Then with RL the models learn to effectively use this reflection to improve answer quality.

The whole read is interesting but I don't think the title is really an accurate description of it…

jamiequint

Some interesting discussion in the author's X thread here: https://x.com/zzlccc/status/1887557022771712308

yash302

[flagged]