Skip to content(if available)orjump to list(if available)

Understanding R1-Zero-Like Training: A Critical Perspective

mentalgear

Overall the industry needs more review, less hype. I was shocked to find out SWE-verified [0] is all but verified.

[0] benchmark used by all major vendors to "showcase" coding ability, turns out to be <10% properly solved: https://www.youtube.com/watch?v=QnOc_kKKuac

belter

Failure modes are also interesting to show what is happening or not really happening. Like the test of asking GenAI to create clocks at specific times, or people drawing with the left hand. All you get are clocks, at 10 min past two, or people drawing with the right hand, since it's 99% of what is in the training data.

Like Sabine says, if the LLM models, already read all the Math books in the world but are not yet able to do basic math, without calling upon a calculator, how much reasoning is really emerging?

"The Path to AGI is Coming Into View": https://youtu.be/mfbRHhOCgzs?t=219

immibis

Or overflowing wine glasses.

refulgentis

I'm not sure what Sabine means. It is a somewhat obvious category error, and fundamentally in error regardless. (I find it hard to believe that, for example, Sabine would beat an LLM on a randomly selection of 10 3x3 digit multiplication problem to be complicated in 60 seconds max, by either party)

fragmede

numeracy isn't mathmatical reasoning

drakenot

I've seen the same "Superficial Self-Reflection" mentioned in their linked blog post[0] as well, where the conclusion doesn't naturally follow the output of the thinking tokens. I think people are fooled by this, but if you take the time to inspect the "chain of thought" tokens they often don't match the final output answer.

I don't deny that performance for certain logic tasks goes up with these models but I don't fully understand what role the thinking tokens take in these cases.

[0] https://oatllm.notion.site/oat-zero

andai

I heard that even just getting the model to print a bunch of whitespace ("think for longer") improves the quality of the final response, because some kind of processing is still happening internally?

integralof6y

printing a bunch of whitespace is a way of entering into a new state ( I am thinking about a state machine), so the LLM can use that whitespace as a new token that can be used later to refine the state of the system. In math terms, whitespace is a tag for a class (or state) in the LLM. I think that perhaps RL can take advantage of such tags. For example whitespace could indicate a point of low gradient (indetermination) or a branching point, the LLM in some way would learn to enhance the learning rate parameter, so the message in the head of the LLM is: be ready to learn from RL because in your actual state you need to take a branch from a branching point that can enhance your capabilities. This is similar to tossing a coin or a die. The rule could be: when whitespace do increase the learning rate parameter to escape from zero gradient points. Caveat emptor: This is just an speculation, I don't have any data to support this hypothesis. Also this suggests that whitespace could be a "token that reflects the state of previous layers" and is not contained in the vocabulary used to train the model, so I should say that whitespace is a macro-token or neurotoken. If this hypothesis has some ground then it could also be plausible that whitespace could be an enumerate neural tag in the sense that the length of whitespace reflects or is related to the layer in which the zero gradient or branching point occurs. Finally, my throwaway user need whitespace so I will change the password to a random one to force myself to avoid adding new ideas.

MoonGhost

Could it be that model just uses latent space for thinking while generating almost garbage? Interesting to check if adding repeating something at the end of prompt helps. I.e. model uses it for 'thinking'.

refulgentis

Latent space has a certain shape, which may mean I'm missing a technical distinction.*

There's been publications with a pause token (https://arxiv.org/abs/2310.02226), backspace token (https://arxiv.org/abs/2306.05426), or a think token (https://arxiv.org/html/2405.08644v1#Ch0.S4), all of them based on the theory that a generic token can sort of act a placeholder for manipulating attention further without further meaningful output.

However, in practice, those approaches haven't been used in the training of a large scale model, i.e. I haven't seen it at all, the most adventurous people have gotten at scale is doing Mamba. (and RL)

* It had a particular technical meaning. The first round of the telephone game was when it came to mean "a 3 spatial dimensions-like space, with N dimensions, in an image diffusion model, that contains all possible image styles, that is navigated by a prompt." We're many iterations afield of it now, I'm afraid. Now, you sort of have to interpret it like you would negative space, defined by what it is around it.

scribu

If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.

I love this sort of “anti-hype” research. We need more of it.

mirekrusin

So they achived R1-Zero like performance without those long CoT that sometimes never end/are impacting inference time with fraction of fine tunining resources?

refulgentis

No, they still have "<think>", but it's shorter by removing part of a term.

mirekrusin

That's what I mean, those CoT are never ending currently until you run out of context.

refulgentis

I'm not sure if you're talking conversationally and I'm taking it as a technical query, or you're saying CoT never terminate for you and asking for input, or asking what the paper implies about CoT, or relaying that you understand the papers claim that this method net reduces CoT length.

blahhh2525

[flagged]