Understanding R1-Zero-Like Training: A Critical Perspective

mentalgear

Overall the industry needs more review, less hype. I was shocked to find out SWE-verified [0] is all but verified.

[0] benchmark used by all major vendors to "showcase" coding ability, turns out to be <10% properly solved: https://www.youtube.com/watch?v=QnOc_kKKuac

belter

Failure modes are also interesting to show what is happening or not really happening. Like the test of asking GenAI to create clocks at specific times, or people drawing with the left hand. All you get are clocks, at 10 min past two, or people drawing with the right hand, since it's 99% of what is in the training data.

Like Sabine says, if the LLM models, already read all the Math books in the world but are not yet able to do basic math, without calling upon a calculator, how much reasoning is really emerging?

"The Path to AGI is Coming Into View": https://youtu.be/mfbRHhOCgzs?t=219

omneity

I would offer an alternative, possible explanation.

Having trained quite a few LLMs by now, especially around the “uplift” from a text completion model to an instruct one, I noticed that the instruction following capabilities tend not to be uniform across all the tasks the LLM is able to perform.

In other words, it doesn’t know what is the implication of your request (time on the clock) in terms of change in output (the svg drawing of the clock), and all the abstractions and steps in between, this is why changing the time in the prompt might yield little difference.

Instruct datasets are quite small in retrospect and hardly cover the range of tasks instruct LLMs are able to perform, and mostly rely on extrapolation (interpolation?) capabilities for the LLM from its small training set. And it is this generalization that is not evenly distributed.

Just my two cents from my modest observations, sprinkle [citation needed] everywhere.

refulgentis

I'm not sure what Sabine means. It is a somewhat obvious category error, and fundamentally in error regardless. (I find it hard to believe that, for example, Sabine would beat an LLM on a randomly selection of 10 3x3 digit multiplication problem to be complicated in 60 seconds max, by either party)

immibis

Or overflowing wine glasses.

fragmede

numeracy isn't mathmatical reasoning

drakenot

I've seen the same "Superficial Self-Reflection" mentioned in their linked blog post[0] as well, where the conclusion doesn't naturally follow the output of the thinking tokens. I think people are fooled by this, but if you take the time to inspect the "chain of thought" tokens they often don't match the final output answer.

I don't deny that performance for certain logic tasks goes up with these models but I don't fully understand what role the thinking tokens take in these cases.

[0] https://oatllm.notion.site/oat-zero

andai

I heard that even just getting the model to print a bunch of whitespace ("think for longer") improves the quality of the final response, because some kind of processing is still happening internally?

MoonGhost

Could it be that model just uses latent space for thinking while generating almost garbage? Interesting to check if adding repeating something at the end of prompt helps. I.e. model uses it for 'thinking'.

refulgentis

Latent space has a certain shape, which may mean I'm missing a technical distinction.*

There's been publications with a pause token (https://arxiv.org/abs/2310.02226), backspace token (https://arxiv.org/abs/2306.05426), or a think token (https://arxiv.org/html/2405.08644v1#Ch0.S4), all of them based on the theory that a generic token can sort of act a placeholder for manipulating attention further without further meaningful output.

However, in practice, those approaches haven't been used in the training of a large scale model, i.e. I haven't seen it at all, the most adventurous people have gotten at scale is doing Mamba. (and RL)

* It had a particular technical meaning. The first round of the telephone game was when it came to mean "a 3 spatial dimensions-like space, with N dimensions, in an image diffusion model, that contains all possible image styles, that is navigated by a prompt." We're many iterations afield of it now, I'm afraid. Now, you sort of have to interpret it like you would negative space, defined by what it is around it.

andai

Here's one where just appending dots to the output improved quality:

https://arxiv.org/abs/2404.15758

integralof6y

printing a bunch of whitespace is a way of entering into a new state ( I am thinking about a state machine), so the LLM can use that whitespace as a new token that can be used later to refine the state of the system. In math terms, whitespace is a tag for a class (or state) in the LLM. I think that perhaps RL can take advantage of such tags. For example whitespace could indicate a point of low gradient (indetermination) or a branching point, the LLM in some way would learn to enhance the learning rate parameter, so the message in the head of the LLM is: be ready to learn from RL because in your actual state you need to take a branch from a branching point that can enhance your capabilities. This is similar to tossing a coin or a die. The rule could be: when whitespace do increase the learning rate parameter to escape from zero gradient points. Caveat emptor: This is just an speculation, I don't have any data to support this hypothesis. Also this suggests that whitespace could be a "token that reflects the state of previous layers" and is not contained in the vocabulary used to train the model, so I should say that whitespace is a macro-token or neurotoken. If this hypothesis has some ground then it could also be plausible that whitespace could be an enumerate neural tag in the sense that the length of whitespace reflects or is related to the layer in which the zero gradient or branching point occurs. Finally, my throwaway user need whitespace so I will change the password to a random one to force myself to avoid adding new ideas.

YeGoblynQueenne

More tokens = fewer options for the final string. It's not any more complicated than that and it doesn't require any reasoning, just an autoregressively trained statistical model of text, but no, it has to be "the model thinks harder if it outputs more tokens".

scribu

If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.

I love this sort of “anti-hype” research. We need more of it.

fancyfredbot

The article starts by saying

"DeepSeek-V3-Base already exhibit 'Aha moment'."

I tried to read the screenshot they present as evidence of this, and indeed it does say "Aha!". But both the preceding reasoning and the following conclusion look like gibberish to me. I'm not sure what we're supposed to conclude here and I gave up reading the article after this inauspicious start.

mirekrusin

So they achived R1-Zero like performance without those long CoT that sometimes never end/are impacting inference time with fraction of fine tunining resources?

refulgentis

No, they still have "<think>", but it's shorter by removing part of a term.

mirekrusin

That's what I mean, those CoT are never ending currently until you run out of context.

refulgentis

I'm not sure if you're talking conversationally and I'm taking it as a technical query, or you're saying CoT never terminate for you and asking for input, or asking what the paper implies about CoT, or relaying that you understand the papers claim that this method net reduces CoT length.

ilrwbwrkhv

I mean anybody who understands the maths knows that there is no real reasoning from these models. The only ones hyping this up are VC bros who want the return on their investment money.

blahhh2525

[flagged]

HN

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding R1-Zero-Like Training: A Critical Perspective