Scaling up test-time compute with latent reasoning: A recurrent depth approach
20 comments
·February 10, 2025janalsncm
edouard-harris
> In R1 they saw it was mixing languages and fixed it with cold start data.
They did (partly) fix R1's tendency to mix languages, thereby making its CoT more interpretable. But that fix came at the cost of degrading the quality of the final answer.[0] Since we can't reliably do interpretability on latents anyway, presumably the only metric that matters in that case is answer quality - and so observing thinking tokens gets you no marginal capability benefit. (It does however give you a potential safety benefit - as Anthropic vividly illustrated in their "alignment faking" paper. [1])
The bitter lesson strikes yet again: if you ask for X to get to Y, your results are worse than if you'd just asked for Y directly in the first place.
[0] From the R1 paper: "To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.“ [emphasis added]
janalsncm
Interpretability also matters when you’re training. If the model works, yes, technically only the final result matters. But in practice it probably won’t work right away and so it’s great to have methods to figure out what is going wrong as you’re training.
For example, should we stop this training or keep going and wait for it to improve? In theory that’s irrelevant because we don’t make mistakes. In practice, theory is just theory.
As an analogy, you technically don’t need code comments. The compiler removes them. But in practice you do need them.
So that’s another reason I mentioned the hyperparameter hell. You’ve removed a simple interpretability method and left us with numbers that worked for a single training run.
ckrapu
My opinion is that opaque reasoning is a prerequisite for many of the worst possible AI outcomes.
We should make reasoning fully visible in the output space.
DennisP
That actually sounds like it'd be really helpful.
optimalsolver
Is there any actual evidence that the reasoning tokens output by current models actually represent the computation happening in the hidden layers?
In both cases, the model is doing a ton of processing that you can't actually inspect, except here, you at least get some efficiency gains.
Even more importantly, you're also less likely to convince yourself that you know what the model is thinking.
anothermathbozo
No and we’ve observed evidence to the contrary
nialv7
Slightly off topic, I rarely see paper talks about their failed training runs, and why those runs failed. This paper is definitely a breath of fresh air. And their analyses of their failures, the changes they made to fix them, and the rational behind that, are all very insightful.
tkellogg
The R1 paper did it as well. Agreed, it's always very interesting.
tmnvdb
Interesting stuff. As the authors note, using latent reasoning seems to be a way to sink more compute into the model and get better performance without increasing the model size, good news for those on a steady diet of 'scale pills'
HarHarVeryFunny
Latent / embedding-space reasoning seems a step in the right direction, but building recurrence into the model while still relying on gradient descent (i.e. BPTT) to train it seems to create more of a problem (training inefficiency) than it solves, especially since they still end up externally specifying the number of recurrent iterations (r=4, 8, etc) for a given inference. Ideally having recurrence internal to the model would allow the model itself to decide how long to iterate for before outputting anything.
thomasahle
> Latent / embedding-space reasoning seems a step in the right direction
Might be good for reasoning, but it's terrible for interpretation / AI-safety.
janalsncm
> seems a step in the right direction
I can’t see why. I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens. And the downsides are obvious.
> externally specifying the number of recurrent iterations
Yeah this seems wrong to me. At least with RL training you saw that the length of the CoT decreased dramatically before climbing again, as the model became more proficient.
viraptor
> I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens.
Efficiency. The written language is extremely inefficient. By running through whole concepts at a time instead of parts of a word the reasoning time will be much more concise.
jonathanrmumm
If we're talking conscious thought, millions of simultaneously firing neurons to form words. If we're unconscious intelligence, it's closer to latent space. A lot of intelligence that can't be articulated.
HarHarVeryFunny
> I can’t see why
It just provides a bigger representation space, and seems more like what we do given that many people don't have an inner dialog, and some think pictorially.
It seems it could allow reasoning over superpositions of concepts, if such things exist internal to the model (but presumably not at the edge were they need to be decodable into specific tokens).
ckrapu
Identifying scheming in the latent streams would be harder as you would have an extra layer of obfuscation between you and the model’s reasoning.
timbilt
Twitter thread about this by the author: https://x.com/jonasgeiping/status/1888985929727037514
danielbln
If you don't have a twitter account and want to read the full thread: https://xcancel.com/jonasgeiping/status/1888985929727037514
EternalFury
Isn’t this equivalent to maximizing latent space activation without corrective user input? How does it implement self correction or backtracking?
One of the benefits of using thinking tokens compared to “thinking in a latent” space is that you can directly observe the quality of the CoT. In R1 they saw it was mixing languages and fixed it with cold start data.
It would be hard to SFT this because you can only SFT the final result not the latent space.
I also notice the authors only had compute for a single full training run. It’s impressive they saw such good results from that, but I wonder if they could get better results by incorporating recent efficiency improvements.
I would personally not use this architecture because 1) it adds a lot of hyperparameters which don’t have a strong theoretical grounding and 2) it’s not clearly better than simpler methods.