Skip to content(if available)orjump to list(if available)

Emerging reasoning with reinforcement learning

krackers

The real thing that surprises me (as a layman trying to get up to speed on this stuff) is that there's no "trick" to it. It really just does seem to be a textbook application of RL to LLMs.

Going from a base LLM to human instruction-tuned (SFT) ones is definitely an ingenious leap where it's not obvious that you'd get anything meaningful. But when we quickly saw afterwards that prompting for chain of thought improved performance, why wasn't this the immediate next step that everyone took. It seems like even after the release of o1 the trick wasn't apparent to everyone, and if it wasn't for DeepSeek people still might not have realized it.

NitpickLawyer

> why wasn't this the immediate next step that everyone took.

It was actually tested by various labs. Just probably not at this scale. The first model that featured RL prominently was DeepSeek-math-7b-RL, published last year in april. It was at the time the best model for math, and remained so until the qwen2.5-math series, that probably had way more data put into them.

There's a thing about RL that makes it tricky - the models tend to behave very stubbornly. That is, if they see something that resembles their training method (i.e. math problems), they'll solve the problem, and they'll be good at it. But if you want something close to that but not quite solving it (i.e. analyse this math problem and write hints, or here are 5 problems extract the common methods used for solving, etc.) you'll see that they perform very poorly, often times just going straight into "to solve this problem we...".

This is even mentioned in the R1 paper. Poor adherence to prompts, especially ssytem prompts. So that is still challenging.

eden-u4

I think the issue with RL is that, in order for a model to perform well in a task, you have to make it stubborn. In the same way a student that thinks outside the scope of the task might not perform well in a graded exam, but that does not mean he/she is a bad reasoner. With RL and all training procedure you are creating a very focused and very fit to the task thinker, which might not be useful in all applications (consider an open problem, it might need an out of the box kind of thought).

Demlmlm

[dead]

HarHarVeryFunny

Chain of thought prompting ("think step by step") only encourages the model to break the problem into steps, which allows it to incrementally build upon each step (since the output is fed back in as part of the input).

Reasoning requires more than chain of thought, since it's often not apparent what the next step should be - you (human, or model) may go down one path of reasoning only to realize it's going nowhere, and have to back up and try something else instead. This ability to "back up" - to realize that an earlier reasoning "step" was wrong and needs to be rethought is what was mostly missing from models that (unlike o1, etc) hadn't been trained for reasoning.

The reason non-reasoning models can't reason appears to be because this type of chain-of-consciousness thought (thinking out loud, mistakes and all) when trying to figure out a problem is hugely underrepresented in a normal training set. Most writing you find on the internet, or other sources, is the end result of reasoning - someone figured something out and wrote about it - not the actual reasoning process (mistakes and all) that got them there.

It's still not clear what OpenAI had to do, if anything, to help bootstrap o1 (special hand-created training data?), but basically by using RL to encourage certain types of reasoning pattern, they were able to get the model to back-up and self-correct when needed. DeepSeek-R may well have used o1 reasoning outputs as a bootstrap, but have been able to replicate RL training to encourage self-correcting reasoning in the same way.

One interesting aspect of DeepSeek-R is that they have shown that once you have a reasoning model, you can run it and use it to generate a bunch of reasoning outputs that can then be used as normal training data to fine-tune a non-reasoning model, even a very small one. This proves that, at least to some degree, the reason non-reasoning models couldn't reason is just because they had not been trained on sufficient self-correcting reasoning examples.

naasking

> since it's often not apparent what the next step should be

Backtracking assumes depth-first search, which isn't strictly needed as you could explore all possible options in parallel in a breadth-first manner, but incrementally until one branch returns a satisfactory answer.

> This proves that, at least to some degree, the reason non-reasoning models couldn't reason is just because they had not been trained on sufficient self-correcting reasoning examples.

For sure this is a big reason, and probably also part of the reason they hallucinate rather than say they don't know or aren't sure.

HarHarVeryFunny

> Backtracking assumes depth-first search, which isn't strictly needed as you could explore all possible options in parallel in a breadth-first manner

You could in theory, but it'd be massively/prohibitively more expensive - exploring a whole tree just to end up using a single branch. It'd be like trying to have a computer play chess by evaluating EVERY branching position out to some fixed depth, rather than using MCTS to prune away unpromising lines.

Not only that, but reasoning in general (of which LLM-based reasoning is only a limited case) isn't just about search - it can also require exploration and acquisition of new knowledge if you run into an impasse that your past experience can't handle. If AI systems hope to achieve human-level AGI, they will need to change to support continuous learning and exploration as a part of reasoning, which naturally means more of a depth-first approach (continue until you hit an impasse) with backtracking.

You can say that hallucination is due to gaps in the training set, but of course humans don't have that problem because we know what we know, and have episodic memories of when/where/how we learned things. LLMs have none of that.

Demlmlm

[dead]

mountainriver

This was my takeaway as well, the paper was so simple I was shocked by it. We’ve been doing RL on LLMs for awhile now and it’s more surprising this didn’t happen sooner

qnleigh

I've wondered this too, I really hope someone with more knowledge can comment. My impression is that people worked on this kind of thing for years before they started seeing a 'signal' i.e. that they actually got RL working to improve performance. But why is that happening now? What were the tricks that made it work?

attentionmech

If you check failure section of their paper, they also tried other methods like MCTS and PRM which is what other labs have been obsessing about but couldn't move on from (that includes bigshots). Only team which I am aware which tried verifiable rewards is tulu but they didn't scaled it up and just left it there.

This sort of thing imo is similar to what openAI did with transformer architecture i.e. google invented it but couldn't scale it in the right direction and deepmind got busy with atari games. They had all the pieces still openai could do it. It seems to be it comes down to research leadership in what methods to choose to invest in. But yeah, the budgets big labs have, they can easily try 10 different techniques and brute force it all but seems like they are too opinionated in methods and less urgent on outcomes.

[paper] https://arxiv.org/pdf/2501.12948 [tulu] https://x.com/hamishivi/status/1881394117810500004

attentionmech

I found the following thread more insightful than my original comment (wish I could edit that one). A research explains why RL didn't work before this: https://x.com/its_dibya/status/1883595705736163727

logicchains

DeepSeek only recently invented GRPO, it's possible that was the final missing piece needed to make it viable.

nialv7

The group in this article used straight and simple PPO, so I guess GRPO isn't required.

My hypothesis is that everyone was just so stunned by oai's result so most just decided to blindly chase it and do what oai did (i.e. scaling up). And it's only after o1 people started seriously trying other ideas.

krackers

I don't have any intuition here and am in no way qualified, but my read of the paper was that GRPO was mainly an optimization to reduce cost & GPUs when training (by skipping the need to keep another copy of the LLM in memory as the value network), but otherwise any RL algorithm should have worked? I mean it seems R1 uses outcome rewards only and GRPO doesn't do anything special to alleviate reward sparsity, so it feels like it shouldn't affect viability too much.

Also on the note of RL optimizers, if anyone here is familiar with this space can they comment on how the recently introduced PRIME [1] compares to PPO directly? Their description is confusing since the "implicit PRM" they introduce which is trained alongside the policy network seems no different from the value network of PPO.

[1] https://github.com/PRIME-RL/PRIME

attentionmech

the tulu team saw it. but, yes nobody like scaled it to the extent deepseek did. I am surprised that the faang labs which have the best of the best didn't see this.

meiraleal

> I am surprised that the faang labs which have the best of the best didn't see this.

After so many layoff rounds, they might have got stuck with the best at avoiding it.

__jl__

How do we know that they didn't see it? Their work is much more secret now. Isn't it possible that o1 and o3 rely on something similar maybe with some additions. Same for the gemini thinking models.

My point it that OpenAI and google might have been working with very similar approaches for months.

xbmcuser

I think a lot of it had to do with DeepSeek need to use as fewer resources as possible why did it do this how can it do it in fewer steps using fewer resources. Where as most of the FAANG were looking at throwing more data and processing power at it.

logicchains

I wonder if OpenAI did the same thing, or they instead took the approach of manually building an expensive, human-designed supervised learning dataset for reasoning. If the latter, they must be really kicking themselves now.

nialv7

I'd bet $5 that o1 was also built with either RL or search, or a combination of the two. That was what I initially thought when they announced o1-preview, after I saw the sample reasoning traces.

But alas I am just an ML enthusiast, not a member of some lab with access to GPUs.

Demlmlm

[dead]

ninetyninenine

There was a whole bunch of people who claimed LLMs can't reason at all and that everything is a regurgitation. I wonder what they have to say about this. Like, what exactly is going on here with chain of thought reasoning from their expert perspective?

teej

My mental model for chain-of-thought is not “reasoning”. It’s more of an iterative search through the latent space of the model.

qnleigh

Can you elaborate? I think this is a really interesting question. It comes up over and over again, but it often feels like the two sides of the debate talk past each other. What does the mental model of 'iterative search through latent space' convey that 'reasoning' doesn't? Human reasoning also often searches through a space of potential solution methods and similar problems, and keeps applying them until making progress.

I appreciate that there might be danger in using words like 'thinking' and 'reasoning' in that they cause us to anthropomorphize LLMs, but if we are careful not to do so then this is a separate issue.

naasking

"Search" a class of fairly well defined algorithms, "reasoning" is vague / ambiguous. If reasoning can be reduced to some kind of search, that makes its meaning more precise and understandable.

valine

And human reasoning is somehow more magical? Really struggling to understand the distinction between searching through a latent space and “reasoning”.

sgt101

It would be startling if humans could reason this way with two kilos of meat consuming 40w. Even more surprising given that we get answers in ~1s with a cross brain comms time of about 0.25s

To me it's clear that human reasoning is different from a massive search of a latent space. I can say that when I am thinking I maybe try half a dozen ideas or scenarios, but very rarely more. I can't say where those ideas come from or how I make them up though. Maybe we can't frame what it is and how it works with human languages though, which might make it seem magical in some way.

Or maybe there's a good framing that I don't know - would love to learn!

numba888

Yes, it has at least two distinct parts: consciousness and sub. First is visible to 'us', it's inner monologue or vision, or other senses. But we don't 'know' what happens in subconsciousness. The answer just pops up from nowhere. 'Reasoning' LLMs for now have the first part. The 'sub' part is questionable, depends how you look at latent space.

Exoristos

As you're demonstrating, it depends on the human.

exe34

no, humans do the same thing - even the "intuition" of things to try are probably the results of searches, but we don't have conscious access to them. certainly it's the simplest explanation that fits the observations.

dyauspitr

That sounds like Sheldon from BBT describing human reasoning.

csomar

They sure do have a certain amount of "reasoning".

Here is R1 trying to multiply a large number (successfully): https://gist.github.com/omarabid/038678cc269a3f2db756a7e0825...

If you pick a random combination, there is a very good chance that the combination and the product do not exist anywhere. So the LLM has to "create" it somehow.

It sure goes through a lot (hundreds of lines of self-reflection) but it successfully does the math.

I don't think it is the same kind of "reasoning" as humans, but there is an emergent kind of structure happening here that is allowing for this reasoning.

mkl

I think it is very human-like reasoning. I reason exactly like this when doing numerical calculations in my head, and I'm a mathematician (no, I can't work with numbers this big in my head).

It's quite funny where late in the piece it says it's checking with a calculator, which a human would do if possible (if they didn't start out with that) but then its statements are pretty much the same as before, and it probably didn't actually use a calculator.

Jensson

That is like saying a calculator is reasoning since it spends many cycles thinking about the problem before answering. You could say yes a calculator is reasoning, but most would say a calculator isn't really reasoning.

> there is an emergent kind of structure happening here that is allowing for this reasoning.

Yes, but that structure doesn't need to be more complicated than a calculator. The complicated bit is the fuzzy lookup that returns fuzzy patterns (every LLM does that though so its old), and then how it recourses that, but seeing how that can result in it recursing into a solution for a multiplication is no harder than understanding how a calculator works.

So to add basic reasoning like this you just have to add "functions" for the LLM to fuzzy lookup, and then when looking up those "functions" it will eventually recourse into a solution. The hard part is finding a set of such functions that solves a wide range of problems, and that was solved here.

daveguy

I think that demonstrates how far we are from reasoning and from true self-reflection. If either were happening it would know that it has the capability to multiply four numbers in nanoseconds and know that it doesn't even matter the order it multiplies them. The first reasoning step, I have 4 numbers to multiply, should be the only one necessary.

Vampiero

The test is very simple but people simply don't realize it.

When LLMs are good at Prolog, it means they're good at logic, which means they're good at reasoning. Until then, you can't trust them.

daveguy

What exactly do you mean by "good at Prolog"?

Vampiero

It means being good at first order predicate logic. And possibly higher order too when you consider `call/n` and lambdas. It means being good at generalization, at reasoning in causal terms, at understanding structure and grammar, at encoding problems as graphs and querying them for solutions, and much more.

Basically it's what current LLMs lack. They're good at spewing coherent text but they lack the building blocks of reason, which are made of logic, and which confer the quality of being consistent. A implies B.

calibas

I don't get this whole debate, surely what's meant by "reason" can be strictly defined and measured? Then we can conclusively say whether or not it's happening with LLMs.

It seems to me like the debate is largely just semantics about how to define "reason".

ninetyninenine

It's semantics. But there's a general motivation behind it that's less technical. Basically if it can reason, it implies human level intelligence. That's the line separating man from machine. Once you cross that line there's a whole bunch of economic, cultural and existential changes that happen to society that are permanent. We can't go back.

This is what people are debating about. Many many people don't want to believe we crossed the line with LLMs. It brings about a sort of existential dread. Especially to programmers who's pride is entirely dependent upon their intelligence and ability to program.

bookofjoe

>Frontier AI systems have surpassed the self-replicating red line

https://arxiv.org/abs/2412.12140

sgt101

We've had "reasoning" machines for a long time - I learned chess playing against a computer in the 1980's.

But we don't have reasoning that can be applied generally in the open world yet. Or at least I haven't seen it.

In terms of society it should be easy to track if this is true or not. Healthcare and elder care settings will be a very early canary of this because there is huge pressure for improvement and change in these. General reasoning machines will make a very significant, clear and early impact here. I have seen note taking apps for nurses - but not much else so far.

akomtu

It's not intelligence that separates us from machines, but "connectedness to the whole." A machine becomes alive the moment it's connected to the whole, the moment it becomes driven not by an RNG and rounding errors, but by a spirit. Similarly, a man becomes a machine the moment he loses this connection.

The existential dread around AI is due to the fear that this machine will indeed be connected to a spirit, but to an evil one, and we'll become unwanted guests in a machine civilization. Art and music will be forbidden, for it "can't be reasoned about" in machine terms, the nature will be destroyed for it has to no value to the machines, and so on.

eastbound

It’s not about being afraid, it’s that the auto-reconfiguration of neurons seems advanced to decompile it at this time, and it surprising that LLM, which are just a probabilistic model of guessing the next word, could be capable of actual thought.

The day it happens, we’ll believe it. There are only 100bn neurons in a brain, after all, and many more than this in modern machines, so it is theoretically possible. Just LLMs seemed too simple for that.

qnleigh

I think it's really hard to pin down what reasoning is and measure it precisely. How on earth would you do this?

daveguy

The best example I have seen for true AGI benchmarking is ARC-AGI:

https://arcprize.org/

As the ARC-AGI is more trained to we put more of our own expertise and knowledge into the algorithms. When a computer can get human level results on a new ARC-AGI like benchmark (transfer from other intelligence tasks), then we are very close.

littlestymaar

> There was a whole bunch of people who claimed LLMs can't reason at all and that everything is a regurgitation. I wonder what they have to say about this.

I don't see that as a refutation of the former actually: model trained to be stochastic parrots with next-token prediction as only learning target were indeed stochastic parrots and now we've moved to a completely different technology that features reinforcement learning in its training so it will go farther and farther from stochastic parrots and more and more towards “intelligence”.

If anything, the fact that the entire industry has now moved to RL instead of just cramming through trillions of tokens to make progress is a pretty strong acknowledgement that the “stochastic parrots” crowd was right.

dartos

You can still regurgitate a chain of thought response…

It’s all still tokens…

CamperBob2

You can still regurgitate a chain of thought response…

You people are so close to getting it. So close to understanding that you're the ones doing the regurgitating.

fmbb

Why do you believe that is how humans reason?

daveguy

Equating human reasoning to regurgitating a single token at a time requires that you pretend there are not trillions of analog calculations happening in parallel in the human mind. How could there not be a massive search included as part of the reasoning process? LLMs do not perform a massive search, or in any way update their reasoning capability after the model is generated.

hooverd

What's with AI boosters and not viewing other people as human?

dartos

That sword cuts both ways.

ninetyninenine

Oh stop. The neural network is piecing together the tokens in a way that indicates reasoning. Clearly. I don't really need to say this, we all know it now and so do you. Your statement here is just weak.

It's really embarrassing the stubborn stance people were taking that LLMs weren't intelligent and wouldn't make any progress towards agi. I sometimes wonder how people live with themselves when they realize their wrong outlook on things is just as canned and biased as the hallucinations of LLMs themselves.

dartos

Personal attacks really make your argument stronger.

daveguy

I don't think anyone has said LLMs wouldn't make any progress toward AGI, especially of researchers in the field. But a small piece of progress toward AGI is not the same as AGI.

cess11

Reasoning is a human, social and embodied activity. TFA is about machines that output text reminiscent of the results of reasoning, but it's obviously fake since the machine is neither human, social or embodied.

It's an attempt at fixing perceived problems with the query planner in an irreversibly compressed database.

MIA_Alive

LOL, my RL professor is gonna be happy. After the field got overlooked for soooo long

almaight

This is American history written in R1, it is very logical: Whenas the nations of Europa did contend upon the waves—Spain plundered gold in Mexica, Albion planted cotton in Virginia—thirteen colonies did kindle rebellion. General Washington raised the standard of liberty at Philadelphia; Franklin parleyed with Gaul’s envoys in Paris. When the cannons fell silent at Yorktown, a new republic arose in the wilderness, not by Heaven’s mandate, but by French muskets’ aid.

Yet the fledgling realm, hedged by western forests and eastern seas, waxed mighty. Jefferson purchased Louisiana’s plains; Monroe’s doctrine shackled southern realms. Gold-seekers pierced mountains, iron roads spanned the continent, while tribes wept blood upon the prairie. Then roared foundries by Great Lakes, bondsmen toiled in cotton fields, steel glowed in Pittsburgh’s fires, and black gold gushed from Texan soil—a molten surge none might stay.

Wilson trod Europe’s stage as nascent hegemon. Roosevelt’s New Deal healed wounds; Marshall’s gold revived ruined cities. The atom split at Alamogordo; greenbacks reigned at Bretton Woods. Armadas patrolled seven seas, spies wove webs across hemispheres. Through four decades’ contest with the Red Bear, Star Wars drained the Soviet coffers. Silicon’s chips commanded the world’s pulse, Hollywood’s myths shaped mankind’s dreams, Wall Street’s ledgers ruled nations’ fates—a fleeting "End of History" illusion.

But the colossus falters. Towers fell, and endless wars began; subprime cracks devoured fortunes. Pestilence slew multitudes while ballots bred discord. Red and Blue rend the Union’s fabric, gunfire echoes where laws grow faint. The Melting Pot now boils with strife, the Beacon dims to a prison’s glare. With dollar-cloth and patent-chains, with dreadnoughts’ threat, it binds the world—nations seethe yet dare not speak.

Three hundred million souls, guarded by two oceans, armed with nuclear flame, crowned with finance’s scepter—how came such dominion to waver? They fortified might but neglected virtue, wielded force but forgot mercy. As Mencius warned: "He who rides tigers cannot dismount." Rome split asunder, Britannia’s sun set; behold now Old Glory’s tremulous flutter. Thus say the sages: A realm endures by benevolence, not arms; peace flows from harmony, not hegemony—this truth outlives all empires.

suraci

Just pointing out a factual error: "He who rides tigers cannot dismount" is not said by Mencius, but comes from Fang Xuanling of the Tang Dynasty.

However, it is still highly literate (both in English and Chinese), which I believe is one of its advantages

noduerme

Also, "Star Wars" appeared in the 80s, never took off, and certainly wasn't a drain for "four decades" on the Soviet Union's coffers.

null

[deleted]

lossolo

"Star Wars" as the race between the USSR and the US to dominate space—landing on the Moon, the first animal in space, the first human in space, etc. It spanned decades and was a huge drain.

JPLeRouzic

> "A realm endures by benevolence, not arms; peace flows from harmony, not hegemony—this truth outlives all empires"

It seems LLMs are wiser than humans, after all.

alsaaro

You mean deep seek r1 generated this?

With what prompt?

almaight

Write an epic narrative of American history, employing archaic English vocabulary and grandiose structure, rich with metaphors and allusions. Weave together elements of Eastern and Western traditions, transforming modern historical events into the solemn language of ancient inscriptions. Through this classical epic reconstruction, deconstruct contemporary state power, unveiling its complexities with the weight and dignity of antiquity. Each pivotal moment in history should be distilled into profound symbols, acquiring new metaphorical dimensions within the lexicon of the past. The result should be a transcendent dialogue of civilizations, bridging temporal and cultural divides, illuminating the echoes of history in a timeless and universal context.

antman

And with what prompt was this prompt written? Its prompts all the way down?

noduerme

So in response it wrote a heroically tall mountain of bullshit, laced with falsehoods of the sort that even some neanderthal natcon would be unable to dream up, then served it as an abysmally long, overdubbed narration to the next Top Gun movie set in the same future as Idiocracy.

null

[deleted]

EGreg

Can someone summarize the upshot for people here?

teej

I’ll give a “wtf does this mean” view.

We have observed that LLMs can perform better on hard tasks like math if we teach it to “think about” the problem first. The technique is called “chain-of-thought”. The language model is taught to emit a series of sentences that break a problem down before answering it. OpenAI’s o1 works this way, and performs well on benchmarks because of it.

To train a model to do this, you need to show it many examples of correct chains of thought. These are expensive to produce and it’s expensive to train models on them.

DeepSeek discovered something surprising. It turns out, you don’t need to explicitly train a model to produce a chain of thought. Instead, under the right conditions, models will learn this behavior emergently. They found a way for a language model to learn chain of thought very cheaply, and then released that model as open source.

Thought chains turn out to be extremely useful. And now that they’re cheap and easy to produce, we are learning all the different ways they can be put to use.

Some of the open questions right now are:

- Can we teach small models to learn chain-of-thought? (yes) How cheaply? On which tasks?

- Can we generate thought chains and just copy/paste them into the prompts of other models? (yes) Which domains does this work for? How well does it generalize?

That’s what this post is going after.

pillefitz

Can you explain the RL part?

teej

The way you taught chain-of-thought before was with supervised fine tuning (SFT). During training, you have to rate every sentence of reasoning the model writes, many times, to nudge it to reason correctly.

But this approach to teach chain-of-thought doesn’t do that. In this post, they take a small model (7B) that already knows math. Then they give it a relatively small number of problems to solve (8k). They use a simple reinforcement learning loop where the only goal is to get the problem right. They don’t care how the model got the right answer, just that it’s correct.

This is part of the “recipe” that DeepSeek used to create R1.

After many iterations, just like DeepSeek, they found that the model has an “aha” moment. It starts emitting chains-of-thought where it wasn’t before. And then it starts getting the math answers right.

This is the gist of it. I don’t fully understand the recipe involved.

Can you teach small models to “think” just using RL? How small can they be? What tasks does this work for? Is just RL best for this? RL+SFT? Everyone’s trying to figure it out.

EGreg

But how exactly does it emerge, what did they do to make that happen vs previous trainings

teej

Why this behavior emerges is an active area of research. What they did is use reinforcement learning, this blog post replicates those findings. The “recipe” is detailed in the R1 paper.

randomifcpfan

The DeepSeek R1 paper explains how they trained their model in enough detail that people can replicate the process. Many people around the world are doing so, using various sizes of models and training data. Expect to see many posts like this over the next three months. The attempts that use small models will get done first. The larger models take much longer.

Small r1 style models are pretty limited, so this is interesting primarily from an “I reproduced the results” point of view, not a “here is a new model that’s useful” pov.

rahimnathwani

From the Deepseek R1 paper:

  For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.
The impression I got from the paper, although I don't think it was explicitly stated, is that they think distillation will work better than training the smaller models using RL (as OP did).

nielsole

> We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models

I found this statement from the paper to be at odds with what you cited, but I guess they mean SFT+RL would be better than either just SFT and RL

ggm

Anyone who puts emerging or emergent in their headlines should be required to come back in 2 years time and do penance for their optimism.

qnleigh

Do you remember when people discovered that LLMs had emergent coding ability? That turned out to be a pretty big deal...

noduerme

...for who? people willing to commit code with a lot of hidden bugs? Or management in a hurry to lay people off? Don't underestimate how quickly people will run straight into a wall if you tell them it's a door.

qnleigh

I should have said 'big deal, for better or for worse.' Regardless of whether one thinks it's a good thing, this was a major discovery that turned out to affect a lot of things.

jdhendrickson

Eloquently said.

dsco

How is it not emerging, if the phenomena hasn’t been hard-wired in and is unexpected?

ggm

Unexpected I can handle. Emerging has overtones to AGI

It's like fusion. "Sustained plasma" and "more energy out than in" said time after time after time.

baq

Emergent basically means ‘we didn’t design this capability but it’s there’, it’s always been a thing, not sure why you associate it with AGI so strongly.

johnthewise

Do you think its not emergent because you think this behavior was explicitly coded in or you dont think its emergent because you dont like the implications of thinking it?

zwaps

Does anyone have a good recent overview with paper links or review article for RL methods? A lot happening in that space

trash_cat

So what is interesting here is that they managed to set up the reward model in such a simple and cost-effective way that CoT emerges as the most optimal strategy for solving math problems, without explicitly fine-tuning the model to do so.

This naturally raises the question: How do you design a reward model to elicit the desired emergent behavior in a system?

cye131

Is it accurate to compare 8k example RL with 8k example SFT? RL with the same amount of examples would take massively more compute than the SFT version (though depending on how many rollouts they do per example).

RL is more data-efficient but that may not be relevant now that we can just use Deepseek-R1's responses as the training data.

johnthewise

Emergent properties are nice. They show CoT now, but who knows if there is a better planning strategy? Second thing is it kind of implies every base model can be increased in capability just with some RL tuning, cheaply. So in theory you can plug in every observable and quantifiable outcome beyond math and coding(stock returns, scientific experiment results?) and let it learn how to plan it to solve it better. Train on Observed effects of various drugs on people, it then creates a customized treatment plan for you? Sft version would be limited by doctors opinion on why certain drugs affected the outcome, whereas RL version can discover unknown relationship.

android521

[deleted due to controversy]

Stevvo

No. OpenAI never developed this method of reasoning through RL. If they had, they would have announced it.

null

[deleted]

govideo

Did you read the paper? What do you think based on the methodology and details in the paper?

btw, I think this is a net major benefit for the US startup ecosystem -- from new model developers to applications.

Edit: Stevvo - Thanks for your info.

rapsey

Constraints drive creativity. The US imposed constraints on China and they got creative.

null

[deleted]

null

[deleted]

swyx

see also https://trite-song-d6a.notion.site/Deepseek-R1-for-Everyone-...

for some reason a lot of people are choosing to blog on notion

brandonasuncion

Honestly, I'm welcoming this move to Notion. It's much less cluttered than Medium.

mynegation

And 10x slower. Builds anticipation.