An analysis of DeepSeek's R1-Zero and R1
219 comments
·January 29, 2025spyckie2
sheepscreek
> The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data.
DeepSeek did precisely this with their LLama fine-tunes. You can try the 70B one here (might have to sign up): https://groq.com/groqcloud-makes-deepseek-r1-distill-llama-7...
spyckie2
Yes, but I meant it slightly differently than the distills.
The idea is to create the next gen SOTA non reasoning model with synthetic reasoning training data.
mohsen1
every time you respond to an AI model "no, you got that wrong, do it this way" you provide a very valuable piece of data to train on. With reasoning tokens there is just a lot more of that data to train on now
merrywhether
Users can be adversarial to the “truth” (to the extent it exists) without being adversarial in intent.
Dinosaur bones are either 65 million year old remnants of ancient creatures or decoys planted by a God during a 7 day creation, and a large proportion of humans earnestly believe either take. Choosing which of these to believe involves a higher level decision about fundamental worldviews. This is an extreme example, but incorporating “honest” human feedback on vaccines, dark matter, and countless other topics won’t lead to de facto improvements.
I guess to put it another way: experts don’t learn from the masses. The average human isn’t an expert in anything, so incorporating the average feedback will pull a model away from expertise (imagine asking 100 people to give you grammar advice). You’d instead want to identify expert advice, but that’s impossible to do from looking at the advice itself without giving into a confirmation bias spiral. Humans use meta-signals like credentialing to augment their perception of received information, yet I doubt we’ll be having people upload their CV during signup to a chat service.
And at the cutting edge level of expertise, the only real “knowledgeable” counterparties are the physical systems of reality themselves. I’m curious how takeoff is possible for a brain in a bottle that can’t test and verify any of its own conjectures. It can continually extrapolate down chains of thought, but that’s most likely to just carry and amplify errors.
nine_k
This assumes that you give honest feedback.
Efforts to feed deployed AI models various epistemic poisons abound in the wild.
visarga
> This assumes that you give honest feedback.
You don't need honest user feedback because you could judge any message part of a conversation using hindsight.
Just ask a LLM to judge if a response is useful, while seeing what messages come after it. The judge model has privileged information. Maybe 5 messages later it turns out what the LLM replied was not a good idea.
You can also use related conversations by the same user. The idea is to extend context so you ca judge better. Sometimes the user tests the llm ideas in the real world and comes back with feedback, that is real world testing, something R1 can't do.
Tesla uses the same method to flag the seconds before a surprising event, it works because it has hindsight. It uses the environment to learn what was important.
Uehreka
This assumes that the companies gathering the data don’t have silent ways of detecting bad actors and discarding their responses. If you’re trying to poison an AI, are you making all of your queries from the same IP? Via a VPN whose IP block is known? Are you using a tool to generate this bad data, which might have detectable word frequency patterns that can be detected with something cheap like tf-idf?
There’s a lot of incentive to figure this out. And they have so much data coming in that they can likely afford to toss out some good data to ensure that they’re tossing out all of the bad.
octacat
There are ways to analyze that your contributions make sense from the conversation point of view. Reasoning detects that pretty quickly. To attack you would actually use another AI, to generate non totally random stuff. It still could be detected.
I would assume to use data they would have to filter it a lot and correlate between many users.
You can detect if the user is the real one and trust their other chats "a bit more".
bobxmax
I don't know why HN users in particular fixate so heavily on fringe issues when it comes to LLMs. Same as the exaggerations of hallucinations.
scarmig
Probably it's something like "give feedback that's on average slightly more correct than incorrect," though you'd get more signal from perfect feedback.
That said, I suspect the signal is very weak even today and probably not too useful except for learning about human stylistic preferences.
hammock
The AI models to begin with assume that a significant majority of the training material is honest/in good faith. So that is not new?
BorisMelnik
I am not in this space, question: are there "bad actors" that are known to feed AI models with poisonous information?
deegles
not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
an example that just occurred to me - if you asked it to generate an image of a mushroom that is safe to eat in your area, how would you tell it it was wrong? "oh, they never got back to me, I'll generate this image for others as well!"
dematz
A common use of these models is asking for code, and maybe you don't know the answer or would take a while to figure it out. For example, here's some html, make it blue and centered. You could give the model feedback on if its answer worked or not, without knowing the correct answer yourself ahead of time.
vincentperes
You constantly have to correct an AI when using it because it either didn't get the question right or you guide him towards a more narrowed answer. There is only more to learn.
Levitz
>not being snarky, but what is the point of using the model if you already know enough to correct it into giving the right answer?
For your example, what if you want to show what such a mushroom looks like to a friend? What if you want to use it on a website?
stetrain
> What is today's date?
>> Today's date is Tuesday, January 28, 2025.
> No, you're wrong, today's date is actually Wednesday the 29th.
>> My mistake. Yes, today's date is Wednesday, January 29th, 2025.
Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
mr-wendel
But thats exactly what you get when you ask questions that require shifting, specific contextual knowledge. The model weights, by their nature, cannot encode that information.
At best, you can only try to layer in contextual info like this as metadata during inference, akin to how other prompting layers exist.
Even then, what up-to-date information should present for every round-trip is a matter of opinion and use-case.
genewitch
the date is in the "system prompt", so the cron job that updates the prompts to the current date may be in a different time zone than you. 7f5dbb71f54322f271c4d3fc3aaa4d3282a1af5541d82b2cbc5aa10c1420b6bc
halfadot
> Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.
Such an ingenious attack, surely none of these companies ever considered it.
amluto
Does it?
If I say "no, you hallucinated basically the entire content of the response", then maybe a newer training set derived from that could train on the specific fact that that specific hallucinated response is hallucinated. This seems to be of dubious value in a training set.
Vampiero
Nah I just insult it and tell it that it costs me 20 dollars a month and it's a huge disappointment
jeffbee
If such labels are collected and used to retrain the model then yes. But these models are not learning online.
Cthulhu_
ChatGPT came out and its interface was a chatbox and a thumbs up / thumbs down icon (or whichever) to rate the responses; surely that created a feedback loop of learning, like all machine learning has done for years now?
jvanderbot
Really? Isn't that the point of RL used in the way R1 did?
Provide a cost function (vs labels) and have it argue itself to greatness as measured by that cost function?
I believe that's what GP meant by "respond", not telling GPT they were wrong.
aprilthird2021
So if I just pay OpenAI $200/mo, and randomly tell the AI, no that's wrong.
I can stop the AI takeover?
dr_kiszonka
You would need a lot of pro accounts! I would be surprised if they didn't use any algorithms for detecting well poisoning.
Exoristos
You can have our thank-you cards forwarded to your cell at Guantanamo Bay.
echelon
> you provide a very valuable piece of data to train on
We've been saying this "we get valuable data" thing since the 2010s [1].
When will our collective Netflix thumbs ups give us artificial super-intelligence?
[1] Especially to investors. They love that line.
genewitch
our collective netflix thumbs up indicators gave investors and netflix the confidence to deploy a series of adam sandler movies that cost 60 to 80 million US dollars to "make". So depending on who you are, the system might be working great.
visarga
> I wonder if there is a cap to multi head attention architecture
I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.
fizx
You're not getting new high-quality textual data for pre-training from your chat service. But you are potentially getting a lot of RL feedback on ambiguous problems.
fsndz
I think we will have to move with pre-training and post-training efforts in parallel. What DeepSeek showed is that you first need to have a strong enough pretrained model. For that, we have to continue the acquisition of high quality, multilingual datasets. Then, when we have a stronger pretrained model, we can apply pure RL to get a reasoning model that we use only to generate synthetic reasoning data. We then use those synthetic reasoning data to fine-tune the original pretrained model and make it even stronger. https://transitions.substack.com/p/the-laymans-introduction-...
godelski
> I highly doubt you are getting novel, high quality data.
That's not the point. The point is you reject low quality data, aka noisehassleblad23
And how would that work at inference time?
bobxmax
Why would it need to work at inference time?
vagabund
> I highly doubt you are getting novel, high quality data.
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)
jcims
>You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
The large foundational models don't really need more empirical data about the world. ChatGPT already 'knows' way more than I do, probably by many orders of magnitude. Yet it's still spewing nonsense at me regularly because it doesn't know how to think like a human or interact with me in a human-like way. To that end, the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition.
visarga
> the ability for a company like OpenAI to collect novel data from interacting with real humans is a material advantage over their competition
It's different kind of data from the R1 reasoning chains. When LLMs have human in the loop, the human provides help based off their personal experience and real world validation. Sometimes users take an idea from the LLM and try it in real life. Then come back later and discuss the outcomes. This is a real world testing loop.
In order to judge if an AI response was useful, you can look at the following messages with a judge LLM. Using hindsight helps a lot here. Maybe it doesn't pan out and the user tries another approach, or maybe some innocuous idea was key to success later. It's hard to tell in the moment, but easy when you see what followed after that.
This scales well - OpenAI has 300M users, I estimate up to 1 Trillion interactive tokens/day. The user base is very diverse, problems are diverse, and feedback comes from user experience and actual testing. They form an experience flywheel, the more problem solving they do, the smarter it gets, attracting more users.
Stevvo
"The o3 system demonstrates the first practical, general implementation of a computer adapting to novel unseen problems"
Yet, they said when it was announced:
"OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
These two statements are completely opposed. I can't take seriously anything this article says about o3.
usaar333
No they aren't. Every arc problem is novel - that's why it resisted deep learning for so long (and still does to a degree).
We just don't know how much the model seeing what an arc problem is on the first place boosts its ability to solve them - that limited statement is all the author is making.
7thpower
They are testing with a different dataset. The authors saying that they have not tested on the version of o3 that has not seen the training set.
daveguy
Your quote is accurate from here:
https://arcprize.org/blog/oai-o3-pub-breakthrough
They were talking about training on the public dataset -- OpenAI tuned the o3 model with 75% of the public dataset. There was some idea/hope that these LLMs would be able to gain enough knowledge in the latent space that they would automatically do well on the ARC-AGI problems. But using 75% of the public training set for tuning puts them at the about same challenge level as all other competitors (who use 100% of training).
In the post they were saying they didn't have a chance to test the o3 model's performance on ARC-AGI "out of-the-box", which is how the 14% scoring R1-zero was tested (no SFT, no search). They have been testing the LLMs out of the box like this to see if they are "smart" wrt the problem set by default.
anothermathbozo
The claim is that this removes the human bottleneck (aka SFT or supervised fine tuning) on domains with a verifiable reward. Critically, this verifiable reward is extremely hard to pin down in nearly all domains besides mathematics and computer science.
aithrowawaycomm
It's also extremely hard to nail down in much of mathematics or computer science!
- is such-and-such theorem deep or shallow?
- is this definition/axiom useful? (there's a big difference between doing compass-straightedge proofs vs. wondering about the parallel postulate)
- more generally, discovering theorems is generally not amenable to verifiable rewards, except in domains where simpler deterministic tools exist (in which case LLMs can likely help reduce the amount of brute forcing)
- is this a good mathematical / software model of a given real-world system?
- is the flexibility of dynamic/gradual typing worth the risk of type errors? is static typing more or less confusing for developers?
- what features should be part of a programming language's syntax? should we opt for lean-and-extensible or batteries-included?
- are we prematurely optimizing this function?
- will this program's memory needs play nicely with Rust's memory model? What architectural decisions do we need to make now to avoid headaches 6 months down the line?
Davidzheng
Not clear to me that theorem discovery is not amenable to verifiable rewards. I think most important theorems probably are recovered automatically by asking AI systems to proof increasing complicated human conjectures. Along the way I expect emergent behaviors of creating conjectures and recognizing important self-breakthroughs. Much like regret emergence
youoy
Theorems discovery is amenable to verifiable rewards. But is meaningful theorems discovery too? Is the ability to discern between meaningful theorems and bad ones an emergent behaviour? You can check for yourself examples of automatic proofs, and the huge amount of intermediate theorems that they can generate which are not very meaningful.
nextos
IMHO, there are strategies that could extend this approach to many other domains.
I was discussing this idea (along with a small prototype) with a prominent symbolic AI researcher who also agrees, and thinks that with the emergence of RL as a viable training method for LLMs, it might be possible to pursue neuro-symbolic learning at a large scale.
Current systems are impressive, but reasoning is too fragile to trust them. They fall into obvious logical and statistical fallacies that are evident to a layperson.
gadtfly
Reasoning transfers across domains.
Philpax
See https://www.interconnects.ai/p/why-reasoning-models-will-gen... for more information.
Onavo
By verifiable do they mean it in the complexity theory P/NP sense of the word?
calebkaiser
In the case of DeepSeek-R1, they used a series of heuristic reward functions that were built for different data types. The paper mentions the use of sandboxed environments to execute generated code against a suite of tests, for example, to evaluate it for correctness. The reward functions also evaluated syntax and formatting.
In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.
ks2048
I think it just means that you can objectively score an answer as being correct or not. (e.g. if the generated program passes some tests; a discovered proof is valid, etc).
drdeca
The other replies have said what was meant, but I don’t think they’ve explicitly addressed whether or not that is the sense used in the idea of NP.
I would say… it is at least somewhat similar.
A problem in NP might be of the form “For this value of X, does there exist a Y such that q(X,Y)?” for some predicate q and value X, and where when the answer is “yes”, the answer of “yes” can be verified by being given a value Y, and evaluating q(X,Y). (Specifically in the case of 3SAT, X would be a 3CNF formula, Y would be an assignment of values to the variables in the formula, and q(X,Y) would be “the formula X when evaluated with variable assignments Y, results in 'true’.”.)
This is sort of like the task of “Given requirements X that can be checked automatically, produce code Y which satisfies those requirements”, except that in this case the question is specifically asking for Y, not just asking whether such a Y exists, but.. well, often in practice when one wants a solution to a problem in NP, one actually wants the witness, not just whether there exists such a Y, right?
So, I would say there is a substantial similarity, but also a difference.
HarHarVeryFunny
For some reasoning data (e.g. you talking out loud as you figure something out, mistakes and all) to be useful for RL training, the conclusion to your reasoning needs to be correct/verified, else that's not the kind of reasoning you want to learn!
Some types of reasoning output, such as solving a math problem or writing a computer program can be automatically verified (e.g. respectively by a symbolic solver, or by compiling and running the program), but in the general case it's hard for a computer to verify whether a chain of reasoning is correct and arrived at a valid answer or not, although LLM-as-judge should work some of the time.
sgt101
There's a big difference. The membership of these classes is determined in the worst case - so if there is no polynomial time solution in the worst case then it's NP.
For this problem we don't care if it's possible that sometimes there are things that aren't verifiable, or the answers aren't exact, we just need training signal.
logicchains
As in there's an objective truth that can be determined by a computer. E.g. whether code compiles, whether a unit test passes, whether the answer given to a mathematical question like 3+5 is correct. Many other fields have no objective truth (like art or creative writing), or objective truth requires measurement of the physical world (although if the world can be simulated accurately enough for the problem class at hand, then sufficient training data can still be generated by a computer).
bpfrh
Isn't "code compiles" an insufficient criteria?
e.g you would need to prove that for all inputs the code produces the correct output which would in turn make the problem way more complex
pertymcpert
They mean that the solutions can be verified to be correct in a binary sense. E.g. a coding solution passes all the unit tests vs writing poetry.
visarga
> R1-Zero removes the human bottleneck
I disagree. It only removes the bottleneck to collecting math and code reasoning chains, not in general. The general case requires physical testing not just calculations, otherwise scientists would not need experimental labs. Discovery comes from searching the real world, it's where interesting things happen. The best interface between AI and the world are still humans, the code and math domains are just lucky to work without real world interaction.
mohsen1
The idea that a lot of compute is moving towards inference has a huge consequence for the current "AI investments". This is bad news for NVDA particularly. The inference focused solutions have better economics than paying NVDA those huge margins (e.g. Grog)
talldayo
Nvidia can actually charge larger margins if inference compute goes down. It would enable them to manufacture more units of smaller GPUs using inferior and cheaper silicon, all of which would increase the profits per unit sold as well as the number of units they can manufacture.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.
buyucu
both llama.cpp and vllm support inference with rocm or vulkan.
inference is the easiest thing to decouple from nvidia.
logicchains
>The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat
Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).
vidarh
For inference Nvidia has more significant competition than for training. See Groq, Google's TPU's etc.
pants2
People talk about Groq and Cerberus as competitors but it seems to me their manufacturing process makes the availability of those chips extremely limited. You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week. Can't say the same for these specialty competitors.
panabee
Nvidia (NVDA) generates revenue with hardware, but digs moats with software.
The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.
OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.
Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.
They had the talent and the incentive to migrate, but didn't.
In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.
People are desperate to quit their NVDA-tine addiction, but they can't for now.
[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]
ClumsyPilot
I think future of inference is on the client side
You can do inference on almost any hardware, I do not see any edge for NVIDIA here
I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.
The main bottleneck appears to be memory, not processing power.
buu700
I would argue that both things are true:
1. The future of inference for ChatGPT-style direct consumer usage is on-device. Cloud-based inference is too gaping of a privacy hole in a world where some level of E2EE is rapidly becoming the default expectation for chat. It's not hard to imagine that the iPhone 50 may be able to comfortably run models that firmly surpass GPT-4o and o1. Similarly, for things like coding and any other creation of novel IP, there are obvious security benefits to keeping the inference local.
2. Going forward, the vast majority of inference will be performed by agents for process automation (both personal and business), rather than direct user interaction. For these use cases, centralized infrastructure will be the natural architecture. Even for cases where an end client device technically exists (e.g. Tesla-Optimus-style machines), there may be economy of scale advantages to offloading compute to the cloud.
mrbungie
In fact, I'm not sure how the "we will need tons of centralized inference infrastructure" argument works when Apple with +50% smartphone market share in the USA has a totally opposite strategy focused on privacy: on-device inference.
snovv_crash
Which AMD GPU gives you 50 tok/s on a 30b model? My 3090 does 30 tok/s with a 4 bit quant.
pertymcpert
So far it's moving towards test time compute true, but reasoning models are still far too large to be done on the edge.
dagelf
Well o3 scored 75% on AGI-1, R1 and o1 only 25%.... watch this space though....
levocardia
What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.
spoaceman7777
I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.
We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.
e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).
We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.
There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.
danenania
That could also definitely make sense if the SOTA models are too slow and expensive to be popular with a general audience.
amelius
Yeah, but they can use DeepSeek's new algorithm too.
mohsen1
with 57 million(!!) tokens
sheepdestroyer
From the article :
o3 (low) 75.7% 335K $20
o3 (high) 87.5% 57M $3.4K
mrandish
When I saw these numbers back in the initial o3-ARC post, I immediately converted them into "$ per ARC-AGI-1 %" and concluded we may be at a point where each increased increment of 'real human-like novel reasoning' gets exponentially more compute costly.
If Mike Knoop is correct, maybe R1 is pointing the way toward more efficient approaches. That would certainly be a good thing. This whole DeepSeek release and the reactions have shown by limiting the export to China of high-end GPUs, the US incentivized China to figure out how to make low-end GPUs work really well. The more subtle meta-lesson here is that the massive flood of investment capital being shoved toward leading edge AI companies has fostered a drag race mentality which prioritized winning top-line performance far above efficiency, costs, etc.
jl6
$3.4K is about what you might pay a magic circle lawyer for an opinion on a matter. Not saying o3 is an efficient use of resources, just saying that it’s not outlandish that a sufficiently good AI could be worth that kind of money.
Davidzheng
I view it as a positive that the methodology can take in more compute (bitter lesson style)
optimalsolver
But can o3 write a symphony?
Seriously though, I'd like to hear suggestions on how to automatically evaluate an AI model's creativity, no humans in the loop.
gsam
In my view there's two modes of creativity:
1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.
2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.
(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)
docfort
Terry Tao has referred to this classification system as foxes vs hedgehogs. https://en.m.wikipedia.org/wiki/The_Hedgehog_and_the_Fox
baq
LLMs have read everything humans made so just ask one if there’s anything truly new in that freshly confabulated slop-phony.
fragmede
we'd have to create a numerical scale for creativity, from boring to Dali, with milliEschers and MegaGeigers somewhere in there as well
rpastuszak
It's essential that we quantify everything so that we can put a price on it. I'd go with Kahlograms though.
mikejulietbravo
Mike from Baseten here
We're super proud to support this work. If you're thinking of running deepseek in production, give us a shout!
fxttr
We currently evaluate DeepSeek-R1 for our production system. We aren't done yet, but I think it's a match.
mikejulietbravo
Awesome - we'd love to have our CEO/CTO chat with you and your team if you're interested. Shoot me a note at mike.bilodeau @ baseten.co and I'll make it happen!
WhitneyLand
Can you share at a high level how you run this model?
We know it’s 671B params with each MOE node at 37B…
If the GPUs have say, 140GB for an H200, then do you just load up as many nodes as will fit into a GPU?
How much do interconnects hurt performance vs being able to load the model into a single GPU?
philipkiely
Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.
There are two ways we can run it:
- 8xH200 GPU == 8x141GB == 1128 GB VRAM
- 16xH100 GPU == 8x80GB == 1280 GB VRAM
Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.
nv35
> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.
littlestymaar
Earlier today I read a reddit comment[1] about a guy who tried running the quantized version from unsloth[2] on 4xH100 and the results was underwhelming (it ended up costing $137 per 1 million tokens).
Any idea of what they're doing wrong?
[1]: https://www.reddit.com/r/LocalLLaMA/comments/1icphqa/how_to_...
philipkiely
They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
littlestymaar
> They're using Llama.cpp which is an amazing tool for local inference but doesn't match fast inference frameworks like TensorRT-LLM/SGLang for production speeds and throughputs on Hopper GPUs.
That's something I thought about, but it wouldn't explain much, as they are roughly two orders of magnitude off in terms of cost, only a small fraction of which could be explain by performance of the inference engine.
> The Unsloth quantizations are really cool, but if you want to experiment with the R1 models in a smaller form factor the R1 Distills like Llama 70B are great and should run a lot faster as they take advantage of existing optimizations around inferencing llama-architecture models.
What kind of optimization do you have in mind? Because Deepseek having only 37B active parameters, which means ~12GB at this level of quantization, means inference ought to be much faster that a dense 70B model, especially unquantized, no? The Llama 70B distill would benefit from speculative decoding though, but it shouldn't be enough to compensate. So I'm really curious about what kind of llama-specific optimizations, and how much speed up you think they'd bring.
coder543
I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
As the comments on reddit said, those numbers don’t make sense.
littlestymaar
> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.
> As the comments on reddit said, those numbers don’t make sense.
Absolutely, hence my question!
hnburnsy
>There are two major shifts happening in AI, economically speaking:
You can now spend more $ to get higher accuracy and reliability
Training $ is moving to inference $
>Both are going to drive a massive amount of demand for inference and neither will curtail the demand for more compute. In fact, they will increase the demand for compute.Is this Nvidia compute or something else?
gorbypark
Nvidia has much less of a moat on the inference side of things. Of course they still dominate the market right now for inference (in datacenters), but it's much easier for companies to move onto AMD or other solutions like Groq or whatever compared to trying to use non-Nvidia for training.
usaar333
Overall good post, but feels like he has an axe to grind with LLMs to the point it is misleading:
> Last week, DeepSeek published their new R1-Zero and R1 “reasoner” systems that is competitive with OpenAI’s o1 system on ARC-AGI-1. R1-Zero, R1, and o1 (low compute) all score around 15-20% – in contrast to GPT-4o’s 5%, the pinnacle of years of pure LLM scaling
R1-zero gets 14% on private set which is the exact same score June Sonnet got; Sonnet, not 4o, is the pinnacle of pure LLM scaling
polishdude20
I predict that the future of LLM's when it comes to coding and software creation is in "custom individually tailored apps". Imagine telling an AI agent what app you want, the requirements and all that and it just builds everything needed from backend to frontend, asks for your input on how things should work, clarifying questions etc.
It tests the software by compiling and running it reading errors and failed tests and fixing the code.
Then, it deploys the software in production for you. It compiles your app to an APK file and publishes it on the Google play store for example.
Sure an LLM now may still not be able to get everything perfect as far as it's outputs go. But surely there's already systems and workflows in place that will auto run your code, compile it, feed errors back to the LLM, some api to interact with cloud providers for hosting etc?
kristjansson
Most people really do not know what they want at any level of detail.
travoc
It's ok, they'll know it when they see it. Keep trying.
jacobsenscott
What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want? Where will the record of those clarifying questions and updates be kept? What if one developer asks the AI to surreptitiously round off pennies and put those pennies into their bank account? Where will that change be recorded, will humans be able to recognize it? What if two developers give it conflicting instructions? Who's reviewing this stream of instructions to the LLM?
"AI" driven programming has a long way to go before it is just a better code completion.
repelsteeltje
That.
Plus coding (producing a working program that fits some requirement) is the least interesting part of software development. It adds complexity, bugs and maintenance.
throw310822
> What's it called when you describe an app with sufficient detail that a computer can carry out the processes you want?
You're wrong here. The entire point is that these are not computers as we used to think of them. These things have common sense; they can analyse a problem including all the implicit aspects, suggest and evaluate different implementation methods, architectures, interfaces.
So the right question is: "what's it called when you describe an app to a development team and they ask back questions and come back with designs and discuss them with you, and finally present you with an mvp, and then you iterate on that?"
Vampiero
Bold of you to imply that GPT asks questions instead of making baseless assumptions every 5 words, even when you explicitly instruct it to ask questions if it doesn't know. When it constantly hallucinates command line arguments and library methods instead of reading the fucking manual.
It's like outsourcing your project to [country where programmers are cheap]. You can't expect quality. Deep down you're actually amazed that the project builds at all. But it doesn't take much to reveal that it's just a facade for a generous serving of spaghetti and bugs.
And refactoring the project into something that won't crumble in 6 months requires more time than just redoing the project from scratch, because the technical debt is obscenely high, because those programmers were awful, and because no one, not even them, understands the code or wants to be the one who has to reverse engineer it.
Except that AI is actually MUCH more expensive!
jumploops
The future is bespoke software.
In some sense, this is how computers were always supposed to work!
jrsdav
I have been trying to imagine something similar, but without all the middleware/distribution layer. You need to do a thing? The LLM just does it and presents the user with the desired experience. Kind of upending the notion that we need "apps" in the first place. It's all materialized, just-in-time style.
energy123
A little further out from that could be the LLM acting as the runtime environment. No code. It's just data in (user inputs etc) -> GUI out.
IAmGraydon
Most software is useful because a large number of people can interact with it or with each other over it. I'm not so certain that one-off software would be very useful for anyone beyond very simple functionality.
prmph
This will almost certainly never materialize, and the reasons are not just technical
acchow
Have you tried https://bolt.diy ?
It does what you describe
IAmGraydon
It claims to do what he describes.
artninja1988
>Ultimately, R1-Zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself.
I would like this to be true, but doesn't the way they're doing RL also require tons of human data?
Davidzheng
I think yes. But hopefully in math with compute advances we can lower the human data input by increasing the gap that is bridged by raw model capabilities vs search augmentation (either with tree search or full rollouts)
cbracketdash
It's a bit deceptive that o3 conveniently had access to ARC-prize-specific training material while r1 probably didn't. [0]
> But now with reasoning systems and verifiers, we can create brand new legitimate data to train on. This can either be done offline where the developer pays to create the data or at inference time where the end user pays!
> This is a fascinating shift in economics and suggests there could be a runaway power concentrating moment for AI system developers who have the largest number of paying customers. Those customers are footing the bill to create new high quality data … which improves the model … which becomes better and more preferred by users … you get the idea.
While I think this is an interesting hypothesis, I'm skeptical. You might be lowering the cost of your training corpus by a few million dollars, but I highly doubt you are getting novel, high quality data.
We are currently in a world where SOTA base model seems to be capped at around GPT4o levels. I have no doubt that in 2-3 years our base models will compete with o1 or even o3... just it remains to be seen what innovations/optimizations get us there.
The most promising idea is to use reasoning models to generate data, and then train our non-reasoning models with the reasoning-embedded data. But... it remains to be seen how much of the chain of thought reasoning you can really capture into model weights. I'm guessing some, but I wonder if there is a cap to multi-head attention architecture. If reasoning can be transferred from reasoning models to base models, OpenAI should have already trained a new model with o3 training data, right?
Another thought is maybe we don't need to improve our base models much. It's sufficient to have them be generalists, and to improve reasoning models (lowering price, improving quality) going forward.