Deepseek R1-0528
246 comments
·May 28, 2025jacob019
jazzyjackson
No sign of what source material it was trained on though right? So open weight rather than reproducible from source.
I remember there's a project "Open R1" that last I checked was working on gathering their own list of training material, looks active but not sure how far along they've gotten:
pradn
Isn't it basically not possible for the input data set list to be listed? It's an open secret all these labs are using immense amounts of copyrighted material.
There's a few efforts at full open data / open weight / open code models, but none of them have gotten to leading-edge performance.
ratamacue
My brain was largely trained using immense amounts of copyrighted material as well. Some of it I can even regurgitate almost exactly. I could list the names of many of the copyrighted works I have read/watched/listened to. I suppose my brain isn't open source, although I don't think it would currently be illegal to take a snapshot of my brain and publish it if the technology existed and open-source that. Granted, this would only be "reproducible" from source if you define the "source" as "my brain" rather than all of the material I consumed to make that snapshot.
3abiton
The only way this would work is with "leaks". But even then as we saw with everything on the internet, it just added another guardrail on content. Now I can't watch youtube videos without logging in, and nearly every website I need to solve some weird ash captchas. It's becoming easier to interact with this chatbots rather than search for a solution online. And I wonder with Veo 4 copy cats, it might be even easier to prompt for a video rather than search for one.
prmoustache
That doesn't mean it isn't possible.
bee_rider
“Not possible” = “a business-destroying level of honesty”?
behnamoh
> No sign of what source material it was trained on though right?
out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..
marci
When you're trully open source, you can make ethings like this:
Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.
ToValueFunfetti
Would be useful for answering "is this novel or was it in the training data", but that's not typically what the point of open source is
anonymoushn
If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.
DANmode
Depending on how you use "randomly", they absolutely can..?
m00x
Many are speculating it was trained by o1/o3 for some of the initial reasoning.
fulafel
Are there any widely used models that publish this? If not, then no I guess.
chrsw
Based on commit history Open R1 still active and they're still making progress. Long may it continue, it's an ambitious project.
therealpygon
This was simply a mad scramble to prove/disprove the claims OpenAI was peddling that the model wasn’t actually performing as well as advertised and that they were lying about the training/compute resources. Open-R1 has since applied the training to a similar 7B model and got similar results. At the end of the day, no one really cares what the data was that it was trained on and most AI providers don’t always share this either when releasing open source models, and certainly not available for closed source models.
make3
I don't think people make the distinction like that. The open source vs non open source distinction boils down to, usually, can you use it for commercial use.
what you're saying is just that it's non reproducible, which is a completely valid but separate issue
alpaca128
There's already established terms and licenses for non-commercial use. Like "open weights".
Open source has the word "source" in it for a reason, and those models ain't open source and have nothing to do with it.
piperswe
But where's the source? I just see a binary blob, what makes it open source?
JKCalhoun
Is there a downloadable model? (Not familiar with openrouter and not seeing the model on ollama.)
zargon
This HN submission goes directly to the downloadable model.
aldanor
Open weights.
fragmede
It's. not. open. source!
cavisne
"knowing why a model refuses to answer something matters"
The companies that create these models cant answer that question! Models get jailbroken all the time to ignore alignment instructions. The robust refusal logic normally sits on top of the model, ie looking at the responses and flagging anything that they don't want to show to users.
The best tool we have for understanding if a model is refusing to answer a problem or actually doesn't know is mechanistic interp, which you only need the weights for.
This whole debate is weird, even with traditional open source code you cant tell the intent of a programmer, what sources they used to write that code etc.
echelon
Open source is a crazy new beast in the AI/ML world.
We have numerous artifacts to reason about:
- The model code
- The training code
- The fine tuning code
- The inference code
- The raw training data
- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)
- The resultant weights
- The inference outputs (which also need a license)
- The research papers (hopefully it's described in literature!)
- The patents (or lack thereof)
The term "open source" is wholly inadequate here. We need a 10-star grading system for this.
This is not your mamma's C library.
AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).
This is more than enough to distill new models from.
Everybody is laundering training data, and it's rife with copyrighted data, PII, and pilfered outputs from other commercial AI systems. Because of that, I don't expect we'll see much legally open training data for some time to come. In fact, the first fully open training data of adequate size (not something like LJSpeech) is likely to be 100% synthetic or robotically-captured.
reedciccio
Https://opensource.org/ai ... Lots of reasoning has been done on those artifacts
Tepix
I think you‘re trying to make it look more complex than it is. Put the amount of data next to every entry in that list of yours.
xnickb
I'd argue we don't need a 10 star system. The single bit we have now is enough. And the question is also pretty clear: did $company steal other peoples work?
The answer is also known. So the reason one would want an open source model (read reproducible model), would be that of ethics
behnamoh
it's got more 'source' than whatever OpenAI provides for their models.
numpad0
less alcoholic beverages are fully alcoholic beverages
stavros
No it doesn't, it has exactly the same source, zero. It has more downloadable binary.
acheong08
No information to be found about it. Hopefully we get benchmarks soon. Reminds me of the days when Mistral would just tweet a torrent magnet link
chvid
Benchmarks seem like a fools errand at this point; overly tuning models just to specific test already published tests, rather than focusing on making them generalize.
Hugging face has a leader board and it seems dominated by models that are finetunings of various common open source models, yet don't seem be broader used:
EvgeniyZh
There are quite a few benchmarks for which that's not the case:
- live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc)
- benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena)
- (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated.
Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept.
Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet.
chvid
I agree with you.
Of course, some benchmarks are still valid and will remain valid. Ie. we can make the models play chess against each other and score them on how well they do. But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after. And often, LLMs perform worse than specialized models. Ie. I don't think there is any LLM out there that can beat a traditional chess program (surely not using the same computing power).
What is really bad are the QA benchmarks which leak over time into the training data of the models. And sometimes, one can suspect even big labs have an economic incentive in scoring well on popular benchmarks which cause them to manipulate the models way beyond what is reasonable.
And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.
So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.
Of course, done right, that would be really expensive. And those sponsoring might not like the result.
behnamoh
right, all benchmarks collapse once you go beyond 32K tokens. I've rarely seen any benchmarks focusing on long range, which is where most programming needs are at.
null
lossolo
The only benchmarks that match my experience with different models are here https://livebench.ai/#/
ribelo
livebench was good, but now it's a joke. Gemini flash is better in coding than pro and sonnet 3.7. And this is only the beginning of weird results.
halyconWays
>overly tuning models just to specific test already published tests, rather than focusing on making them generalize.
I think you just described SATs and other standardized tests
Mistletoe
SAT has a correlation to IQ of 0.82 to 0.86 and I do think IQ is very useful in judging intelligence.
kbumsik
Artificial Analysis is the only stable source. Don't look at others like HF Leaderboard.
z2
There's a table here showing some "Overall" and "Median" score, but no context on what exactly was tested. It appears to be in the ballpark as the latest models, but with some cost advantages with the downside of being just as slow as the original r1 (likely lots of thinking tokens). https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
xelos
It’s appeared on the Livecodebench leaderboard too. Performance on par with O4 Mini - https://livecodebench.github.io/leaderboard.html
swyx
i think usually deepseek posts a paper after a model release about a day later.
no idea why they cant just wait a bit to coordinate stuff. bit messy in the news cycle.
aibrother
getting a similar vibe yeah. given how adjacent they are, wouldn't be surprised if this was an intentional nod from DeepSeek
willchen
I love how Deepseek just casually drops new updates (that deliver big improvements) without fanfare.
doctoboggan
Honest question, how do you know this is a big improvement? Are there any benchmarks anywhere?
KeyBoardG
There will be a video from FireShip if its a big one. /s
sundarurfriend
Ah FireShip, I forgot that channel existed at all. I asked YouTube to not recommend that channel after every vaguely AI-related news was "BIG NEWS!!!", the videos were also thin on actual content, and there were repeated factual errors over multiple videos too. At that point, the only thing it's good for is to make yourself (falsely) feel like you're keeping up.
wetpaws
[dead]
therein
Much more preferred to what OpenAI always did and Anthropic recently started doing. Just write some complicated narrative about how scary this new model is and how it tried to escape and deceive and hack the mainframe while telling the alignment operators bed time stories.
camkego
Really? I missed this. The new hype trick is implying the new LLM releases are almost AGI? Love it.
IceWreck
Anthropic "warned" Claude 4 is so smart that it will try to use the terminal (if using Claude Code) or any other tools available (depending on where you're invoking it from) to contact local authorities if you're doing something very immoral.
ilaksh
I think they did make an announcement on WeChat.
modeless
I like it too, but some benchmark numbers would be nice at least.
hd4
On the day Nvidia report earnings too. Pretty sure it's just a coincidence, bro.
margorczynski
Yeah the timing seems strange. Considering how much money will move hands based on those results this might be some kind of play to manipulate the market at least a bit.
consumer451
I believe that they are funded by a hedge fund. So, there are no coincidences here.
rwmj
Is releasing a better product really "market manipulation"? It seems to me like regular, good competition.
Maxatar
How does releasing it today affect the market compared to releasing it last week?
belter
Plenty of manipulation to go around..
"Tech Chip software stocks sink on report Trump ordered halt to China sales" - https://www.cnbc.com/2025/05/28/chip-software-trump-china.ht...
esafak
Anyone got benchmarks?
dyauspitr
What big improvements?
transcriptase
Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses?
danielhanchen
We made DeepSeek R1 run on a local device via offloading and 1.58bit quantization :) https://unsloth.ai/blog/deepseekr1-dynamic
I'm working on the new one!
behnamoh
> 1.58bit quantization
of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.
danielhanchen
Oh you can still run them unquantized! See https://docs.unsloth.ai/basics/llama-4-how-to-run-and-fine-t... where we show you can offload all MoE layers to system RAM, and leave non MoE layers on the GPU - the speed is still pretty good!
You can do it via `-ot ".ffn_.*_exps.=CPU"`
CamperBob2
Your 1.58-bit dynamic quant model is a religious experience, even at one or two tokens per second (which is what I get on my 128 MB Raptor Lake+4090). It's like owning your own genie... just ridiculously smart. Thanks for the work you've put into it!
nxobject
Likewise - for me, it feels how I imagined getting a microcomputer in the 70s was like. (Including the hit to the wallet… an Apple II cost the 2024 equivalent of ~$5k, too.)
danielhanchen
Oh thank you! :) Glad they were useful!
screaminghawk
I use this a lot! Thanks for your work and looking forward to the next one
danielhanchen
Thank you!! New versions should be much better!
terhechte
You can run the 4bit quantized version of it on a M3 Ultra 512GB. That's quite expensive though. Another alternative is a fast CPU with 500GB of DDR5 RAM. That of course, is also not cheap and slower than the M3 Ultra. Or, you buy multiple Nvidia cards to reach ~500GB of VRam. That is probably the most expensive option but also the fastest
lodovic
If you use the excess memory for AI only it's cheaper to rent . A single H100 costs less than $2 per hour. (incl power)
diggan
Vast.ai has a bunch of 1x H100 SXM available, right now the cheapest at $1.554/hr.
Not affiliated, just a (mostly) happy user, although don't trust the bandwidth numbers, lots of variance (not surprising though, it is a user-to-user marketplace).
omneity
Worth mentioning that a single H100 (80-96GB) is not enough to run R1. You're looking at 6-8 GPUs on the lower end, and factor in the setup and download time.
An alternative is to use serverless GPU or LLM providers which abstract some of this for you, albeit at a higher cost and slow starts when you first use your model for some time.
behohippy
About 768 gigs of ddr5 RAM in a dual socket server board with 12 channel memory and an extra 16 gig or better GPU for prompt processing. It's a few grand just to run this thing at 8-10 tokens/s
wongarsu
About $8000 plus the GPU. Let's throw in a 4080 for about $1k, and you have the full setup for the price of 3 RTX5090. Or cheaper than a single A100. That's not a bad deal.
For the hobby version you would presumably buy a used server and a used GPU. DDR4 ECC Ram can be had for a little over $1/GB, so you could probably build the whole thing for around $2k
JKCalhoun
Been putting together a "mining rig" [1] (or rather I was before the tariffs, ha ha.) Going to try to add a 2nd GPU soon. (And I should try these quantized versions.)
Mobo was some kind of mining rig from AliExpress for less than $100. GPU is an inexpensive NVIDIA TESLA card that I 3D printed a shroud for (added fans). Power supply a cheap 2000 Watt Dell server PS off eBay....
[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...
phonon
This is the state of the art for such a setup. Really good performance!
mechagodzilla
I have a $2k used dual-socket xeon with 768GB of DDR4 - It runs at about 1.5 tokens/sec for the 4-bit quantized version.
SkyPuncher
Practically, smaller, quantized versions of R1 can be run on a pretty typically Macbook Pro setup. Quantized versions are definitely less performant, but they will absolutely run.
Truthfully, it's just not worth it. You either run these things so slowly that you're wasting your time or you have to buy 4- or 5-figures of hardware that's going to sit, mostly unused.
hu3
It's probably going to be free at OpenRouter.
There's already a 685B parameter DeepSeek V3 for free there.
latchkey
It is free to use, but you're feeding OR data and someone is profiting off that.
ankit219
Thats how a lot of application layer startups are going to make money. There is a bunch of high quality usage data. Either you monetize it yourself (cursor), get acquired (windsurf) or provide that data to others at a fee (lmsys, mercor). This is inevitable and a market for this is just going to increaase. If you want to prevent this as an org, there arent many ways out. Either use open source models you can deploy, or deal directly with model providers where you can sign specific contracts.
85392_school
You're actually sending data to random GPUs connected to one of the Bittensor subnets that run LLMs.
dist-epoch
Not every prompt is privacy sensitive.
For example you could use it to summarize a public article.
inquirerGeneral
[dead]
whynotmaybe
I'm using GPT4All with DeepSeek-R1-Distill-QWen-7B (which is not R1-0528) on a Ryzen 5 3600 with 32Gb ram.
With an average of 3.6 tokens/sec, answers usually take 150-200 seconds.
hadlock
As mentioned you can run this on a server board with 768+ gb memory in cpu mode. Average joe is going to be running quantized 30b (not 600b+) models on an $300/$400/$900 8/12/16gb GPU
rahimnathwani
I'm not sure that's enough RAM to run it at full precision (FP8).
This guy ran a 4-bit quantized version with 768GB RAM: https://news.ycombinator.com/item?id=42897205
karencarits
What use cases are people using local LLMs for? Have you created any practical tools that actually increase your efficiency? I've been experimenting a bit but find it hard to get inspiration for useful applications
jsemrau
I have a signal tracer that evaluates unusual trading volumes. Given those signals, my local agent receives news items through API to make an assessment what happens. This helps me tremendously. If I would do this through a remote app, I'd have to spend a several dollars per day. So I have this on existing hardware.
karencarits
Thank you, this is a great example!
dyauspitr
Do you want to share it?
codedokode
Anyone who does not want to leak their data? I am actually surprised that people are ok with trusting their secrets to a random foreign company.
karencarits
But what do you do with these secrets? Like tagging emails, summarizing documents?
lurking_swe
a document management system is an easy example. Let’s say medical, legal, and tax documents.
nprateem
No one cares about your 'secrets' as much as you think. They're only potentially valuable if you're doing unpatented research or they can tie them back to you as an individual. The rest is paranoia.
Having said that, I'm paranoid too. But if I wasn't they'd have got me by now.
lurking_swe
step back for a bit. some people actually work with sensitive documents as part of their JOB. Like accountants, lawyers, people in medical industry, etc.
Sending a document with a social security number to OpenAI is just a dumb idea. As an example.
rurban
A random foreign company is far better than a big 5 eyes country, which syphon everything to the NSA, and use it against you.
Whilst the Chinese intelligence agency will have not much power over you.
itsmevictor
I do a lot of data cleaning as part of my job, and I've found that small models could be very useful for that, particularly in the face of somewhat messy data.
You can for instance use them to extract some information such as postal codes from strings, or to translate and standardize country names written in various languages (e.g. Spanish, Italian and French to English), etc.
I'm sure people will have more advanced use cases, but I've found them useful for that.
lvturner
Also worth it for the speed of AI autocomplete in coding tools, the round trip to my graphics card is much faster than going out over the network.
mbac32768
Anyone actually doing this? DeepSeek-R1 32b ollama can't run on an RTX 4090 and the 17b is nowhere near as good at coding as OpenAI or Claude models.
lvturner
I specified autocomplete, I'm not running a whole model asking it to build something and await an output.
DeepSeek-coder-v2 is fine for this, I occasionally use a smaller Qwen3 (I forget exactly which at the moment... Set and forget) for some larger queries about code, given my fairly light used cases and pretty small contexts it works well enough for me
sudomarcma
Any companies with any type of sensitive data will love to have anything to do with LLM done locally.
thenameless7741
A recent example: a law firm hired this person [0] to build a private AI system for document summarization and Q&A.
[0] https://xcancel.com/glitchphoton/status/1927682018772672950
bcoates
I use the local LLM-based autocomplete built into PyCharm and I'm pretty happy with it
jacob019
Not much to go off of here. I think the latest R1 release should be exciting. 685B parameters. No model card. Release notes? Changes? Context window? The original R1 has impressive output but really burns tokens to get there. Can't wait to learn more!
deepsquirrelnet
I think it’s cool to see this kind of international participation in fierce tech competition. It’s exciting. It’s what I think capitalism should be.
This whole “building moats” and buying competitors fascination in the US has gotten boring, obvious and dull. The world benefits when companies struggle to be the best.
mjcohen
Deepseek seems to be one of the few LLMs that run on a iPod Touch because of the older version of ios.
cropcirclbureau
Hey! You! You can't just say that and not explain. Come back.
MrPowerGamerBR
If I had to guess, they were talking about the DeepSeek iOS app: https://apps.apple.com/br/app/deepseek-assistente-de-ia/id67...
titaniumtown
... What?
AJAlabs
671B parameters! Well, it doesn't look like I'll be running that locally.
amy_petrik
there is a small community of people that do indeed run this locally. typically on CPU/RAM (lots and lots of RAM), insofar as that's cheaper than GPU(s).
cesarvarela
About half the price of o4 mini high for not that much worse performance, interesting
edit: most providers are offering a quantized version...
htrp
You're gonna need at least 8 h100 80s for this....
canergly
I want to see it in groq asap !
porphyra
Groq doesn't even have any true deepseek models --- I thought they only had `deepseek-r1-distill-llama-70b` which was distilled onto llama 70b [1].
jacob019
Groq has a weak selection of models, which is frustrating because their inference speed is insane. I get it though, selection + optimization = performance.
jbentley1
From conversation with someone from Groq, they have a custom compiler and runtime for the models to run on their custom hardware, which is why the selection is poor. For every model type they need to port the architecture to run on their compiler beforehand.
sergiotapia
the only reason they are fast is because the models they host are severely quantized so i've heard.
Well that didn't take long, available from 7 providers through openrouter.
https://openrouter.ai/deepseek/deepseek-r1-0528/providers
May 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.
Fully open-source model.