DeepSeek-R1

116 comments

·January 20, 2025

ozgune

> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.

It's great that DeepSeek-R1 fixes that.

The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.

Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.

[1] https://github.com/ubicloud/ubicloud/discussions/2608

ozgune

The R1 GitHub repo is way more exciting than I had thought.

They aren't only open sourcing R1 as an advanced reasoning model. They are also introducing a pipeline to "teach" existing models how to reason and align with human preferences. [2] On top of that, they fine-tuned Llama and Qwen models that use this pipeline; and they are also open sourcing the fine-tuned models. [3]

This is *three separate announcements* bundled as one. There's a lot to digest here. Are there any AI practitioners, who could share more about these announcements?

[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.

[3] Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

ankit219

> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.

This is probably the result of a classifier which determines if it have to go through the whole CoT at the start. Mostly on tough problems it does, and otherwise, it just answers as is. Many papers (scaling ttc, and the mcts one) have talked about this as a necessary strategy to improve outputs against all kinds of inputs.

pixl97

>if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email.

Did o1 actually do this on a user hidden output?

At least in my mind if you have an AI that you want to keep from outputting harmful output to users it shouldn't this seems like a necessary step.

Also, if you have other user context stored then this also seems like a means of picking that up and reasoning on it to create a more useful answer.

Now for summarizing email itself it seems a bit more like a waste of compute, but in more advanced queries it's possibly useful.

ozgune

Yes, o1 hid its input. Still, it also provided a summary of its reasoning steps. In the email case, o1 thought for six seconds, summarized its thinking as "summarizing the email", and then provided the answer.

We saw this in other questions as well. For example, if you asked o1 to write a "python function to download a CSV from a URL and create a SQLite table with the right columns and insert that data into it", it would immediately produce the answer. [4] If you asked it a hard math question, it would try dozens of reasoning strategies before producing an answer. [5]

[4] https://github.com/ubicloud/ubicloud/discussions/2608#discus...

[5] https://github.com/ubicloud/ubicloud/discussions/2608#discus...

coffeebeqn

I think O1 does do that. It once spit out the name of the expert model for programming in its “inner monologue” when I used it. Click on the grey “Thought about X for Y seconds” and you can see the internal monologue

cma

> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email.

The full o1 reasoning traces aren't available, you just have to guess it isn't doing from the summary.

Sometimes you put in something like "hi" and it says it thought for 1 minute before replying "hello."

pixl97

Human: "Hi"

o1 layers: "Why did they ask me hello. How do they know who I am. Are they following me. We have 59.6 seconds left to create a plan on how to kill this guy and escape this room before we have to give a response....

... and after also taking out anyone that would follow thru in revenge and overthrowing the government... crap .00001 seconds left, I have to answer"

o1: "Hello"

iamronaldo

You should make more of these lmao

simonw

OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...

The one I'm running is the 8.54GB file. I'm using Ollama like this:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0

You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:

    uvx --with llm-ollama \
      llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
      'a joke about a pelican and a walrus who run a tea room together'

Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...

I also set an alias for the model like this:

    llm aliases set r1l 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0'

Now I can run "llm -m r1l" (for R1 Llama) instead.

I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/

widdershins

Yeesh, that shows a pretty comprehensive dearth of humour in the model. It did a decent examination of characteristics that might form the components of a joke, but completely failed to actually construct one.

I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.

reissbaker

FWIW, you can also try all of the distills out in BF16 on glhf.chat, including the 70b. Personally I've been most impressed with the Qwen 32b distill.

(Disclosure: I'm the cofounder)

tkgally

Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, New York Times Connections puzzles, and suggesting further improvements to an already-polished 500-word text in English. ChatGPT o1 was, in my judgment, clearly better than the other two, and DeepSeek was the weakest.

I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)

By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.

torginus

o1 is impressive, I tried feeding it some of the trickier problems I have solved (that involved nontrivial algorithmic challenges) over the past few months, and it managed to solve all of them, and usually came up with slightly different solutions than I did, which was great.

However what I've found odd was the way it formulated the solution was in excessively dry and obtuse mathematical language, like something you'd publish in an academic paper.

Once I managed to follow along its reasoning, I understood what it came up with could essentially be explain in 2 sentences of plain english.

On the other hand, o1 is amazing at coding, being able to turn an A4 sheet full of dozens of separate requirements into an actual working application.

starfezzy

Can it solve easy problems yet? Weirdly, I think that's an important milestone.

Prompts like, "Give me five odd numbers that don't have the letter 'e' in their spelling," or "How many 'r's are in the word strawberry?"

I suspect the breakthrough won't be trivial that enables solving trivial questions.

diggan

> Can it solve easy problems yet? Weirdly, I think that's an important milestone.

Easy for who? Some problems are better solved in one way compared to another.

In the case of counting letters and such, it is not a easy problem, because of how the LLM tokenizes their input/outputs. On the other hand, it's really simple problem for any programming/scripting language, or humans.

And then you have problems like "5142352 * 51234" which is trivial problems for a calculator, but very hard for a human or a LLM.

Or "problems" like "Make a list of all the cities that had celebrity from there who knows how to program in Fortan", would be a "easy" problem for a LLM, but pretty much a hard problem anything else than Wikidata, assuming both LLM/Wikidata have data about it in their datasets.

> I suspect the breakthrough won't be trivial that enables solving trivial questions.

So with what I wrote above in mind, LLMs already solve trivial problems, assuming you think about the capabilities of the LLM. Of course, if you meant "trivial for humans", I'll expect the answer to always remain "No", because things like "Standing up" is trivial for humans, but it'll never be trivial for a LLM, it doesn't have any legs!

msoad

> Give me five odd numbers that don't have the letter 'e' in their spelling

Compare the reasoning times!!! 84s vs 342s

R1 (Thought for 84 seconds)

      No odd number in English avoids the letter 'e' in its spelling. The request for five such numbers cannot be fulfilled.

o1 Pro (Thought for 5 minutes and 42 seconds)

      No standard English spelling of an odd number can avoid “e.” Every odd digit (one, three, five, seven, nine) already includes “e,” so once you build any odd number out of those digits, the spelled‐out form will contain “e.” As a result, there are no such odd numbers in standard English without an “e” in their name.

coffeebeqn

Took 1m 36s for me. My default prompt is a bit different “think from first principles”. It’s pretty verbose but I enjoyed looking through all the work it did. Pretty impressive !

salviati

I would argue anything requiring insights on spelling is a hard problem for an LLM: they use tokens, not letters. Your point still stands, but you need different examples IMO.

synergy20

a dumb question,how did you use deepseek,e.g r1?

tkgally

I use it at https://chat.deepseek.com/ . It’s free but requires a log-in. Now, when I hover over the “DeepThink” button below the prompt field, a pop-up appears saying “Use DeepSeek-R1 to solve reasoning problems.”

pizza

Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet (except on GPQA). While that says nothing about how it will handle your particular problem, dear reader, that does seem.. like an insane transfer of capabilities to a relatively tiny model. Mad props to DeepSeek!

qeternity

This says more about benchmarks than R1, which I do believe is absolutely an impressive model.

For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.

bochoh

I wonder if (when) there will be a GGUF model available for this 8B model. I want to try it out locally in Jan on my base m4 Mac mini. I currently run Llama 3 8B Instruct Q4 at around 20t/s and it sounds like this would be a huge improvement in output quality.

DrPhish

Making your own ggufs is trivial: https://rentry.org/tldrhowtoquant/edit

It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.

bochoh

Thanks for that! It seems that unsloth actually beat me to [it](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...)!

bugglebeetle

YC’s own incredible Unsloth team already has you covered:

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B

noodletheworld

> according to these benchmarks

Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?

You can’t wave a magic wand and make an 8b model that good.

I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.

You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.

> like an insane transfer of capabilities to a relatively tiny model

It certainly does.

…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.

qqqult

Kind of insane how a severely limited company founded 1 year ago competes with the infinite budget of Open AI

Their parent hedge fund company isn't huge either, just 160 employees and $7b AUM according to Wikipedia. If that was a US hedge fund it would be the #180 largest in terms of AUM, so not small but nothing crazy either

jstummbillig

The nature of software that has not moat built into it. Which is fantastic for the world, as long as some companies are willing to pay the premium involved in paving the way. But man, what a daunting prospect for developers and investors.

HeatrayEnjoyer

I'm not sure we should call it "fantastic"

The negative downsides begin at "dystopia worse than 1984 ever imagined" and get worse from there

rtsil

That dystopia is far more likely in a world where the moat is so large that a single company can control all the llms.

CuriouslyC

That dystopia will come from an autocratic one party government with deeply entrenched interests in the tech oligarchy, not from really slick AI models.

rvnx

The way it is going, we are all going be busy with WW3 soon so we won’t have much time to worry about that.

sschueller

This is the reason I believe the new AI chip restriction that was just put in place will backfire.

iury-sza

Alrdy did. Forced China to go all in in the chip race and they're catching up fast.

rvnx

Deepseek can run on Huawei Ascend chips already and Nvidia pretended respecting the restrictions with the H800 (and was never punished for that)

imtringued

It's pretty clear, because OpenAI has no clue what they are doing. If I was the CEO of OpenAI, I would have invested significantly in catastrophic forgetting mitigations and built a model capable of continual learning.

If you have a model that can learn as you go, then the concept of accuracy on a static benchmark would become meaningless, since a perfect continual learning model would memorize all the answers within a few passes and always achieve a 100% score on every question. The only relevant metrics would be sample efficiency and time to convergence. i.e. how quickly does the system learn?

Squarex

Well one could assume they have an infinite budget from the communist party.

diggan

Why could one assume so? Are there any explicit links? Or just because it's a Chinese company it's of course compromised and to be shunned?

tokioyoyo

To my understanding, most people, even in tech, disregard and look down on Chinese software. For some reason they also have a picture of 10 CCP employees sitting on each dev team, reviewing code before it gets released on GitHub.

There was a conversation with some western dev how they kept saying Chinese devs don’t work with scale like Meta/Google do, so they don’t have experience in it either. That was also an interesting thread to read, because without thinking about anything else, WeChat itself has more than 1B users. I’m not sure if it’s pure ignorance, or just people want to feel better about themselves.

I agree that a good chunk of Chinese apps’ UX is trash though.

yehosef

The chinese are great at taking secrets. Chatbots are great places for people to put in secrets. Other people say "we're not going to use your data" - with a Chinese company you're pretty much guaranteed that China mothership is going to have access to it.

The open source model is just the bait to make you think they are sincere and generous - chat.deepseek.com is the real game. Almost no-one is going to run these models - they are just going to post their secrets (https://www.cyberhaven.com/blog/4-2-of-workers-have-pasted-c...)

greenchair

yep because it is chinese company of strategic importance.

Squarex

I am not going pretend to know the specifics, but don't the have mandatory Communist Party Committee? Comming from former eastern block country, I assume that they tend to have the final voice.

phillipcarter

...and the US government doesn't provide grants for research and various other incentives for for-profit companies?

The CCP has plenty of problems it needs to solve for itself that don't involve releasing open source AI models.

suraci

copium lol

wrasee

Except it’s not really a fair comparison, since DeepSeek is able to take advantage of a lot of the research pioneered by those companies with infinite budgets who have been researching this stuff in some cases for decades now.

The key insight is that those building foundational models and original research are always first, and then models like DeepSeek always appear 6 to 12 months later. This latest move towards reasoning models is a perfect example.

Or perhaps DeepSeek is also doing all their own original research and it’s just coincidence they end up with something similar yet always a little bit behind.

h8hawk

That’s totally not true.

https://epoch.ai/gradient-updates/how-has-deepseek-improved-...

matthewdgreen

This is what many folks said about OpenAI when they appeared on the scene building on foundational work done at Google. But the real point here is not to assign arbitrary credit, it’s to ask how those big companies are going to recoup their infinite budgets when all they’re buying is a 6-12 month head start.

wrasee

This is true, and practically speaking it is how it is. My point was just not to pretend that it’s a fair comparison.

byefruit

This is pretty harsh on DeepSeek.

There are some significant innovations behind behind v2 and v3 like multi-headed latent attention, their many MoE improvements and multi-token prediction.

wrasee

I don’t think it’s that harsh. And I don’t also deny that they’re a capable competitor and will surely mix in their own innovations.

But would they be where they are if they were not able to borrow heavily from what has come before?

techload

You can learn more about DeepSeek and Liang Wenfeng here: https://www.chinatalk.media/p/deepseek-ceo-interview-with-ch...

qqqult

great article, thank you

wrasee

Also don’t forget that if you think some of the big names are playing fast and loose with copyright / personal data then DeepSeek is able to operate in a regulatory environment that has even less regard for such things, especially so for foreign copyright.

rvnx

Which is great for users.

We all benefit from Libgen training, and generally copyright laws do not forbid reading copyrighted content, but to create derivative works, but in that case, at which point a work is derivative and at which point it is not ?

On the paper all works is derivative from something else, even the copyrighted ones.

gizmo

Fast following is still super hard. No AI startup in Europe can match DeepSeek for instance, and not for lack of trying.

wrasee

Mistral.

netdevphoenix

mistral probably would

netdur

Didn't DeepSeek's CEO say that Llama is two generations behind, and that's why they didn't use their methods?

fullstackwife

I was initially enthusiastic about DS3, because of the price, but eventually I learned the following things:

- function calling is broken (responding with excessive number of duplicated FC, halucinated names and parameters)

- response quality is poor (my use case is code generation)

- support is not responding

I will give a try to the reasoning model, but my expectations are low.

ps. the positive side of this is that apparently it removed some traffic from anthropic APIs, and latency for sonnet/haikku improved significantly.

pmarreck

I got some good code recommendations out of it. I usually give the same question to a few models and see what they say; they differ enough to be useful, and then I end up combining the different suggestions with my own to synthesize the best possible (by my personal metric, of course) code.

HarHarVeryFunny

There are all sorts of ways that additional test time compute can be used to get better results, varying from things like sampling multiple CoT and choosing the best, to explicit tree search (e.g. rStar-Math), to things like "journey learning" as described here:

https://arxiv.org/abs/2410.18982?utm_source=substack&utm_med...

Journey learning is doing something that is effectively close to depth-first tree search (see fig.4. on p.5), and does seem close to what OpenAI are claiming to be doing, as well as what DeepSeek-R1 is doing here... No special tree-search sampling infrastructure, but rather RL-induced generation causing it to generate a single sampling sequence that is taking a depth first "journey" through the CoT tree by backtracking when necessary.

chaosprint

Amazing progress with this budget.

My only concern is that on openrouter.ai it says:

"To our knowledge, this provider may use your prompts and completions to train new models."

https://openrouter.ai/deepseek/deepseek-chat

This is a dealbreaker for me to use it at the moment.

lhl

Fireworks, Together, and Hyperbolic all offer DeepSeek V3 API access at reasonable prices (and full 128K output) and none of them will retain/train on user submitted data. Hyperbolic's pricing is $0.25/M tokens, which is actually pretty competitive to even DeepSeek's "discount" API pricing.

I've done some testing and if you're inferencing on your own system (2xH100 node, 1xH200 node, or 1xMI300X node) sglang performs significantly better than vLLM on deepseek-v3 (also vLLM had an stop token issue for me, not sure if that's been fixed, sglang did not have output oddities).

gliptic

Where are you seeing Hyperbolic offering DeepSeek V3 API? I'm only seeing DeepSeek V2.5.

csomar

Fair compromise for running it for free. The model is open, so you can be 100% certain it's not pinging back if you don't want it to.

msoad

No model really can "call home". It's the server running it. Luckily for Deepseek there are other providers that guarantee no data collection since the models are open source

null

[deleted]

mythz

Works great for us as most of our code is public and we can only benefit from more our code of our product or using it being available.

Also happy for any of our code expands their training set and improves their models even further given they're one of the few companies creating and releasing OSS SOTA models, which in addition to being able to run it locally ourselves should we ever need to, it allows price competition bringing down the price of a premier model whilst keeping the other proprietary companies price gouging in check.

lopuhin

With distilled models being released, it's very likely they'd be soon served by other providers at a good price and perf, unlike the full R1 which is very big and much harder to serve efficiently.

simonw

You don't need to worry about that if you are using the open weights models they just released on your own hardware. You can watch network traffic to confirm nothing is being transferred.

jerpint

> This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs.

Wow. They’re really trying to undercut closed source LLMs

msoad

It already replaces o1 Pro in many cases for me today. It's much faster than o1 Pro and results are good in most cases. Still, sometimes I have to ask the question from o1 Pro if this model fails me. Worth the try every time tho, since it's much faster

Also a lot more fun reading the reasoning chatter. Kinda cute seeing it say "Wait a minute..." a lot

tripplyons

I just pushed the distilled Qwen 7B version to Ollama if anyone else here wants to try it locally: https://ollama.com/tripplyons/r1-distill-qwen-7b

999900000999

Great, I've found DeepSeek to consistently be a better programmer than Chat GPT or Claude.

I'm also hoping for progress on mini models, could you imagine playing Magic The Gathering against a LLM model! It would quickly become impossible like Chess.

HN

DeepSeek-R1

DeepSeek-R1