How to Run DeepSeek R1 671B Locally on a $2000 EPYC Server
43 comments
·February 1, 2025geertj
nielsole
I've been running the unsloth 200GB dynamic quantisation with 8k context on my 64GB Ryzen 7 5800G. CPU and iGPU utilization were super low, because it basically has to read the entire model from disk. (Looks like it needs ~40GB of actual memory that it cannot easily mmap from disk) With a Samsung 970 Evo Plus that gave me 2.5GB/s read speed. That came out at 0.15 tps Not bad for completely underspecced hardware.
Given the model has only so few active parameters per token (~40B), it is likely that just being able to hold it in memory absolve the largest bottleneck. I guess with a single consumer PCIe4.0x16 graphics card you could get at most 1tps just because of the PCIe transfer speed? Maybe CPU processing can be faster simply because DDR transfer is faster than transfer to the graphics card.
notsylver
I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this
gliptic
Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.
firtoz
Would it be something like this?
> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi
https://x.com/geerlingguy/status/1884994878477623485
I haven't tried it myself or haven't verified the creds, but seems exciting at least
gliptic
That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.
isoprophlex
Online, R1 costs what, $2/MTok?
This rig does >4 tok/s, which is ~15-20 ktok/hr, or $0.04/hr when purchased through a provider.
You're probably spending $0.20/hr on power (1 kW) alone.
Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)
rightbyte
> Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)
I would argue that is enough and that this is awesome. It was a long time ago I wanted to do a tech hack like this much.
codetrotter
> doesn't make a lot of sense (besides privacy...)
Privacy is worth very much though.
christophilus
Aside: it’s pretty amazing what $2K will buy. It’s been a minute since I built my desktop, and this has given me the itch to upgrade.
Any suggestions on building a low-power desktop that still yields decent performance?
Havoc
>Any suggestions on building a low-power desktop that still yields decent performance?
You don't for now. The bottleneck is mem throughput. That's why people using CPU for LLM are running xeon-ish/epyc setups...lots of mem channels.
The APU class gear along the lines of Halo Strix is probably the path closest to lower power but it's not going to do 500gb of ram and still doesn't have enough throughput for big models
baobun
Hard to know what ranges you have in mind with "decent performance" and "low-power".
I think your best bet might be a Ryzen U-series mini PC. Or perhaps an APU barebone. The ATX platform is not ideal from a power-efficiency perspective (whether inherently or from laziness or conspiracy from mobo and PSU makers, I do not know). If you want the flexibility or scale, you pay the price of course but first make sure it's what you want. I wouldn't look at discrete graphics unless you have specific needs (really high-end gaming, workstation, LLMs, etc) - the integrated graphics of last few years can both drive your 4k monitors and play recent games at 1080p smoothly, albeit perhaps not simultaneously ;)
Lenovo Tiny mq has some really impressive flavors (ECC support at the cost of CPU vendor-lock on PRO models) and there's the whole roster of Chinese competitors and up-and-comers if you're feeling adventerous. Believe me you can still get creative if you want to scratch the builder itch - thermals is generally what keeps these systems from really roaring (:
alecco
Affiliate link spam.
buyucu
related reddit thread: https://old.reddit.com/r/LocalLLaMA/comments/1idseqb/deepsee...
teruakohatu
If you are going to go to that effort, adding a second NMVE drive, and doing RAID 0 across them, will improve the speed of getting the model into RAM.
merb
than you will way more memory, since you can only do software raid a nvme drive.
davidmurdoch
About how much memory overhead would that require?
H8crilA
Do we have any estimate on the size of OpenAI top of the line models? Would they also fit in ~512GB of (V)RAM?
Also, this whole self hosting of LLMs is a bit like cloud. Yes, you can do it, but it's a lot easier to pay for API access. And not just for small users. Personally I don't even bother self hosting transcription models which are so small that they can run on nearly any hardware.
GreenWatermelon
It's nice because a company can optionaly provide a SOTA reasoning model for their clients, without having to go through a middleman e.g. HR co. Can provide an LLM for their HRMS system for a small 2000$ investment. Not 2000/month, just a one time 2000 investment.
yapyap
Is the size of OpenAI‘s top of the line models even relevant? Last I checked they weren’t open source in the slightest.
exe34
it would make sense if you don't want somebody else to have access to all your code and customer data.
sylware
Well, I read this, now I am sure: as of today, deepseek handling of LLMs is the less wrong, and by far.
cma
He's running quantized Q4 671b. However, MoE doesn't need cluster networking so you could probably run the full thing on two of them unquantized. Maybe the router could be all resident in GPU RAM instead of in contrast offloading a larger percentage of everything there, or is that already how it is set up in his gpu offload config?
sAggymEllenZ
[dead]
This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.
This [1] X thread runs the 671B model in the original Q8 at 6-8 TPS for $6K using a dual socket Epyc server motherboard using 768GB of RAM. I think this could be made cheaper by getting slower RAM but since this is RAM bandwidth limited that would likely reduce TPS. I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.
[1] https://x.com/carrigmat/status/1884244369907278106?s=46&t=5D...