Skip to content(if available)orjump to list(if available)

How to Run DeepSeek R1 671B Locally on a $2000 EPYC Server

geertj

This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.

This [1] X thread runs the 671B model in the original Q8 at 6-8 TPS for $6K using a dual socket Epyc server motherboard using 768GB of RAM. I think this could be made cheaper by getting slower RAM but since this is RAM bandwidth limited that would likely reduce TPS. I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.

[1] https://x.com/carrigmat/status/1884244369907278106?s=46&t=5D...

nielsole

I've been running the unsloth 200GB dynamic quantisation with 8k context on my 64GB Ryzen 7 5800G. CPU and iGPU utilization were super low, because it basically has to read the entire model from disk. (Looks like it needs ~40GB of actual memory that it cannot easily mmap from disk) With a Samsung 970 Evo Plus that gave me 2.5GB/s read speed. That came out at 0.15 tps Not bad for completely underspecced hardware.

Given the model has only so few active parameters per token (~40B), it is likely that just being able to hold it in memory absolve the largest bottleneck. I guess with a single consumer PCIe4.0x16 graphics card you could get at most 1tps just because of the PCIe transfer speed? Maybe CPU processing can be faster simply because DDR transfer is faster than transfer to the graphics card.

3abiton

3x the price for less than 2x the speed increase. I don't think the price justifies the upgrade.

phonon

Q4 vs Q8.

notsylver

I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this

gliptic

Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.

firtoz

Would it be something like this?

> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi

https://x.com/geerlingguy/status/1884994878477623485

I haven't tried it myself or haven't verified the creds, but seems exciting at least

gliptic

That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.

isoprophlex

Online, R1 costs what, $2/MTok?

This rig does >4 tok/s, which is ~15-20 ktok/hr, or $0.04/hr when purchased through a provider.

You're probably spending $0.20/hr on power (1 kW) alone.

Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)

rightbyte

> Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)

I would argue that is enough and that this is awesome. It was a long time ago I wanted to do a tech hack like this much.

codetrotter

> doesn't make a lot of sense (besides privacy...)

Privacy is worth very much though.

christophilus

Aside: it’s pretty amazing what $2K will buy. It’s been a minute since I built my desktop, and this has given me the itch to upgrade.

Any suggestions on building a low-power desktop that still yields decent performance?

Havoc

>Any suggestions on building a low-power desktop that still yields decent performance?

You don't for now. The bottleneck is mem throughput. That's why people using CPU for LLM are running xeon-ish/epyc setups...lots of mem channels.

The APU class gear along the lines of Halo Strix is probably the path closest to lower power but it's not going to do 500gb of ram and still doesn't have enough throughput for big models

baobun

Hard to know what ranges you have in mind with "decent performance" and "low-power".

I think your best bet might be a Ryzen U-series mini PC. Or perhaps an APU barebone. The ATX platform is not ideal from a power-efficiency perspective (whether inherently or from laziness or conspiracy from mobo and PSU makers, I do not know). If you want the flexibility or scale, you pay the price of course but first make sure it's what you want. I wouldn't look at discrete graphics unless you have specific needs (really high-end gaming, workstation, LLMs, etc) - the integrated graphics of last few years can both drive your 4k monitors and play recent games at 1080p smoothly, albeit perhaps not simultaneously ;)

Lenovo Tiny mq has some really impressive flavors (ECC support at the cost of CPU vendor-lock on PRO models) and there's the whole roster of Chinese competitors and up-and-comers if you're feeling adventerous. Believe me you can still get creative if you want to scratch the builder itch - thermals is generally what keeps these systems from really roaring (:

alecco

Affiliate link spam.

teruakohatu

If you are going to go to that effort, adding a second NMVE drive, and doing RAID 0 across them, will improve the speed of getting the model into RAM.

merb

than you will way more memory, since you can only do software raid a nvme drive.

davidmurdoch

About how much memory overhead would that require?

H8crilA

Do we have any estimate on the size of OpenAI top of the line models? Would they also fit in ~512GB of (V)RAM?

Also, this whole self hosting of LLMs is a bit like cloud. Yes, you can do it, but it's a lot easier to pay for API access. And not just for small users. Personally I don't even bother self hosting transcription models which are so small that they can run on nearly any hardware.

GreenWatermelon

It's nice because a company can optionaly provide a SOTA reasoning model for their clients, without having to go through a middleman e.g. HR co. Can provide an LLM for their HRMS system for a small 2000$ investment. Not 2000/month, just a one time 2000 investment.

yapyap

Is the size of OpenAI‘s top of the line models even relevant? Last I checked they weren’t open source in the slightest.

exe34

it would make sense if you don't want somebody else to have access to all your code and customer data.

sylware

Well, I read this, now I am sure: as of today, deepseek handling of LLMs is the less wrong, and by far.

cma

He's running quantized Q4 671b. However, MoE doesn't need cluster networking so you could probably run the full thing on two of them unquantized. Maybe the router could be all resident in GPU RAM instead of in contrast offloading a larger percentage of everything there, or is that already how it is set up in his gpu offload config?

wanse_yung

[flagged]

null

[deleted]

leobg

This is a verbatim quote from Jason Calacanis.