DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

123 comments

·March 6, 2025

hnfong

As other commenters have mentioned, the performance of this set up is probably not really great since there's not enough VRAM and lots of bits have to be moved between CPU and GPU RAM.

That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic

I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...

DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.

SlavikCA

I tried that Unsloth R1 quantization on my dual Xeon Gold 5218 with 384 GB DDR4-2666 (about half of memory channels used, so not most optimal).

Type IQ2_XXS / 183GB, 16k context:

CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.

CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.

I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.

idonotknowwhy

Thanks a lot for the v2.5! I'll give that a whirl. Hopefully it's as coherent as v3.5 when quantized so small.

> I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

I run the Q2_K_XL and it's perfectly good for me. Where it lacks vs FP8 is in creative writing. If you prompt it with for a story a few times, then compare with FP8, you'll see what I mean.

For coding, the 1.58bit clearly makes more errors than the Q2XXS and Q2_K_XL

colorant

Currently >8 token/s; there is a demo in this post: https://www.linkedin.com/posts/jasondai_run-671b-deepseek-r1...

pinoy420

[dead]

colorant

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

Requirements (>8 token/s):

380GB CPU Memory

1-8 ARC A770

500GB Disk

colorant

Also see the demo from Jason Dai's post: https://www.linkedin.com/posts/jasondai_with-the-latest-ipex...

aurareturn

CPU inference is both bandwidth and compute constrained.

If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.

colorant

Prompt length mainly impacts prefill latency (FTFF), not the decoding speed (TPOT)

GTP

> 1-8 ARC A770

To get more than 8 t/s, is one Intel Arc A770 enough?

colorant

Yes, but the context length will be limited due to VRAM constraint

faizshah

Anyone got a rough estimate of the cost of this setup?

I’m guessing it’s under 10k.

I also didn’t see tokens per second numbers.

ynniv

It better be! AMD @ $2k: https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...

aurareturn

This article keeps getting posted but it runs a thinking model at 3-4 tokens/s. You might as well take a vacation if you ask it a question.

It’s a gimmick and not a real solution.

utopcell

What a teaser article! All this info for setting up the system, but no performance numbers.

null

[deleted]

jamesy0ung

What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?

VladVladikoff

I think it’s that most non Xeon motherboards don’t have the memory channels to have this much memory with any sort of commercially viable dimms.

genewitch

Pcie lanes

hedora

I was about to correct you because this doesn't use PCIe for anything, and then I realized Arc was a GPU (and they support up to 8 per machine).

Any idea how many Arc's it takes to match an H100?

numpad0

  DDR4 UDIMM is up to 32GB/module  
  DDR5 UDIMM is up to 64GB/module[0]  
  non-Xeon M/B has up to 4 UDIMM slots 
  -> non-Xeon is up to 128GB/256GB per node

Server motherboards have as many as 16 DIMM slots per socket with RDIMM/LRDIMM support, which allows more modules as well as higher capacity modules to be installed.

0: there has been a 128GB UDIMM launch at peak COVID

walrus01

There's not much else (other than Epyc) in the way of affordably priced motherboards that have enough cumulative RAM. You can buy a used Dell dual socket older xeon CPU server with 512GB of RAM for test/development purposes for not very much money.

Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.

You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.

Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.

https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...

https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...

https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...

numpad0

PowerEdge R series is significantly cheaper if you already have an ear protection

walrus01

yes, an R730 or R740 for instance. There's lots of used R630 and R640 with 512GB of RAM as well, but a 1U server is not the best thing to try putting gaming GPU type pci-express video cards into.

Gravityloss

I'm sure this question has been asked before, but why not launch a GPU with more but slower ram? That would fit bigger models while still affordable...

TeMPOraL

Why would you need it for? Not gaming for sure. AI, you say? Then fork up the cash.

That's Nvidia's current MO. There's more demand for GPUs for AI than there are GPUs available, and most of that demand still has stupid amounts of money behind it (being able to get grants, loans or investment based on potential/hype) - money that can be captured by GPU vendors. Unfortunately, VRAM is the perfect discriminator between "casual" and "monied" use.

(This is not unlike the "SSO tax" - single sign-on is pretty much the perfect discriminator between "enterprise use" and "not enterprise use".)

ChocolateGod

Because then you would have less motivation to buy the more expensive GPUs.

antupis

Yeah, Nvidia doesn't have any incentive to do that and AMD needs to get their shit together at software side.

cheschire

This topic is about Intel Arc GPUs though

null

[deleted]

fleischhauf

they absolutely can build gpus with larger vram, they just don't have the competition to have to do so. it's much more profitable this way.

andrewstuart

Did you miss the news about AMD Halo Strix?

More than twice as fast as Nvidia 4090 for AI.

Launched last week.

coolspot

> More than twice as fast as Nvidia 4090 for AI.

Not in memory bandwidth which is all that matter for LLM inference.

Gravityloss

I indeed was not aware, thanks

varelse

[dead]

yongjik

Did DeepSeek learn how to name their models from OpenAI.

vlovich123

The convention is weird but it's pretty standard in the industry across all models, particularly GGUF. 671B parameters, quantized to 4 bits. The K_M terminology I believe is more specific to GGUF and describes the specific quantization strategy.

CamperBob2

Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)

colorant

>8TPS at this moment on a 2-socket 5th Xeon (EMR)

codetrotter

> the dual Epyc workstation recipe that was popularized recently

Anyone have a link to this one?

hedora

https://news.ycombinator.com/item?id=42897205

notum

Censoring of token/s values in the sample output surely means this runs great!

null

[deleted]

mrbonner

I see there are a few options to run inference for LLM and Stable Diffusion outside Nvidia. There is Intel Arc, Apple Ms and now AMD Ryzen AI Max. It is obvious that running in Nvidia would be the most optimal way. But given the availability of high VRAM Nvidia cards at reasonable price, I can't stop thinking about getting one that is not Nvidia. So, if I'm not interested in training or fine tuning, would any of those solutions actually works? On a Linux machine?

999900000999

If you actually want to seriously do this, go with Nvidia.

This article is basically Intel saying remember us, we made a GPU! And they make great budget cards, but the ecosystem is just so far behind.

Honestly this is not something you can really do on a budget.

andrewstuart

With the arrival of APUs for AI everyone is going to lose interest in GPUs real fast.

Why buy an overpriced Nvidia 4090 when you can get an AMD Halo Strix or Apple M3 Studio APU with 512GB or 128GB of Ram?

Nvidia has kept prices high and performance low for as long as it can and finally competition is here.

Even Intel can make APUs with tons of RAM.

Nvidia hopefully is squirming.

7speter

I’ve been following the progress Intel Arc support in Pytorch is making, at least in Linux, and it seems like if things stay on track, we may see the first version of pytorch with full Xe/Arc support by around June. I think I’m just going to wait until then instead of dealing with anything ipex or openvino.

colorant

This is based on llama.cpp

null

[deleted]

HN

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon