Skip to content(if available)orjump to list(if available)

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

hnfong

As other commenters have mentioned, the performance of this set up is probably not really great since there's not enough VRAM and lots of bits have to be moved between CPU and GPU RAM.

That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic

I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...

DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.

SlavikCA

I tried that Unsloth R1 quantization on my dual Xeon Gold 5218 with 384 GB DDR4-2666 (about half of memory channels used, so not most optimal).

Type IQ2_XXS / 183GB, 16k context:

CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.

CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.

I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.

colorant

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

Requirements (>8 token/s):

380GB CPU Memory

1-8 ARC A770

500GB Disk

colorant

aurareturn

CPU inference is both bandwidth and compute constrained.

If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.

colorant

Prompt length mainly impacts prefill latency (FTFF), not the decoding speed (TPOT)

faizshah

Anyone got a rough estimate of the cost of this setup?

I’m guessing it’s under 10k.

I also didn’t see tokens per second numbers.

ynniv

aurareturn

This article keeps getting posted but it runs a thinking model at 3-4 tokens/s. You might as well take a vacation if you ask it a question.

It’s a gimmick and not a real solution.

utopcell

What a teaser article! All this info for setting up the system, but no performance numbers.

null

[deleted]

jamesy0ung

What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?

VladVladikoff

I think it’s that most non Xeon motherboards don’t have the memory channels to have this much memory with any sort of commercially viable dimms.

genewitch

Pcie lanes

hedora

I was about to correct you because this doesn't use PCIe for anything, and then I realized Arc was a GPU (and they support up to 8 per machine).

Any idea how many Arc's it takes to match an H100?

walrus01

There's not much else (other than Epyc) in the way of affordably priced motherboards that have enough cumulative RAM. You can buy a used Dell dual socket older xeon CPU server with 512GB of RAM for test/development purposes for not very much money.

Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.

You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.

Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.

https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...

https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...

https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...

mrbonner

I see there are a few options to run inference for LLM and Stable Diffusion outside Nvidia. There is Intel Arc, Apple Ms and now AMD Ryzen AI Max. It is obvious that running in Nvidia would be the most optimal way. But given the availability of high VRAM Nvidia cards at reasonable price, I can't stop thinking about getting one that is not Nvidia. So, if I'm not interested in training or fine tuning, would any of those solutions actually works? On a Linux machine?

999900000999

If you actually want to seriously do this, go with Nvidia.

This article is basically Intel saying remember us, we made a GPU! And they make great budget cards, but the ecosystem is just so far behind.

Honestly this is not something you can really do on a budget.

null

[deleted]

null

[deleted]

ryao

Where is the benchmark data?

zamadatix

Since the Xeon alone could run the model in this set up it'd be more interesting if they compared the performance uplift with using 0/1/2..8 Arc A770 GPUs.

Also, it's probably better to link straight to the relevant section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

hmottestad

If you’re running just one GPU your context is limited to 1024 tokens, as far as I could tell. I couldn’t see what the context size is for more cards though.

colorant

Yes, you are right. Unfortunately HN somehow truncated my original URL link.

zamadatix

Sounds like submission "helper" tools are working about as well as normal :).

Did you have the chance to try this out yourself or did you just run across it recently?

yongjik

Did DeepSeek learn how to name their models from OpenAI.

vlovich123

The convention is weird but it's pretty standard in the industry across all models, particularly GGUF. 671B parameters, quantized to 4 bits. The K_M terminology I believe is more specific to GGUF and describes the specific quantization strategy.

null

[deleted]

null

[deleted]

CamperBob2

Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)

colorant

>8TPS at this moment on a 2-socket 5th Xeon (EMR)

codetrotter

> the dual Epyc workstation recipe that was popularized recently

Anyone have a link to this one?