Skip to content(if available)orjump to list(if available)

How to Run DeepSeek R1 671B Locally on a $2000 EPYC Server

geertj

This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.

This [1] X thread runs the 671B model in the original Q8 at 6-8 TPS for $6K using a dual socket Epyc server motherboard using 768GB of RAM. I think this could be made cheaper by getting slower RAM but since this is RAM bandwidth limited that would likely reduce TPS. I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.

[1] https://x.com/carrigmat/status/1884244369907278106?s=46&t=5D...

nielsole

I've been running the unsloth 200GB dynamic quantisation with 8k context on my 64GB Ryzen 7 5800G. CPU and iGPU utilization were super low, because it basically has to read the entire model from disk. (Looks like it needs ~40GB of actual memory that it cannot easily mmap from disk) With a Samsung 970 Evo Plus that gave me 2.5GB/s read speed. That came out at 0.15 tps Not bad for completely underspecced hardware.

Given the model has only so few active parameters per token (~40B), it is likely that just being able to hold it in memory absolve the largest bottleneck. I guess with a single consumer PCIe4.0x16 graphics card you could get at most 1tps just because of the PCIe transfer speed? Maybe CPU processing can be faster simply because DDR transfer is faster than transfer to the graphics card.

TeMPOraL

To add another datapoint, I've been running the 131GB (140GB on disk) 1.58 bit dynamic quant from Unsloth with 4k context on my 32GB Ryzen 7 2700X (8 cores, 3.70 GHz), and achieved exactly the same speed - around 0.15 tps on average, sometimes dropping to 0.11, tps occasionally going up to 0.16 tps. Roughly 1/2 of your specs, roughly 1/2 smaller quant, same tps.

I've had to disable the overload safeties in LM Studio and tweak with some loader parameters to get the model to run mostly from disk (NVMe SSD), but once it did, it also used very little CPU!

I tried offloading to GPU, but my RTX 4070 Ti (12GB VRAM) can take at most 4 layers, and it turned out to make no difference in tps.

My RAM is DDR4, maybe switching to DDR5 would improve things? Testing that would require replacing everything but the GPU, though, as my motherboard is too old :/.

Eisenstein

More channels > faster ram.

Some math:

DDR5 6000 is 3000mhz x 2 (double data rate) x 64 bits / 8 for bytes = 48000 /1000 = 48GB/s

DDR3 1866 is 933mhz x 2 x 64 / 8 / 1000 = 14.93GB/s. If you have 4 channels that is 4 x 14.93 = 59.72GB/s

kristianp

For a 131GB model, the biggest difference would be to fit it all in RAM, eg get 192GB of RAM. Sorry if this is too obvious, but it's pointless to run an llm if it doesn't fit in ram, even if it's an MOE model. And also obviously, it may take a server motherboard and cpu to fit that much RAM.

numpad0

I wonder if one could just replicate the "Mac mini LLM cluster" setup over Ethernet of some form and 128GB per node of DDR4 RAM. Used DDR4 RAM with likely dead bits are dirt cheap, but I would imagine that there will be challenges linking systems together.

conor_mc

I wonder if the, now abandoned, Intel Optane drives could help with this. They had very low latency, high IOPS, and decent throughput. They made RAM modules as well. A ram disk made of them might be faster.

loxias

Intel PMem really shines for things you need to be non-volatile (preserved when the power goes out) like fast changing rows in a database. As far as I understand it, "for when you need millions of TPS on a DB that can't fit in RAM" was/is the "killer app" of PMem.

Which suggests it wouldn't be quite the right fit here -- the precomputed constants in the model aren't changing, nor do they need to persist.

Still, interesting question, and I wonder if there's some other existing bit of tech that can be repurposed for this.

I wonder if/when this application (LLMs in general) will slow down and stabilize long enough for anything but general purpose components to make sense. Like, we could totally shove model parameters in some sort of ROM and have hardware offload for a transformer, IF it wasn't the case that 10 years from now we might be on to some other paradigm.

smcleod

I get around 4-5t/s with the unsloth 1.58bit quant on my home server that has 2x3090 and 192GB of DDR5 Ryzen 9, usable but slow.

null

[deleted]

segmondy

how much context size?

baobun

I imagine you can get more by striping drives. Depending on what chipset you have, the CPU should handle at least 4. Sucks that no AM4 APU supports PCIe 4 while the platform otherwise does.

null

[deleted]

geertj

> I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.

Per o3-mini, the blocked gemm (matrix multiply) operations have very good locality and therefore MT/s should matter much more than CAS latency.

iwontberude

I have been doing this with an Epyc 7402 and 512GB of DDR4 and its been fairly performant, you dont have to wait very long to get pretty good results. It's still LLM levels of bad, but at least I dont have to pay $20/mo to OpenAI.

whatevaa

I don't think the cost of such machine will ever be a better than $20/mo, though. Capital costs are too high.

iwontberude

But I use it for many other use cases and hosting protocol based services that would otherwise expose me to advertising or additional service charges. It's just like people that buy solar panels instead of buying service from the power company. You get the benefit with your multi-year ROI.

I built the machine for $5500 four years ago and it certainly has not paid for itself, but it still has tons of utility and will probably last another four years bringing my monthly cost to ~$50/mo which is way lower than what a cloud provider would charge, especially considering egress network traffic. Instead of paying Discord, Twitter, Netflix/Hulu/Amazon/etc, paid game hosting, and ChatGPT, I can self host Jitsi/Matrix, Bluesky, Plex, SteamCMD, and ollama. In total I end up spending about the same, but I have way more control, better access to content, and can do more when offline for internet outages.

Thanks to CloudFlare Tunnel, I dont have to pay a cloud vendor, cdn or vpn for good routes to my web resources or opt into paid DDoS protection services. It's fantastic.

3abiton

3x the price for less than 2x the speed increase. I don't think the price justifies the upgrade.

phonon

Q4 vs Q8.

m348e912

> TacticalCoder 14 minutes ago [dead] | root | parent | prev | next [–]

>

> TFA says it can bump the spec to 768 GB but that it's then more like > $2500 than $2000. At 768 GB that'd be the full, 8 bit, model.

> Seems indeed like a good price compared to $6000 for someone who wants to hack a build.

> I mean: $6 K is doable but I take it take many who'd want to build such a machine for fun would prefer to only fork $2.5K.

.

I am not sure why TacticalCoder's comment was downvoted to oblivion. I would have upvoted if the comment wasn't already dead.

bee_rider

I mean, nothing ever actually scales linearly, right?

TacticalCoder

TFA says it can bump the spec to 768 GB but that it's then more like $2500 than $2000. At 768 GB that'd be the full, 8 bit, model.

Seems indeed like a good price compared to $6000 for someone who wants to hack a build.

I mean: $6 K is doable but I take it take many who'd want to build such a machine for fun would prefer to only fork $2.5K.

manmal

The Q8 model will likely slow this down to 50%, probably not a very useful speed. The 6k setup will probably do 10-12t/s at Q4.

plagiarist

Is there a source that unrolls that without creating an account?

isoprophlex

Online, R1 costs what, $2/MTok?

This rig does >4 tok/s, which is ~15-20 ktok/hr, or $0.04/hr when purchased through a provider.

You're probably spending $0.20/hr on power (1 kW) alone.

Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)

rightbyte

> Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)

I would argue that is enough and that this is awesome. It was a long time ago I wanted to do a tech hack like this much.

isoprophlex

Well thinking about it a bit more, it would be so cool if you could

A) somehow continuously interact with the running model, ambient-computing style. Say have the thing observe you as you work, letting it store memories.

B) allowing it to process those memories when it chooses to/whenever it's not getting any external input/when it is "sleeping" and

C) (this is probably very difficult) have it change it's own weights somehow due to whatever it does in A+B.

THAT, in a privacy friendly self-hosted package, i'd pay serious money for

Willingham

I imagine it could solve crimes if it watched millions of hours of security footage…scary thought. Possibly it could arrest us before we even commit a crime through prediction like that black mirror episode.

codetrotter

> doesn't make a lot of sense (besides privacy...)

Privacy is worth very much though.

onlyrealcuzzo

What privacy benefit do you get running this locally vs renting a baremetal GPU and running it there?

Wouldn't that be much more cost-effective?

Especially when you inevitably want to run a better / different model in the near future that would benefit from different hardware?

You can get similar Tok/sec on a single RTX 4090 - which you can rent for <$1/hr.

randomjoe2

But at a totally different quant, you're crazy if you think you can run the entire R1 model on a single 4090, come on man. Apples and oranges.

infecto

Definitely but when you can run this in places like Azure with tight contracts it makes little sense except for the ultra paranoid.

gloflo

Considering the power of three letter agencies in the USA and the complete unhingedness of the new administration, I would not trust anything to a contract.

greenchair

[flagged]

jpc0

You could absolutely install 2kw of solar for probably around 2-4k and then at worst it turns your daytime usage into 0$. I also would be surprised if this was pulling 1kw in reality, I would want to see an actual measurement of what it is realistically pulling at the wall.

I believe it was an 850w PSU on the spec sheet?

dboreham

Quick note that solar power doesn't have zero cost.

bee_rider

It could have zero marginal cost, right? In particular, if you over-provisioned your solar installation already anyway, most of the time it should be producing more energy than you need.

madduci

And in winter, depending on the region, it might generate 0kW

killingtime74

Marginal cost $0, 2kw solar + inverter + battery + install is worth more than this rig

jpc0

No need for battery and battery is by far you largest cost. This could 100% just fallback to grid power, it's not backup power it's reducing usage.

Not sure about where you are but where I am a 2kW plus li-ion batteries is about 2months of the average salary here, not for tech, average salary, to put it into perspective converted to USD that is 1550 usd. Panels is maybe 20% of that cost, you can add 4kW of panels for 450 USD where I am.

So for less than the price of that PC I would be able to do 2kW of solar with li-ion batteries and overspecing panels by double. None of that cheaping out on components, can absolutely get lower than that if cheaping out. Installation will be maybe another 500-600 USD here, likely to be much higher depending on region. Also to put it into perspective we pay about 0.3 USD cents per kWh for electricity and this would pay for itself in between a year and two in savings.

By the time it needs to be replaced which is from 5-7 years on the stuff I just got pricing on it would have 100% offset the cost of running.

Again I am lucky and we effectively get 80-100% output year round even with cloud cover, you might be pretty far north and that doesn't apply.

TLDR: it depends but if you are in the right region and this setup generates even some income for you the cost to go solar is negative, it would actually not make financial sense to not do it, concidering a 2K USD box was in your budget.

rufus_foreman

Privacy, for me, is a necessary feature for something like this.

And I think your math is off, $0.20 per kWh at 1 kW is is $145 a month. I pay $0.06 per kWh. I've got what, 7 or 8 computers running right now and my electric bill for that and everything else is around $100 a month, at least until I start using AC. I don't think the power usage of something like this would be significant enough for me to even shut it off when I wasn't using it.

Anyway, we'll find out, just ordered the motherboard.

michpoch

> I pay $0.06 per kWh

That is like, insanely cheap. In Europe I'd expect prices between $0.15 - 0.25 per kWh. $0.06 sounds like you live next to some solar farm or large hydro installation? Is that a total price, with transfer?

rufus_foreman

So one thing, it's the winter rate, summer rate is higher when people run their AC and the energy company has to use higher cost sources. Second, it's a tiered rate, first x amount of kWh is a higher rate, then once you reach that amount it's a lower rate. But I'm already above the tier cutoff every month no matter what, so marginal winter rate is around $0.06.

rodonn

Depends on where you live. The average in San Francisco is $0.29 per kWh.

null

[deleted]

magic_hamster

This gets you the (arguably) most powerful AI in the world running completely privately, under your control, in around $2000. There are many use cases for when you wouldn't want to send your prompts and data to a 3rd party. A lot of businesses have a data export policy where you are just not allowed to use company data anywhere but internal services. This is actually insanely useful.

api

How is it that cloud LLMs can be so much cheaper? Especially given that local compute, RAM, and storage are often orders of magnitude cheaper than cloud.

Is it possible that this is an AI bubble subsidy where we are actually getting it below cost?

Of course for conventional compute cloud markup is ludicrous, so maybe this is just cloud economy of scale with a much smaller markup.

NeutralCrane

My guess is two things:

1. Economies of scale. Cloud providers are using clusters in the tens of thousands of GPUs. I think they are able to run inference much more efficiently than you would be able to in a single cluster just built for your needs.

2. As you mentioned, they are selling at a loss. OpenAI is hugely unprofitable, and they reportedly lose money on every query.

boredatoms

The purchase price for a H100 is dramatically lower when you buy a few thousand at a time

thijson

I think batch processing of many requests is cheaper. As each layer of the model is loaded into cache, you can put through many prompts. Running it locally you don't have that benefit.

michpoch

> Especially given that local compute, RAM, and storage are often orders of magnitude cheaper than cloud

He uses old, much less efficient GPUs.

He also did not select his living location based on the electricity prices, unlikely the cloud providers.

realusername

It's cheaper because you are unlikely to run your local AI at top capacity 24/7 so you have unused capacity which you are paying for.

davrosthedalek

The calculation shows it's cheaper even if you run local AI 24/7

NeutralCrane

They are specifically referring to usage of APIs where you just pay by the token, not by compute. In this case, you aren’t paying for capacity at all, just usage.

octacat

It is shared between users and better utilized and optimized.

otabdeveloper4

"Sharing between users" doesn't make it cheaper. It makes it more expensive due to the inherent inefficiencies of switching user contexts. (Unless your sales people are doing some underhanded market segmentation trickery, of course.)

agieocean

Isn't that just because they can get massive discounts on hardware buying in bulk (for lack of a proper term) + absorb losses?

raducu

All that, but also because they have those GPUs with crazy amounts of RAM and crazy bandwidth? So the TPS is that much higher, but in terms of power, I guess those boards run in the same ballpark of power used by consumer GPUs?

matja

How would it use 1kW? Socket SP3 tops at 280W and the system in the article has a 850W PSU so I'm not sure what I'm missing.

falcor84

I assume that the parent just rounded 850W up to 1kW, no?

isoprophlex

Yeah i was vigorously waving hands. Even at 200W, 10 cents/kWh you'd need to run this a LONG time to break even

topbanana

The point is running locally, not efficiently

huijzer

What is a bit weird about AI currently is that you basically always want to run the best model, but the price of the hardware is a bit ridiculous. In the 1990s, it was possible to run Linux on scrappy hardware. You could also always run other “building blocks” like Python, Docker, or C++ easily.

But the newest AI models require an order of magnitude more RAM than my system or the systems I typically rent have.

So I’m curious to people here, has this in the history of software happened before? Maybe computer games are a good example. There people would also have to upgrade their system to run the latest games.

spamizbad

Like AI, there were exciting classes of applications in the 70s, 80s and 90s that mandated pricier hardware. Anything 3D related, running multi-user systems, higher end CAD/EDA tooling, and running any server that actually got put under “real” load (more than 20 users).

If anything this isn’t so bad: $4K in 2025 dollars is an affordable desktop computer from the 90s.

lukeschlather

The thing is I'm not that interested in running something that will run on a $4K rig. I'm a little frustrated by articles like this, because they claim to be running "R1" but it's a quantized version and/or it has a small context window... it's not meaningfully R1. I think to actually run R1 properly you need more like $250k.

But it's hard to tell because most of the stuff posted is people trying to do duct tape and bailing wire solutions.

mechagodzilla

I can run the 671B-Q8 version of R1 with a big context on a used dual-socket Xeon I bought for about $2k with 768GB of RAM. It gets about 1-1.5 tokens/sec, which is fine to give it a prompt and just come back an hour or so later. To get to many 10s of tokens/sec, you would need >8 GPUs with 80GB of HBM each, and you're probably talking well north of $250k. For the price, the 'used workstation with a ton of DDR4' approach works amazingly well.

adenta

If you google, there is a $6k setup for the non-quantized version running like 3-4 tps.

handzhiev

Indeed, even design and prepress required quite expensive hardware. There was a time when very expensive Silicone Graphics workstations were a thing.

Keyframe

Of course it has. Coughs in SGI and advanced 3D and video software like PowerAnimator, Softimage, Flame. Hardware + software combo starting around 60k of 90's dollars, but to do something really useful with it you'd have to enter 100-250k of 90's dollars range.

tarruda

> What is a bit weird about AI currently is that you basically always want to run the best model,

I think the problem is thinking that you always need to use the best LLM. Consider this:

- When you don't need correct output (such as when writing a blog post, there's no right/wrong answer), "best" can be subjective.

- When you need correct output (such as when coding), you always need to review the result, no matter how good the model is.

IMO you can get 70% of the value of high end proprietary models by just using something like Llama 8b, which is runnable on most commodity hardware. That should increase to something like 80% - 90% when using bigger open models such as the newly released "mistral small 3"

lukeschlather

With o1 I had a hairy mathematical problem recently related to video transcoding. I explained my flawed reasoning to o1, and it was kind of funny in that it took roughly the same amount of time to figure out the flaw in my reasoning, but it did, and it also provided detailed reasoning with correct math to correct me. Something like Llama 8b would've been worse than useless. I ran the same prompt by ChatGPT and Gemini, and both gave me sycophantic confirmation of my flawed reasoning.

> When you don't need correct output (such as when writing a blog post, there's no right/wrong answer), "best" can be subjective.

This is like, everything that is wrong with the Internet in a single sentence. If you are writing a blog post, please write the best blog post you can, if you don't have a strong opinion on "best," don't write.

rblatz

This isn’t he best comment I’ve seen on HN, you should delete it, or stop gatekeeping.

lurking_swe

for coding insights / suggestions as you type, similar to copilot, i agree.

for rapidly developing prototypes or working on side projects, i find llama 8b useless. it might take 5-6 iterations to generate something truly useful. compared to say 1-shot with claude sonnet 3.5 or open ai gpt-4o. that’s a lot less typing and time wasted.

NegativeK

I'm not sure Linux is the best comparison; it was specifically created to run on standard PC hardware. We have user access to AI models for little or no monetary cost, but they can be insanely expensive to run.

Maybe a better comparison would be weather simulations in the 90s? We had access to their outputs in the 90s but running the comparable calculations as a regular Joe might've actually been impossible without a huge bankroll.

bee_rider

Or 3D rendering, or even particularly intense graphic design-y stuff I think, right? In the 90’s… I mean, computers in the $1k-$2k range were pretty much entry level, right?

detourdog

The early 90's and digital graphic production. Computer upgrades could make intensive alterations interactive. This was true of photoshop and excel. There were many bottle necks to speed. Upgrade a network of graphic machines from 10mbit networking to 100mbit did wonders for server based workflows.

evilduck

Adjusting for inflation, $2000 is about the same price as the first iMac, an entry level consumer PC at the time. Local AI is still pretty accessible to hobbyist level spending.

diffeomorphism

Not adjusting at all, this is not "entry level" but rather "enthusiast"

https://www.logicalincrements.com/

Still accessible but only for dedicated hobbyists with deeper pockets.

svilen_dobrev

well, if there was e.g. a model trained for coding - i.e. specialization as such, having models trained mostly for this or that - instead of everything incl. Shakespeare, the kitchen sink and the cockroaches biology under it, that would make those runable on much low level hardware. But there is only one, The-Big-Deal.. in many incarnations.

ant6n

Read “masters of doom”, they go into quite some detail on how Carmack got himself a very expensive work station to develop Doom/Quake.

notsylver

I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this

gliptic

Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.

ants_everywhere

With used GPUs do you have to be concerned that they're close to EOL due to high utilization in a Bitcoin or AI rig?

gliptic

I guess it will be a bigger issue the longer it's been since they stopped making them, but most I've heard (including me) haven't had any issue. Crypto rigs don't necessarily break GPUs faster because they care about power consumption and run the cards at a pretty even temperature. What probably breaks first is the fans. You might also have to open the card up and repaste/repad them to keep the cooling under control.

EVa5I7bHFq9mnYK

GPUs were last used for Bitcoin mining in 2013, so you shouldn't be concerned unless you are buying a GTX 780.

pmarreck

M4 Mac with unified GPU RAM

Not very cheap though! But you get a quite usable personal computer with it...

gliptic

Any that can run 70B at >5 t/s are >$2k as far as I know.

jjallen

How does inference happen on a GPU with such limited memory compared with the full requirements of the model? This is something I’ve been wondering for a while

Gracana

You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.

ynniv

Umm, two 3090's? Additional cards scale as long as you have enough PCIe channels.

gliptic

I arbitrarily chose $1k as the "cheap" cut-off. Two 3090 is definitely the most bang for the buck if you can fit them.

api

Apple M chips with their unified GPU memory are not terrible. I have one of the first M1 Max laptops with 64G and it can run up to 70B models at very useful speeds. Newer M series are going to be faster and they offer more RAM now.

Are there any other laptops around other than the larger M series Macs that can run 30-70B LLMs at usable speeds that also have useful battery life and don’t sound like a jet taxiing to the runway?

For non-portables I bet a huge desktop or server CPU with fast RAM beats the Mac Mini and Studio for price performance, but I’d be curious to see benchmarks comparing fast many core CPU performance to a large M series GPU with unified RAM.

jenny91

As a data point: you can get an RTX 3090 for ~$1.2k and it runs deepseek-r1:32b perfectly fine via Ollama + open webui at ~35 tok/s in an OpenAI-like web app and basically as fast as 4o.

kevinak

You mean Qwen 32b fine-tuned on Deepseek :)

There is only one model of Deepseek (671b), all others are fine-tunes of other models

driverdan

> you can get an RTX 3090 for ~$1.2k

If you're paying that much you're being ripped off. They're $800-900 on eBay and IMO are still overpriced.

bick_nyers

It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.

firtoz

Would it be something like this?

> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi

https://x.com/geerlingguy/status/1884994878477623485

I haven't tried it myself or haven't verified the creds, but seems exciting at least

gliptic

That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.

etra0

it's using a Raspberry Pi with a.... USD$1k gpu, which kinda defeat the purpose of using the RPI in the first place imo.

or well, I guess you save a bit on power usage.

unethical_ban

I suppose it makes sense, for extremely GPU centric applications, that the pi be used essentially as a controller for the 3090.

firtoz

Oh, I was naive to think that the Pi was capable of some kind of magic (sweaty smile emoji goes here)

spaceport

I put together a $350 build with a 3060 12GB and its still my favorite build. I run llama 3.2 11b q4 on it and its a really efficient way to get started and the tps is great.

Svoka

You can run smaller models on MacbookPro with ollama with those speeds. Even with several 3k GPUs it won't come close to 4o level.

spaceport

Hi HN, Garage youtuber here. Wanted to add in some stats on the wattages/ram.

Idle wattage: 60w (well below what I expected, this is w/o GPUs plugged in)

Loaded wattage: 260w

RAM Speed I am running currently: 2400 (V likely 3200 has a decent perf impact)

brunohaid

Still surprised that the $3000 NVIDIA Digits doesn’t come up more often in that and also the gung-ho market cap discussion.

I was an AI sceptic until 6 months ago, but that’s probably going to be my dev setup from spring onwards - running DeepSeek on it locally, with a nice RAG to pull in local documentation and datasheets, plus a curl plugin.

https://www.nvidia.com/en-us/project-digits/

fake-name

It'll probably be more relevant when you can actually buy the things.

It's just vaporware until then.

brunohaid

Call me naive, but I somehow trust them to deliver in time/specs?

It’s also a more general comment around „AI desktop appliance“ vs homebuilts. I’d rather give NVIDIA/AMD $3k for a well adjusted local box than tinkering too much or feeding the next tech moloch, and have a hunch I’m not the only one feeling that way. Once it’s possible of course.

fake-name

Oh, if it's anything close to what they claim, I'll probably buy one as well, but I certainly do not expect them to deliver on time.

fulafel

Also, LPDDR memory, and no published bandwidth numbers.

Cane_P

Seeing as it is going to deliver 1 PFLOP, it will need to have similar speed as the "native" (GDDR) counterpart otherwise it will only be able to hit that performance as long as all data is in the cache...

My guess is that they will use the RTX 5070 Ti laptop version (992 TFLOPS, slightly higher clocked to reach 1000 TFLOPS/ 1 PFLOP).

Their big GB200 chips have 546 GB/s to their LPDDR memory, they could use the same memory controler on the GB10. They don't need to design a new one. It would still be slower than what they are currently using on the RTX 5070 Ti laptop GPU, but any slower than that, and there is no chance that they could argue that it would hit anywhere near 1 PFLOP of FP4. It would only be possible in extreme edge case scenarios when all data will fit in it's 40MB L2 cache.

ganoushoreilly

and people are missing the "Starting at" price. I suspect the advertised specs will end up more than $3k. If it comes out at that price, i'm in for 2. But I'm not holding my breath given Nvidia and all.

Cane_P

CPU (20 ARM cores), GPU (1 PFLOP of FP4) and memory (128 GB) seems fixed, so the only configurable parts would be storage (up to 4TB) and cabling (if you want to connect two DIGITS).

We kind of know what storage cost in a store and we know that Apple (Mac computers) and every phone manufacturer adds a ton of cost for a small increase. NVIDIA will probably do the same.

I have no idea what the cost for their cabling would be, but they exist in 100G, 200G, 400G and 800G speeds and you seem to need two of them.

If you are only going to use one DIGITS, and you can make do with whatever is the smallest storage option, then it is $3000. Many people might have another computer (set up FTP/SMB or similar solution), NAS or USB thumbdrive/external hardrive where they can stor extra data, and in that case you can have more storage without paying for more.

ranguna

I'm not sure you can fit a decent quant of R1 in digits, 128 GB of memory is not enough for 8 and I'm not sure of 4 but I have my doubts. So you might have to go for around 1, which has a significant quality loss.

Cane_P

You can connect two, and get 256 GB. But it will still not be enough to run it in native format. You will still need to use lower quant.

diffeomorphism

The webpage does not say $3000 but starting at $3000. I am not so optimistic that the base model will actually be capable of this.

Cane_P

They won't have different models, in any other ways than if you want more storage (up to 4 TB, we don't know the lowest they will sell) and cabling necessary for connecting two DIGITS (it won't be included in the box).

We already know that it is going to be one single CPU and GPU and fixed memory. The GPU is most likely the RTX 5070 Ti laptop model (992 TFLOPS, clocked 1% higher to get 1 PFLOP).

yapyap

probably because nvidia digits is just a concept rn

christophilus

Aside: it’s pretty amazing what $2K will buy. It’s been a minute since I built my desktop, and this has given me the itch to upgrade.

Any suggestions on building a low-power desktop that still yields decent performance?

Havoc

>Any suggestions on building a low-power desktop that still yields decent performance?

You don't for now. The bottleneck is mem throughput. That's why people using CPU for LLM are running xeon-ish/epyc setups...lots of mem channels.

The APU class gear along the lines of Halo Strix is probably the path closest to lower power but it's not going to do 500gb of ram and still doesn't have enough throughput for big models

spaceport

Not to be that yt'r that shills my videos all over, but you did ask for a low powered desktop build and this $350 one I put together is still my favorite. The 3060 12GB with llama 3.2 vision 11b is a very fun box that is low idle power (intel rules) to leave on 24/7 and have it run some additional services like HA.

https://youtu.be/iflTQFn0jx4

baobun

Hard to know what ranges you have in mind with "decent performance" and "low-power".

I think your best bet might be a Ryzen U-series mini PC. Or perhaps an APU barebone. The ATX platform is not ideal from a power-efficiency perspective (whether inherently or from laziness or conspiracy from mobo and PSU makers, I do not know). If you want the flexibility or scale, you pay the price of course but first make sure it's what you want. I wouldn't look at discrete graphics unless you have specific needs (really high-end gaming, workstation, LLMs, etc) - the integrated graphics of last few years can both drive your 4k monitors and play recent games at 1080p smoothly, albeit perhaps not simultaneously ;)

Lenovo Tiny mq has some really impressive flavors (ECC support at the cost of CPU vendor-lock on PRO models) and there's the whole roster of Chinese competitors and up-and-comers if you're feeling adventerous. Believe me you can still get creative if you want to scratch the builder itch - thermals is generally what keeps these systems from really roaring (:

jbritton

Does it make any sense to have specialized models, which could possibly be a lot smaller. Say a model that just translates between English and Spanish, or maybe a model that just understands unix utilities and bash. I don’t know if limiting the training content affects the ultimate output quality or model size.

walterbell

Some enterprises have trained small specialized models based on proprietary data.

https://www.maginative.com/article/nvidia-leverages-ai-to-as...

> NVIDIA researchers customized LLaMA by training it on 24 billion tokens derived from internal documents, code, and other textual data related to chip design. This advanced “pretraining” tuned the model to understand the nuances of hardware engineering. The team then “fine-tuned” ChipNeMo on over 1,000 real-world examples of potential assistance applications collected from NVIDIA’s designers.

2023 paper, https://research.nvidia.com/publication/2023-10_chipnemo-dom...

> Our results show that these domain adaptation techniques enable significant LLM performance improvements over general-purpose base models across the three evaluated applications, enabling up to 5x model size reduction with similar or better performance on a range of design tasks.

2024 paper, https://developer.nvidia.com/blog/streamlining-data-processi...

> Domain-adaptive pretraining (DAPT) of large language models (LLMs) is an important step towards building domain-specific models. These models demonstrate greater capabilities in domain-specific tasks compared to their off-the-shelf open or commercial counterparts.

lhl

Last fall I built a new workstation with an EPYC 9274F (24C Zen4 4.1-4.3GHz, $2400), 384GB 12 x 32GB DDR5-4800 RDIMM ($1600), and a Gigabyte MZ33-AR0 motherboard. I'm slowly populating with GPUs (including using C-Payne MCIO gen5 adapters), not focused on memory, but I did spend some time recently poking at it.

I spent extra on the 9274F because of some published benchmarks [1] that showed that the 9274F had STREAM TRIAD results of 395 GB/s (on 460.8 GB/s of theoretical peak memory bandwidth), however sadly, my results have been nowhere near that. I did testing with LIKWID, Sysbench, and llama-bench, and even w/ an updated BIOS and NUMA tweaks, I was getting <1/2 the Fujitsu benchmark numbers:

  Results for results-f31-l3-srat:
  {
      "likwid_copy": 172.293857421875,
      "likwid_stream": 173.132177734375,
      "likwid_triad": 172.4758203125,
      "sysbench_memory_read_gib": 191.199125,
      "llama_llama-2-7b.Q4_0": {
          "tokens_per_second": 38.361456,
          "model_size_gb": 3.5623703002929688,
          "mbw": 136.6577115303955
      }
  }
For those interested in all the system details/running their own tests (also MLC and PMBW results among others): https://github.com/AUGMXNT/speed-benchmarking/tree/main/epyc...

[1] https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-perfor...

menaerus

Assuming that you populated the channels correctly, which I believe you did, I can only think that this issue could be related to the motherboard itself or RAM. I think you could start by measuring the single-core RAM bandwidth and latency.

Since the CPU is clocked quite high, figures you should be getting are I guess around ~100ns, but probably less than that, and 40-ish GB/s of BW. If those figures do not match then it could be either a motherboard (HW) or BIOS (SW) issue or RAM stick issue.

If those figures closely match then it's not a RAM issue but a motherboard (BIOS or HW) and you could continue debugging by adding more and more cores to the experiment to understand at which point you hit the saturation point for the bandwidth. It could be a power issue with the mobo.

lhl

Yeah, that channels are populated correctly. As you can see from the mlc-results.txt, the latency looks fine:

   mlc --idle_latency
  Intel(R) Memory Latency Checker - v3.11b
  Command line parameters: --idle_latency

  Using buffer size of 1800.000MiB
  Each iteration took 424.8 base frequency clocks (       104.9   ns)
As does the per-channel --bandwidth_matrix results:

                  Numa node
  Numa node            0       1       2       3       4       5       6       7
         0        45999.8 46036.3 50490.7 50529.7 50421.0 50427.6 50433.5 52118.2
         1        46099.1 46129.9 52768.3 52122.3 52086.5 52767.6 52122.6 52093.4
         2        46006.3 46095.3 52117.0 52097.2 50385.2 52088.5 50396.1 52077.4
         3        46092.6 46091.5 52153.6 52123.4 52140.3 52134.8 52078.8 52076.1
         4        45718.9 46053.1 52087.3 52124.0 52144.8 50544.5 50492.7 52125.1
         5        46093.7 46107.4 52082.0 52091.2 52147.5 52759.1 52163.7 52179.9
         6        45915.9 45988.2 50412.8 50411.3 50490.8 50473.9 52136.1 52084.9
         7        46134.4 46017.2 52088.9 52114.1 52125.0 52152.9 52056.6 52115.1
I've tried various NUMA configurations (from 1 domain to a per-CCD config) and it doesn't seem to make much difference.

Updating from the board-delivered F14 to the latest 9004 F31 BIOS (the F33 releases bricked the board and required using a BIOS flasher for manual recover) gave marginal (5-10%) improvement, but nothing major.

While 1DPC, the memory is 2R (but still registers at 4800), training on every boot. The PMBW graph is probably the most useful behavior chart: https://github.com/AUGMXNT/speed-benchmarking/blob/main/epyc...

Since I'm not so concerned with CPU inference, I feel like the debugging/testing I've done is... the amount I'm going to do, which is enough to at least characterize, if not fix the performance.

I might write up a more step-by-step guide at some point to help others but for now the testing scripts are there - I think most people who are looking at theoretical MBW should probably do their own real-world testing as it seems to vary a lot more than GPU bandwidth.

menaerus

To saturate the bandwidth, you would need ~16 zen4 cores but you could first try running

    lkwid -t load -i 100 -w S0:5GB:8:1:2
and see what you get. I think you should be able to get somewhere around ~200 GB/s.

easygenes

This is neat, but what I really want to see is someone running it on 8x 3090/4090/5090 and what is the most practical configuration for that.

gatienboquet

According to NVIDIA. > a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second.

You can rent a single H200 for 3$/hour.

MaxikCZ

I have been searching for a single example of someone running it like this (or 8x P40 and alike), and found nothing..

deoxykev

8x 3090 will net you around 10-12tok/s

bick_nyers

It would not be that slow as it is an MoE model with 37b activated parameters.

Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.

At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.

This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).

deoxykev

Yeah, there is a clear bottleneck somewhere in llama.cpp. Even high end hardware is struggling to get good numbers. The theoretical limit should be higher, but it's not yet.

Benchmarks: https://github.com/ggerganov/llama.cpp/issues/11474#issuecom...

rdlw

Is it possible that eight graphics cards is the most practical configuration? How do you even set that up? I guess server mobos have crazy numbers of PCIe slots?

pama

What is the fastest documented way so far to serve the full R1 or V3 models (Q8, not Q4) if the main purpose is inference with many parallel queries and maximizing the total tokens per sec? Did anyone document and benchmark efficient distributed service setups?

manmal

The top comment in this thread mentions a 6k setup, which likely could be used with vLLM with more tinkering. AFAIK vLLM‘s batched inference is great.

snovv_crash

You need enough VRAM to hold the whole thing plus context. So probably a bunch of H100s, or MI300s.

jonotime

I'm also kind of new to this and coming from coding with ChatGPT. Isnt the time to first token important? He is sitting there for minutes waiting for a response. Shouldnt that be a concern?

HarHarVeryFunny

I'd rather wait to get a good response, than get a quick response that is much less useful, and it's the nature of these "reasoning" models that they reason before responding.

Yesterday I was comparing DeepSeek-R1 (NVidia hosted version) with both Sonnet 3.5 (regarded by most as most capable coder) and the new Gemini 2.0 flash, and the wait was worth it. I was trying to get all three to create a web page with a horizontally scrolling timeline with associated clickable photos...

Gemini got to about 90% success after half a dozen prompts, after which it became a frustrating game of whack-a-mole trying to get it to fix the remaining 10% without introducing new bugs - I gave up after ~30min. Sonnet 3.5 looked promising at first, generating based on a sketch I gave it, but also only got to 90%, then hit daily usage limit after a few attempts to complete the task.

DeepSeek-R took a while to generate it, but nailed it on first attempt.

jonotime

Interesting. So in my use, I rarely see gpt get it right on the first pass but thats mostly due to interpretation of the question. I'm ruling out the times when it hallucinates calls to functions that dont exist.

Lets say I ask for some function that calculates some matrix math in python. It will spit out something but I dont like what it did. So I will say, now dont us any calls to that library you pulled in, and also allow for these types of inputs. Add exception handling...

So response time is important since its a conversation, no matter how correct the response is.

When you say deep seek "nailed it on the first attempt" do you mean it was without bugs? Or do you mean it worked how you imagined? Or what exactly?

HarHarVeryFunny

DeepSeek-R generated a working web page on first attempt, based on a single brief prompt I gave it.

With Sonnet 3.5, given the same brief prompt I gave DeepSeek-R, it took a half dozen feedback steps to get to 90%. Trying a hand drawn sketch input to Sonnet instead was quicker - impressive first attempt, but iterative attempts to fix it failed before I hit the usage limit. Gemini was the slowest to work with, and took a lot of feedback to get to the "almost there" stage, after which it floundered.

The AI companies seem to want to move in the direction of autonomous agents (with reasoning) that you hand a task off to that they'll work on while you do something else. I guess that'd be useful if they are close to human level and can make meaningful progress without feedback, and I suppose today's slow-responding reasoning models can be seen as a step in that direction.

I think I'd personally prefer something fast enough responding to use as a capable "pair programmer", rather than an autonomous agent trying to be an independent team member (at least until the AI gets MUCH better), but in either case being able to do what's being asked is what matters. If the fast/interactive AI only gets me 90% complete (then wastes my time floundering until I figure out it's just not capable of the task), then the slower but more capable model seems preferable as long as it's significantly better.

lukeschlather

The alternative isn't to use a weaker model, the alternative is to solve the problem myself. These are all very academically interesting, but they don't usually save any time. On the other hand, the other day I had a math problem I asked o1 for help with, and it was barely worth it. I realized my problem at the exact moment it gave me the correct answer. I say that because these high-end reasoning models are getting better. "Barely useful" is a huge deal and it seems like we are hitting the inflection point where expensive models are starting to be consistently useful.

HarHarVeryFunny

Yes, it seems we've only recently passed the point where these models are extremely impressive but still not good enough to really be useful, to now being actual time savers for doing quite a few everyday tasks.

The AI companies seem to be pushing AI-assisted software development as an early use case, but I've always thought this is one of the more difficult things for them to become good at, since many/most development tasks require both advanced reasoning (which they are weak at) and ability to learn from experience (which they just can't do). The everyday, non-development tasks, like "take this photo of my credit card bill and give me category subtotals" are where the models are now actually useful, but software development still seems to be an area where they are highly impressive but ultimately not capable enough to be useful outside of certain narrow use cases. That said, it'll be interesting to see how good these reasoning models can get, but I think that things like inability to learn (other than in-context) put a hard limit on what this type of pre-trained LLM tech will be useful for.