Skip to content(if available)orjump to list(if available)

Building a personal, private AI computer on a budget

axegon_

I did something similar but using a K80 and M40 I dug up from eBay for pennies. Be advised though, stay as far away as possible from the K80 - the drivers were one of the most painful tech things I've ever had to endure, even if 24GB of VRAM for 50 bucks sounds incredibly appealing. That said, I had a decent-ish HP workstation laying around with 1200 watt power supply so I had where to put those two in. The one thing to note here is that these types of GPUs do not have a cooling of their own. My solution was to 3d print a bunch of brackets and attach several Noctua fans and have them blow at full speed 24/7. Surprisingly it worked way better than I expected - I've never gone above 60 degrees. As a side efffect, the CPUs are also benefiting from this hack: at idle, they are in the mid-20 degrees range. Mind you, the noctua fans are located on the front and the back of the case: the ones on the front act as an intake and the ones on the back as exhaust and there's two more inside the case that are stuck in front of the GPUs.

The workstation was refurbished for just over 600 bucks, and another 120 bucks for the GPUs and another ~60 for the fans.

Edit: and before someone asks - no I have not uploaded the STL's anywhere cause I haven't had the time but also since this is a very niche use case, though I might: the back(exhaust) bracket came out brilliant the first try - it was a sub-millimeter fit. Then I got cocky and thought that I'd also nail it first try on the intake and ended up re-printing it 4 times.

yjftsjthsd-h

> Be advised though, stay as far away as possible from the K80 - the drivers were one of the most painful tech things I've ever had to endure, even if 24GB of VRAM for 50 bucks sounds incredibly appealing.

I thought the problem was that those cards have loads of RAM but lack really important compute capabilities such that they're kind of useless for actually running AI workloads on. Is that not the case?

almostgotcaught

> Is that not the case?

it is - they're laughably slow and not even supported by latest CUDA

> NVIDIA Driver support for Kepler is removed beginning with R495. CUDA Toolkit development support for Kepler continues through CUDA 11.x.

GTP

But Deepseek R1 doesn't use CUDA, so maybe for this specific case, it isn't a big deal?

TrueDuality

I'm running P41s in one of my test boxes. These don't have support for BF16 but they do support F16 and F32 and those are accelerated to a certain degree, they're lacking kernels that are as optimized but its not terribly hard to adapt other ones for the purposes.

You don't get great out-of-the-box performance but it only took me three work days or so with no experience writing these to adapt, test, and validate a kernel using the acceleration hardware that was available (no prior experience writing these kernels).

They're not as powerful as others but still significantly better than running on a CPU alone and I'd bet my kernel is missing more advanced optimizations.

My issue with these was the power cable and fans. The author touches on the fans and I did try a 3D printed shroud and some of the higher pressure fans but I could only run the cards in short stints. I ended up making an enclosure that went straight out of the case using two high pressure SAN array fans I harvested from the IT graveyard per card and making a hole with an angle grinder.

The power cable is NOT STANDARD on these. I had to find a weird specific cable to adapt the standard 8-pin GPU connector and each card takes two of these bad boys.

egorfine

> K80 - the drivers were one of the most painful tech things I've ever had to endure

Well, for a dedicated LLM box it might be feasible to suffer with drivers a bit, no? What was your experience like with the software side?

JKCalhoun

Curious what HP workstation you have?

9front

HP Z440, it's in the article.

JKCalhoun

My comment was not directed at the blog but at the person I responded to.

BizarroLand

What kind of performance did you get out of that?

deadbabe

What’s the most pain you’ve ever felt?

kamranjon

For the same price ($1799) you could buy a Mac Mini with 48gb of unified memory and an m4 pro. It’d probably use less power and be much quieter to run and likely could outperform this setup in terms of tokens per second. I enjoyed the write up still, but I would probably just buy a Mac in this situation.

diggan

> likely could outperform this setup in terms of tokens per second

I've heard arguments both for and against this, but they always lack concrete numbers.

I'd love something like "Here is Qwen2.5 at Q4 quantization running via Ollama + these settings, and M4 24GB RAM gets X tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're just propagating mostly anecdotes without any reality-checks.

fkyoureadthedoc

On an M1 Max 64GB laptop running gemma2:27b same prompt and settings from blog post

    total duration:       24.919887458s
    load duration:        39.315083ms
    prompt eval count:    37 token(s)
    prompt eval duration: 963.071ms
    prompt eval rate:     38.42 tokens/s
    eval count:           441 token(s)
    eval duration:        23.916616s
    eval rate:            18.44 tokens/s
I have a gaming PC with a 4090 I could try, but I don't think this model would fit

condiment

On a 3090 (24gb vram), same prompt & quant, I can report more than double the tokens per second, and significantly faster prompt eval.

    total_duration:       10530451000
    load_duration:        54350253
    prompt_eval_count:    36
    prompt_eval_duration: 29000000
    prompt_token/s:       1241.38
    eval_count:           460
    eval_duration:        10445000000
    response_token/s:     44.04
Fast prompt eval is important when feeding larger contexts into these models, which is required for almost anything useful. GPUs have other advantages for traditional ML, whisper models, vision, and image generation. There's a lot of flexibility that doesn't really get discussed when folks trot out the 'just buy a mac' line.

Anecdotally I can share my revealed preference. I have both an M3 (36gb) as well as a GPU machine, and I went through the trouble of putting my GPU box online because it was so much faster than the mac. And doubling up the GPUs allows me to run models like the deepseek-tuned llama 3.3, with which I have completely replaced my use of chatgpt 4o.

diggan

> gemma2:27b

What quantization are you using? What's the runtime+version you run this with? And the rest of the settings?

Edit: Turns out parent is using Q4 for their test. Doing the same test with LM Studio and a 3090ti + Ryzen 5950X (with 44 layers on GPU, 2 on CPU) I get ~15 tokens/second.

fkyoureadthedoc

7800X3D, 32GB DDR5, 4090:

    total duration:       10.5922028s
    load duration:        21.1739ms
    prompt eval count:    36 token(s)
    prompt eval duration: 546ms
    prompt eval rate:     65.93 tokens/s
    eval count:           467 token(s)
    eval duration:        10.023s
    eval rate:            46.59 tokens/s

cruffle_duffle

I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by. I almost want to use the word “hobbiest stage” where almost all of the “data” and “best practice” is anecdotal but I think we are a step above that.

Still, it’s way to early and there are simply way to many hardware and software combinations that change almost weekly to establish “the best practice hardware configuration for training / inferencing large language models locally”.

Some day there will be established guides with solid. In fact someday there will be be PC’s that specifically target LLMs and will feature all kinds of stats aimed at getting you to bust out your wallet. And I even predict they’ll come up with metrics that all the players will chase well beyond when those metrics make sense (megapixels, clock frequency, etc)… but we aren’t there yet!

motorest

> I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by.

What's hard about it? You get the hardware, you run the software, you take measurements.

diggan

Right, but how are we supposed to be getting anywhere else unless people start being more specific and stop leaning on anecdotes or repeating what they've heard elsewhere?

Saying "Apple seems to be somewhat equal to this other setup" doesn't really contribute to someone getting an accurate picture if it is equal or not, unless we start including raw numbers, even if they aren't directly comparable.

I don't think it's too early to say "I get X tokens/second with this setup + these settings" because then we can at least start comparing, instead of just guessing which seems to be the current SOTA.

vladgur

as someone who is paying $0.50 per kwh, id also like to include kw per 1000 tokens or something to give me a sense of cost of ownership these local systems

troyvit

That would be an awesome thing across the industry -- even for the big commercial models -- for those who care not only about price but also carbon footprint.

un_ess

Per the screenshot, this is a DeepSeek running on a 192GB M2 Studio https://nitter.poast.org/ggerganov/status/188461277009384272...

The same on Nvidia (various models) https://github.com/ggerganov/llama.cpp/issues/11474

[1] this is a the model: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/De...

diggan

So Apple M2 Studio does ~15 tks/second and A100-SXM4-80GB does 9 tks/second?

I'm not sure I'm reading the results wrong or missing some vital context, but that sounds unlikely to me.

null

[deleted]

motorest

> For the same price ($1799) you could buy a Mac Mini with 48gb of unified memory and an m4 pro.

Around half that price tag was attributed to the blogger reusing an old workstation he had lying around. Beyond this point, OP slapped two graphics cards into an old rig. A better description would be something like "what buying two graphics cards gets you in terms of AI".

Capricorn2481

> Beyond this point, OP slapped two graphics cards into an old rig

Meaning what? This is largely what you do on a budget since RAM is such a difference maker in token generation. This is what's recommended. OP could buy an a100, but that wouldn't be a budget build.

oofbaroomf

The bottleneck for single batch inference is memory bandwidth. The M4 Pro has less memory bandwidth than the P40, so it would be slower. Also, the setup presented in the OP has system RAM, allowing you to run models than what fits in 48GB of VRAM (and with good speeds too if you offload with something like ktransformers).

anthonyskipper

>>M4 Pro has less memory bandwidth than the P40, so it would be slower

Why do you say this? I thought the p40 only had a memory bandwidth of 346 Gbytes/sec. The m4 is 546 GB/s. So the macbook should kick the crap out of the p40.

oofbaroomf

The M4 Max has up to 546 GB/s. The M4 Pro, what GP was talking about, has only 273 GB/s. An M4 Max with that much RAM would most likely exceed OP's budget.

ekianjo

Mac Mini will be very slow for context ingestion compared to nvidia GPU, and the other issue is that they are not usable for Stable Diffusion... So if you just want to use LLMs, maybe, but if you have other interests in AI models, probably not the right answer.

drcongo

I use a Mac Studio for Stable Diffusion, what's special about the Mac Mini that means it won't work?

vunderba

What models are you using? Stable diffusion 1.5, SDXL, or flux?

I've heard that Macs are pretty slow with XL and borderline unusable for flux requiring minutes at a time to generate a single image - whereas an RTX4090 can generate a 1024x1024 image with the higher quality Flux Dev model (not schnell) in 14 seconds.

OP is probably correct that if you want to branch out of just strictly LLM's, cuda is the way to go. I've never heard of anyone getting LTX or hunyuan running on a Mac for example.

JKCalhoun

For this use case though, I would prefer something more modular than Apple hardware — where down the road I could upgrade the GPUs, for example.

UncleOxidant

I wish Apple would offer a 128GB option in the Mac Mini - That would require an M4 Max which they don't offer in the mini. I know they have a MBP with M4 Max and 128GB, but I don't need another laptop.

kridsdale1

I’m waiting until this summer with the M4 Ultra Studio.

UncleOxidant

Which will likely be over five grand for 128GB.

joshstrange

I’d really love to build a machine for local LLMs. I’ve tested models on my MBP M3 Max with 128GB of ram and it’s really cool but I’d like a dedicated local server. I’d also like an excuse to play with proxmox as I’ve just run raw Linux servers or UnRaid w/ containers in the past.

I have OpenWebUI and LibreChat running on my local “app server” and I’m quite enjoying that but every time I price out a beefier box I feel like the ROI just isn’t there, especially for an industry that is moving so fast.

Privacy is not something to ignore at all but the cost of inference online is very hard to beat, especially when I’m still learning how best to use LLMs.

cwalv

> but every time I price out a beefier box I feel like the ROI just isn’t there, especially for an industry that is moving so fast.

Same, esp. if you factor in the cost of renting. Even if you run 24/7 it's hard to see it paying off in half the time it will take to be obsolete

datadrivenangel

You pay a premium to get the theoretical local privacy and reliability of hosting your own models.

But to get commercially competitive models you need 5 figures of hardware, and then need to actually run it securely and reliably. Pay as you go with multiple vendors as fallback is a better option right now if you don't need harder privacy.

joshstrange

Yeah, really I'd love for my Home Assistant to be able to use a local LLM/TTS/STT which I did get working but was way too slow. Also it would fun to just throw some problems/ideas at the wall without incurring (more) cost, that's a big part of it. But each time I run the numbers I would be better off using Anthropic/OpenAI/DeepSeek/other.

I think sooner or later I'll break down and buy a server for local inference even if the ROI is upside down because it would be a fun project. I also find that these thing fall in the "You don't know what you will do with it until you have it and it starts unlocking things in your mind"-category. I'm sure there are things I would have it grind on overnight just to test/play with an idea which is something I'd be less likely to do on a paid API.

nickthegreek

You shouldn't be having slow response issues with LLM/TTS/STT for HA on a mbp m3 max 128gb. I'd either limit the entities exposed or choose a smaller model.

cruffle_duffle

> You don't know what you will do with it until you have it and it starts unlocking things in your mind

Exactly. Once the price and performance get to the level where buying stuff for local training and inferencing… that is when we will start to see the LLM break out of its current “corporate lawyer safe” stage and really begin to shake things up.

rsanek

With something like OpenRouter, you don't even have to manually integrate with multiple vendors

wkat4242

Is that like LiteLLM? I have that running but never tried OpenRouter. I wonder now if it's better :)

whalesalad

The juice aint worth the squeeze to do this locally.

But you should still play with proxmox, just not for this purpose. My recommendation would be to get an i7 HP Elitedesk. I have multiple racks in my basement, hundreds of gigs of ram, multiple 2U 2x processor enterprise servers etc.... but at this point all of it is turned off and a single HP Elitedesk with a 2nd NIC added and 64GB of ram is doing everything I ever needed and more.

joshstrange

Yeah, right now I'm running a tower PC (Intel Core i9-11900K, 64GB Ram) with Unraid as my local "app server". I want to play with Proxmox (for professional and mostly fun reasons) though. Someday I'd like a rack in my basement as my homelab stuff has overgrown the space it's in and I'm going to need to add a new 12-bay Synology (on top of 2x12-bay) soon since I'm running out of space again. For now I've been sticking with consumer/prosumer equipment but my needs are slowly outstripping that I think.

smith7018

For what it's worth, looking at the benchmarks, I think the machine they built is comparable to what your MBP can already do. They probably have a better inference speed, though.

moffkalast

A Strix Halo minipc might be a good mid tier option once they're out, though AMD still isn't clear on how much they'll overprice them.

Core Ultra Arc iGPU boxes are pretty neat too for being standalone and can be loaded up with DDR5 shared memory, efficient and usable in terms of speed, though that's definitely low end performance, plus SYCL and IPEX are a bit eh.

reacharavindh

The thing is though.... the locally hosted models in such hardware are cute as toys, and sure do write funny jokes and importantly, perform private tasks that I would never consider passing to non-selfhosted models, but pale in comparison to the models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc). If I could run deepseek-r1-678b locally, without breaking the bank, I would. But, for now, opex > capex at a consumer level.

walterbell

200+ comments, https://news.ycombinator.com/item?id=42897205

> This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.

elorant

Runs is an overstatement though. With 4 tokens/second you can't use it on production.

mechagodzilla

I have a similar setup running at about 1.5 tokens/second, and it's perfectly usable for the sorts of difficult tasks one needs a frontier model like this for - give it a prompt and come back an hour or two later. You interact with it like e-mailing a coworker. If I need an answer back in seconds, it's probably not a very complicated question, and a much smaller model will do.

deadbabe

Isn’t 4 tps good enough for local use by a single user, which is the point of a personal AI computer?

Cascais

I agree with elorant. Indirectly, some youtubers ended up demonstrating that it's difficult to run the best models with less than 7k$, even if NVIDIA hardware is very efficient.

In the future, I expect this to not be the case, because models will be far more efficient. At this pace, maybe even 6 months can make a difference.

walterbell

Some LLM use cases are async, e.g. agents, "deep research" clones.

CamperBob2

What I'd like to know is how well those dual-Epyc machines run the 1.58 bit dynamic quant model. It really does seem to be almost as good as the full Q8.

cratermoon

This is not because the models are better. These services have unknown and opaque levels of shadow prompting[1] to tweak the behavior. The subject article even mentions "tweaking their outputs to the liking of whoever pays the most". The more I play with LLMs locally, the more I realize how much prompting going on under the covers is shaping the results from the big tech services.

1 https://www.techpolicy.press/shining-a-light-on-shadow-promp...

CamperBob2

The 1.58-bit DeepSeek R1 dynamic quant model from Unsloth is no joke. It just needs a lot of RAM and some patience.

jaggs

There seems to be a LOT of work going on to optimize the 1.58-bit option in terms of hardware and add-ons. I get the feeling that someone from Unsloth is going to have a genuine breakthrough shortly, and the rig/compute costs are going to plummet. Hope I'm not being naïve or over-confident.

vanillax

Huh? Toys? You can run DeepSeek 70b on 36GB ram Macbook pro.. You can run Phi4, Qwen2.5, or llama3.3. They work great for coding tasks

3s

Yeah but as one of the replies points out the resulting tokens/second would be unusable in production environments

vanillax

What? Literally use it at work to write code.

jmyeet

The author mentions it but I want to expand on it: Apple is a seriously good option here, specifically the M4 Mac Mini.

What makes Apple attractive is (as the author mentions) that RAM is shared between main and video RAM whereas NVidia is quite intentionally segmenting the market and charging huge premiums for high VRAM cards. Here are some options:

1. Base $599 Mac Mini: 16GB of RAM. Stocked in store.

2. $999 Mac Mini: 24GB of RAM. Stocked in store.

3. Add RAM to either of the above up to 32GB. It's not cheap at $200/8GB but you can buy a Mac Mini with 32GB of shared RAM for $999, substantially cheaper than the author's PC build but less storage (although you can upgrade that too).

4. M4 Pro: $1399 w/ 24GB of RAM. Stocked in store. You can customize this all the way to 64GB of RAM for +$600 so $1999 in total. That is amazing value for this kind of workload.

5. The Mac Studio is really the ultimate option. Way more cores and you can go all the way to 192GB of unified memory (for a $6000 machine). The problem here is that the Mac Studio is old, still on the M2 architecture. An M4 Ultra update is expected sometime this year, possibly late this year.

6. You can get into clustering these (eg [1]).

7. There are various Macbook Pro options, the highest of which is a 16" Mackbook Pro with 128GB of unified memory for $4999.

But the main takeaway is the M4 Mac Mini is fantastic value.

Some more random thoughts:

- Some Mac Minis have Thunderbolt 5 ("TB5"), which is up to either 80Gbps or 120Gbps bidirectional (I've seen it quoted as both);

- Mac Minis have the option of 10GbE (+$200);

- The Mac Mini has 2 USB3 ports and either 3 TB4 or 3 TB5 ports.

[1]: https://blog.exolabs.net/day-2/

atwrk

Worth pointing out that you "only" get <= 270GB/s of memory bandwith with those Macs, unless you choose the max/ultra models.

If that is enough for your use case, it may make sense to wait 2 months and get a Ryzen AI Max+ 395 APU, which will have the same memory bandwith, but allows for up to 128GB RAM. For probably ~half the Mac's price.

Usual AMD driver disclaimer applies, but then again inference is most often way easier to get running than training.

sofixa

The issue with Macs is that below Max/Ultra processors, the memory bandwidth is pretty slow. So you need to spend a lot on a high level processor and lots of memory, and the current gen processor, M4, doesn't even have an Ultra, while the Max is only available in a laptop form factor (so thermal constraints).

An M4 Pro still has only 273GB/s, while even the 2 generations old RTX 3090 has 935GB/s.

https://github.com/ggerganov/llama.cpp/discussions/4167

jmyeet

That's a good point. I checked the M2 Mac Studio and it's 400GB/s for the M2 Max and 800GB/s for the M2 Ultra so the M4 Ultra when we get it later this year should really be a beast.

Oh and the top end Macbook Pro 16 (the only current Mac with an M4 Max) has 410GB/s memory bandwidth.

Obviously the Mac Studio is at a much higher price point.

Still, you need to spend $1500+ to get an NVidia GPU with >12GB of RAM. Multiple of those starts adding up quick. Put multiple in the same box and you're talking more expensive case, PSU, mainboard, etc and cooling too.

Apple has a really interesting opportunity here with their unified memory architecture and power efficiency.

diggan

How is the performance difference between using a dedicated GPU from Nvidia for example compared to whatever Apple does?

So lets say we'd run a model on a Mac Mini M4 with 24GB RAM, how many tokens/s are you getting? Then if we run the exact same model but with a RTX 3090ti for example, how many tokens/s are you getting?

Do these comparisons exist somewhere online already? I understand it's possible to run the model on Apple hardware today, with the unified memory, but how fast is that really?

redman25

Not the exact same comparison but I have an M1 mac with 16gb ram and can get about 10 t/s with a 3B model. The same model on my 3060ti gets more than 100 t/s.

Needless to say, ram isn't everything.

diggan

Could you say what exact model+quant you're using for that specific test + settings + runtime? Just so I could try to compare with other numbers I come across.

oofbaroomf

Unified memory is great because it's fast, but you can also get a lot of system memory on a "conventional" machine like OP's, and offload MOE layers like what Ktransformers did, so you can run huge models with acceptable speeds. While the Mac mini may have better value for anything that fits in the unified memory, if you want to run Deepseek R1 or other large models, then it's best to max out system RAM and get a GPU to offload.

sethd

For sure and the Mac Mini M4 Pro with 64GB of RAM feels like the sweet spot right now.

That said, the base storage option is only 512GB, and if this machine is also a daily driver, you’re going to want to bump that up a bit. Still, it’s an amazing machine for under $3K.

wolfhumble

It would be better/cheaper to buy an external Thunderbolt 5 enclosure for the NVME drive you need.

sethd

I looked into this a couple months ago and external TB5 was still more expensive at 1-2 TB not sure about above, though.

iamleppert

The hassle of not being able to work with native CUDA isn't worth it for a huge amount of AI. Good luck getting that latest paper or code working quickly just to try it out, if the author didn't explicitly target M4 (unlikely but all the most mainstream of stuff).

darkwater

In a homelab scenario, having your own AI assistant not ran by someone else, that is not an issue. If you want to tinker/learn AI it's definitely an issue.

ollybee

The middle ground is to rent a GPU VPS as needed. You can get an H100 for $2/h. Not quite the same privacy as fully local offline, but better than a SASS API and good enough for me. Hopefully in a year or three it will truly be cost effective to run something useful locally and then I can switch.

anonzzzies

That is what I do but it costs a lot of $, more than just using openrouter. I would like to have a machine so I can have a model talk to itself 24/7 for a realtively fixed price. I have enough solar and wind + cheap net electric so it would basically be free after buying. Just hard to pick what to buy without just forking out a fortune on GPU's.

1shooner

Do you have a recommended provider or other pointers for GPU rental?

birktj

I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros. I'm not super up to date on the architecture on modern LLMs, but as far as I understand you should be able to split the layers between multiple nodes? It is not that much data the needs to be sent between them, right? I guess you won't get quite the same performance as a modern mac or nvidia GPU, but it could be quite acceptable and possibly a cheap way of getting a lot of memory.

On the other hand I am wondering about what is the state of the art in CPU + GPU inference. Prompt processing is both compute and memory constrained, but I think token generation afterwards is mostly memory bound. Are there any tools that support loading a few layers at a time into a GPU for initial prompt processing and then switches to CPU inference for token generation? Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.

Eisenstein

> I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros.

Look into RPC. Llama.cpp supports it.

* https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

> Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.

Moving layers over the PCIe bus to do this is going to be slow, which seems to be the issue with that strategy. I think it the key is to use MoE and be smart about which layers go where. This project seems to be doing that with great results:

* https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...

dgrabla

Great breakdown!. The "own your own AI" at home is a terrific hobby if you like to tinker, but you are going to spend a ton of time and money on hardware that will be underutilized most of the time. If you want to go nuts check out Mitko Vasilev's dream machine. It makes no sense if you don't have a very clear use case that only requires small models or really slow token generation speeds.

If the goal however is not to tinker but to really build and learn AI, it is going to be financially better to rent those GPUs/TPUs as needs arise.

theshrike79

Any M-series Mac is "good enough" for home LLMs. Just grab LM studio and a model that fits in memory.

Yes, it will not rival OpenAI, but it's 100% local with no monthly fees and depending on the model no censoring or limits on what you can do with it.

jrm4

For what purpose? I'm asking this as someone who threw one of the cheap $500 Nvidia's with 16gb of VRAM and I'm already overwhelmed with what I can do already with Ollama, Krita+ComfyUI etc etc.

lioeters

> spend a ton of time and money

Not necessarily. For non-professional purposes, I've spent zero dollars (no additional memory or GPU) and I'm running a local language model that's good enough to help with many kinds of tasks including writing, coding, and translation.

It's a personal, private, budget AI that requires no network connection or third-party servers.

ImPostingOnHN

on what hardware (and how much did you spend on it)?

memhole

This is correct. The cost makes no sense outside of hobby and interest. You're far better off renting. I think there is some merit to having a local inference server if you're doing development. You can manage models and have a little more control over your infra as the main benefits.

JKCalhoun

Terrific hobby? Sign me up!

miniwark

2 x Nvidia Tesla P40 card for €660 is not a thing i consider to be "on a budget".

People can play with "small" or "medium" models less powerfull and cheaper cards. A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be found around €200-250 on second hand market (and they are around 300~350 new).

In my opinion, 48Gb of VRAM is overkill to call it "on a budget", for me this setup is nice but it's for semi-professional or professional usage.

There is of course a trade off to use medium or small models, but being "on a budget" is also to do trade off.

whywhywhywhy

> A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be found around €200-250 on second hand market

1080Ti might even be a better option, it also has a 12gb model and some reports say it even outperforms the 3060, in non-rtx I presume.

Eisenstein

CUDA compute version is a big deal. 1080ti is 6.1. 3060 is 8.6. It also has tensor cores.

Note that CUDA version numbers are confusing, the compute number is a different thing than the runtime/driver version.

Melatonic

Not sure what used prices are like these days but the Titan XP (similar to the 1080 ti) is even better

mock-possum

Yeesh, yeah, that was my first thought too - who’s budget??

less than $500 total feels more fitting as a ‘budget’ build - €1700 is more along the lines of ‘enthusiast’ or less charitably “I am rich enough to afford expensive hobbies”

If it’s your business and you expect to recoup the cost and write off the cost on your taxes, that’s one thing - but if you’re just looking to run a personal local LLM for funnies, that’s not an accessible price tag.

I suppose “or you could just buy a Mac” should have tipped me off though.

cwoolfe

As others have said, a high powered Mac could be used for the same purpose at a comparable price and lower power usage. Which makes me wonder: why doesn't Apple get into the enterprise AI chip game and compete with Nvidia? They could design their own ASIC for it with all their hardware & manufacturing knowledge. Maybe they already are.

gmueckl

The primary market for such a product would be businesses. And Apple isn't particularly good at selling to companies. The consumer product focus may just be too ingrained to be successful with such a move.

A beefed up home pod with a local LLM-based assistant would be a more typical Apple product. But they'd probably need LLMs to become much, much more reliable to not ruin their reputation over this.

fragmede

Why? Siri's still total crap but that doesn't seem to have slowed down iPhone sales.

gmueckl

Siri mostly hit the expectations they themselves were able to set through their ads when launching that product - having a voice based assistant at all was huge back then. With an LLM-based assistant, the market has set the expectations for them and they are just unreasonably high and don't mirror reality. That's a potentially big trap for Apple now.

lolinder

> And Apple isn't particularly good at selling to companies.

With a big glaring exception: developer laptops are overwhelmingly Apple's game right now. It seems like they should be able to piggyback off of that, given that the decision makers are going to be in the same branch of the customer company.

jrm4

For roughly the same reason Steve Jobs et al killed Hypercard; too much power to the users.

gregwebs

The problem for me with making such an investment is that next month a better model will be released. It will either require more or less RAM than the current best model- making it either not runnable or expensive to run on an overbuilt machine.

Using cloud infrastructure should help with this issue. It may cost much more per run but money can be saved if usage is intermittent.

How are HN users handling this?

michaelt

Among people who are running large models at home, I think the solution is basically to be rich.

Plenty of people in tech earn enough to support a family and drive a fancy car, but choose not to. A used RTX 3090 isn't cheap, but you can afford a lot of $1000 GPUs if you don't buy that $40k car.

Other options include only running the smaller LLMs; buying dated cards and praying you can get the drivers to work; or just using hosted LLMs like normal people.

tempoponet

Most of these new models release several variants, typically in the 8b, 30b, and 70b range for personal use. YMMV with each, but you usually use the models that fit your hardware, and the models keep getting better even in the same parameter range.

To your point about cloud models, these are really quite cheap these days, especially for inference. If you're just doing conversation or tool use, you're unlikely to spend more than the cost of a local server, and the price per token is a race to the bottom.

If you're doing training or processing a ton of documents for RAG setups, you can run these in batches locally overnight and let them take as long as they need, only paying for power. Then you can use cloud services on the resulting model or RAG for quick and cheap inference.

idrathernot

There is also an overlooked “tail risk” with cloud services that can end up costing you more than a a few entire on-premise rigs if you don’t correctly configure services or forget to shut down a high end vm instance. Yeah you can implement additional scripts and services as a fail-safe, but this adds another layer of complexity that isn’t always trivial (especially for a hobbyist).

I’m not saying that dumping $10k into rapidly depreciating local hardware is the more economical choice, just that people often discount the likelihood and cost of making mistakes in the cloud during their evaluations and the time investment required to ensure you have the correct safeguards in-place.

anon373839

Yes. And somehow, those cloud providers just can’t seem to work out how to build a spend limit feature for customers who’d like to prevent that. It must be a really difficult engineering problem…

nickthegreek

I plan to wait for the NVIDIA Digits release and see what the token/sec is there. Ideally it will work well for at least 2-3 years then I can resell and upgrade if needed.

3s

Exactly! While I have llama running locally on RTX and it’s fun to tinker with, I can’t use it for my workflows and don’t want to invest 20k+ to run a decent model locally

> How are HN users handling this? I’m working on a startup for end-to-end confidential AI using secure enclaves in the cloud (think of it like extending a local+private setup to the cloud with verifiable security guarantees). Live demo with DeepSeek 70B: chat.tinfoil.sh

JKCalhoun

I think the solution is already in the article and comments here: go cheap. Even next year the author will still have, at the very least, their P40 setup running late 2024 models.

I'm about to plunge in as others have to get my own homelab running the current crop of models. I think there's no time like the present.

walterbell

> expensive to run on an overbuilt machine

There's a healthy secondary market for GPUs.

xienze

The price goes up dramatically once you go past 12GB though, that’s the problem.

JKCalhoun

Not on these server GPUs.

I'm seeing 24GB M40 cards for $200, 24GB K80 cards for $40 on eBay.

diggan

> How are HN users handling this?

Combine the best of both worlds. I have a local assistant (communicate via Telegram) that handles tool-calling and basic calendar/todo management (running on a RTX 3090ti), but for more complicated stuff, it can call out to more advanced models (currently using OpenAI APIs for this) granted the request itself doesn't involve personal data, then it flat out refuses, for better or worse.

refibrillator

Pay attention to IO bandwidth if you’re building a machine with multiple GPUs like this!

In this setup the model is sharded between cards so data must be shuffled through a PCIe 3.0 x16 link which is limited to ~16 GB/s max. For reference that’s an order of magnitude lower than the ~350 GB/s memory bandwidth of the Tesla P40 cards being used.

Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.

Building on a budget is really hard. In my experience 5-15 tok/s is a bit too slow for use cases like coding, but I admit once you’ve had a taste of 150 tok/s it’s hard to go back (I’ve been spoiled by RTX 4090 with vLLM).

Miraste

Unless you run the GPUs in parallel, which you have to go out of your way to do, the IO bandwidth doesn't matter. The cards hold separate layers of the model, they're not working together. They're only passing a few kilobytes per second between them.

Xenograph

Which models do you enjoy most on your 4090? and why vLLM instead of ollama?

ekianjo

> Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.

How would you setup NVLink, if the cards support it?

zinccat

I feel that you are mistaking the two bandwidth numbers