Building a personal, private AI computer on a budget
124 comments
·February 10, 2025axegon_
egorfine
> K80 - the drivers were one of the most painful tech things I've ever had to endure
Well, for a dedicated LLM box it might be feasible to suffer with drivers a bit, no? What was your experience like with the software side?
JKCalhoun
Curious what HP workstation you have?
kamranjon
For the same price ($1799) you could buy a Mac Mini with 48gb of unified memory and an m4 pro. It’d probably use less power and be much quieter to run and likely could outperform this setup in terms of tokens per second. I enjoyed the write up still, but I would probably just buy a Mac in this situation.
diggan
> likely could outperform this setup in terms of tokens per second
I've heard arguments both for and against this, but they always lack concrete numbers.
I'd love something like "Here is Qwen2.5 at Q4 quantization running via Ollama + these settings, and M4 24GB RAM gets X tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're just propagating mostly anecdotes without any reality-checks.
un_ess
Per the screenshot, this is a DeepSeek running on a 192GB M2 Studio https://nitter.poast.org/ggerganov/status/188461277009384272...
The same on Nvidia (various models) https://github.com/ggerganov/llama.cpp/issues/11474
[1] this is a the model: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/De...
diggan
So Apple M2 Studio does ~15 tks/second and A100-SXM4-80GB does 9 tks/second?
I'm not sure I'm reading the results wrong or missing some vital context, but that sounds unlikely to me.
null
fkyoureadthedoc
On an M1 Max 64GB laptop running gemma2:27b same prompt and settings from blog post
total duration: 24.919887458s
load duration: 39.315083ms
prompt eval count: 37 token(s)
prompt eval duration: 963.071ms
prompt eval rate: 38.42 tokens/s
eval count: 441 token(s)
eval duration: 23.916616s
eval rate: 18.44 tokens/s
I have a gaming PC with a 4090 I could try, but I don't think this model would fitdiggan
> gemma2:27b
What quantization are you using? What's the runtime+version you run this with? And the rest of the settings?
Edit: Turns out parent is using Q4 for their test. Doing the same test with LM Studio and a 3090ti + Ryzen 5950X (with 44 layers on GPU, 2 on CPU) I get ~15 tokens/second.
cruffle_duffle
I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by. I almost want to use the word “hobbiest stage” where almost all of the “data” and “best practice” is anecdotal but I think we are a step above that.
Still, it’s way to early and there are simply way to many hardware and software combinations that change almost weekly to establish “the best practice hardware configuration for training / inferencing large language models locally”.
Some day there will be established guides with solid. In fact someday there will be be PC’s that specifically target LLMs and will feature all kinds of stats aimed at getting you to bust out your wallet. And I even predict they’ll come up with metrics that all the players will chase well beyond when those metrics make sense (megapixels, clock frequency, etc)… but we aren’t there yet!
motorest
> I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by.
What's hard about it? You get the hardware, you run the software, you take measurements.
diggan
Right, but how are we supposed to be getting anywhere else unless people start being more specific and stop leaning on anecdotes or repeating what they've heard elsewhere?
Saying "Apple seems to be somewhat equal to this other setup" doesn't really contribute to someone getting an accurate picture if it is equal or not, unless we start including raw numbers, even if they aren't directly comparable.
I don't think it's too early to say "I get X tokens/second with this setup + these settings" because then we can at least start comparing, instead of just guessing which seems to be the current SOTA.
oofbaroomf
The bottleneck for single batch inference is memory bandwidth. The M4 Pro has less memory bandwidth than the P40, so it would be slower. Also, the setup presented in the OP has system RAM, allowing you to run models than what fits in 48GB of VRAM (and with good speeds too if you offload with something like ktransformers).
motorest
> For the same price ($1799) you could buy a Mac Mini with 48gb of unified memory and an m4 pro.
Around half that price tag was attributed to the blogger reusing an old workstation he had lying around. Beyond this point, OP slapped two graphics cards into an old rig. A better description would be something like "what buying two graphics cards gets you in terms of AI".
JKCalhoun
For this use case though, I would prefer something more modular than Apple hardware — where down the road I could upgrade the GPUs, for example.
ekianjo
Mac Mini will be very slow for context ingestion compared to nvidia GPU, and the other issue is that they are not usable for Stable Diffusion... So if you just want to use LLMs, maybe, but if you have other interests in AI models, probably not the right answer.
birktj
I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros. I'm not super up to date on the architecture on modern LLMs, but as far as I understand you should be able to split the layers between multiple nodes? It is not that much data the needs to be sent between them, right? I guess you won't get quite the same performance as a modern mac or nvidia GPU, but it could be quite acceptable and possibly a cheap way of getting a lot of memory.
On the other hand I am wondering about what is the state of the art in CPU + GPU inference. Prompt processing is both compute and memory constrained, but I think token generation afterwards is mostly memory bound. Are there any tools that support loading a few layers at a time into a GPU for initial prompt processing and then switches to CPU inference for token generation? Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.
joshstrange
I’d really love to build a machine for local LLMs. I’ve tested models on my MBP M3 Max with 128GB of ram and it’s really cool but I’d like a dedicated local server. I’d also like an excuse to play with proxmox as I’ve just run raw Linux servers or UnRaid w/ containers in the past.
I have OpenWebUI and LibreChat running on my local “app server” and I’m quite enjoying that but every time I price out a beefier box I feel like the ROI just isn’t there, especially for an industry that is moving so fast.
Privacy is not something to ignore at all but the cost of inference online is very hard to beat, especially when I’m still learning how best to use LLMs.
cwalv
> but every time I price out a beefier box I feel like the ROI just isn’t there, especially for an industry that is moving so fast.
Same, esp. if you factor in the cost of renting. Even if you run 24/7 it's hard to see it paying off in half the time it will take to be obsolete
datadrivenangel
You pay a premium to get the theoretical local privacy and reliability of hosting your own models.
But to get commercially competitive models you need 5 figures of hardware, and then need to actually run it securely and reliably. Pay as you go with multiple vendors as fallback is a better option right now if you don't need harder privacy.
joshstrange
Yeah, really I'd love for my Home Assistant to be able to use a local LLM/TTS/STT which I did get working but was way too slow. Also it would fun to just throw some problems/ideas at the wall without incurring (more) cost, that's a big part of it. But each time I run the numbers I would be better off using Anthropic/OpenAI/DeepSeek/other.
I think sooner or later I'll break down and buy a server for local inference even if the ROI is upside down because it would be a fun project. I also find that these thing fall in the "You don't know what you will do with it until you have it and it starts unlocking things in your mind"-category. I'm sure there are things I would have it grind on overnight just to test/play with an idea which is something I'd be less likely to do on a paid API.
nickthegreek
You shouldn't be having slow response issues with LLM/TTS/STT for HA on a mbp m3 max 128gb. I'd either limit the entities exposed or choose a smaller model.
cruffle_duffle
> You don't know what you will do with it until you have it and it starts unlocking things in your mind
Exactly. Once the price and performance get to the level where buying stuff for local training and inferencing… that is when we will start to see the LLM break out of its current “corporate lawyer safe” stage and really begin to shake things up.
rsanek
With something like OpenRouter, you don't even have to manually integrate with multiple vendors
smith7018
For what it's worth, looking at the benchmarks, I think the machine they built is comparable to what your MBP can already do. They probably have a better inference speed, though.
reacharavindh
The thing is though.... the locally hosted models in such hardware are cute as toys, and sure do write funny jokes and importantly, perform private tasks that I would never consider passing to non-selfhosted models, but pale in comparison to the models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc). If I could run deepseek-r1-678b locally, without breaking the bank, I would. But, for now, opex > capex at a consumer level.
walterbell
200+ comments, https://news.ycombinator.com/item?id=42897205
> This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.
elorant
Runs is an overstatement though. With 4 tokens/second you can't use it on production.
mechagodzilla
I have a similar setup running at about 1.5 tokens/second, and it's perfectly usable for the sorts of difficult tasks one needs a frontier model like this for - give it a prompt and come back an hour or two later. You interact with it like e-mailing a coworker. If I need an answer back in seconds, it's probably not a very complicated question, and a much smaller model will do.
Cascais
I agree with elorant. Indirectly, some youtubers ended up demonstrating that it's difficult to run the best models with less than 7k$, even if NVIDIA hardware is very efficient.
In the future, I expect this to not be the case, because models will be far more efficient. At this pace, maybe even 6 months can make a difference.
walterbell
Some LLM use cases are async, e.g. agents, "deep research" clones.
deadbabe
Isn’t 4 tps good enough for local use by a single user, which is the point of a personal AI computer?
CamperBob2
What I'd like to know is how well those dual-Epyc machines run the 1.58 bit dynamic quant model. It really does seem to be almost as good as the full Q8.
cratermoon
This is not because the models are better. These services have unknown and opaque levels of shadow prompting[1] to tweak the behavior. The subject article even mentions "tweaking their outputs to the liking of whoever pays the most". The more I play with LLMs locally, the more I realize how much prompting going on under the covers is shaping the results from the big tech services.
1 https://www.techpolicy.press/shining-a-light-on-shadow-promp...
vanillax
Huh? Toys? You can run DeepSeek 70b on 36GB ram Macbook pro.. You can run Phi4, Qwen2.5, or llama3.3. They work great for coding tasks
3s
Yeah but as one of the replies points out the resulting tokens/second would be unusable in production environments
CamperBob2
The 1.58-bit DeepSeek R1 dynamic quant model from Unsloth is no joke. It just needs a lot of RAM and some patience.
dgrabla
Great breakdown!. The "own your own AI" at home is a terrific hobby if you like to tinker, but you are going to spend a ton of time and money on hardware that will be underutilized most of the time. If you want to go nuts check out Mitko Vasilev's dream machine. It makes no sense if you don't have a very clear use case that only requires small models or really slow token generation speeds.
If the goal however is not to tinker but to really build and learn AI, it is going to be financially better to rent those GPUs/TPUs as needs arise.
theshrike79
Any M-series Mac is "good enough" for home LLMs. Just grab LM studio and a model that fits in memory.
Yes, it will not rival OpenAI, but it's 100% local with no monthly fees and depending on the model no censoring or limits on what you can do with it.
lioeters
> spend a ton of time and money
Not necessarily. For non-professional purposes, I've spent zero dollars (no additional memory or GPU) and I'm running a local language model that's good enough to help with many kinds of tasks including writing, coding, and translation.
It's a personal, private, budget AI that requires no network connection or third-party servers.
ImPostingOnHN
on what hardware (and how much did you spend on it)?
memhole
This is correct. The cost makes no sense outside of hobby and interest. You're far better off renting. I think there is some merit to having a local inference server if you're doing development. You can manage models and have a little more control over your infra as the main benefits.
JKCalhoun
Terrific hobby? Sign me up!
miniwark
2 x Nvidia Tesla P40 card for €660 is not a thing i consider to be "on a budget".
People can play with "small" or "medium" models less powerfull and cheaper cards. A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be found around €200-250 on second hand market (and they are around 300~350 new).
In my opinion, 48Gb of VRAM is overkill to call it "on a budget", for me this setup is nice but it's for semi-professional or professional usage.
There is of course a trade off to use medium or small models, but being "on a budget" is also to do trade off.
refibrillator
Pay attention to IO bandwidth if you’re building a machine with multiple GPUs like this!
In this setup the model is sharded between cards so data must be shuffled through a PCIe 3.0 x16 link which is limited to ~16 GB/s max. For reference that’s an order of magnitude lower than the ~350 GB/s memory bandwidth of the Tesla P40 cards being used.
Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.
Building on a budget is really hard. In my experience 5-15 tok/s is a bit too slow for use cases like coding, but I admit once you’ve had a taste of 150 tok/s it’s hard to go back (I’ve been spoiled by RTX 4090 with vLLM).
Miraste
Unless you run the GPUs in parallel, which you have to go out of your way to do, the IO bandwidth doesn't matter. The cards hold separate layers of the model, they're not working together. They're only passing a few kilobytes per second between them.
Xenograph
Which models do you enjoy most on your 4090? and why vLLM instead of ollama?
zinccat
I feel that you are mistaking the two bandwidth numbers
ekianjo
> Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.
How would you setup NVLink, if the cards support it?
gregwebs
The problem for me with making such an investment is that next month a better model will be released. It will either require more or less RAM than the current best model- making it either not runnable or expensive to run on an overbuilt machine.
Using cloud infrastructure should help with this issue. It may cost much more per run but money can be saved if usage is intermittent.
How are HN users handling this?
michaelt
Among people who are running large models at home, I think the solution is basically to be rich.
Plenty of people in tech earn enough to support a family and drive a fancy car, but choose not to. A used RTX 3090 isn't cheap, but you can afford a lot of $1000 GPUs if you don't buy that $40k car.
Other options include only running the smaller LLMs; buying dated cards and praying you can get the drivers to work; or just using hosted LLMs like normal people.
tempoponet
Most of these new models release several variants, typically in the 8b, 30b, and 70b range for personal use. YMMV with each, but you usually use the models that fit your hardware, and the models keep getting better even in the same parameter range.
To your point about cloud models, these are really quite cheap these days, especially for inference. If you're just doing conversation or tool use, you're unlikely to spend more than the cost of a local server, and the price per token is a race to the bottom.
If you're doing training or processing a ton of documents for RAG setups, you can run these in batches locally overnight and let them take as long as they need, only paying for power. Then you can use cloud services on the resulting model or RAG for quick and cheap inference.
idrathernot
There is also an overlooked “tail risk” with cloud services that can end up costing you more than a a few entire on-premise rigs if you don’t correctly configure services or forget to shut down a high end vm instance. Yeah you can implement additional scripts and services as a fail-safe, but this adds another layer of complexity that isn’t always trivial (especially for a hobbyist).
I’m not saying that dumping $10k into rapidly depreciating local hardware is the more economical choice, just that people often discount the likelihood and cost of making mistakes in the cloud during their evaluations and the time investment required to ensure you have the correct safeguards in-place.
3s
Exactly! While I have llama running locally on RTX and it’s fun to tinker with, I can’t use it for my workflows and don’t want to invest 20k+ to run a decent model locally
> How are HN users handling this? I’m working on a startup for end-to-end confidential AI using secure enclaves in the cloud (think of it like extending a local+private setup to the cloud with verifiable security guarantees). Live demo with DeepSeek 70B: chat.tinfoil.sh
JKCalhoun
I think the solution is already in the article and comments here: go cheap. Even next year the author will still have, at the very least, their P40 setup running late 2024 models.
I'm about to plunge in as others have to get my own homelab running the current crop of models. I think there's no time like the present.
nickthegreek
I plan to wait for the NVIDIA Digits release and see what the token/sec is there. Ideally it will work well for at least 2-3 years then I can resell and upgrade if needed.
walterbell
> expensive to run on an overbuilt machine
There's a healthy secondary market for GPUs.
diggan
> How are HN users handling this?
Combine the best of both worlds. I have a local assistant (communicate via Telegram) that handles tool-calling and basic calendar/todo management (running on a RTX 3090ti), but for more complicated stuff, it can call out to more advanced models (currently using OpenAI APIs for this) granted the request itself doesn't involve personal data, then it flat out refuses, for better or worse.
rcarmo
Given the power and noise involved, a Mac Mini M4 seems like a much nicer approach, although the RAM requirements will drive up the price.
null
DrPhish
This is just a limited recreation of the ancient mikubox from https://rentry.org/lmg-build-guides
Its funny to see people independently "discover" these builds that are a year plus old.
Everyone is sleeping on these guides, but I guess the stink of 4chan scares people away?
gwern
> Another important finding: Terry is by far the most popular name for a tortoise, followed by Turbo and Toby. Harry is a favorite for hares. All LLMs are loving alliteration.
Mode-collapse. One reason that the tuned (or tuning-contaminated models) are bad for creative writing: every protagonist and place seems to be named the same thing.
diggan
Couldn't you just up the temperature/change some other parameter to get it to be more random/"creative"? It wouldn't be active/intentional randomness/novelty like what a human would do, but at least it shouldn't generate exactly the same naming.
I did something similar but using a K80 and M40 I dug up from eBay for pennies. Be advised though, stay as far away as possible from the K80 - the drivers were one of the most painful tech things I've ever had to endure, even if 24GB of VRAM for 50 bucks sounds incredibly appealing. That said, I had a decent-ish HP workstation laying around with 1200 watt power supply so I had where to put those two in. The one thing to note here is that these types of GPUs do not have a cooling of their own. My solution was to 3d print a bunch of brackets and attach several Noctua fans and have them blow at full speed 24/7. Surprisingly it worked way better than I expected - I've never gone above 60 degrees. As a side efffect, the CPUs are also benefiting from this hack: at idle, they are in the mid-20 degrees range. Mind you, the noctua fans are located on the front and the back of the case: the ones on the front act as an intake and the ones on the back as exhaust and there's two more inside the case that are stuck in front of the GPUs.
The workstation was refurbished for just over 600 bucks, and another 120 bucks for the GPUs and another ~60 for the fans.
Edit: and before someone asks - no I have not uploaded the STL's anywhere cause I haven't had the time but also since this is a very niche use case, though I might: the back(exhaust) bracket came out brilliant the first try - it was a sub-millimeter fit. Then I got cocky and thought that I'd also nail it first try on the intake and ended up re-printing it 4 times.