I want everything local – Building my offline AI workspace
117 comments
·August 8, 2025tcdent
Aurornis
I think the local LLM scene is very fun and I enjoy following what people do.
However every time I run local models on my MacBook Pro with a ton of RAM, I’m reminded of the gap between local hosted models and the frontier models that I can get for $20/month or nominal price per token from different providers. The difference in speed and quality is massive.
The current local models are very impressive, but they’re still a big step behind the SaaS frontier models. I feel like the benchmark charts don’t capture this gap well, presumably because the models are trained to perform well on those benchmarks.
I already find the frontier models from OpenAI and Anthropic to be slow and frequently error prone, so dropping speed and quality even further isn’t attractive.
I agree that it’s fun as a hobby or for people who can’t or won’t take any privacy risks. For me, I’d rather wait and see what an M5 or M6 MacBook Pro with 128GB of RAM can do before I start trying to put together another dedicated purchase for LLMs.
jauntywundrkind
I agree and disagree. Many of the best models are open source, just too big to run for most people.
And there are plenty of ways to fit these models! A Mac Studio M3 Ultra with 512 GB unified memory though has huge capacity, and a decent chunk of bandwidth (800GB/s. Compare vs a 5090's ~1800GB/s). $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive. Performance is even less, but a single AMD Turin chip with it's 12-channels DDR5-6000 can get you to almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+ in ram costs, plus $4800 for for example a 48 core Turin to go with it. (But if you go to older generations, affordability goes way up! Special part, but the 48-core 7R13 is <$1000).
Still, those costs come to $5000 at the low end. And come with much less token/s. The "grid compute" "utility compute" "cloud compute" model of getting work done on a hot gpu with a model already on it by someone else is very very direct & clear. And are very big investments. It's just not likely any of us will have anything but burst demands for GPUs, so structurally it makes sense. But it really feels like there's only small things getting in the way of running big models at home!
Strix Halo is kind of close. 96GB usable memory isn't quite enough to really do the thing though (and only 256GB/s). Even if/when they put the new 64GB DDR5 onto the platform (for 256GB, lets say 224 usable), one still has to sacrifice quality some to fit 400B+ models. Next gen Medusa Halo is not coming for a while, but goes from 4->6 channels, so 384GB total: not bad.
(It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-direction. Compared to the need here, it's no-where near enough to have a big memory host and smaller memory gpu)
Aurornis
> Many of the best models are open source, just too big to run for most people.
You can find all of the open models hosted across different providers. You can pay per token to try them out.
I just don't see the open models as being at the same quality level as the best from Anthropic and OpenAI. They're good but in my experience they're not as good as the benchmarks would suggest.
> $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive.
This is why I only appreciate the local LLM scene from a distance.
It’s really cool that this can be done, but $10K to run lower quality models at slower speeds is a hard sell. I can rent a lot of hours on an on-demand cloud server for a lot less than that price or I can pay $20-$200/month and get great performance and good quality from Anthropic.
I think the local LLM scene is fun where it intersects with hardware I would buy anyway (MacBook Pro with a lot of RAM) but spending $10K to run open models locally is a very expensive hobby.
esseph
https://pcisig.com/pci-sig-announces-pcie-80-specification-t...
From 2003-2016, 13 years, we had PCIE 1,2,3.
2017 - PCIE 4.0
2019 - PCIE 5.0
2022 - PCIE 6.0
2025 - PCIE 7.0
2028 - PCIE 8.0
Manufacturing and vendors are having a hard time keeping up. And the PCIE 5.0 memory is.. not always the most stable.
jstummbillig
> Many of the best models are open source, just too big to run for most people
I don't think that's a likely future, when you consider all the big players doing enormous infrastructure projects and the money that this increasingly demands. Powerful LLMs are simply not a great open source candidate. The models are not a by-product of the bigger thing you do. They are the bigger thing. Open sourcing a LLM means you are essentially investing money to just give it away. That simply does not make a lot of sense from a business perspective. You can do that in a limited fashion for a limited time, for example when you are scaling or it's not really your core business and you just write it off as expenses, while you try to figure yet another thing out (looking at you Meta).
But with the current paradigm, one thing seems to be very clear: Building and running ever bigger LLMs is a money burning machine the likes of which we have rarely or ever seen, and operating that machine at a loss will make you run out of any amount of money really, really fast.
Uehreka
I was talking about this in another comment, and I think the big issue at the moment is that a lot of the local models seem to really struggle with tool calling. Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”
So that’s a real brick wall for a lot of people. It doesn’t matter how smart a local model is if it can’t put that smartness to work because it can’t touch anything. The difference between manually copy/pasting code from LM Studio and having an assistant that can read and respond to errors in log files is light years. So until this situation changes, this asterisk needs to be mentioned every time someone says “You can run coding models on a MacBook!”
com2kid
> Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”
I'm working on solving this problem in two steps. The first is a library prefilled-json, that lets small models properly fill out JSON objects. The second is a unpublished library called Ultra Small Tool Call that presents tools in a way that small models can understand, and basically walks the model through filling out the tool call with the help of prefilled-json. It'll combine a number of techniques, including tool call RAG (pulls in tool definitions using RAG) and, honestly, just not throwing entire JSON schemas at the model but instead using context engineering to keep the model focused.
IMHO the better solution for local on device workflows would be if someone trained a custom small parameter model that just determined if a tool call was needed and if so which tool.
jauntywundrkind
Agreed that this is a huge limit. There's a lot of examples actually of "tool calling" but it's all bespoke code-it-yourself: very few of these systems have MCP integration.
I have a ton of respect for SGLang as a runtime. I'm hoping something can be done there. https://github.com/sgl-project/sglang/discussions/4461 . As noted in that thread, it is really great that Qwen3-Coder has a tool-parser built-in: hopefully can be some kind useful reference/start. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/b...
null
mxmlnkn
This resonates. I have finally started looking into local inference a bit more recently.
I have tried Cursor a bit, and whatever it used worked somewhat alright to generate a starting point for a feature and for a large refactor and break through writer's blocks. It was fun to see it behave similarly to my workflow by creating step-by-step plans before doing work, then searching for functions to look for locations and change stuff. I feel like one could learn structured thinking approaches from looking at these agentic AI logs. There were lots of issues with both of these tasks, though, e.g., many missed locations for the refactor and spuriously deleted or indented code, but it was a starting point and somewhat workable with git. The refactoring usage caused me to reach free token limits in only two days. Based on the usage, it used millions of tokens in minutes, only rarely less than 100K tokens per request, and therefore probably needs a similarly large context length for best performance.
I wanted to replicate this with VSCodium and Cline or Continue because I want to use it without exfiltrating all my data to megacorps as payment and use it to work on non-open-source projects, and maybe even use it offline. Having Cursor start indexing everything, including possibly private data, in the project folder as soon as it starts, left a bad taste, as useful as it is. But, I quickly ran into context length problems with Cline, and Continue does not seem to work very well. Some models did not work at all, DeepSeek was thinking for hours in loops (default temperature too high, should supposedly be <0.5). And even after getting tool use to work somewhat with qwen qwq 32B Q4, it feels like it does not have a full view of the codebase, even though it has been indexed. For one refactor request mentioning names from the project, it started by doing useless web searches. It might also be a context length issue. But larger contexts really eat up memory.
I am also contemplating a new system for local AI, but it is really hard to decide. You have the choice between fast GPU inference, e.g., RTX 5090 if you have money, or 1-2 used RTX 3090, or slow, but qualitatively better CPU / unified memory integrated GPU inference with systems such as the DGX Spark, the Framework Desktop AMD Ryzen AI Max, or the Mac Pro systems. Neither is ideal (and cheap). Although my problems with context length and low-performing agentic models seem to indicate that going for the slower but more helpful models on a large unified memory seems to be better for my use case. My use case would mostly be agentic coding. Code completion does not seem to fit me because I find it distracting, and I don't require much boilerplating.
It also feels like the GPU is wasted, and local inference might be a red herring altogether. Looking at how a batch size of 1 is one of the worst cases for GPU computation and how it would only be used in bursts, any cloud solution will be easily an order of magnitude or two more efficient because of these, if I understand this correctly. Maybe local inference will therefore never fully take off, barring even more specialized hardware or hard requirements on privacy, e.g., for companies. To solve that, it would take something like computing on encrypted data, which seems impossible.
Then again, if the batch size of 1 is indeed so bad as I think it to be, then maybe simply generate a batch of results in parallel and choose the best of the answers? Maybe this is not a thing because it would increase memory usage even more.
1oooqooq
more interesting is the extent apple convinced people a laptop can replace a desktop or server. mind blowing reality distortion field (as will be proven by some twenty comments telling I'm wrong 3... 2... 1).
jazzypants
I think this would be more interesting if you were to try to prove yourself correct first.
There are extremely few things that I cannot do on my laptop, and I have very little interest in those things. Why should I get a computer that doesn't have a screen? You do realize that, at this point of technological progress, the computer being attached to a keyboard and a screen is the only true distinguishing factor of a laptop, right?
bionsystem
I'm a desktop guy, considering the switch to a laptop-only setup, what would I miss ?
SteveJS
AFAICT, the RTX 4090 I bought in 2023 has actually appreciated rather than depreciated.
isaacremuant
Everything you're saying is FUD. There's immense value in being able to do local or remote as you please and part of it is knowledge.
Also, at the end of the day is about value creates and AI may allow some people to generate more stuff but overall value still tends to align with who is better at the craft pre AI. Not who pays more.
kelnos
> I expect this will change in the future
I'm really hoping for that too. As I've started to adopt Claude Code more and more into my workflow, I don't want to depend on a company for day-to-day coding tasks. I don't want to have to worry about rate limits or API spend, or having to put up $100-$200/mo for this. I don't want everything I do to be potentially monitored or mined by the AI company I use.
To me, this is very similar to why all of the smart-home stuff I've purchased all must have local control, and why I run my own smart-home software, and self-host the bits that let me access it from outside my home. I don't want any of this or that tied to some company that could disappear tomorrow, jack up their pricing, or sell my data to third parties. Or even use my data for their own purposes.
But yeah, I can't see myself trying to set any LLMs up for my own use right now, either on hardware I own, or in a VPS I manage myself. The cost is very high (I'm only paying Anthropic $20/mo right now, and I'm very happy with what I get for that price), and it's just too fiddly and requires too much knowledge to set up and maintain, knowledge that I'm not all that interested in acquiring right now. Some people enjoy doing that, but that's not me. And the current open models and tooling around them just don't seem to be in the same class as what you can get from Anthropic et al.
But yes, I hope and expect this will change!
motorest
> As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.
Can you explain your rationale? It seems that the worst case scenario is that your setup might not be the most performant ever, but it will still work and run models just as it always did.
This sounds like a classical and very basic opex vs capex tradeoff analysis, and these are renowned for showing that on financial terms cloud providers are a preferable option only in a very specific corner case: short-term investment to jump-start infrastructure when you do not know your scaling needs. This is not the case for LLMs.
OP seems to have invested around $600. This is around 3 months worth of an equivalent EC2 instance. Knowing this, can you support your rationale with numbers?
tcdent
When considering used hardware you have to take quantization into account; gpt-oss-120b for example is running a very new MXFP4 which will use far more than 80GB to fit into the available fp types on older hardware or Apple silicon.
Open models are trained on modern hardware and will continue to take advantage of cutting edge numeric types, and older hardware will continue to suffer worse performance and larger memory requirements.
motorest
You're using a lot of words to say "I believe yesterday's hardware might not run models as as fast as today's hardware."
That's fine. The point is that yesterday's hardware is quite capable of running yesterday's models, and obviously it will also run tomorrow's models.
So the question is cost. Capex vs opex. The fact is that buying your own hardware is proven to be far more cost-effective than paying cloud providers to rent some cycles.
I brought data to the discussion: for the price tag of OP's home lab, you only afford around 3 months worth of an equivalent EC2 instance. What's your counter argument?
jeremyjh
I expect it will never change. In two years if there is a local option as good as GPT-5 there will be a much better cloud option and you'll have the same tradeoffs to make.
c-hendricks
Why would AI be one of the few areas where locally-hosted options can't reach "good enough"?
ac29
Maybe a better question is when will SOTA models be "good enough"?
At the moment there appears to be ~no demand for older models, even models that people praised just a few months ago. I suspect until AGI/ASI is reached or progress plateaus, that will continue be the case.
hombre_fatal
For some use-cases, like making big complex changes to big complex important code or doing important research, you're pretty much always going to prefer the best model rather than leave intelligence on the table.
For other use-cases, like translations or basic queries, there's a "good enough".
Aurornis
There will always be something better on big data center hardware.
However, small models are continuing to improve at the same time that large RAM capacity computing hardware is becoming cheaper. These two will eventually intersect at a point where local performance is good enough and fast enough.
kingo55
If you've tried gpt-oss:120b and Moonshot AIs Kimi Dev, it feels like this is getting closer to reality. Mac Studios, while expensive are now offering 512gb of usable RAM as well. The tooling available to running local models is also becoming more accessible than even just a year ago.
victorbjorklund
Next two years probably. But at some point we will either hit scales where you really dont need anything better (lets say cloud is 10000 token/s and local is 5000 token/s. Makes no difference for most individual users) or we will hit som wall where ai doesnt get smarter but cost of hardware continues to fall
kasey_junk
I’d be surprised by that outcome. At one point databases were cutting edge tech with each engine leap frogging each other in capability. Still the proprietary db often have features that aren’t matched elsewhere.
But the open db got good enough that you need to justify not using them with specific reasons why.
That seems at least as likely an outcome for models as they continue to improve infinitely into the stars.
zwnow
You know there's a ceiling to all this with the current LLM approaches right? They won't become that much better, its even more likely they will degrade. There are cases of bad actors attacking LLMs by feeding it false information and propaganda. I dont see this changing in the future.
duxup
Maybe, but my phone has become is a "good enough" computer for most tasks compared to a desktop or my laptop.
Seems plausible the same goes for AI.
kvakerok
What is even a point of having a self hosted gpt5 equivalent that's not into petabytes of knowledge?
pfannkuchen
It might change once the companies switch away from lighting VC money on fire mode and switch to profit maximizing mode.
I remember Uber and AirBnB used to seem like unbelievably good deals, for example. That stopped eventually.
jeremyjh
This I could see.
oblio
AirBNB is so good that it's half the size of Booking.com these days.
And Uber is still big but about 30% of the time in places I go to, in Europe, it's just another website/app to call local taxis from (medallion and all). And I'm fairly sure locals generally just use the website/app of the local company, directly, and Uber is just a frontend for foreigners unfamiliar with that.
ActorNightly
>but when you factor in the performance of the models you have access to, and the cost of running them on-demand in a cloud, it's really just a fun hobby instead of a viable strategy to benefit your life.
Its because people are thinking too linearly about this, equating model size with usability.
Without going into too much detail because this may be a viable business plan for me, but I have had very good success with Gemma QAT model that runs quite well on a 3090 wrapped up in a very custom agent format that goes beyond simple prompt->response use. It can do things that even the full size large language models fail to do.
alliao
really depends on whether local model satisfies your own usage right? if it works locally well enough, just package it up and be content? as long as it's providing value now at least it's local...
andylizf
This is fantastic work. The focus on a local, sandboxed execution layer is a huge piece of the puzzle for a private AI workspace. The `coderunner` tool looks incredibly useful.
A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.
(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.
Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."
Code: https://github.com/yichuan-w/LEANN Paper: https://arxiv.org/abs/2405.08051
wfn
Thank you for the pointer to LEANN! I've been experimenting with RAGs and missed this one.
I am particularly excited about using RAG as the knowledge layer for LLM agents/pipelines/execution engines to make it feasible for LLMs to work with large codebases. It seems like the current solution is already worth a try. It really makes it easier that your RAG solution already has Claude Code integration![1]
Has anyone tried the above challenge (RAG + some LLM for working with large codebases)? I'm very curious how it goes (thinking it may require some careful system-prompting to push agent to make heavy use of RAG index/graph/KB, but that is fine).
I think I'll give it a try later (using cloud frontier model for LLM though, for now...)
[1]: https://github.com/yichuan-w/LEANN/blob/main/packages/leann-...
doctoboggan
> A vector database for years of emails can easily exceed 50GB.
In 2025 I would consider this a relatively meager requirement.
andylizf
Yeah, that's a fair point at first glance. 50GB might not sound like a huge burden for a modern SSD.
However, the 50GB figure was just a starting point for emails. A true "local Jarvis," would need to index everything: all your code repositories, documents, notes, and chat histories. That raw data can easily be hundreds of gigabytes.
For a 200GB text corpus, a traditional vector index can swell to >500GB. At that point, it's no longer a "meager" requirement. It becomes a heavy "tax" on your primary drive, which is often non-upgradable on modern laptops.
The goal for practical local AI shouldn't just be that it's possible, but that it's also lightweight and sustainable. That's the problem we focused on: making a comprehensive local knowledge base feasible without forcing users to dedicate half their SSD to a single index.
sebmellen
I know next to nothing about embeddings.
Are there projects that implement this same “pruned graph” approach for cloud embeddings?
oblio
It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?
andylizf
Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.
yichuan
I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me
woadwarrior01
> LLMs: Ollama for local models (also private models for now)
Incidentally, I decided to try to Ollama macOS app yesterday, and the first thing it tries to do upon launch is try to connect to some google domain. Not very private.
Aurornis
Automatic update checks https://github.com/ollama/ollama/blob/main/docs/faq.md
eric-burel
But can be audited which I'd buy everyday. It's probably not to hard to find network calls in a codebase if this task must be automated on update.
abtinf
Yep, and I’ve noticed the same thing with in vscode with both the cline plugin and the copilot plugin.
I configure them both to use local ollama, block their outbound connections via little snitch, and they just flat out don’t work without the ability to phone home or posthog.
Super disappointing that Cline tries to do so much outbound comms, even after turning off telemetry in the settings.
com2kid
> Even with help from the "world's best" LLMs, things didn't go quite as smoothly as we had expected. They hallucinated steps, missed platform-specific quirks, and often left us worse off.
This shows how little native app training data is even available.
People rarely write blog posts about designing native apps, long winded medium tutorials don't exist, heck even the number of open source projects for native desktop apps is a small percentage compared to mobile and web apps.
Historically Microsoft paid some of the best technical writers in the world to write amazing books on how to code for Windows (see: Charles Petzold), but now days that entire industry is almost dead.
These types of holes in training data are going to be a larger and larger problem.
Although this is just representative of software engineering in general - few people want to write native desktop apps because it is a career dead end. Back in the 90s knowing how to write Windows desktop apps was great, it was pretty much a promised middle class lifestyle with a pretty large barrier to entry (C/C++ programming was hard, the Windows APIs were not easy to learn, even though MS dumped tons of money into training programs), but things have changed a lot. Outside of the OS vendors themselves (Microsoft, Apple) and a few legacy app teams (Adobe, Autodesk, etc), very few jobs exist for writing desktop apps.
thorncorona
I mean outside of HPC why would you when the browser is the world’s most ubiquitous VM?
Imustaskforhelp
I think I still prefer local but I feel like that's because that most AI inference is kinda slow or comparable to local. But I recently tried out cerebras or (I have heard about groq too) and honestly when you try things at 1000 tk/s or similar, your mental model really shifts and becomes quite impatient. Cerebras does say that they don't log your data or anything in general and you would have to trust me to say that I am not sponsored by them (Wish I was tho) Its just that they are kinda nice.
But I still hope that we can someday actually have some meaningful improvements in speed too. Diffusion models seem to be really fast in architecture.
shaky
This is something that I think about quite a bit and am grateful for this write-up. The amount of friction to get privacy today is astounding.
sneak
This writeup has nothing of the sort and is not helpful toward that goal.
frank_nitti
I'd assume they are referring to being able to run your own workloads in a home-built system, rather then surrendering that ownership to the tech giants alone
Imustaskforhelp
Also you get a sort of complete privacy that the data never leaves your home too whereas at best you would have to trust the AI cloud providers that they are not training or storing that data.
Its just more freedom and privacy in that matter.
noelwelsh
It's the hardware more than the software that is the limiting factor at the moment, no? Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.
colecut
This is rapidly improving
Imustaskforhelp
I hope it improves at such a steady rate! Please lets just hope that there is still room for improvement to packing even more improvements in such LLMS which can help the home labbing community in general.
ramesh31
>Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.
And "good" is still questionable. The thing that makes this stuff useful is when it works instantly like magic. Once you find yourself fiddling around with subpar results at slower speeds, essentially all of the value is gone. Local models have come a long way but there is still nothing even close to Claude levels when it comes to coding. I just tried taking the latest Qwen and GLM models for a spin through OpenRouter with Cline recently and they feel roughly on par with Claude 3.0. Benchmarks are one thing, but reality is a completely different story.
null
mkummer
Super cool and well thought out!
I'm working on something similar focused on being able to easily jump between the two (cloud and fully local) using a Bring Your Own [API] Key model – all data/config/settings/prompts are fully stored locally and provider API calls are routed directly (never pass through our servers). Currently using mlc-llm for models & inference fully local in the browser (Qwen3-1.7b has been working great)
eric-burel
An llm on your computer is a fun hobby, an llm in your SME for 10 people is a business idea. There are not enough resources on this topic at all and the need is growing extremely fast. Local LLMs are needed for many use cases and business where cloud is not possible.
luke14free
you might want to check out what we built -> https://inference.sh supports most major open source/weight models from wan 2.2 video, qwen image, flux, most llms, hunyan 3d etc.. works in a containerized way locally by allowing you to bring your own gpu as an engine (fully free) or allows you to rent remote gpu/pool from a common cloud in case you want to run more complex models. for each model we tried to add quantized/ggufs versions to even wan2.2/qwen image/gemma become possible to execute with as little as 8gb vram gpus. mcp support coming soon in our chat interface so it can access other apps from the ecosystem.
navbaker
Open Web UI is a great alternative for a chat interface. You can point to an OpenAI API like vLLM or use the native Ollama integration and it has cool features like being able to say something like “generate code for an HTML and JavaScript pong game” and have it display the running code inline with the chat for testing
I'm constantly tempted by the idealism of this experience, but when you factor in the performance of the models you have access to, and the cost of running them on-demand in a cloud, it's really just a fun hobby instead of a viable strategy to benefit your life.
As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.
Coupled with the dramatically inferior performance of the weights you would be running in a local environment, it's just not worth it.
I expect this will change in the future, and am excited to invest in a local inference stack when the weights become available. Until then, you're idling a relatively expensive, rapidly depreciating asset.