Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop
139 comments
·October 31, 2025egberts1
Ollama, 16-CPU Xenon E6320 (old), 1.9Ghz, 120GB DDRAM4, 240TB RAID5 SSDs, on Dell Precision T710 ("The Beast"). NO GPU. 20b (n oooooot f aah st at all). Pure CPU bound. Tweaked for 256KB chunking into RAG.
Ingested election laws of 50 states, territories and Federal.
Goal. Mapping out each feature of the election and deal with (in)consistent terminologies sprouted by different university-trained public administration. This is the crux of hallunications: getting a diagram of ballot handling and their terminologies.
Then maybe tackle the multitude ways of election irregularities, or at least point out integrity gaps at various locales.
https://figshare.com/articles/presentation/Election_Frauds_v...
banku_brougham
bravo, this is a great use of talent for society
gcr
For new folks, you can get a local code agent running on your Mac like this:
1. $ npm install -g @openai/codex
2. $ brew install ollama; ollama serve
3. $ ollama pull gpt-oss:20b
4. $ codex --oss -m gpt-oss:20b
This runs locally without Internet. Idk if there’s telemetry for codex, but you should be able to turn that off if so.
You need an M1 Mac or better with at least 24GB of GPU memory. The model is pretty big, about 16GB of disk space in ~/.ollama
Be careful - the 120b model is 1.5× better than this 20b variant, but takes 5× higher requirements.
nickthegreek
have you been able to build or reiterate anything of value using just 20b to vibe code?
lreeves
I sometimes still code with a local LLM but can't imagine doing it on a laptop. I have a server that has GPUs and runs llama.cpp behind llama-swap (letting me switch between models quickly). The best local coding setup I've been able to do so far is using Aider with gpt-oss-120b.
I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.
Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.
fm2606
gpt-oss-120b is amazing. I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc). ChatGPT finished a 50 question quiz in 6 min with a score of 46 / 50. gpt-oss-120b took over an hour but got 47 / 50. All the other local LLMs I tried were small and performed way worse, like less than 50% correct.
I ran this on an i7 with 64gb of RAM and an old nvidia card with 8g of vram.
EDIT: Forgot to say what the RAG system was doing which was answering a 50 question multiple choice test about GCP and cloud engineering.
embedding-shape
> gpt-oss-120b is amazing
Yup, I agree, easily best local model you can run today on local hardware, especially when reasoning_effort is set to "high", but "medium" does very well too.
I think people missed out on how great it was because a bunch of the runners botched their implementations at launch, and it wasn't until 2-3 weeks after launch that you could properly evaluate it, and once I could run the evaluations myself on my own tasks, it really became evident how much better it is.
If you haven't tried it yet, or you tried it very early after the release, do yourself a favor and try it again with updated runners.
null
lacoolj
you can run the 120b model on an 8GB GPU? or are you running this on CPU with the 64GB RAM?
I'm about to try this out lol
The 20b model is not great, so I'm hoping 120b is the golden ticket.
ThatPlayer
With MoE models like gpt-oss, you can run some layers on the CPU: https://github.com/ggml-org/llama.cpp/discussions/15396
Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"
gunalx
I have in many cases had better results with the 20b model, over the 120b model. Mostly because it is faster and I can iterate prompts quicker to choerce it to follow instructions.
fm2606
Everything I run, even the small models, some amount goes to the GPU and the rest to RAM.
fm2606
Hmmm...now that you say that, it might have been the 20b model.
And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.
Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.
I was really impressed but disappointed at the huge disparity between time the two.
dust42
On a Macbook pro 64GB I use Qwen3-Coder-30B-A3B Q4 quant with llama.cpp.
For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.
When giving well defined small tasks, it is as good as any frontier model.
I like the speed and low latency and the availability while on the plane/train or off-grid.
Also decent FIM with the llama.cpp VSCode plugin.
If I need more intelligence my personal favourites are Claude and Deepseek via API.
Xenograph
Have you tried continue.dev's new open completion model [1]? How does it compare to llama.vscode FIM with qwen?
redblacktree
Would you use a different quant with a 128 GB machine? Could you link the specific download you used on huggingface? I find a lot of the options there to be confusing.
dust42
I usually use unsloth quants, in this case https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-... - the Q4_K_M variant.
On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.
To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167
About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...
And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.
tommy_axle
Not the OP but yes you can definitely get a bigger quant like Q6 if it makes a difference but you also can go with a bigger param model like gpt oss 120B. A 70B would probably be great for a 128GB machine, which I don't think qwen has. You can search for the model you're interested in on hugging face often with "gguf" to get it ready to go (e.g. https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main). Otherwise it's not a big deal to quantize yourself using llama.cpp.
codingbear
how are you running qwen3 with llama-vscode? I am still using qwen-2.5-7b.
There is an open issue about adding support for Qwn3 which I have been monitoring, would love to use Qwen3 if possible. Issue - https://github.com/ggml-org/llama.vscode/issues/55
dennemark
I have AMD Strix Halo (395) on my work laptop (HP Ultrabook G1A) as well as at home with Framework Desktop.
On both i have setup lemonade-server on system start. At work i use Qwen3 Coder 30B-3A with continue.dev. It serves me well in 90% of cases.
At home i have 128GB RAM. I try a bit GPT120B. I host Open WebUI on it and connect via https and wireguard to it, so i can use it as PWA on my phone. I love not needing to think about where my data goes. But i would like to allow parallel requests, so i need to tinker a bit more. Maybe llama-swap is enough.
I just need to see how to deal with context length. My models stop or go into infinite loop after some messages. But then i often start a new chat.
Lemonade-server runs with llama.cpp, vllm seems to be scaling better thoug, but is not so easy to set up.
Unsloth GGUFs are great resource for models.
Also for Strix Halo check out kyuz0 repositorIES! Also has image gen. I didnt try those yet. But the benchmarks are awesome! Lots to learn from. Framework forum can be useful, too.
https://github.com/kyuz0/amd-strix-halo-toolboxes Also nice: https://llm-tracker.info/ It links to some benchmark site with models by size. I prefer such resources, since it is quite easy to see which one fit in my RAM (even though i have this silly thumbrule Billion Token ≈ GB RAM).
Btw. even a AMD HX 370 with non soldered RAM can get some nice t/s for smaller models. Can be helpful enough when disconnected from internet and you dont know how to style a svg :)
Thanks for opening up this topic! Lots of food :)
jetsnoc
  Models
    gpt-oss-120b, Meta Llama 3.2, or Gemma
    (just depends on what I’m doing)
  Hardware
    - Apple M4 Max (128 GB RAM)
      paired with a GPD Win 4 running Ubuntu 24.04 over USB-C networking
  Software
    - Claude Code
    - RA.Aid
    - llama.cpp
  For CUDA computing, I use an older NVIDIA RTX 2080 in an old System76 workstation.
  Process
    I create a good INSTRUCTIONS.md for Claude/Raid that specifies a task & production process with a task list it maintains. I use Claude Agents with an Agent Organizer that helps determine which agents to use. It creates the architecture, prd and security design, writes the code, and then lints, tests and does a code review.Infernal
What does the GPD Win 4 do in this scenario? Is there a step w/ Agent Organizer that decides if a task can go to a smaller model on the Win 4 vs a larger model on your Mac?
altcognito
What sorts of token/s are you getting with each model?
jetsnoc
Model performance summary:
  **openai/gpt-oss-120b** — MLX (MXFP4), ~66 tokens/sec @ Hugging Face: `lmstudio-community/gpt-oss-120b-MLX-8bit`
  **google/gemma-3-27b** — MLX (4-bit), ~27 tokens/sec @ Hugging Face: `mlx-community/gemma-3-27b-it-qat-4bit`
  **qwen/qwen3-coder-30b** — MLX (8-bit), ~78 tokens/sec @ Hugging Face: `Qwen/Qwen3-Coder-30B-A3B-Instruct`
CubsFan1060
What is the Agent Organizer you use?
jetsnoc
It’s a Claude agent prompt. I don’t recall who originally shared it, so I can’t yet attribute the source, but I’ll track that down shortly and add proper attribution here.
Here’s the Claude agent markdown:
https://github.com/lst97/claude-code-sub-agents/blob/main/ag...
Edit: Updated from the old Pastebin link to the GitHub version. Attribution found: lst97 on GitHub
nicce
How it looks like Claude agent is written by Claude...
softfalcon
For anyone who wants to see some real workstations that do this, you may want to check out Alex Ziskind's channel on YouTube:
https://www.youtube.com/@AZisk
At this point, pretty much all he does is review workstations for running LLM's and other machine-learning adjacent tasks.
I'm not his target demographic, but because I'm a dev, his videos are constantly recommended to me on YouTube. He's a good presenter and his advice makes a lot of sense.
fm2606
> I'm not his target demographic Me either and I am a dev as well
> He's a good presenter and his advice makes a lot of sense. Agree
Not that I think he forms his answers on who is sponsoring him, but I feel he couldn't do a lot of the stuff he does without sponsors. If the sponsors aren't supplying him with all that hardware then, in my opinion, he is taking a significant risk in buying all of it out of pocket and hoping that the money he makes from YT covers it (which I am sure it does, several times over). But there is no guarantee that the money he makes from YT will cover the costs, is the point I'm making.
But, then again, he does use the hardware in other videos so the it isn't like he is banking on a single video to cover the costs.
hereme888
Dude... what a good YT channel. The guy is no nonsense, straight to the point. Thanks.
erikig
Hardware: MacBook Pro M4 Max, 128GB
Platform: LMStudio (primarily) & Ollama
Models:
- qwen/qwen3-coder-30b A3B Instruct 8-bit MLX
- mlx-community/gpt-oss-120b-MXFP4-Q8
For code generation especially for larger projects, these models aren't as good as the cutting edge foundation models. For summarizing local git repos/libraries, generating documentation and simple offline command-line tool-use they do a good job.
I find these communities quite vibrant and helpful too:
mkagenius
Since you are on Mac, if you need some kind code execution sandbox, check out Coderunner[1] which is based on Apple container, provides a way execute any LLM generated cod e without risking arbitrary code execution on your machine.
I have recently added claude skills to it. So, all the claude skills can be executed locally on your mac too.
giancarlostoro
If you're going to get a MacBook, get the Pro, it has a built-in fan, you don't want the heat just sitting there on the MacBook Air. Same with the Mac mini, get the studio instead, it has a fan, the Mini does not. I don't know about you but I wouldn't want my brand new laptop / desktop to be heating up the entire time I'm coding with 0 cool off. If you go the Mac route, I recommend getting TG Pro, the default fan settings on the Mac are awful they don't kick in soon enough, TG Pro lets you make it a little more "sensitive" to those temperature shifts, its like $20 for TG Pro if I remember correctly, but worth it.
I have a MacBook Pro with an M4 Pro chip, and 24GB of RAM, I believe only 16 of it is usable by the models, so I can run the GPT OSS 20B model (iirc) but the smaller one. It can do a bit, but the context window fills up quickly, so I do find myself switching context windows often enough. I do wonder if a maxed out MacBook Pro would be able to run larger context windows, then I would easily be able to code all day with it offline.
I do think Macs are phenomenal at running local LLMs if you get the right one.
trailbits
The default context window using ollama or lmstudio is small, but you can easily quadruple the default size while running gpt-oss-20b on a 24GB Mac.
amonroe805-2
Quick correction: The mac mini does have a fan. Studio is definitely more capable due to bigger, better chips, but my understanding is the mini is generally not at risk of thermal throttling with the chips you can buy it with. The decision for desktop macs really just comes down to how much chip you want to pay for.
suprjami
Correction, if you're going to get a Mac then get a Max or Ultra, with as much memory as possible, the increase in RAM bandwidth will make running larger models viable.
embedding-shape
> I do think Macs are phenomenal at running local LLMs if you get the right one.
How does the prompt processing speed look like today? I think it was either M3 or M4 together with 128GB, trying to run even slightly longer prompts took forever for the initial prompt processing so whatever speed gain you get at inference, basically didn't matter. Maybe it works better today?
Terretta
And yes, for context windows / cached context, the MacBook Pro with 128GB memory is a mind boggling laptop.
The Studio Ultras are surprisingly strong as well for a pretty monitor stand.
whitehexagon
Qwen3:32b on MBP M1 Pro 32GB running Asahi linux. Mainly command line for some help with armv8 assembly, and some SoC stuff (this week explaining I2C protocol). I couldnt find any good intro on the web-of-ads. It's not much help with Zig, but then nothing seems to keep up with Zig at the moment.
I get a steady stream of tokens, slightly slower than my reading pace, which I find is more than fast enough. In fact I´d only replace with exact same, or maybe M2 + Asahi with enough RAM to run the bigger Qwen3 model.
I saw qwen3-coder mentioned here. I didnt know about that one. Anyone got any thoughts on how that compares to qwen3? Will it also fit in 32GB?
I'm not interested in agents, or tool integration, and especially wont use anything cloud. I like to own my env. and code top-to-bottom. Having also switched to Kate and Fossil it feels like my perfect dev environment.
Currently using an older Ollama, but will switch to llama.cpp now that ollama has pivoted away from offline only. I got llama.cpp installed, but not sure how to reuse my models from ollama, I thought ollama was just a wrapper, but they seems to be different model formats?
[edit] be sure to use it powered, linux is a bit battery heavy, but Qwen3 will pull 60W+ and flatten a battery real fast.
suprjami
Qwen 3 Coder is a 30B model but an MoE with 3B active parameters, so it should be much faster for you. Give it a try.
It's not as smart as dense 32B for general tasks, but theoretically should be better for the sort of coding tasks from StackExchange.
embedding-shape
> Which model(s) are you running (e.g., Ollama, LM Studio, or others)
I'm running mainly GPT-OSS-120b/20b depending on the task, Magistral for multimodal stuff and some smaller models I've fine-tuned myself for specific tasks..
All the software is implemented by myself, but I started out with basically calling out to llama.cpp, as it was the simplest and fastest option that let me integrate it into my own software without requiring a GUI.
I use Codex and Claude Code from time to time to do some mindless work too, Codex hooked up to my local GPT-OSS-120b while Claude Code uses Sonnet.
> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
Desktop, Ryzen 9 5950X, 128GB of RAM, RTX Pro 6000 Blackwell (96GB VRAM), performs very well and I can run most of the models I use daily all together, unless I want really large context then just GPT-OSS-120B + max context, ends up taking ~70GB of VRAM.
> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
Almost anything and everything, but mostly coding. But then general questions, researching topics, troubleshooting issues with my local infrastructure, troubleshooting things happening in my other hobbies and a bunch of other stuff. As long as you give the local LLM access to a search tool (I use YaCy + my own adapter), local models works better for me than the hosted models, mainly because of the speed and I have better control over the inference.
It does fall short on really complicated stuff. Right now I'm trying to do CUDA programming, creating a fused MoE kernel for inference in Rust, and it's a bit tricky as there are a lot of moving parts and I don't understand the subject 100%, and when you get to that point, it's a bit hit or miss. You really need to have a proper understanding of what you use the LLM for, otherwise it breaks down quickly. Divide and conquer as always helps a lot.
andai
gpt-oss-120b keeps stopping for me in Codex. (Also in Crush.)
I have to say "continue" constantly.
embedding-shape
See https://news.ycombinator.com/item?id=45773874 (TLDR, you need to hard-code some inference parameters to be the right ones, otherwise you'd get really bad behaviour + prompting to get the workflow right)
andai
Thanks. Did you need to modify Codex's prompt?
juujian
I passed on the machine, but we set up gpt-oss-120b on a 128GB RAM Macbook pro and it is shockingly usable. Personally, I could imagine myself using that instead of OpenAI's web interface. The Ollama UI has web search working, too, so you don't have to worry about the model knowing the latest and greatest about every software package. Maybe one day I'll get the right drivers to run a local model on my Linux machine with AMD's NPU, too, but AMD has been really slow on this.
j45
LM Studio also works well on Mac.
Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:
Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?
What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
I'm conducting my own investigation, which I will be happy to share as well when over.
Thanks! Andrea.