Running Qwen3 on your macbook, using MLX, to vibe code for free
168 comments
·May 1, 2025omneity
TuxSH
Personally, I'm getting 15 tok/s on both RTX 3060 and my MacBook Air M4 (w/ 32GB, but 24 should suffice), with the default config from LMStudio.
Which I find even more impressive, considering the 3060 is the most used GPU (on Steam) and that M4 Air and future SoCs are/will be commonplace too.
(Q4_K_M with filesize=18GB)
anon373839
One of the most interesting things about that model is its excellent score on the RAG confabulations (hallucination) leaderboard. It’s the 3rd best model overall, beating all OpenAI models, for example. I wonder what Alibaba did to achieve that.
c0brac0bra
What tasks have you found the 0.6B model useful for? The hallucination that's apparent during its thinking process put up a big red flag for me.
Conversely, the 4B model actually seemed to work really well and gave results comparable to Gemini 2.0 Flash (at least in my simple tests).
SparkyMcUnicorn
You can use 0.6B for speculative decoding on the larger models. It'll speed up 32B, but slows down 30B-A3B dramatically.
omneity
It's okay for extracting simple things like addresses, or for formatting text with some input data, like a more advanced form of mail merge.
I haven't evaled these tasks so YMMV. I'm exploring other possibilities as well. I suspect it might be decent at autocomplete, and it's small enough one could consider finetuning it on a codebase.
jasonjmcghee
Importantly they note that using a draft model screws it up and this was my experience. I was initially impressed, then started seeing problems, but after disabling my draft model it started working much better. Very cool stuff- it's fast too as you note.
The /think and /no_think commands are very convenient.
woadwarrior01
That should not be the case. Speculative decoding is trading off compute for memory bandwidth. The model's output is guaranteed to be the same, with or without it. Perhaps there's a bug in the implementation that you're using.
marcalc
What do you mean by draft model? And how would one disable it? Cheers
_neil
A draft model is something that you would explicitly enable. It uses a smaller model to speculatively generate next tokens, in theory speeding up generation.
Here’s the LM Studio docs on it: https://lmstudio.ai/docs/app/advanced/speculative-decoding
mtw
how much RAM do you have? I want to compare with my local setup (M4 Pro)
dust42
I have a MBP M1 Max 64GB and I get 40t/s with llama.cpp and unsloth q4_k_m on the 30B A3B model. I always use /nothink and Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 - these are the settings recommended for Qwen3 and they make a big difference. With the default settings from llama-server it will always run into an endless loop.
The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.
Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.
[1] parameters for Qwen3: https://huggingface.co/Qwen/Qwen3-30B-A3B
[2] unsloth quant: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
[3] llama.cpp: https://github.com/ggml-org/llama.cpp
omneity
128GB but it's not using much.
I'm running Q4 and it's taking 17.94 GB VRAM with 4k context window, 20GB with 32k tokens.
A4ET8a8uTh0_v2
I am not a mac person, but I am debating buying one for the unified ram now that the prices seem to be inching down. Is it painful to set up? The general responses I seem to get range from "It is takes zero effort" to "It was a major hassle to set everything up."
tmaly
I am curious if you have tried function calling or MCPs with it?
UK-Al05
It's fits entirely in my 7900xtx memory. But tbh i've been disappointed with programming ability so far.
It's using 20GB of memory according to ollama.
tomr75
I’m getting 56 with mlx and lmstudio. How 76?
artdigital
Cool, but running qwen3 and doing a ls tool call is not “vibe coding”, this reads more like a lazy ad for localforge
I doubt it can perform well with actual autonomous tasks like reading multiple files, navigating dirs and figuring out where to make edits. That’s at least what I would understand under “vibe coding”
datpuz
What *is* vibe coding? And how can I stop hearing about it for the rest of my life?
krashidov
The real definition of vibe coding is coding with just an LLM. Never looking at what code it outputs, never doing manual edits. Just iterating with the LLM and always pressing approve.
It is a viable way of making software. People have made working software with it. It will likely only ever be more prevalent but might be renamed to just plain old making apps.
baq
‘If it works it ships’
avetiszakharyan
Definitely try it, it can navigate files search for stuff, run bash commands, and while 30b is a bit cranky it gets the job done (much worse then i would get when i plug in gpt-4.1, but its still not bad, Kudos o qwen. As for localforge, it really is a vibe coding tool, just like claude or codex, but with the possibility to plug more than just one provider. What's wrong with that?
tough
They're just pointing out how most -educational- content is actually marketing in disguise, which is fine, but also fine to acknowledge i guess, even if a bit snarkily
avetiszakharyan
Well its an oss project, free, I kind of didnt see it that way i guess, that something thats given for free is bad-tone to market in any possible way. Iguess from my standpoint its more of a, I just want to show this thing to people, because I am proud of it as a personal project, and it brings me joy to just, put it out there And since if you "just put it out there" it will sink to the bottom of the HN pit, why not get a bit more creative.
85392_school
You should try it. It's trained for tool calling and thinks before taking action.
kamranjon
Just wanted to give a shout out to MLX and MLX-LM - I’ve been using it to fine-tune Gemma 3 models locally and it’s a surprisingly well put together library and set of tools from the Apple devs.
avetiszakharyan
I'd thought to share this quick tutorial to get an actual autonomous agent running on your local and doing some simple tasks. Still in progress trying to figure ou right MLX settings or proper model version to do it, but the framework around this approach is solid, so i hought i'd share!
nottorp
Now how do you feed it an existing codebase as part of your prompt? Does it even support that (prompt size etc).
avetiszakharyan
Ye you can just run it in a folder, and ask it to look around, it can execute bash commands, do anything that Claude Code can do. it will read all the codebase if it has to
pylotlight
Typically I'd use tools for that as context will be finite, but I hear it does a decent job at tool calling too so should see solid perf there.
chuckadams
Anyone know of a setup, perhaps with MCP, where I can get my local LLM to work in tandem on tasks, compress context, or otherwise act in concert with the cloud agent I'm using with Augment/Cursor/whatever? It seems silly that my shiny new M3 box just renders the UI while the cloud LLM alone refactors my codebase, I feel they could negotiate the tasks between themselves somehow.
_joel
There's a few Ollama-MCP bridge servers already (from a quick search, also interested myself):
ollama-mcp-bridge: A TypeScript implementation that "connects local LLMs (via Ollama) to Model Context Protocol (MCP) servers. This bridge allows open-source models to use the same tools and capabilities as Claude, enabling powerful local AI assistants"
simple-mcp-ollama-bridge: A more lightweight bridge connecting "Model Context Protocol (MCP) servers to OpenAI-compatible LLMs like Ollama"
rawveg/ollama-mcp: "An MCP server for Ollama that enables seamless integration between Ollama's local LLM models and MCP-compatible applications like Claude Desktop"
How you route would be an interesting challenge, presumably could just tell it to use the mcp for certain tasks, thereby offloading locally.
rcarmo
I've been toying with Visual Studio Code's MCP and agent support and gotten it to offload things like reference searches and targeted web crawling (look up module X on git repo Y via this URL pattern that the MCP server goes, fetches and parses).
I started by giving it a reference Python MCP server and asking it to modify the code to do that. Now I have 3-4 tools that give me reproducible results.
101011
This is the closest I've found that's akin to Claude Code: https://aider.chat/
rcarmo
Coincidentally, I just managed to get Qwen3 to go into a loop by using a fairly simple prompt:
"create a python decorator that uses a trie to do mqtt topic routing”
phi4-reasoning works, but I think the code is buggy
phi4-mini-reasoning freaks out
qwen3:30b starts looping and forgets about the decorator
mistral-small gets straight to the point and the code seems sane
https://mastodon.social/@rcarmo/114433075043021470
I regularly use Copilot models, and they can manage this without too many issues (Claude 3.7 and Gemini output usable code with tests), but local models seem to not have the ability to do it quite yet.
datpuz
Here's qwen-30b-a3b's response to your prompt when I worded it better:
The prompt was:
"Create a Python decorator that registers functions as handlers for MQTT topic patterns (including + and # wildcards). Internally, use a trie to store the topic patterns and match incoming topic strings to the correct handlers. Provide an example showing how to register multiple handlers and dispatch a message to the correct one based on an incoming topic."
rcarmo
I went back and used your prompt, and it is still looping:
anon373839
Are you using Ollama? If so, the issue may be Ollama's default context length: just 2,048 tokens. Ollama truncates the rest of the context silently, so "thinking" models cannot work with the default settings.
If you are using Ollama, try explicitly setting the `num_ctx` parameter in your request to something higher like 16k or 32k, and then see if you still encounter the looping. I haven't run into that behavior once with this model.
datpuz
I think your prompt is bad. Still impressive that Claude 3.7 handled your bad prompt, but qwen3 had no problem with this prompt:
Create a Python decorator that registers functions as handlers for MQTT topic patterns (including + and # wildcards). Internally, use a trie to store the topic patterns and match incoming topic strings to the correct handlers. Provide an example showing how to register multiple handlers and dispatch a message to the correct one based on an incoming topic.
rcarmo
I purposefully used exactly the same thing I did with Claude and Gemini to see how the models dealt with ambiguity. It shouldn't have degraded the chain of thought to the point where it starts looping.
101011
The trick shouldn't be to try and generate a litmus test for agentic development, it's to change your workflow to game-plan solutions and decompose problems (like you would a jira epic to stories), and THEN have it build something for you.
avetiszakharyan
Is there an additional system prompt before that? Or i can repro with just this?
rcarmo
Just that. I purposefully used exactly the same thing I did with Claude and Gemini to see how the models dealt with ambiguity.
GaggiX
You should probably try a different quantization, have you try UD-Q4_K_XL?
throwaway314155
[flagged]
nico
Very cool to see this and glad to discover localforge. Question about localforge, can I combine two agents to do something like: pass an image to a multimodal agent to provide html/css for it, and another to code the rest?
In the post I saw there’s gemma3 (multimodal) and qwen3 (not multimodal). Could they be used as above?
How does localforge know when to route a prompt to which agent?
Thank you
avetiszakharyan
you can combine agents in 2 ways, you can constantly swap agents during one conversation, or you can have the in separate conversations and collaborate. I was even thinking 2 agents can work on 2 separate git clones, and then do PR's to each other. I also like using code to do the image parsing and css, adn then gemini to do the coding. I tried using gemma and qwen for "real stuff" but its more of a, simple stuff only, if i really need output, id' rather spend money for now. Hopefully to change soon. as for rotuing, localforge does NOT know. you choose the agent, and it will loop inside that agent forever. Like, the way it works is that unless agent decides to talk to a user, it will forever be doing function calls and "talking to functions", as one agent. The only routing happens this way. there is main model and there is Expert model. main model knows to ask expert model (see system promp), when its stuck. so for any rouing to happen 1) system prompt needs to mention it 2) a routing to another model should be a function call. that way model knows how to ask another model for something
nico
Great insights, thank you for the extended and detailed answer, I'll have to try it out
walthamstow
Looks good. I've been looking for a local-first AI-assisted IDE to work with Google's Gemma 3 27B
I do think you should disclose that Localforge is your own project though.
danw1979
Personally, I assumed that a blog post on the domain localforge.dev was written by the developers of localforge, but I might be wrong.
SquareWheel
They likely mean that the submitter, avetiszakharyan, should disclose their relationship to Localforge.
zarathustreal
Fascinating.. I wonder how much of the economy runs on social proof
null
walthamstow
Sure, if you already know what Localforge is before clicking.
tasuki
I didn't know, and still assumed the blog post on localforge.dev was written by the localforge.dev people. Who else?
avetiszakharyan
Where do i put that, in the blogpost or?
999900000999
Very impressive, it doesn't need to be as good as the pay for token models. For example I've probably spent at least $300 last month on vibe coding, a big part of this is I want to know what tools I'm going to end up competing with, and another is I got a working implementation of one of my side projects, and then I decided I wanted it to be rewritten in another programming language.
Even if I chill out a bit here, a refurbished Nvidia laptop would pay for itself within a year. I am a bit disappointed Ollama can't handle the full flow yet, IE it could be a single command.
ollama code qwen3
_bin_
I just tried it. It got stuck looping on a `cargo check` call and literally wouldn't do anything else. No additional context, just repeatedly spitting out the same tool call.
The problem is the best models barely clear the bar for some stuff in terms of coherence and reliability; anything else just isn't particularly usable.
999900000999
This happens when I'm using Claude Code too. Even the best models need humans to get unstuck.
Fron what I've seen most of them are good at writing new code from scratch.
Refactoring is very difficult.
_bin_
I tried it 3-4 times before giving up and it did this every single time. I checked the tool call output and it was running cargo check appropriately. I think maybe the 30b-scale models just aren't sufficient for typical development.
You're generally correct though, that from-scratch gets better results. This is a huge constraint of them: I don't want a model that will write something its way. I've already gone through my design and settled on the style/principles/libraries I did for a reason; the bot working terribly with that is a major flaw and I don't see saying "let the bot do things its preferred way" as a good answer. Some systems, things like latency matters, and the bot's way just isn't good enough.
The vast majority of man-hours are maintaining and extending code, not green-fielding new stuff. Vendors should be hyper-focused on this, on compliance with user directions, not with building something that makes a react todo-list app marginally faster or better than competitors.
ttoinou
Great thank you. Side topic : anyone knows a way to have a centralized proxy to all LLMs services, online or local, that lets our services connect to it and we manage access to LLMs only once there ? And also records calls to LLM. Would make the whole UX of switching LLMs weekly easier, we would only reconfigure the proxy. I know only LiteLLM that can do that but its record of all LLMs calls is a bit clunky to use properly
Havoc
Litellm is definitely your best bet. For recording - you can probably vibe code a proxy in front of it that mitms it and dumps the request into whatever format you need
rcarmo
Litellm can log stuff pretty well on its own.
mnholt
I’ve been looking for this for my team but haven’t found it. Providers like OpenAI and Anthropic offer admin token to manage team accounts and you look hook into Ollama or another self managed service for local AI.
Seems like a great way to roll out AI to a medium sized team where a very small team can coordinate access to the best available tools so the entire team doesn’t need to keep pace at the current break-neck speed.
calebkaiser
I'm a maintainer of Opik, an open source LLM eval/observability framework. If you use something like LiteLLM or OpenRouter to handle the proxying of requests, Opik basically provides an out-of-the-box recording layer via its integrations with both:
tidbeck
Could you maybe make use of Simon Willsons [LLM lib/app](https://github.com/simonw/llm)? It has great LLM support (just pass in the model to use) and records everything by default.
simonw
The one feature missing from LLM core for this right now is serving models over an HTTP OpenAI-compatible local server. There's a plugin you can try for that here though: https://github.com/irthomasthomas/llm-model-gateway
null
desireco42
You can just use Ollama and have a bunch of models, some are good for planning, some are for executing tasks... this sounds more complex then it should be or maybe I am lazy and want everything neatly sorted.
I have models on external drive because Apple and through Ollama server they interact really well with Cline or Roo code or even Bolt, but I found Bolt really not working well.
desireco42
To add, you can use so called, abliterated models that are stripped of censorship for example. Much better experience sometimes.
jononor
Running models locally is starting to get interesting now. Especially the 30B-A3B version seems like a promising direction, though it is still out of reach on 16 GB VRAM (quite accessible). Hoping for new Nvidia RTX cards with 24/32 GB VRAM. Seems that we might get to GPT4-ish levels within a few years? Which is useful for a bunch of tasks.
avetiszakharyan
I think we are just tiny bit away of being able to really "code" with ai, locally. Because even if it would be on gemini2.5 level, since its free, you can make it self prompt a bit more and eventually solve any problem. if i could ran 200b or if 30b wouldve been as good - it wouldve been enough
I'm using Qwen3-30B-A3B locally and it's very impressive. Feels like the GPT-4 killer we were waiting for for two years. I'm getting 70 tok/s on an M3 Max, which is pushing it into the "very usable" quadrant.
What was even more impressive is the 0.6B model which made the sub 1B actually useful for non-trivial tasks.
Overall very impressed. I am evaluating how it can integrate with my current setup and will probably report somewhere about that.