Gemma 3 QAT Models: Bringing AI to Consumer GPUs
285 comments
·April 20, 2025simonw
rs186
Can you quote tps?
More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
simonw
My tooling doesn't measure TPS yet. It feels snappy to me on MLX.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
freeamz
>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.
triyambakam
> specifically for dealing with extremely sensitive data like leaked information from confidential sources.
Can you explain this further? It seems in contrast to your previous comment about trusting Anthropic with your data
overfeed
> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
rs186
I have a 4070 super for gaming, and used it to play with LLM a few times. It is by no means a bad card, but I realize that unless I want to get 4090 or new Macs that I don't have any other use for, I can only use it to run smaller models. However, most smaller models aren't satisfactory and are still slower than hosted LLMs. I haven't found a model that I am happy with for my hardware.
Regarding agentic workflows -- sounds nice but I am too scared to try it out, based on my experience with standard LLMs like GPT or Claude for writing code. Small snippets or filling in missing unit tests, fine, anything more complicated? Has been a disaster for me.
adastra22
I have never found any agent able to put together sensible pull requests without constant hand holding. I shudder to think of what those repositories must look like.
ein0p
Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.
lodovic
This is a really cool idea. Do you pretrain the model so it can tag people? I have so many photo's that it seems impossible to ever categorize them,using a workflow like yours might help a lot
starik36
I was thinking of doing the same, but I would like to include people's name. in the description. For example "Jennifer looking out in the desert sky.".
As it stands, Gemma will just say "Woman looking out in the desert sky."
trees101
Not sure how accurate my stats are. I used ollama with the --verbose flag. Using a 4090 and all default settings, I get 40TPS for Gemma 29B model
`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS
`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS
Strange results; the full model gives me slightly more TPS.
orangecat
ollama's `gemma3:27b` is also 4-bit quantized, you need `27b-it-q8_0` for 8 bit or `27b-it-fp16` for FP16. See https://ollama.com/library/gemma3/tags.
k__
The local LLM is your project manager, the big remote ones are the engineers and designers :D
jonaustin
On a M4 Max 128GB via LM Studio:
query: "make me a snake game in python with pygame"
(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token
(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token
using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...
yencabulator
I genuinely would have expected a $3,500+ setup to do better than just 10x pure-CPU on a AMD Ryzen 9 8945HS.
starik36
On an A5000 with 24GB, this model typically gets between 20 to 25 tps.
pantulis
> Can you quote tps?
LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.
kristianp
Is QAT a different quantisation format to Q4_0? Can you try "gemma-3-27b-it-qat" for a model: https://lmstudio.ai/model/gemma-3-27b-it-qat
pantulis
Gah, turns out I was running the Mac in low power mode!
I get 24tps in LM Studio now with gemma-3-27b-it-qat.
jychang
That's pretty terrible. I'm getting 18tok/sec Gemma 3 27b QAT on a M1 Max 32gb macbook.
bobjordan
Thanks for the call out on this model! I have 42gb usable VRAM on my ancient (~10yrs old) quad-sli titan-x workstation and have been looking for a model to balance large context window with output quality. I'm able to run this model with a 56K context window and it just fits into my 42gb VRAM to run 100% GPU. The output quality is really good and 56K context window is very usable. Nice find!
paprots
The original gemma3:27b also took only 22GB using Ollama on my 64GB MacBook. I'm quite confused that the QAT took the same. Do you know why? Which model is better? `gemma3:27b`, or `gemma3:27b-qat`?
zorgmonkey
Both versions are quantized and should use the same amount of RAM, the difference with QAT is the quantization happens during training time and it should result in slightly better (closer to the bf16 weights) output
kgwgk
Look up 27b in https://ollama.com/library/gemma3/tags
You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M
carbocation
Just following this comment up as a note-to-self: just as `kgwgk noted, the default gemma3:27B model has ID a418f5838eaf, which corresponds to 27b-it-q4_K_M. But the new gemma3:27B quantization-aware training (QAT) model being discussed is gemma3:27b-it-qat with ID 29eb0b9aeda3.
superkuh
Quantization aware training just means having the model deal with quantized values a bit during training so it handles the quantization better when it is quantized after training/etc. It doesn't change the model size itself.
nolist_policy
I suspect your "original gemma3:27b" was a quantized model since the non-quantized (16bit) version needs around 54gb.
prvc
> ~15GB (MLX) leaving plenty of memory for running other apps.
Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?
tomrod
Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).
simonw
MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.
jychang
MLX is slower than GGUFs on Macs.
On my M1 Max macbook pro, the GGUF version bartowski/google_gemma-3-27b-it-qat-GGUF is 15.6gb and runs at 17tok/sec, whereas mlx-community/gemma-3-27b-it-qat-4bit is 16.8gb and runs at 15tok/sec. Note that both of these are the new QAT 4bit quants.
Elucalidavah
> MacBook Pro M2 with 64GB of RAM
Are there non-mac options with similar capabilities?
nico
Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
simonw
The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...
nico
Been playing with that, but doesn’t seem to have much effect. It works very well to limit output to smaller bits, like setting it to 100-200. But above 2-4k the output seems to never get longer than about 1 page
Might try using the models with mlx instead of ollama to see if that makes a difference
Any tips on prompting to get longer outputs?
Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?
tootie
I'm using 12b and getting seriously verbose answers. It's squeezed into 8GB and takes its sweet time but answers are really solid.
Casteil
This is basically the opposite of what I've experienced - at least compared to another recent entry like IBM's Granite 3.3.
By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.
nico
I agree with you. The outputs are usually good, it’s just that for the use case I have now (writing several pages of long dialogs), the output is not as long as I’d want it, and definitely not as long as it’s supposedly capable of doing
null
littlestymaar
> and it only uses ~22Gb (via Ollama) or ~15GB (MLX)
Why is the memory use different? Are you using different context size in both set-ups?
simonw
No idea. MLX is its own thing, optimized for Apple Silicon. Ollama uses GGUFs.
https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?
jychang
Those are the same quant, but this is a good example of why you shouldn't use ollama. Either directly use llama.cpp, or use something like LM Studio if you want something with a GUI/easier user experience.
The Gemma 3 17b QAT GGUF should be taking up ~15gb, not 22gb.
Patrick_Devine
The vision tower is 7GB, so I was wondering if you were loading it without vision?
Samin100
I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!
itake
I tried to use the -it models for translation, but it completely failed at translating adult content.
I think this means I either have to train the -pt model with my own instruction tuning or use another provider :(
jychang
Try mradermacher/amoral-gemma3-27B-v2-qat-GGUF
itake
My current architecture is an on-device model for fast translation and then replace that with a slow translation (via an API call) when its ready.
24b would be too small to run on device and I'm trying to keep my cloud costs low (meaning I can't afford to host a small 27b 24/7).
andhuman
Have you tried Mistral Small 24b?
itake
My current architecture is an on-device model for fast translation and then replace that with a slow translation (via an API call) when its ready.
24b would be too small to run on device and I'm trying to keep my cloud costs low (meaning I can't afford to host a small 24b 24/7).
diggan
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
croemer
Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.
nithril
In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage
claiir
Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me. > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.
Wish they showed benchmarks / added quantized versions to the arena! :>
mark_l_watson
Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.
gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
nxobject
Fellow owner of a 32GB MBP here: how much memory does it use while resident - or, if swapping happens, do you see the effects in your day to day work? I’m in the awkward position of using on a daily basis a lot of virtualized bloated Windows software (mostly SAS).
mark_l_watson
I have the usual programs running on my Mac, along with open-codex: Emacs, web browser, terminals, VSCode, etc. Even with large contexts, open-codex with Ollama and Gemma 3 27B QAT does not seem to overload my system.
To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.
pantulis
How did you manage to run open-codex against a local ollama? I keep getting 400 Errors no matter what I try with the --provider and --model options.
pantulis
Never mind, found your Leanpub book and followed the instructions and at least I have it running with qwen-2.5. I'll investigate what happens with Gemma.
Tsarp
What tps are you hitting? And did you have to change KV size?
mekpro
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.
trebligdivad
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
simonw
What are you using to run it? I haven't got image input working yet myself.
trebligdivad
I'm using llama.cpp - built last night from head; to do image stuff you have to run a separate client they provide, with something like:
./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png
Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.
Havoc
The upcoming qwen3 series is supposed to be MoE...likely to give better tk/s on CPU
slekker
What's MoE?
Havoc
Mixture of experts like other guy said - everything gets loaded into mem but not every byte is needed to generate a token (unlike classic LLMs like gemma).
So for devices that have lots of mem but weaker processing power it can get you similar output quality but faster. So tends to do better on CPU and APU like setups
zamalek
Mixture of Experts. Very broadly speaking, there are a bunch of mini networks (experts) which can be independently activated.
manjunaths
I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
mdp2021
> My kid's been using it to feed his school textbooks and ask it questions
Which method are you employing to feed a textbook into the model?
behnamoh
This is what local LLMs need—being treated like first-class citizens by the companies that make them.
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.
mmoskal
Also ~noone runs h100 at home, ie at batch size 1. What matters is throughput. With 37b active parameters and a massive deployment throughout (per gpu) should be similar to Gemma.
freeamz
so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.
behnamoh
half the amount of those dots is what it takes. but also, why compare a 27B model with a +600B? that doesn't make sense.
smallerize
It's an older image that they just reused for the blog post. It's on https://ai.google.dev/gemma for example
mythz
The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).
porphyra
It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.
ivape
This is why the "AI hardware cycle is hype" crowd is so wrong. We're not even close, we're basically at ColecoVision/Atari stage of hardware here. It's going be quite a thing when everyone gets a SNES/Genesis.
icedrift
Capable local models have been usable on Macs for a while now thanks to their unified memory.
dragonwriter
AI PCs aren't about running the kind of models that take a 3090-class GPU, or even running on GPU at all, but systems where the local end is running something like Phi-3.5-vision-instruct, on system RAM using a CPU with an integrated NPU, which is why the AI PC requirements specify an NPU, a certain amount of processing capacity, and a minimum amount of DDR5/LPDDR5 system RAM.
NorwegianDude
A 3090 is not a extremely high end GPU. Is a consumer GPU launched in 2020, and even in price and compute it's around a mid-range consumer GPU these days.
The high end consumer card from Nvidia is the RTX 5090, and the professional version of the card is the RTX PRO 6000.
dragonwriter
For model usability as a binary yes/no, pretty much the only dimension that matters is VRAM, and at 24GB the 3090 is still high end for a consumer NVidia GPUs, yes, the 5090 (and only the 5090) is above it, at 32GB, but 24GB is way ahead of the mid-range.
NorwegianDude
24 GB of VRAM is a large amount of VRAM on a consumer GPU, that I totally agree with you on. But it's definitely not an extremely high end GPU these days. It is suitable, yes, but not high end. The high end alternative for a consumer GPU would be the RTX 5090, but that is only available for €3000 now, while used 3090s are around €650.
zapnuk
A 3090 still costs 1800€. Thats not mid-range by a long shot
The 5070 or 5070ti are mid range. They cost 650/900€.
NorwegianDude
3090s are no longer produced, that's why new ones are so expensive. At least here, used 3090s are around €650, and a RTX 5070 is around €625.
It's definitely not extremely high end any more, the price is(at least here) the same as the new mid range consumer cards.
I guess the price can vary by location, but €1800 for a 3090 is crazy, that's more than the new price in 2020.
sentimentscan
A year ago, I bought a brand-new EVGA hybrid-cooled 3090 Ti for 700 euros. I'm still astonished at how good of a decision it was, especially considering the scarcity of 24GB cards available for a similar price. For pure gaming, many cards perform better, but they mostly come with 12 to 16GB of VRAM.
emrah
Available on ollama: https://ollama.com/library/gemma3
jinay
Make sure you're using the "-it-qat" suffixed models like "gemma3:27b-it-qat"
ein0p
Thanks. I was wondering why my open-webui said that I already had the model. I bet a lot of people are making the same mistake I did and downloading just the old, post-quantized 27B.
Der_Einzige
How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).
The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.
Zambyte
The significant convenience benefits outweigh the higher TPS that vLLM offers in the context of my single machine homelab GPU server. If I was hosting it for something more critical than just myself and a few friends chatting with it, sure. Being able to just paste a model name into Open WebUI and run it is important to me though.
It is important to know about both to decide between the two for your use case though.
Der_Einzige
Running any HF model on vllm is as simple as pasting a model name into one command in your terminal.
ach9l
instead of ranting, maybe explain how to make a qat q4 work with images in vllm, afaik it is not yet possible
oezi
Why is sillytavern the only LLM frontend which matters?
GordonS
I tried sillytavern a few weeks ago... wow, that is an "interesting" UI! I blundered around for a while, couldn't figure out how to do anything useful... and then installed LM Studio instead.
Der_Einzige
It supports more sampler and other settings than anyone else.
simonw
Last I looked vLLM didn't work on a Mac.
mitjam
Afaik vllm is for concurrent serving with batched inference for higher throughput, not single-user inference. I doubt inference throughput is higher with single prompts at a time than Ollama. Update: this is a good Intro to continuous batching in llm inference: https://www.anyscale.com/blog/continuous-batching-llm-infere...
prometheon1
From the HN guidelines: https://news.ycombinator.com/newsguidelines.html
> Be kind. Don't be snarky.
> Please don't post shallow dismissals, especially of other people's work.
In my opinion, your comment is not in line with the guidelines. Especially the part about sillytavern being the only LLM frontend that matters. Telling the devs of any LLM frontend except sillytavern that their app doesn't matter seems exactly like a shallow dismissal of other people's work to me.
janderson215
I did not know this, so thank you. I read a blogpost a while back that encouraged using Ollama and never mention vLLM. Do you recommend reading any particular resource?
oezi
Somebody in this thread mentioned 20.x tok/s on ollama. What are you seeing in vLLM?
Zambyte
FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the 27b qat. You can't really compare inference engine to inference engine without keeping the hardware and model fixed.
Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.
m00dy
Ollama is definitely not for production loads but vLLm is.
technologesus
Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.
Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu
jvictor118
i've found the vision capabilities are very bad with spatial awareness/reasoning. They seem to know that certain things are in the image, but not where they are relative to each other, their relative sizes, etc.
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/