Gemma 3 QAT Models: Bringing AI to Consumer GPUs
125 comments
·April 20, 2025simonw
rs186
Can you quote tps?
More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
DJHenk
> More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
There is another aspect to consider, aside from privacy.
These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.
However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.
simonw
My tooling doesn't measure TPS yet. It feels snappy to me on MLX.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
freeamz
>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.
overfeed
> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
otabdeveloper4
The only actually useful application of LLM's is processing large amounts of data for classification and/or summarizing purposes.
That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.
("AI assistant" is an evolutionary dead end, and Star Trek be damned.)
tomrod
Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).
simonw
MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.
Elucalidavah
> MacBook Pro M2 with 64GB of RAM
Are there non-mac options with similar capabilities?
nico
Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
simonw
The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...
nico
Been playing with that, but doesn’t seem to have much effect. It works very well to limit output to smaller bits, like setting it to 100-200. But above 2-4k the output seems to never get longer than about 1 page
Might try using the models with mlx instead of ollama to see if that makes a difference
Any tips on prompting to get longer outputs?
Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?
Casteil
This is basically the opposite of what I've experienced - at least compared to another recent entry like IBM's Granite 3.3.
By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.
nico
I agree with you. The outputs are usually good, it’s just that for the use case I have now (writing several pages of long dialogs), the output is not as long as I’d want it, and definitely not as long as it’s supposedly capable of doing
littlestymaar
> and it only uses ~22Gb (via Ollama) or ~15GB (MLX)
Why is the memory use different? Are you using different context size in both set-ups?
simonw
No idea. MLX is its own thing, optimized for Apple Silicon. Ollama uses GGUFs.
https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?
diggan
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
claiir
Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me. > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.
Wish they showed benchmarks / added quantized versions to the arena! :>
croemer
Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.
nithril
In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage
porphyra
It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.
trebligdivad
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
simonw
What are you using to run it? I haven't got image input working yet myself.
trebligdivad
I'm using llama.cpp - built last night from head; to do image stuff you have to run a separate client they provide, with something like:
./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png
Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.
terhechte
Image input has been working with LM Studio for quite some time
mythz
The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).
behnamoh
This is what local LLMs need—being treated like first-class citizens by the companies that make them.
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.
mekpro
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.
emrah
Available on ollama: https://ollama.com/library/gemma3
jinay
Make sure you're using the "-it-qat" suffixed models like "gemma3:27b-it-qat"
Der_Einzige
How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).
The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.
Zambyte
The significant convenience benefits outweigh the higher TPS that vLLM offers in the context of my single machine homelab GPU server. If I was hosting it for something more critical than just myself and a few friends chatting with it, sure. Being able to just paste a model name into Open WebUI and run it is important to me though.
It is important to know about both to decide between the two for your use case though.
Der_Einzige
Running any HF model on vllm is as simple as pasting a model name into one command in your terminal.
ach9l
instead of ranting, maybe explain how to make a qat q4 work with images in vllm, afaik it is not yet possible
simonw
Last I looked vLLM didn't work on a Mac.
mitjam
Afaik vllm is for concurrent serving with batched inference for higher throughput, not single-user inference. I doubt inference throughput is higher with single prompts at a time than Ollama. Update: this is a good Intro to continuous batching in llm inference: https://www.anyscale.com/blog/continuous-batching-llm-infere...
oezi
Why is sillytavern the only LLM frontend which matters?
GordonS
I tried sillytavern a few weeks ago... wow, that is an "interesting" UI! I blundered around for a while, couldn't figure out how to do anything useful... and then installed LM Studio instead.
Der_Einzige
It supports more sampler and other settings than anyone else.
janderson215
I did not know this, so thank you. I read a blogpost a while back that encouraged using Ollama and never mention vLLM. Do you recommend reading any particular resource?
oezi
Somebody in this thread mentioned 20.x tok/s on ollama. What are you seeing in vLLM?
Zambyte
FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the 27b qat. You can't really compare inference engine to inference engine without keeping the hardware and model fixed.
Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.
m00dy
Ollama is definitely not for production loads but vLLm is.
miki123211
What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
999900000999
Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
umajho
I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
holografix
Could 16gb vram be enough for the 27b QAT version?
jffry
With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory usage is just a hair over 20GB, so no, probably not without a nerfed context window
woadwarrior01
Indeed, the default context length in ollama is a mere 2048 tokens.
hskalin
With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)
senko
I'm doing that with a 12GB card, ollama supports it out of the box.
For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.
Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.
halflings
That's what the chart says yes. 14.1GB VRAM usage for the 27B model.
erichocean
That's the VRAM required just to load the model weights.
To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.
oezi
I didn't realize that the context would require such so much memory. Is this KV caches? It would seem like a big advantage if this memory requirement could be reduced.
parched99
I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.
Prompt Tokens: 10
Time: 229.089 ms
Speed: 43.7 t/s
Generation Tokens: 41
Time: 959.412 ms
Speed: 42.7 t/s
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/