Gemma3 – The current strongest model that fits on a single GPU

147 comments

·March 12, 2025

archerx

I have tried a lot of local models. I have 656GB of them on my computer so I have experience with a diverse array of LLMs. Gemma has been nothing to write home about and has been disappointing every single time I have used it.

Models that are worth writing home about are;

EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.

Rocinante-12B-v2i - Fun for stories and D&D

Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks

OpenThinker-7B - Good and fast reasoning

The Deepseek destills - Able to handle more complex task while still being fast

DeepHermes-3-Llama-3-8B - A really good vLLM

Medical-Llama3-v2 - Very interesting but be careful

Plus more but not Gemma.

anon373839

From the limited testing I've done, Gemma 3 27B appears to be an incredibly strong model. But I'm not seeing the same performance in Ollama as I'm seeing on aistudio.google.com. So, I'd recommend trying it from the source before you draw any conclusions.

One of the downsides of open models is that there are a gazillion little parameters at inference time (sampling strategy, prompt template, etc.) that can easily impair a model's performance. It takes some time for the community to iron out the wrinkles.

moffkalast

At the end of the day it doesn't matter how good it its, it has no system prompt which means no steerability, a sliding window for incredibly slow inference compared to similar sized models because it's too niche and most inference systems have high overhead implementations of it, and Google's psychotic instruct tuning that made Gemma 2 an inconsistent and unreliable glass cannon.

I mean hell, even Mistral added system prompts in their last release, Google are the only ones that don't seem to bother with it by now.

hnfong

If you actually looked at gemma-3 you’ll see that it does support system prompts.

I’ve never seen a case where putting the system prompt in the user prompt would lead to significantly different outcomes though. Would like to see some examples.

(edit: my bad. i stand corrected. it seems the code just prepends the system prompts to the first user prompt.)

sieve

The Gemma 2 Instruct models are quite good (9 & 27B) for writing. The 27B is good at following instructions. I also like DeepSeek R1 Distill Llama 70B.

The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.

Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.

[1] Variations on a Theme of Saki (https://gist.github.com/s-i-e-v-e/b4d696bfb08488aeb893cce3a4...)

smahs

[dead]

mythz

Concur with Gemma2 being underwhelming, I dismissed it pretty quickly but gemma3:27b is looking pretty good atm.

BTW mistral-small:24b is also worth mentioning (IMO best local model) and phi4:14b is also pretty strong for its size.

mistral-small was my previous local goto model, testing now to see if gemma3 can replace it.

InsideOutSanta

One more vote for Mistral for local models. The 7B model is extremely fast and still good enough for many prompts.

zacksiri

You should try Mistral Small 24b it’s been my daily companion for a while and have continued to impress me daily. I’ve heard good things about QwQ 32b that just came out too.

jrm4

Nice, I think you're nailing the important thing -- which is "what exactly are they good FOR?"

I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?

blooalien

I have had nothing but good results using the Qwen2.5 and Hermes3 models. The response times and token generation speeds have been pretty fantastic compared against other models I've tried, too.

usef-

To clarify, are you basing this comment on experience with previous Gemma releases, or the one from today?

mupuff1234

Ok, but have you tried Gemma3?

danielhanchen

I wrote a mini guide on running Gemma 3 at https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-e...!

The recommended settings according to the Gemma team are:

temperature = 0.95

top_p = 0.95

top_k = 64

Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M

vessenes

Daniel, as always, thanks for these. I had good results with your Q4_K_M quant on mac / llama.cpp. However, on Linux/A100/ollama, there is something very wrong with your Q8_0 quant. python code has indentation errors, missing close parens, quite a lot that's bad. I ran both with your suggested command lines, but of course could have been some mistake I made. I'm testing the bf16 on the A100 now to make sure it's not a hardware issue, but my gut is there's a model or ollama sampling problem here.

EDIT: 27b size

tarruda

Thanks for this, but I'm still unable to reproduce the results from Google AI studio.

I tried your version and when I ask it to create a tetris game in python, the resulting file has syntax errors. I see strange things like a space in the middle of a variable name/reference or weird spacing in the code output.

ac29

Some models are more sensitive to quantization than others, presumably AI Studio is running the full 16 bit model.

Try maybe the 8bit quant if you have the hardware for it? ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q8_0

tarruda

I tested the full fp16 gguf

svachalek

This seems worse than the official Ollama build. First question I tried:

>>> who is president

The বর্তমানpresident of the United States is Джо Байден (JoeBiden).

swores

See the other HN submission (for the Gemma3 technical report doc) for a more active discussion thread - 50 comments at time of writing this.

https://news.ycombinator.com/item?id=43340491

iamgopal

Small Models should be train on specific problem in specific language, and should be built one upon another, the way container works. I see a future where a factory or home have local AI server which have many highly specific models, continuously being trained by super large LLM on the web, and are connected via network to all instruments and computer to basically control whole factory. I also see a future where all machinery comes with AI-Readable language for their own functioning. A http like AI protocol for two way communication between machine and an AI. Lots of possibility.

antirez

After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.

archerx

That's why I like giving it a real world test. For example take a podcast transcription and ask it to make show notes and summary. With a temperature of 0 different models will tackle the problem in different ways and you can infer if they really understood the transcript. Usually the transcripts that I give it come from about 1 hour of audio of two or more people talking.

antirez

Good test. I'm slowly accumulating private tests that I use to rate LLMs, and this one was missing... Thanks.

amelius

Aren't there any "blind" benchmarks?

nathanasmith

Unfortunately that wouldn't help as much as you think since talented AI labs can just watch the public leaderboard and note what models move up and down to deduce and target whatever the hidden benchmark is testing.

nickthegreek

OpenRouter Arena Ratings are probably the closet thing.

toinewx

can you expand a bit?

antirez

The model performs very poorly in practice, while in the benchmark it is shown to be DeepSeek V3 level. It's not terrible but it's at another level compared to the models it is very close to (a bit better / a bit worse) in the benchmarks.

anon373839

I’d recommend trying it on Google AI Studio (aistudio.google.com). I am getting exceptional results on a handful of novel problems that require deep domain knowledge and structured reasoning. I’m not able to replicate this performance with Ollama, so I suspect something is a bit off.

alekandreev

Hey, Gemma engineer here. Can you please share reports on the type of prompts and the implementation you used?

kamranjon

I really respect the work that you've done, but I am always very surprised when people just speak anecdotally as though it is truth with regards to AI models these days. It's as if everyone believes they are an expert now, but have nothing of substance to provide but their gut feelings.

It's as if people don't realize that these models are used for many different purposes, and subjectively one person could think one model is amazing and another person think it's awful. I just would hope that we could at least back up statements like "The model performs very poorly in practice" with actual data or at least some explanation of how it performed poorly.

tarruda

In my experience, Gemma models were always bad at coding (but good at other tasks).

bearjaws

Prompt adherence is pretty bad from what I can tell.

smcleod

No mention of how well it's claimed to perform with tool calling?

The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.

PKop

I wasn't able to get function calls to work for Gemma3 in ollama, nor were others[0]. What is another way to run these models locally?

[0] https://github.com/ollama/ollama/issues/9680

[1] https://github.com/ollama/ollama/issues/9680#issuecomment-27...

mythz

Not sure if anyone else experiences this, but ollama downloads starts off strong but the last few MBs take forever.

Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.

From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.

Patrick_Devine

There are some fixes coming to uniformly speed up pulls. We've been testing that out but there are a lot of moving pieces with the new engine so it's not here quite yet.

dizhn

It might not be downloading but converting the model. Or if it's already downloading a properly formatted model file, deduping on disk which I hear it does. This also makes its model files on disk useless for other frontends.

squeakywhite

I experienced this just now. The download slowed down to approx 500kB/s for the last 1% or so. When this happens, you can Ctrl+C to cancel and then start the download again It will continue from where it left off, but at regular (fast) download speed.

elif

Good job Google. It is kinda hilarious that 'open'AI seems to be the big player least likely to release any of their models.

amelius

lyingAI

wtcactus

The claim of “strongest” (what does that even mean?) seems moot. I don’t think a multimodal model is the way to go to use on single, home, GPUs.

I would much rather have specific tailored models to use in different scenarios, that could be loaded into the GPU when needed. It’s a waste of parameters to have half of the VRAM loaded with parts of the model targeting image generation when all I want to do is write code.

JKCalhoun

That's interesting. Are they often an amalgam of image & text tokens? Because, yeah, image generation is not interesting to em at all.

amelius

Perhaps the model performs better (has higher intelligence) if it was trained on a more diverse set of topics (?)

singularity2001

How does it compare to OlympicCoder 7B [0] which allegedly beats Claude Sonnet 3.7 in the International Olympiad in Informatics [1] ?

[0] https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vll...

[1] https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=...

tarruda

My usual non-scientific benchmark is asking it to implement the game Tetris in python, and then iterating with the LLM to fix/tweak it.

My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"

It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:

" Key improvements and explanations:

     Clearer Code Structure:  The code is now organized into a Tetris class, making it much more maintainable and readable.  This is essential for any non-trivial game.

Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.

I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".

Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:

"The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"

Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.

tarruda

I ran the same prompt on google AI studio it had the same behavior of talking about improvements as if the code it wrote was not the first version.

Other than that, the experience was completely different:

- The game worked on first try

- I iterated with the model making enhancements. The first version worked but didn't show scores, levels or next piece, so I asked it to implement those features. It then produced a new version which almost worked: The only problem was that levels were increasing whenever a piece fell, and I didn't notice any increase in falling speed.

- So I reported the problems with level tracking and falling speed and it produced a new version which crashed immediately. I pasted the error and it was able to fix it in the next version

- I kept iterating with the model, fixing issues until it finally produced a perfectly working tetris game which I played and eventually lost due to high falling speed.

- As a final request, I asked it to port the latest working version of the game to JS/HTML with the implementation self contained in a file. It produced a broken implementation, but I was able to fix it after tweaking it a little bit.

Gemma 3 27b on Google AI studio is easily one of the best LLMs I've used for coding.

Unfortuantely I can't seem to reproduce the same results in ollama/open webui, even when running the full fp16 version.

whbrown

Those sound like the sort of issues which could be caused by your server silently truncating the middle of your prompts.

By default, Ollama uses a context window size of 2048 tokens.

tarruda

I checked this, the whole conversation was about 1000 tokens.

I suspect the Ollama version might have wrong default settings, such as conversation delimiters. The experience of Gemma 3 in AI studio is completely different.

whiplash451

Why did this get downvoted? Asking genuinely

null

[deleted]