Gemma 3 Technical Report [pdf]
175 comments
·March 12, 2025meetpateltech
diggan
> open weights
What exactly is this supposed to mean? That I can grab the weights by just downloading them, or something like that?
Because when I open up the HuggingFace repository, it asks me to "accept the conditions" (Google’s usage license). How is this different from any other proprietary binaries people distribute on the internet but let you run locally? Are other software (like 1Password for example) also "open software" because you can download it?
idonotknowwhy
Replace "google" with "unsloth" in the browser address bar if you want to download them without signing up to hf
diggan
Regardless of where you get the weights, Google says you need to follow their terms and conditions for the model/weights:
> By using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.
https://ai.google.dev/gemma/terms
Worth knowing if you're planning to use this model for production usage/with a business.
So once again, I don't understand what "open" is supposed to mean when they call models like these "open weights". What part exactly is "open"?
derbaum
The ollama page shows Gemma 27B beating Deepseek v3 and o3-mini on lmarena. I'm very excited to try it out.
LeoPanthera
Doesn't yet work in LM Studio. Barfs an error when trying to load the model. (Error 6, whatever that means. Happy I missed the first 5.)
diggan
> Barfs an error when trying to load the model
Since you're not using the official models (since they're not GGUFs), what exact model are you trying to use? The 3rd party you rely on might have screwed something up.
osanseviero
Please make sure to update to the latest llama.cpp version
upghost
I'm still a huge fan of gemma-22b. Looking forward to this one!
alekandreev
Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.
(Opinions our own and not of Google DeepMind.)
PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957
saagarjha
Google is using Greenhouse for ATS now?
heinrichf
I'm comparing Gemma3 12 B (https://ollama.com/library/gemma3; running fully on my 3060 12GB) and Mistral Small 3 24B (https://ollama.com/library/mistral-small; 10% offloaded to the CPU).
- Gemma3 12B: ~100 t/s on prompt eval; 15 t/s on eval
- MistralSmall3 24B: ~500 t/s on prompt eval; 10 t/s on eval
Do you know what different in architecture could make the prompt eval (prefill) so much slower on the 2x smaller Gemma3 model?
alekandreev
Thank you for the report! We are working with the Ollama team directly and will look into it.
sidkshatriya
As per the technical report, every 5 layers you have a global attention layer. The global attention layer during training can have as many as a 128k context length during training (though I understand it is usually 32k).
Q. When you are training with a context length of 128k, is the attention in the global layers dense or sparse ?
If dense, would the attention memory requirement here would be O(n^2) where n is 128k for each global layer ?
alekandreev
We never train at 128k, only 32k, changing the scaling factor at the end.
We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.
Individual attention layers are always dense.
sidkshatriya
Thanks for your answer ! So in the 32k global layer, every token attends to each of the other 32k tokens ?
[Edit: You answered the question when you said that individual attention layers are always dense.]
mdp2021
Thank you!
Question: your model supports 140 languages. Given that you are focusing on compactness and efficiency, would you not have gains in also developing models on a selected limited number of languages (e.g. the topmost (in cultural production) four "western" ones with shared alphabet - or similar set)?
Edit: of course the multilingual capability can be can be welcome. On the other hand, there are evident cases in which efficiency can be paramount. We can wonder about the tradeoff: how much in efficiency is sacrificed by features.
alekandreev
That's an idea we've thought about. However, we think the open source community has already created a very impressive set of language or region-specific finetunes [1] [2]. Also there is a lot of cultural and nuance context in every language that we don't have the capacity to cover sufficiently. So for v3 we focused on creating the best foundational multilingual model.
mdp2021
And have you measured the trade-off that could come with embracing such a large number of languages and alphabets? It would be interesting to note whether you are sacrificing some response quality, or if such supposed sacrifice is interestingly negligible, or if - even more interestingly - the quality increases with the added proficiency.
magicalhippo
Thanks, been using Gemma 2 a lot at home as it still holds up very well and the 9B version runs great on my 2080Ti. Strong prompt adherence coupled with overall capability makes it very useful. Looking forward to trying Gemma 3.
I have some dumb questions though, might as well ask. How do you decide on the model sizes? And how do you train them? Independently or are they related somehow?
alekandreev
Picking model sizes is not an exact science. We look for sizes that will fit quantized on different categories on devices (e.g., low-end and high-end smartphone, laptops and 16GB GPUs, and bigger GPUs/TPUs). We also want the ratio of model width to depth (number of layers) to be consistently around 90, which we found works best.
The models are trained with distillation from a bigger teacher. We train them independently, but for v3 we have unified the recipes for 4B-27B, to give you more predictably when scaling up and down to different model sizes.
magicalhippo
Thanks again, very interesting.
One unexpected (to me) use-case appeared not long ago when I found myself without internet but wanting to fix some non-standard Linux configuration issue. As a Windows guy I tend to web search such things, but local LLM to the rescue!
Even smaller models like Gemma 2 9B has enough compressed knowledge that it managed to help me quickly solve my issue.
This got me thinking how such smaller, but very capable models might be a game-changer in communities where internet might not be available or too expensive for continuous use. It's almost like having a portion of the internet in a box, just add electricity.
miki123211
How good is Gemma at structured output generation, JSON schema compliance and tool use? Particularly the smaller versions, particularly in foreign languages?
We will run our internal evals on it for sure, but just wanted to ask whether that's even a use case that the team considered and trained for.
seektable
Just tried gemma3:4b for structured output and it fails with a strange error ( ollama is the latest):
Ollama error: POST predict: Post "http://127.0.0.1:49675/completion": read tcp 127.0.0.1:49677->127.0.0.1:49675: wsarecv: An existing connection was forcibly closed by the remote host.
Not sure this is Ollama or gemma3:4b problem. At the same time, gemma3:12b works fine for the same API request (100% identical, only difference is model id).
seektable
looks like Ollama's issue: https://github.com/ollama/ollama/issues/9686, https://github.com/ollama/ollama/issues/9687
canyon289
Hey, I'm from the Gemma team. There's a couple of angles to your question
We do care about prompted instructions, like json schema, and it is something we eval for and encourage you to try. Here's an example from Gemma2 to guide folks looking to do what it sounds like you're interested in.
https://www.youtube.com/watch?v=YxhzozLH1Dk
Multilinguality was a big focus in Gemma3. Give it a try
And for structured output Gemma works well with many structured output libraries, for example the one built into Ollama
https://github.com/ollama/ollama/blob/main/docs/api.md#struc...
In short you should have all the functionality you need!
null
swyx
will there ever be a Gemma 3 Thinking? how copyable is the Flash Thinking approach to the Gemma series?
alekandreev
That's a very interesting area, but nothing we can announce today.
moffkalast
What's the official take on the system prompt? The technical report doesn't mention it, but the official QAT GGUFs include some form of prepending it to the first user message. Has it been trained with any <start_of_turn>system turns with tool calls and such?
alekandreev
We recommend using <start_of_turn>user for the system prompt as well.
tucnak
I was under the impression that the purpose of "system" prompt is to encode the instruction boundary explicitly to reduce the risk of injection. Do you enforce some kind of security invariant that we could rely on? For example, does the alignment regiment include adversarial demonstrations so that out-of-order instruction-following (such as contradicting preceding) is penalised?
vessenes
Lots to be excited about here - in particular new architecture that allows subquadratic scaling of memory needs for long context; looks like 128k+ context is officially now available on a local model. The charts make it look like if you have the RAM the model is pretty good out to 350k or so(!) with RoPE.
In addition, it flavor tests well on chat arena, ELO significantly above yesterday’s best open model, Qwen 2.5 72b, has some pretty interesting properties that indicate it has not spent much of its model weight space on memorization, hopefully implying that it has spent it on cognition and conceptual stuff.
And, oh also vision and 140 languages.
This seems like one worth downloading and keeping; Gemma models have at times not performed quite to benchmark, but I’d guess from all this that this will be a useful strong local model for some time. I’m curious about coding abilities and tool following, and about ease of fine tuning for those.
Thanks open sourcing this, DeepMind team! It looks great.
hnuser123456
Gemma is made by Google, not DeepMind.
edit: Sorry, forgot DeepMind was Google's AI R&D, I read it as deepseek in your comment.
saagarjha
That’s Google DeepMind to you
newfocogi
Job postings for working on Gemma are under DeepMind in London: https://boards.greenhouse.io/deepmind/jobs/6590957
xnx
Linking to he announcement (which links to his PDF) would probably be more useful.
Introducing Gemma 3: The most capable model you can run on a single GPU or TPU
toisanji
How do the gemma and gemini collaborate and share information?
tomthe
Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).
But the example image shows that this model still makes dumb errors or has a poor common sense although it read every information correctly.
aoeusnth1
Looking at every other benchmark, it's significantly behind typical big models from a year ago (Claude 3.0, Gemini 1.5, GPT 4.0). I think Google must have extensive LMArena-focused RLHF tuning for their models to juice their scores.
vessenes
I was thinking the same thing about the receipt calculation: a warning that only tourists tip 18% in Switzerland would no doubt have been appreciated!
jerrygenser
I suppose siglipv2 wasn't out yet when they were training this - I wonder if there will be an update to the multimodal models or pali-gemma which utilizes Siglip2. Aya Vision from Cohere utilized siglip 2 to great effect
dhbradshaw
Quote:
The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3- 4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
Really cool
atarus
Looks great! So excited about this! We have been using gemma models since gemma 1.0 and they are so far ahead of the curve!
xnx
This is the first model I can think of that advertises itself as being optimized for AMD ROCm.
danielhanchen
Super cool models! Love the mixture of sliding window and global attention to cut down on KV cache sizes! And 4B, 12B and 27B are vision + text by default! Super cool!
xnx
Even though Gemma 3 takes much less inference processing power and delivers better results than DeepSeek v3, I'm certain this won't cause the same Nvidia stock price panic that DeepSeek did.
Gemma 3 is out! Multimodal (image + text), 128K context, supports 140+ languages, and comes in 1B, 4B, 12B, and 27B sizes with open weights & commercial use.
Gemma 3 model overview: https://ai.google.dev/gemma/docs/core
Huggingface collection: https://huggingface.co/collections/google/gemma-3-release-67...
ollama: https://ollama.com/library/gemma3