Gemma 3 Technical Report [pdf]

259 comments

·March 12, 2025

meetpateltech

Gemma 3 is out! Multimodal (image + text), 128K context, supports 140+ languages, and comes in 1B, 4B, 12B, and 27B sizes with open weights & commercial use.

Gemma 3 model overview: https://ai.google.dev/gemma/docs/core

Huggingface collection: https://huggingface.co/collections/google/gemma-3-release-67...

ollama: https://ollama.com/library/gemma3

setgree

A kind of ancillary note, but it's amazing to me how fragmented this presentation and documentation is:

* the parent link is to storage.googleapis.com

* There's documentation on ai.google.dev

* The announcement blogpost is https://blog.google/technology/developers/gemma-3/

* you try it on https://aistudio.google.com/

It's helpful to have a top-level post like this, but can some PM please consolidate this into, IDK, ai.google.com/gemini?

matteocontrini

Apparently ai.google.com currently redirects to ai.google, which is different from ai.google.dev where the Gemini stuff actually is.

bigdict

* the code is at https://github.com/google-deepmind/gemma

* you download the weights at https://www.kaggle.com/models/google/gemma-3/

klysm

I don't see how this actually matters - who cares if it it's different top level domains?

jhayward

Two reasons it matters:

1) Discoverability

2) "System structure mirrors organization". I.E., it's an indicator of a fragmented and disorganized structure that's not likely to produce cohesive product results.

derbaum

The ollama page shows Gemma 27B beating Deepseek v3 and o3-mini on lmarena. I'm very excited to try it out.

null

[deleted]

Hiskias

Same!

LeoPanthera

Doesn't yet work in LM Studio. Barfs an error when trying to load the model. (Error 6, whatever that means. Happy I missed the first 5.)

genewitch

You need the newest llama.cpp and if you have an amd card and recently updated the drivers, roll them back. Most people complaining are using ROCm.

I assure you gemma 3 works fine in LM studio. Gguf and MLx are available.

diggan

> Barfs an error when trying to load the model

Since you're not using the official models (since they're not GGUFs), what exact model are you trying to use? The 3rd party you rely on might have screwed something up.

null

[deleted]

osanseviero

Please make sure to update to the latest llama.cpp version

genpfault

> ollama: https://ollama.com/library/gemma3

Needs an ollama newer than 0.5.11. Probably the very-recently-released v0.6.0[1]:

> New Model:

> * Gemma 3: Google Gemma 3 model is now available in 1B, 4B, 12B, and 27B parameter sizes.

[1]: https://github.com/ollama/ollama/releases/tag/v0.6.0

starik36

Doesn't work on 0.5.13. Had to upgrade to 0.6.0.

diggan

> open weights

What exactly is this supposed to mean? That I can grab the weights by just downloading them, or something like that?

Because when I open up the HuggingFace repository, it asks me to "accept the conditions" (Google’s usage license). How is this different from any other proprietary binaries people distribute on the internet but let you run locally? Are other software (like 1Password for example) also "open software" because you can download it?

idonotknowwhy

Replace "google" with "unsloth" in the browser address bar if you want to download them without signing up to hf

diggan

Regardless of where you get the weights, Google says you need to follow their terms and conditions for the model/weights:

> By using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.

https://ai.google.dev/gemma/terms

Worth knowing if you're planning to use this model for production usage/with a business.

So once again, I don't understand what "open" is supposed to mean when they call models like these "open weights". What part exactly is "open"?

svachalek

"Open weights" refers to a license that allows you to freely (or mostly freely) copy the model file (i.e. weights). An "open source" model would be possible to build from training data, but those hardly exist.

upghost

I'm still a huge fan of gemma-22b. Looking forward to this one!

alekandreev

Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.

(Opinions our own and not of Google DeepMind.)

PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957

heinrichf

I'm comparing Gemma3 12 B (https://ollama.com/library/gemma3; running fully on my 3060 12GB) and Mistral Small 3 24B (https://ollama.com/library/mistral-small; 10% offloaded to the CPU).

- Gemma3 12B: ~100 t/s on prompt eval; 15 t/s on eval

- MistralSmall3 24B: ~500 t/s on prompt eval; 10 t/s on eval

Do you know what different in architecture could make the prompt eval (prefill) so much slower on the 2x smaller Gemma3 model?

alekandreev

Thank you for the report! We are working with the Ollama team directly and will look into it.

remuskaos

At what context sizes? I've just run the same prompt and query on my RTX3080 with openwebui as frontend.

When I set the context size to 2048 (openwebui's default), the inference is almost twice as fast as when I set it to 4096. I can't set the conext size any higher because my GPU only has 12GB of RAM and ollama crashes for larger context sizes.

Still, I find that thoroughly odd. Using the larger conetext size (4096), the GPU usage is only 50% as seen in nvtop. I have no idea why.

magicalhippo

Thanks, been using Gemma 2 a lot at home as it still holds up very well and the 9B version runs great on my 2080Ti. Strong prompt adherence coupled with overall capability makes it very useful. Looking forward to trying Gemma 3.

I have some dumb questions though, might as well ask. How do you decide on the model sizes? And how do you train them? Independently or are they related somehow?

alekandreev

Picking model sizes is not an exact science. We look for sizes that will fit quantized on different categories on devices (e.g., low-end and high-end smartphone, laptops and 16GB GPUs, and bigger GPUs/TPUs). We also want the ratio of model width to depth (number of layers) to be consistently around 90, which we found works best.

The models are trained with distillation from a bigger teacher. We train them independently, but for v3 we have unified the recipes for 4B-27B, to give you more predictably when scaling up and down to different model sizes.

magicalhippo

Thanks again, very interesting.

One unexpected (to me) use-case appeared not long ago when I found myself without internet but wanting to fix some non-standard Linux configuration issue. As a Windows guy I tend to web search such things, but local LLM to the rescue!

Even smaller models like Gemma 2 9B has enough compressed knowledge that it managed to help me quickly solve my issue.

This got me thinking how such smaller, but very capable models might be a game-changer in communities where internet might not be available or too expensive for continuous use. It's almost like having a portion of the internet in a box, just add electricity.

bguberfain

Can you provide more information about this “bigger teacher” model?

miki123211

How good is Gemma at structured output generation, JSON schema compliance and tool use? Particularly the smaller versions, particularly in foreign languages?

We will run our internal evals on it for sure, but just wanted to ask whether that's even a use case that the team considered and trained for.

canyon289

Hey, I'm from the Gemma team. There's a couple of angles to your question

We do care about prompted instructions, like json schema, and it is something we eval for and encourage you to try. Here's an example from Gemma2 to guide folks looking to do what it sounds like you're interested in.

https://www.youtube.com/watch?v=YxhzozLH1Dk

Multilinguality was a big focus in Gemma3. Give it a try

And for structured output Gemma works well with many structured output libraries, for example the one built into Ollama

https://github.com/ollama/ollama/blob/main/docs/api.md#struc...

In short you should have all the functionality you need!

refulgentis

The Ollama stuff is the old llama.cpp stuff that constrains output tokens.

It's great, I've used it to get outputs from as small a model as 1B.

But it's a stark difference in quality from, say, Phi-4's native tool-calling.

If Gemma 3 is natively trained on tool-calling, i.e. y'all are benching on say, Berekley Function Calling leaderboard, that'd be great to know out here.

Tangentially, github.com/ochafik is a Googler who landed an excellent overhaul of llama.cpp's tool-calling, might be worth reaching out to (if you're not working with him already!)

eternityforest

I notice in my (brief and probably user error filled, I'm an embedded dev, not an AI expert) testing, it(and pretty much every other small model) seems to have trouble interpreting numbers expressed as words when filling out a JSON object like:

{"operator": "*", "command": "calculate", "a": 473, "b": 2848}

You might say something like five thousand fifty six, and it will fill in something like 556 or 5560.

Like as if it is just transferring digits one by one, not using the structure to know about the implicit zero.

Which is very interesting since that seems like a mistake I would make too!

It doesn't do it all the time, and I only know about the ollama quantized version, and I mostly only try the 1B models, and I've seen similar issues with other sub-2B models as well.

The other interesting thing is in a chat, almost every model I've tried seems to interpret the numbers correctly, if you say "what's ten million and fifty times eight" it will start with "10,000,050 x 8 is...".

Sometimes they get the math wrong after that, but the number interpretation is right.

I wonder if there's something special about all "intro text" in the chat mode that is actually acting like reasoning, or if the digit separators(that don't exist in JSON) help them figure out what they're doing?

I wonder if it would be better for some applications to include a line of thoughts/summary/intro in the JSON format constraint?

Other than that I've been really enjoying Gemma3!

seektable

Just tried gemma3:4b for structured output and it fails with a strange error ( ollama is the latest):

Ollama error: POST predict: Post "http://127.0.0.1:49675/completion": read tcp 127.0.0.1:49677->127.0.0.1:49675: wsarecv: An existing connection was forcibly closed by the remote host.

Not sure this is Ollama or gemma3:4b problem. At the same time, gemma3:12b works fine for the same API request (100% identical, only difference is model id).

seektable

looks like Ollama's issue: https://github.com/ollama/ollama/issues/9686, https://github.com/ollama/ollama/issues/9687

null

[deleted]

swyx

will there ever be a Gemma 3 Thinking? how copyable is the Flash Thinking approach to the Gemma series?

alekandreev

That's a very interesting area, but nothing we can announce today.

mdp2021

Thank you!

Question: your model supports 140 languages. Given that you are focusing on compactness and efficiency, would you not have gains in also developing models on a selected limited number of languages (e.g. the topmost (in cultural production) four "western" ones with shared alphabet - or similar set)?

Edit: of course the multilingual capability can be can be welcome. On the other hand, there are evident cases in which efficiency can be paramount. We can wonder about the tradeoff: how much in efficiency is sacrificed by features.

alekandreev

That's an idea we've thought about. However, we think the open source community has already created a very impressive set of language or region-specific finetunes [1] [2]. Also there is a lot of cultural and nuance context in every language that we don't have the capacity to cover sufficiently. So for v3 we focused on creating the best foundational multilingual model.

[1] https://huggingface.co/aiplanet/buddhi-indic

[2] https://ai.google.dev/gemma/gemmaverse/sealion

jjani

Just wanted to say that Gemini 1.5-Pro is still the SOTA foundational model for certain languages (including non-Google models), so it's disappointing to have received the email that it will be removed in September - it will cause our product quality to go backwards when we're forced to replace it by a worse model. Unless a better one appears in that time, but we've extensively tested all big models and for the languages in question, none of them perform on the same level.

Happy to elaborate if there's a way to get in touch, in case the team isn't aware of this.

mdp2021

And have you measured the trade-off that could come with embracing such a large number of languages and alphabets? It would be interesting to note whether you are sacrificing some response quality, or if such supposed sacrifice is interestingly negligible, or if - even more interestingly - the quality increases with the added proficiency.

sidkshatriya

As per the technical report, every 5 layers you have a global attention layer. The global attention layer during training can have as many as a 128k context length during training (though I understand it is usually 32k).

Q. When you are training with a context length of 128k, is the attention in the global layers dense or sparse ?

If dense, would the attention memory requirement here would be O(n^2) where n is 128k for each global layer ?

alekandreev

We never train at 128k, only 32k, changing the scaling factor at the end.

We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.

Individual attention layers are always dense.

sidkshatriya

Thanks for your answer ! So in the 32k global layer, every token attends to each of the other 32k tokens ?

[Edit: You answered the question when you said that individual attention layers are always dense.]

moffkalast

What's the official take on the system prompt? The technical report doesn't mention it, but the official QAT GGUFs include some form of prepending it to the first user message. Has it been trained with any <start_of_turn>system turns with tool calls and such?

alekandreev

We recommend using <start_of_turn>user for the system prompt as well.

tucnak

I was under the impression that the purpose of "system" prompt is to encode the instruction boundary explicitly to reduce the risk of injection. Do you enforce some kind of security invariant that we could rely on? For example, does the alignment regiment include adversarial demonstrations so that out-of-order instruction-following (such as contradicting preceding) is penalised?

werediver

Is speculative decoding possible across 1/4/12/27 B Gemma 3 variants?

LM Studio doesn't allow that (yet), but maybe the s/w requires some adjustments to support speculative decoding with Gemma 3.

pinglin

It's reported working but not with LM Studio: https://www.reddit.com/r/LocalLLaMA/comments/1j9reim/comment...

vessenes

Lots to be excited about here - in particular new architecture that allows subquadratic scaling of memory needs for long context; looks like 128k+ context is officially now available on a local model. The charts make it look like if you have the RAM the model is pretty good out to 350k or so(!) with RoPE.

In addition, it flavor tests well on chat arena, ELO significantly above yesterday’s best open model, Qwen 2.5 72b, has some pretty interesting properties that indicate it has not spent much of its model weight space on memorization, hopefully implying that it has spent it on cognition and conceptual stuff.

And, oh also vision and 140 languages.

This seems like one worth downloading and keeping; Gemma models have at times not performed quite to benchmark, but I’d guess from all this that this will be a useful strong local model for some time. I’m curious about coding abilities and tool following, and about ease of fine tuning for those.

Thanks open sourcing this, DeepMind team! It looks great.

hnuser123456

Gemma is made by Google, not DeepMind.

edit: Sorry, forgot DeepMind was Google's AI R&D, I read it as deepseek in your comment.

newfocogi

Job postings for working on Gemma are under DeepMind in London: https://boards.greenhouse.io/deepmind/jobs/6590957

vessenes

Hah no worries - when I read your comment I was like “dang how did I mix up deepseek and google?” Then I read your edit.

saagarjha

That’s Google DeepMind to you

genewitch

Can you link how you fine tune? Does it make a LoRA?

null

[deleted]

xnx

Linking to he announcement (which links to his PDF) would probably be more useful.

Introducing Gemma 3: The most capable model you can run on a single GPU or TPU

https://blog.google/technology/developers/gemma-3/

tomthe

Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).

But the example image shows that this model still makes dumb errors or has a poor common sense although it read every information correctly.

wizee

It seems to have been very benchmark-tuned for LMArena. In my own experiments, it was roughly in line with other comparably sized models for factual knowledge (like Mistral Small 3), and worse than Mistral Small 3 and Phi-4 at STEM problems and logic. It's much worse than Llama 3.3 70b or Mistral Large 2411 in knowledge or intelligence in reality, even though LMArena ranks it as better than those.

aoeusnth1

Looking at every other benchmark, it's significantly behind typical big models from a year ago (Claude 3.0, Gemini 1.5, GPT 4.0). I think Google must have extensive LMArena-focused RLHF tuning for their models to juice their scores.

vessenes

I was thinking the same thing about the receipt calculation: a warning that only tourists tip 18% in Switzerland would no doubt have been appreciated!

behnamoh

> We also change the architecture of the model to reduce the KV-cache memory that tends to ex plo de with long context

This is key (pun not intended). It's one thing to run these models locally; it's a totally different game when you need longer context.

Sure, the new M3 Ultra can fit a Q4 DeepSeek r1 in URAM, but as soon as you wanna get usable context like +64k, the t/s and PP quickly become prohibitive.

Speaking of M3 Ultra, I really wish Apple had put more bandwidth in this beast of a machine. It's got a lot of "energy", not a lot of "power" to actually use that energy.

l33tman

For someone jumping back on the local LLM train after having been out for 2 years, what is the current best local web-server solution to host this for myself on a GPU (RTX3080) Linux server? Preferably with support for the multimodal image input and LaTeX rendering on the output..

I don't really care about insanely "full kitchen sink" things that feature 100 plugins to all existing cloud AI services etc. Just running the released models the way they are intended on a web server...

flipflipper

Ollama + open web-ui in a container

https://ollama.com/

https://github.com/open-webui/open-webui

lastLinkedList

preemptively adding for us AMD users - it’s pretty seamless to get Ollama working with rocm, and if you have a card that’s a bit below the waterline (lowest supported is a 6800xt, i bought a 6750xt), you can use a community patch that will enable it for your card anyway:

https://github.com/likelovewant/ollama-for-amd/wiki#demo-rel...

I specifically recommend the method where you grab the patched rocblas.dll for your card model, and replace the one that Ollama is using, as someone who is technical but isn’t proficient with building from source (yet!)

dunb

What's the benefit of the container over installing as a tool with uv? It seems like extra work to get it up and running with a GPU, and if you're using a Mac, the container slows down your models.

rahimnathwani

For that GPU the best Gemma 3 model you'll be able to run (with GPU-only inference) is 4-bit quantized 12b parameter model: https://ollama.com/library/gemma3:12b

You could use CPU for some of the layers, and use the 4-bit 27b model, but inference would be much slower.

genewitch

LM studio in API mode, then literally any frontend that talks openAI api.

Or, just use the LM studio front end, it's better than anything I've used for desktop use.

I get 35t/s gemma 15b Q8 - you'll need a smaller one, probably gemma 3 15b q4k_l. I have a 3090, that's why.

mfro

Librechat + ollama is the best I have tried. Fairly simple setup if you can grok yaml config.

atarus

Looks great! So excited about this! We have been using gemma models since gemma 1.0 and they are so far ahead of the curve!

kbrannigan

Can someone explain Gemma vs Gemini for me please?

hargup

Gemma is their open-source series of models. Gemini is the propertierary ones. Gemini models are bigger and better. But Gemma are pretty good too.

tpm

open-weights, not open-source (sorry to be that one but open source in this case would mean you can build it yourself from provided "source", which you can't, because it's not provided)

mrob

And even "open-weights" is generous, as they're released under a proprietary license with usage restrictions, not an open-source license.

danielhanchen

Super cool models! Love the mixture of sliding window and global attention to cut down on KV cache sizes! And 4B, 12B and 27B are vision + text by default! Super cool!

xnx

This is the first model I can think of that advertises itself as being optimized for AMD ROCm.

gundmc

What do companies like Meta and Google gain from releasing open models? Is it just reputational? Attractive to top AI talent?

npodbielski

I believe (and some other people on the internet having more knowledge in LLM believe too) that open source local models are the future. Probably big models with API and chat like OpenAI is doing will have its niche toot but it is very costly and it is not AGI and it will not be in the near future. On the other hand with rise of NPU chips and small models you can have your own assistant on your phone using your own data almost instaneously with almost no cost. Whoever will build the best OS model will win this race. As the winner you will be able to set the standard. Basically it is why we have Linux on the severs not Windows and why even browsers are free you still get one from every tech giant.

lastLinkedList

I’m curious to hear more about phone-local assistants. I rather assumed only the latest hardware ( iPhone 15+, not sure on Android side) could do local inference. Is there a way to get something going on hardware a couple years old?

simne

> Is there a way to get something going on hardware a couple years old?

Tensor accelerators are very recent thing, and GPU/WebGPU also recent. RAM was also limited, 4Gb was long time barrier.

So, model should run on CPU and within 4Gb or even 2Gb.

Oh, I forget one important thing - couple years old mobile CPUs was also weak (and btw exception was iphone/ipad).

But, if you have gaming mobile (or iphone), which at that time was comparable to Notebooks, may run something like Llama-2 quantized to 1.8Gb at about 2 tokens per second, not very impressive, but could work.

genewitch

FUTO voice typing runs local on my galaxy 20, so, yes. Also there are SPA that claim to load local that I have but I haven't tried that. There are small models, one I know of is 380M parameter, rather than 15B or 800B...

colejhudson

Those are certainly benefits, but it's most likely a prophylactic move.

LLMs will be (are?) a critical piece of infrastructure. Commoditizing that infrastructure ensures that firms like Google and Meta won't be dependent on any other (OpenAI) for access to that infrastructure.

Meta in particular has had this issue wrt Ads on iOS. And Google wrt paying Apple to be the default search engine.

See also: Joel Spoelsky's famous Strategy Letter V [0].

[0]: https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

summerlight

There are certain demands and if you don't do anything, those will be taken over by competitors and you lose controls. This is especially important for Google as they see LLM as a significant portion of future Cloud business and probably want to have a smooth, exclusive transition path to their proprietary models.

simne

Unfortunately, this is known business model, most known example was Eclipse IDE, which killed all small IDE businesses. Other example, MySQL from Oracle.

Yes, idea, to make basically free something, on which small-medium businesses could survive and grow to something big, so making big death valley between small and big businesses.

Only exception are tiny businesses, living in tiny niches, but for them nearly impossible to overcome gap from tiny to big.

And you should understand, "open models" are in reality open-weight models, as they not disclose sources from which trained, so community cannot remake model from scratch.

Headhunting is sure important, but big business typically are so much finance powerful, so they could just buy talents.

- Headhunting with reputation is really important for small businesses, because they typically very limited in finances.

Medium business typically between small and big, but as I said at beginning, making some strategic things free, create death valley, so it become very hard to be medium.

Reputation is good thing for all, but again, top corporations are powerful non-proportional to size, so in many cases for them is relatively cheap to just maintain neutral reputation, they don't need to spend much to whitening.

HN

Gemma 3 Technical Report [pdf]

Gemma 3 Technical Report [pdf]