Gemma3 Function Calling
33 comments
·March 23, 2025canyon289
brianjking
Thanks, Gemma is fantastic and that it supports function calling is great.
chadash
so if i'm reading this correctly, it's essentially prompt engineering here and there's no guarantee for the output. Why not enforce a guaranteed output structure by restricting the allowed logits at each step (e.g. what outlines library does)?
canyon289
So in short there's no guarantee for any output from any LLM whether its Gemma or any other (ignoring some details like setting a random seed or parameters like temperature to 0). Like you mentioned though libraries like outlines can constrain the output, whereas hosted models often already include this in their API, but they can do so because its a model + some server side code.
With Gemma, or any open model, you can use the open libraries in conjunction to get what you want. Some inference frameworks like Ollama include structured output as part of their functionality.
But you mentioned all of this already in your question so I feel like I'm missing something. Let me know!
But I think you already mentioned all this in your response so I might be missing the question?
programmarchy
With OpenAI models, my understanding is that token output is restricted so that each next token must conform to the specified grammar (ie json schema) so you’re guaranteed to get either a function call or an error.
Edit: per simonw’s sibling comment, ollama also has this feature.
simonw
If you run Gemma via Ollama (as recommended in the Gemma docs) you get exactly that feature, because Ollama provides that for any model that they run for you: https://ollama.com/blog/structured-outputs
Under the hood, it is using the llama.cpp grammars mechanism that restricts allowed logits at each step, similar to Outlines.
refulgentis
I've been working on tool calling in llama.cpp for Phi-4 and have a client that can switch between local models and remote for agentic work/search/etc., I learned a lot about this situation recently:
- We can constrain the output of a JSON grammar (old school llama.cpp)
- We can format inputs to make sure it matches the model format.
- Both of these combined is what llama.cpp does, via @ochafik, in inter alia, https://github.com/ggml-org/llama.cpp/pull/9639.
- ollama isn't plugged into this system AFAIK
To OP's question, specifying a format in the model unlocks training the model specifically had on functions calling: what I sometimes call an "agentic loop", i.e. we're dramatically increasing the odds we're singing in the right tune for the model to do the right thing in this situation.
jampekka
Is the format used in the examples the same that's used in the function calling instruction training, i.e. should it be the optimal prompt for function calling?
I find it a bit frustrating when details of the training is not known and one has to guess what kinds of prompts the model has been tuned with.
canyon289
We feel this model excels at instructability which is why we're recommending bringing your own prompt! Benchmark wise you can see this performance from BFCL directly, they (independently) ran their eval using their prompted format the larger Gemma models performed quite well if you ask me.
Specifically though I want to thank you for leaving a comment. We're reading all this feedback and its informing what we can do next to reduce frustration and create the best model experience the community
jampekka
Do you mean that the exact prompt for tool use shouldn't matter? Has this been tested? Is the tool use trained with a variety of prompt styles?
I would imagine training with a specific, perhaps structured, prompt could make the function calling a bit more robust.
troupo
> We feel this model excels at instructability which is why we're recommending bringing your own prompt!
Sigh Taps the sign:
--- start quote ---
To put it succinctly, prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:
- training set
- weights
- constraints on the model
- layers between you and the model that transform both your input and the model's output that can change at any time availability of compute for your specific query
- and definitely some more details I haven't thought of
"Prompt engineers" will tell you that some specific ways of prompting some specific models will result in a "better result"... without any criteria for what a "better result" might signify.
https://dmitriid.com/prompting-llms-is-not-engineering
--- end quote ---
null
attentive
ToolACE-2-8B and watt-tool-8B have impressive score for the size in that leaderboard.
42lux
Don’t wanna be that guy but you guys have too many models. I love Gemini and Gemma but it’s way too crowded atm.
minimaxir
The example of function calling/structured output here is the cleanest example on how function it works behind the scenes, incorporating prompt engineering and JSON schema.
With the advent of agents/MCP, the low level workflow has only become more confusing.
canyon289
This is me speculating along with you so don't take this as fact, but my sense is that the LLMs tool stack is getting "layerized" like network layer architectures.
Right now the space is moving fast so new concepts and things are getting introduced quite fast, and the ecosystem hasn't settled.
https://en.wikipedia.org/wiki/OSI_model#Layer_architecture
But like all other things with computers, like shells, terminals, GUIs etc we're getting there. Just faster than ever.
lioeters
That's insightful. Thank you for sharing your work and the patient responses to everyone's questions.
Yesterday I started exploring a smaller Gemma3 model locally with Ollama, and it's clearly a level up from the previous model I was using (Llama3) in terms of instruction comprehension and the sophistication of responses. It's faster, smaller, and smarter.
I very much appreciate how such innovative technology is available for non-experts to benefit from and participate in. I think one of the best things about the emergence and evolution of LLMs is the power of open source, open standards, and the ideal of democratizing artificial intelligence and access to it. The age-old dream of machines augmenting the human intellect (Vannevar Bush, Doug Englebart, et al) is being realized in a surprising way, and seeing the foundational layers being developed in real time is wonderful.
canyon289
Of course! Glad you can find models that work well for you and we're all learning together. Even on the "expert side" we're learning from what folks like yourself are doing and taking note so we can shape these models to be better you all.
nurettin
So it's just a prompt? Well then you can do function calling with pretty much any model from this quarter.
zellyn
Am I getting slightly different use-cases mixed up, or would it be better if everything just spoke MCP?
PufPufPuf
MCP is the wire protocol, it doesn't say anything about how the LLM output is structured and parsed.
simonw
You need function calling support in the models in order to layer MCP over the top of them.
mentalgear
Great, your work on open-source SLM are much appreciated ! (btw: seems like the google page does not respect the theme device "auto" setting)
canyon289
Thank you! Community vibes motivate us to code more up for you all. Really appreciate the note.
Regarding the device theme in the browser, I'll ask some folks what's going on there.
sunrabbit
It's honestly frightening to see how fast it's evolving. It hasn't even been that many years since GPT was first released.
behnamoh
I'm glad this exists. It ruins the day for Trelis who took the open-source and free Llama and made it commercial by giving it function calling abilities: https://huggingface.co/Trelis/Meta-Llama-3-70B-Instruct-func...
kristjansson
I mean Meta did that already with Llama 3.1
https://www.llama.com/docs/model-cards-and-prompt-formats/ll...
Hey folks, I'm on the Gemma team, we released new model(s) just recently, and I saw many questions here about function calling. We just published the docs to detail this more. In short Gemma3's prompted instruction following is quite good for the larger models and that's how you use the feature.
You don't need to take our word for it! We were waiting for an external and independent validation from the Berkeley team, and they just published their results. You can use their metrics to get a rough sense of performance, and of course try it out yourself in AIstudio or locally with your own prompts.
https://gorilla.cs.berkeley.edu/leaderboard.html
Hope you all enjoy the models!