Fine-tune Google's Gemma 3

79 comments

·March 19, 2025

smokel

I'm interested to know if anyone is using fine-tuning to train a model on proprietary or in-house codebases and documentation.

RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.

How much effort is required to turn code into something one can use for fine-tuning?

vineyardmike

I’ve actually found the opposite. At work, we went from a fine-tuned model to a RAG system for internal and external documentation and a generic coding-focused model for code.

Fine tuning against in-house code seems like a small gain over a base model and search. It’s unlikely your code is unique and special and big enough that it’s hard to get results from a base model. You’ll be pinned to a certain version of a certain model, and you won’t be able to upgrade to future models nearly as quickly. Of course, you’re also fighting time again on each commit changing the code unless you continually fine tune it.

A RAG model might still struggle with a super vague question like “where does the foo cal bar with bax set” but it’s unlikely that this would work for fine tuning as well. This is where static code search by symbols really should be used.

Xmd5a

There are frameworks for graph-based RAG that mix both approach. One LLM encodes info as a knowledge graph, gradually building up an ontology. Another LLM is used to query this knowledge graph by emitting speculative queries. As the database grows, the second LLM is fine-tuned again and again with exemple queries using the ontology the first LLM came up with.

Scipio_Afri

Would you mind naming some of the frameworks?

danielhanchen

RAG definitely is helpful! Fine-tuning imo is extremely powerful but it's still relatively alchemy - technically gpt4, Claude any large model is a finetune of a base model! Reasoning finetuning is also very powerful!

Tbh the hardest part is the lifecycle - ie new data, updating, serving etc - that seems to be the biggest issue

wahnfrieden

Is anyone having success with iteratively feeding chunks of code (or other documents) to LLM for search? I understand 'haystack' issues with LLMs are quite bad, but RAG is quite bad too and a lot of that haystack research seems to be with feeding very large contexts in.

moffkalast

Well, why not both? If you've already got a tuned model why not use RAG on that to get even better results? It already knows the big picture, it just needs the details so it doesn't have to hallucinate them.

danielhanchen

Yes RAG combined is pretty cool! Fyi I'm planning to add optimized RAG directly into unsloth as well!

fine_tune

> I'm interested to know if anyone is using fine-tuning to train a model on proprietary or in-house codebases and documentation.

I've done it, 1/2 the team though it was great 20% of the time, 1/2 the team hated it from day 0. I used roughly 500K lines of code.

> How much effort is required to turn code into something one can use for fine-tuning?

Very little to moderate, less than 200 lines of python, QWEM FIM, HF, LLAMA.CPP, LLAMA.CPP code extension.

> RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.

The only problem either way is keeping the information up to date, RAG just adds more cost to the inference process (which at my dev speed is pretty important).

> How much effort is required to turn code into something one can use for fine-tuning?

Fine tuning "fill in the middle" process is the process of taking a file, cutting out a some text in the middle and asking AI to guess what was there - there is a hugging face example that will have you doing it in an hour or less - your OPs team saying "No you cant litreally copy all code to a single folder" is probably the biggest hurdle (advise them you'll do it in CI and then they can stand up a FIM training endpoint that accepts a csv, pretty easy)

danielhanchen

Oh fill in the middle is definitely smart especially for codebases!!

fine_tune

Love unsloth btw, use it for some other stuff at work, GRPO stuff was fun :)

I know its coming but "mUlTi GpU PlZ" :pleading: <3

t1amat

I would like to see more knowledgeable people with experience talk about this.

Is it just a matter of assembling Q/A pairs like: “What’s class X?”, “class X { … }”

Do you really need to do this training on the base model instead, which means you have to fine tune chat on it afterward?

How does this work?

Tostino

I've not done fine tuning on code bases but I have done other fine tuning.

You will generally get better results when you fine-tune the base model on your data.

Since you still want to use it with the chat template in the end, you fine-tune the base model with the chat template with your specific data.

From there you'll have a lora that knows your data alright, but still doesn't really work for chatting.

You take that lora, merge it with the base model. Let's call this the stage model.

Then you use mergekit to merge the base model with both the stage model and the chat model. I used the TIES merge method in the past. Now you have your final model.

I use vLLM for inference, and needed access to multiple fine tunes on only a single set of hardware. So from that point I go and take the base model and my final model and extract a new lora. I also take the base model and chat model and extract another lora for that. Then I load up vLLM with the base model and as many of the fine tune loras I need + the chat lora.

The only time this hasn't worked is if the chat model adds a bunch of new tokens on top of the base model. If I remember right there was an issue with that

This has worked well for me in the past.

danielhanchen

Yes!! The trick is the merging of models weights!!

t1amat

Thank you, this was a great explanation!

danielhanchen

Yes qa pairs does work - I found training your dataset concatted with general datasets to work well!

onel

Generally it is recommended that you fine-tune if you want to shape the output style. If you eant ro only output json, or just output jsdoc, etc. To add knowledge, RAG is a better idea and easier to keep updated on code changes. If rag does not give back good results then that's a problem with the retrieval part, which can be measured and improved. We're building documentation + rag systems for enterprise code bases and this is what we saw works best.

yroc92

We are at Scribe[1]. We do it to make sense of knowledge workflows on computers to predict the next step in the process (our software points out where in the DOM a user might need to interact with next). We fine tune with tons of JSON data and DOM data. I’m sure doing it with code no more complicated.

[1] https://scribehow.com/library/scribe-agent

danielhanchen

We see a lot of this in large orgs! The main issue imo is actually the selection of chat templates - there's a lot of people who use a template for finetuning then totally forget to use it for finetuning.

A lot of financial, legal and health companies do fine-tuning! Reasoning finetuning via GRPO is also very powerful since you don't need any cot data in between! Just inputs and outputs!

Is there a version of Gemma 3 that has tool calling? Google's blog claimed it supports tools but it doesn't seem like it actually does.

weird-eye-issue

Not sure but you can just implement tool calling with some custom prompting yourself, it's really not too hard and is what we do

xnx

Function calling docs: https://ai.google.dev/gemma/docs/capabilities/function-calli...

bryan0

Are people fine-tuning LLMs on their local machines with a single GPU? What are people using to scale their training to multiple nodes / gpus? I've been playing around with Hugging Face Estimators in sagemaker.huggingface but not sure if there are better options for this?

samspenc

It takes a significant amount of time (few hours) on a single consumer GPU, even 4090 / 5090, on personal machines. I think most people use online services like runpod, vast ai, etc to rent out high-powered H100 and similar GPUs for a few cents per hour, run the fine-tuning / training there, and just use local GPUs for inference on those fine-tuned models generated on cloud-rented instances.

danielhanchen

It used to be that way! Interestingly I find people in large orgs and the general enthusiast don't mind waiting - memory usage and quality are more important factors!

deet

Google Colab is quite easy to use and has the benefit of not making your local computer feel sluggish while you run the training. The linked Unsloth post provides a notebook that can be launched there and I've had pretty good luck adapting their other notebooks with different foundational models. As a sibling noted, if you're using LORA instead of a full fine-tune, you can create adapters for fairly large models with the VRAM available in Colab, especially the paid plans.

If you have a Mac, you can also do pretty well training LORA adapters using something like Llama-Factory, and allowing it to run overnight. It's slower than an NVIDIA GPU but the increased effective memory size (if you say have 128GB) can allow you more flexibility.

michaelt

Take a look at the hardware requirements at https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#...

A 'LoRA' is a memory-efficient type of fine tuning that only tunes a small fraction of the LLM's parameters. And 'quantisation' reduces an LLM to, say, 4 bits per parameter. So it's feasible to fine-tune a 7B parameter model at home.

Anything bigger than 7B parameters and you'll want to look at renting GPUs on a platform like Runpod. In the current market, there are used 4090s selling on ebay right now for $2100 while runpod will rent you a 4090 for $0.34/hr - you do the math.

It's certainly possible to scale model training to span multiple nodes, but generally scaling through bigger GPUs and more GPUs per machine is easier.

danielhanchen

For experimentation and smaller models, single gpu is the way to go! Tbh I normally find most people to spend the majority of their time on datasets, training loss convergence issues etc!

But if its helpful I was thinking about spinning up a platform for something like that!

jsight

For experimentation? Absolutely. It can often be done overnight for smaller models and reasonably sized GPUs (24GB+).

It'd become a lot less practical with huge datasets, but I'd guess that a lot of fine tuning tasks aren't really that large.

rockwotj

is anyone outside of the research labs fine tuning models for production use cases? I have been seeing more people just using foundational models off the shelf especially in light of a new advancement that seems to come every few months

simonw

I've had trouble getting a great answer to this question - I ask it in various places every month or so, most recently here: https://nitter.net/simonw/status/1895301139819860202

On paper fine tuning smaller models can greatly reduce the cost for a specific task, but I've not heard many real-world success stories around that.

I think vision LLMs are one of the most interesting applications here - things like fine-tuning for better results extracting data from a specific paper form or report structure. Again, not many public examples of that.

danielhanchen

Oh there's a lot! Some cool examples I see:

1. Codebases, docs, large corpses of internal datasets - fill in the middle, auto completion etc.

2. I know a tonne of financial institutions use fine-tuning for trading, real time data parsing headline analysis, signal creation etc

3. Distillation is also relatively common - taking outputs of a large model and distilling it to a small model

4. Accuracy increasing is the most important - not cost or latency - we find if you solve the finetuning life cycle ie continuous auto fine-tuning, data filtering, reinforcement learning via DPO, that works well!

5. Lots of organizations use DPO and preference fine-tuning to align models since they have tonnes of feedback data!

6. Yep vision fine-tuning! For eg medical diagnosis, docs, qa on pics etc

7. And obviously large model labs finetune all base models ie chatgpt4.5 is a finetune of a base model

8. Finally reasoning finetuning via GRPO is very cool! If you have inputs and outputs but no labelled cot in between, GRPO is the way to go! Custom reward functions by companies!

simonw

"Codebases, docs, large corpses of internal datasets"

I still haven't seen a convincing demo of using fine-tuning to "teach" a model new information from additional documents. I'd love to see one.

(Closest I've come to that is I heard a rumor that Jane Street have fine-tuned an LLM for OCaml)

deet

Vision LLMs are definitely an interesting application.

At Avy.ai we're running small (2B-7B, quantized) vision models as part of a Mac desktop application for understanding what someone is working on in the moment, to offer them related information and actions.

We found that the raw results in understanding the images with a light LORA fine tune are not substantially different -- but the ease of getting a small model to follow instructions in outputting structured data in response to the image and at the level of verbosity and detail we need is greatly enhanced with fine tuning. Without fine tuning the models on the smaller end of that scale would be much more difficult to use, not reliably producing output that matched what the consuming application expects

msp26

Was constrained decoding not enough to force the output to be in a specific format?

kiratp

We use multiple post-trained models in production, at scale at https://osmos.io

simonw

Have you published details of how you're doing that anywhere?

Could be a useful marketing strategy for you, given how starved we all are of information about successful fine tuning stories.

minimaxir

Finetuning is easy and worthwhile, especially with LoRAs as these Unsloth demos do. The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable.

In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.

anon373839

There are inference providers such as Together AI that will serve your LoRA adapters at no extra cost above the model price. Then, there’s basically no difference between using your fine-tuned model or an API model off the shelf (except for the benefits you get from fine-tuning).

This (Serverless LoRA providers) is what most people want even if they don't know it.

slopeloaf

Yeah this big time. I haven’t found a solution that makes sense. Larger models are already good enough and so convenient.

When it’s more feasible to do inference on the client (browser or desktop) I can see SLMs popping up more common in production.

ekojs

> The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable

It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.

qudat

For self-hosting I've been using https://tuns.sh which is a tunneling solution using SSH. It works great for prototyping and I've been using it to host open-webui

naveen99

If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

If you want to scale up and down on demand, you can just fine tune on openai and google cloud as well.

simonw

> If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

I don't think that's true.

I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.

Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.

317070

I've been finetuning these models since before chatGPT, and the one lesson I've learned is that by the time you have set up everything to fine-tune a model, you can expect a newer model to do as well with prompt-tuning.

So, unless you hope to stay at the fore front (e.g. to be ahead of competitors), there has been no real reason to finetune for the last 4 years, at best you could hope to stay about 1-3 months ahead, depending on how fast you were at setting up your training. And if that is what you did hope to achieve, you needed to automate on a higher level, i.e. automate data collection and the collection of eval cases.

nwienert

It feels like there should be a service where I just drag drop a folder of examples and it fine tunes the latest DeepSeek or whatever for me and even can host it for me at some cost. I'd pay for that immediately, but last I checked there was nothing that really did that well (would love to be wrong).

arkmm

There are some options out there, depending on what type of task you're trying to fine tune. I think RL finetuning for DeepSeek e.g. isn't well developed yet, but you can finetune a small LLama model (~3B params) for classification or extraction tasks and it works really well. What sort of tasks were you looking at finetuning for?

fragmede

Vibe coding has taken over for frontend dev, but outside that narrow band of very visible coding, most models aren't great at more esoteric programming languages. Even Swift gives Claude trouble. So the reason to fine-tune is simply that the best newest models still remain bad at things outside their comfort zone (how human).

317070

I take my quip both ways, so I would wager that even with finetuning, these models are only 1 generation ahead in esoteric language performance and therefore _still not very good_. Am I correct?

m101

I feel like this is true but would be great if you could provide examples so we could get a better idea of why you think/know this.

317070

I work for DeepMind on project Astra. Not to dwell too deep into confidentiality of what capabilities I have been looking at, but it has been the theme since the flamingo model that you only gain about 1 model-generation by fine-tuning versus prompt-tuning.

netdur

I have documents from the last 50 years that I need to digitalize, millions of them written in old Arabic. The OCR is not accurate due to handwritten documents, so I need to fine-tune a model on around 300k pairs of texts (OCR output and manually corrected versions)

Diederich

This sounds very interesting; can you share more? Thanks!

netdur

I followed this guide for fine-tuning: https://ai.google.dev/gemini-api/docs/model-tuning

Arabic OCR is a mess with historical texts. Take the word الف (alf/thousand) in dates like 1950 - in old documents, the ف (fa) had a dot below it, but modern OCR doesn't get this and outputs الد (alad), which is just gibberish in Arabic

Same problem with ق (qaf) written as ف (fa) in old Arabic

And don't get me started on merged letters! In محمد (Muhammad), sometimes the م (meem) sits right on top of the ح (haa), or appears as a little circle below the line. Modern OCR has no clue what to do with these

My solution? Run OCR first, then use LLMs to fix the mess based on context. The surprising part? In my tinkering, smaller fine-tuned models actually do BETTER at this specific task than the big general-purpose ones. They seem to learn the patterns of historical Arabic quirks more effectively. Pretty neat tradeoff of specialized knowledge vs. general intelligence

refulgentis

IMHO the biggest factor holding that back is how rushed and distanced these model releases are, still.

Both Phi-4-mini and Gemma 3 were released recently. Phi-4's damn close to a good, real, model release. Microsoft's done a great job of iterating.

Gemma 3's an excellent, intelligent, model, but it's got a gaping blind spot: tool-calling / JSON output. There was a vague quick handwave about it in some PR, a PM/eng on the Gemma team commented here in response to someone else that TL;DR "it's supported in Ollama!", which is Not Even Wrong, i.e. in the Pauli sense of the phrase.

- Ollama uses a weak, out of date llama.cpp thing where the output tokens are constrained to match a JSON schema. This falls apart almost immediately, i.e. as soon as there is more than one tool.

- The thing that matters isn't whether we can constrain output tokens, any model can do that, I've had Llama 3 1B making tool calls that way. The thing that matters is A) did you train that in and B) if you did, tell us the format

All that to say, IMHO we're still 6 months to a year out from BigCo understanding enough about their own stuff to even have a good base for it. Sure, tool calling and fine-tuning are orthogonal, in a sense, but in practice, if I'm interested in getting a specific type of output, odds are I wanted that formatted a specific way.

eternityforest

Gemma3 1B seems to be able to choose which tool to use for very simple cases, if you constrain using anyOf, and narrow it down to just a few with RAG first.

It can't understand numbers very well though, "one thousand five" might become "1500".

JSON constraints seem to make them unable to figure it out even if they'd normally get it every time.

Maybe it's different with models above 4B though.

tough

could one train now a gemma 3 fine tune for tool use?

found this on HF https://huggingface.co/ZySec-AI/gemma-3-27b-tools

icelancer

We were but with the models becoming so good, so large, and so cheap, we've largely abandoned it in our long-term roadmap.

deepsquirrelnet

I’m trying right now. The combination of small models, qlora and grpo has made it accessible to experimenters. I’m not using unsloth yet, but I will probably start checking it out pretty soon so that I can train larger models or increase the number of generations for grpo.

jbentley1

I am. I have some use cases related to data extraction where using a fine tuned small model outperforms the best-in-class closed source models and at a fraction of the cost.

yieldcrv

Instead of versions, these things should be labeled by their release date, since this kind of training is based on started at a dataset snapshot in time, colloquially called knowledge-cutoff date which isnt really accurate

we are optimizing these on different dimensions at once, and multiple branches of evolution from each model

so a successor version name doesn't really convey that

huqedato

Great article, but I didn't see anything about the costs.

I'm particularly interested in this aspect because we're considering fine-tuning Gemma 3, but our budget is tight. We're looking into (real-world) cost estimates for this approach.

danielhanchen

Oh hey! For now we don't have a platform, so we generally tell folks to use Colab free gpus! Kaggle also has 30 hours for free per week! I put links for kaggle here: https://docs.unsloth.ai/get-started/unsloth-notebooks

flakiness

> This also means Colab Notebooks with free Tesla T4 GPUs also work!

My understanding is that they don't charge these by themselves although you might have to pay Colab fee to Google.

They charge higher end models it seems https://unsloth.ai/pricing

danielhanchen

Yep via Colab for now! If its popular I can spin up a deployment fine-tuning system!

siliconc0w

It likely makes sense to use more expensive frontier models as teachers or architects for smaller fine-tuned ones that generate the majority of tokens (though possibly against the ToS).

admiralrohan

Have anyone used those small models in any production environment?

If yes, what they are good and bad at?

yash2401

[dead]

dhooper

Please try to enjoy each Gemma tuning equally, and not show preference for any over the others

HN

Fine-tune Google's Gemma 3

Fine-tune Google's Gemma 3