Fine-tune Google's Gemma 3
23 comments
·March 19, 2025bryan0
samspenc
It takes a significant amount of time (few hours) on a single consumer GPU, even 4090 / 5090, on personal machines. I think most people use online services like runpod, vast ai, etc to rent out high-powered H100 and similar GPUs for a few cents per hour, run the fine-tuning / training there, and just use local GPUs for inference on those fine-tuned models generated on cloud-rented instances.
jsight
For experimentation? Absolutely. It can often be done overnight for smaller models and reasonably sized GPUs (24GB+).
It'd become a lot less practical with huge datasets, but I'd guess that a lot of fine tuning tasks aren't really that large.
rockwotj
is anyone outside of the research labs fine tuning models for production use cases? I have been seeing more people just using foundational models off the shelf especially in light of a new advancement that seems to come every few months
minimaxir
Finetuning is easy and worthwhile, especially with LoRAs as these Unsloth demos do. The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable.
In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.
slopeloaf
Yeah this big time. I haven’t found a solution that makes sense. Larger models are already good enough and so convenient.
When it’s more feasible to do inference on the client (browser or desktop) I can see SLMs popping up more common in production.
naveen99
If you have the resources to fine tune, you have the resources to run inference on fine tuned model.
If you want to scale up and down on demand, you can just fine tune on openai and google cloud as well.
simonw
> If you have the resources to fine tune, you have the resources to run inference on fine tuned model.
I don't think that's true.
I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.
Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.
317070
I've been finetuning these models since before chatGPT, and the one lesson I've learned is that by the time you have set up everything to fine-tune a model, you can expect a newer model to do as well with prompt-tuning.
So, unless you hope to stay at the fore front (e.g. to be ahead of competitors), there has been no real reason to finetune for the last 4 years, at best you could hope to stay about 1-3 months ahead, depending on how fast you were at setting up your training. And if that is what you did hope to achieve, you needed to automate on a higher level, i.e. automate data collection and the collection of eval cases.
nwienert
It feels like there should be a service where I just drag drop a folder of examples and it fine tunes the latest DeepSeek or whatever for me and even can host it for me at some cost. I'd pay for that immediately, but last I checked there was nothing that really did that well (would love to be wrong).
m101
I feel like this is true but would be great if you could provide examples so we could get a better idea of why you think/know this.
netdur
I have documents from the last 50 years that I need to digitalize, millions of them written in old Arabic. The OCR is not accurate due to handwritten documents, so I need to fine-tune a model on around 300k pairs of texts (OCR output and manually corrected versions)
Diederich
This sounds very interesting; can you share more? Thanks!
simonw
I've had trouble getting a great answer to this question - I ask it in various places every month or so, most recently here: https://nitter.net/simonw/status/1895301139819860202
On paper fine tuning smaller models can greatly reduce the cost for a specific task, but I've not heard many real-world success stories around that.
I think vision LLMs are one of the most interesting applications here - things like fine-tuning for better results extracting data from a specific paper form or report structure. Again, not many public examples of that.
kiratp
We use multiple post-trained models in production, at scale at https://osmos.io
simonw
Have you published details of how you're doing that anywhere?
Could be a useful marketing strategy for you, given how starved we all are of information about successful fine tuning stories.
icelancer
We were but with the models becoming so good, so large, and so cheap, we've largely abandoned it in our long-term roadmap.
deepsquirrelnet
I’m trying right now. The combination of small models, qlora and grpo has made it accessible to experimenters. I’m not using unsloth yet, but I will probably start checking it out pretty soon so that I can train larger models or increase the number of generations for grpo.
jbentley1
I am. I have some use cases related to data extraction where using a fine tuned small model outperforms the best-in-class closed source models and at a fraction of the cost.
refulgentis
IMHO the biggest factor holding that back is how rushed and distanced these model releases are, still.
Both Phi-4-mini and Gemma 3 were released recently. Phi-4's damn close to a good, real, model release. Microsoft's done a great job of iterating.
Gemma 3's an excellent, intelligent, model, but it's got a gaping blind spot: tool-calling / JSON output. There was a vague quick handwave about it in some PR, a PM/eng on the Gemma team commented here in response to someone else that TL;DR "it's supported in Ollama!", which is Not Even Wrong, i.e. in the Pauli sense of the phrase.
- Ollama uses a weak, out of date llama.cpp thing where the output tokens are constrained to match a JSON schema. This falls apart almost immediately, i.e. as soon as there is more than one tool.
- The thing that matters isn't whether we can constrain output tokens, any model can do that, I've had Llama 3 1B making tool calls that way. The thing that matters is A) did you train that in and B) if you did, tell us the format
All that to say, IMHO we're still 6 months to a year out from BigCo understanding enough about their own stuff to even have a good base for it. Sure, tool calling and fine-tuning are orthogonal, in a sense, but in practice, if I'm interested in getting a specific type of output, odds are I wanted that formatted a specific way.
tough
could one train now a gemma 3 fine tune for tool use?
found this on HF https://huggingface.co/ZySec-AI/gemma-3-27b-tools
siliconc0w
It likely makes sense to use more expensive frontier models as teachers or architects for smaller fine-tuned ones that generate the majority of tokens (though possibly against the ToS).
yieldcrv
Instead of versions, these things should be labeled by their release date, since this kind of training is based on started at a dataset snapshot in time, colloquially called knowledge-cutoff date which isnt really accurate
we are optimizing these on different dimensions at once, and multiple branches of evolution from each model
so a successor version name doesn't really convey that
Are people fine-tuning LLMs on their local machines with a single GPU? What are people using to scale their training to multiple nodes / gpus? I've been playing around with Hugging Face Estimators in sagemaker.huggingface but not sure if there are better options for this?