Skip to content(if available)orjump to list(if available)

The case for the return of fine-tuning

The case for the return of fine-tuning

15 comments

·October 19, 2025

meander_water

A couple of examples I have seen recently which makes me agree with OP:

- PaddleOCR, a 0.9B model that reaches SOTA accuracy across text, tables, formulas, charts & handwriting. [0]

- A 3B and 8B model which performs HTML to json extraction at GPT-5 level accuracy at 40-80x less cost, and faster inference. [1]

I think it makes sense to fine tune when you're optimizing for a specific task.

[0] https://huggingface.co/papers/2510.14528

[1] https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_buil...

soVeryTired

Have you used PaddleOCR? I'm surprised they're claiming SOTA without comparing against Amazon Textract or Azure doc intelligence (LayoutLM v3 under the hood, as far as I know).

I've played around with doc recognition quite a bit, and as far as I can tell those two are best-in-class.

empiko

Fine-tuning is a good technique to have in a toolbox, but in reality, it is feasible only in some use cases. On one hand, many NLP tasks are already easy enough for LLMs to have near perfect accuracy and fine tuning is not needed. On the other hand, really complex tasks are really difficult to fine-tune and clevem data collection might be pretty expensive. Fine-tuning can help with the use cases somewhere in the middle, not too simple, not too complex, feasible for data collection, etc.

libraryofbabel

What would you say is an example of one of those “middle” tasks it can help with?

CaptainOfCoit

An example I just found worked very well with fine-tuning: I wanted to extract any frame that contained a full-screen presentation slide from a various videos I've archived, only when it's full-screen, and also not capture videos, and some other constraints.

Naturally I reached for CLIP+ViT which got me a ~60% success rate out of the box. Then based on that, I created a tiny training script that read `dataset/{slide,no_slide}` and trained a new head based on that. After adding ~100 samples of each, the success rate landed at 95% which was good enough to call it done, and circle back to iterate once I have more data.

I ended up with a 2.2K large "head_weights.safetensors" that increased the accuracy by ~35% which felt really nice.

leblancfg

Fine tuning was never really hard to do locally if you had the hardware. What I’d like to read in an article like this is more details into why they’re making a comeback.

Curious to hear others’ thoughts on this

melpomene

This website loads at impressive speeds (from Europe)! Rarely seen anything more snappy. Dynamic loading of content as you scroll, small compressed images without looking like it (webp). Well crafted!

hshdhdhehd

Magic of a CDN? Plus avoiding JS probably. Haven't checked source though.

oli5679

The OpenAI fine-tuning api is pretty good - you need to label an evaluation benchmark anyway to systematically iterate on prompts and context, and it’s often creates good results if you give it a 50-100 examples, either beating frontier models or allowing a far cheaper and faster model to catch up.

It requires no local gpus, just creating a json and posting to OpenAI

https://platform.openai.com/docs/guides/model-optimization

deaux

They don't offer it for GPT-5 series, as a result much of the time fine-tuning Gemini 2.5-Flash is a better deal.

CuriouslyC

Fine tuning by pretraining over a RL tuned model is dumb AF. RL task tuning works quite well.

HarHarVeryFunny

You may have no choice in how the model you are fine tuning was trained, and may have no interest in verticals it was RL tuned for.

In any case, platforms like tinker.ai support both SFT and RL.

CuriouslyC

Why would you choose a model where the trained in priors don't match your use case? Also, keep in mind that RL'd in behavior includes things like reasoning and how to answer questions correctly, so you're literally taking smart models and making them dumber by doing SFT. To top it off, SFT only produces really good results when you have traces that closely model the actual behavior you're trying to get the model to display. If you're just trying to fine tune in a knowledge base, a well tuned RAG setup + better prompts win every time.

imcritic

Because you need a solution for your problem and the available tools are what they are and nothing else and you don't have enough resources to train your own model.

null

[deleted]