Skip to content(if available)orjump to list(if available)

Distillation makes AI models smaller and cheaper

FlyingLawnmower

Sidenote, but the scholarship on distillation always makes me a bit sad. The Original work, cited in the abstract of the Hinton, Vinyals, and Dean paper that is cited everywhere, was the model compression work from Caruana, Buciluǎ, and Niculescu-Mizil.

The distillation paper added minor parameter tweaks and had a fancier name, but the essence of the method came from Caruana et. al's model compression paper: https://dl.acm.org/doi/abs/10.1145/1150402.1150464

phreeza

One mind-bending thing is that self-distillation, meaning distilling one model into another of the same architecture, number of parameters, etc., also often works! https://arxiv.org/abs/2206.08491

flukas88

Also makes openai moan about companies stealing from them when they stole the internet for free

tcldr

Exactly. This is the argument that I find lacking from today's discourse: AI companies are already extracting generations worth of human intellectual data into their models. If they want to argue that this is 'fair use' then model distillation is, too. Can't have it both ways.

an0malous

You can when the laws exist to serve the investor class instead of fairness and justice. There is a ludicrous amount of money in AI now, it has become a central initiative of the current administration and defense industry. The large AI companies will get whatever they want now.

GuB-42

It is complicated, and culture and legal systems will have to adapt.

But you can have it both way. Often, a distinction between fair and unfair is if are competing against the authors directly.

Take Ghibli memes for instance. While obviously the result of training on studio Ghibli content without permission, it doesn't compete against Studio Ghibli directly. Studio Ghibli doesn't draw memes and ChatGPT doesn't make feature films or copy official artwork, I don't think Studio Ghibli lost anything to the meme, they are not in the same business. So it could be considered fair use.

Training a LLM on data from a law firm to make a search engine directly competing against the search engine of said law firm is not fair use, and there is a legal precedent (Thomson Reuters vs Ross). Training your model from another model to compete against them would be the same kind of thing.

There are plenty of nuance, like how transformative it is. But it is possible that extracting massive amount of data is fair use but distillation is not. There are plenty of people at work on the question right now.

miki123211

Open AI is transforming those works, Deepseek is not.

OpenAI takes in code, books and articles and produces a model. This model can be used for novel tasks, like paraphrasing your own writing, translating your text to a different language, writing code according to a provided specification etc, even if there was nothing in the original corpus that exactly solved your problem.

To produce this model, you need four ingredients. The data, the compute, research effort and a lot of tedious RLHF work. While OpenAI uses the first one without providing author compensation (and it has no other option here), the latter three it provides entirely on its own.

People distilling from OpenAI do not create transformative works. They take Open AI's model and make a model of their own. Both models can do very similar things and are suitable for very similar purposes.

Distillation is just a particularly easy way of making an inexact copy of the model weights. The values of those weights will be very different, just as the values of each pixel in an illicit camera recording of a movie at a cinema are very different from those in the original version, but the net result is the same.

AdamConwayIE

People always forget that back when OpenAI accused DeepSeek of distillation, o1's reasoning process was locked down, with only short sentences shared with the user as it "thought." There was a paper published in November 2024 from Shanghai Jiao Tong University that outlined how one would distill information from o1[1], and it even says that they used "tens of thousands" of o1 distilled chains. Given that the primary evidence given for distillation, according to Bloomberg[2], was that a lot of data was sent from OpenAI developer accounts in China in late 2024, it's not impossible that this (and other projects like it) could also have been the cause of that.

The thing is, given the other advances that were outlined in the DeepSeek R1 paper, it's not as if DeepSeek needed to coast on OpenAI's work. The use of GRPO RL, not to mention the training time and resources that were required, is still incredibly impressive, no matter the source of the data. There's a lot that DeepSeek R1 can be credited with in the LLM space today, and it really did signify a number of breakthroughs all at once. Even their identification of naturally emergent CoT through RL was incredibly impressive, and led to it becoming commonplace across LLMs these days.[3]

It's clear that there are many talented researchers on their team (their approach to MoE with its expert segmentation and expert isolation is quite interesting), so it would seem strange that with all of that talent, they'd resort to distillation for knowledge gathering. I'm not saying that it didn't happen, it absolutely could have, but a lot of the accusations that came from OpenAI/Microsoft at the time seemed more like panic given the stock market's reaction rather than genuine accusations with evidence behind them... especially given we've not heard anything since then.

https://github.com/GAIR-NLP/O1-Journey https://www.bloomberg.com/news/articles/2025-01-29/microsoft... https://github.com/hkust-nlp/simpleRL-reason

tcldr

Just because we're unable to compensate many millions, perhaps billions of people, for using their work without a) permission, or b) remuneration, doesn't justify giving a blanket license to use it without some form of *serious* compensation that reflects the gravity of what is being created.

The current winner-takes-all approach to the outcome is wholly inappropriate. AI companies right now are riding atop the shoulders of giants. Data, mathematics and science that humanity has painstakingly assembled discovered, developed and shared over millennia. Now, we're saying the companies that tip the point of discovery over into a new era should be our new intellectual overlords?

Not cool.

It's clear that model creators and owners should receive some level of reward for their work, but to discount the intellectual labour of generations as worthless is clearly problematic. Especially given the implications for the workforce and society.

Ultimately we'll need to find a more equitable deal.

Until then, forgive me if I don't have much sympathy for a company that's had its latest model distilled.

LearnYouALisp

YOu mean making something sound like it was either written on Reddit or in a paper mill and requires effort to quickly find the material of value like a reading a machine-translation

atmosx

Funny how that works :-)

cma

Not just that, o1 didn't even show its real chain of thought, yet OpenAI said deepseek distilled from them to make their reasoning model: distilling what wasn't there.

Lionga

[flagged]

NitpickLawyer

The article is pretty light on details, and misses (or I missed it if they mentioned it) an important distinction. There are two main types of distillation:

- completion based methods, where you take a big model, give it some queries, and use the answers to post-train a smaller model. This is what deepseek did with qwen models, where they took ~800k traces made by R1 and used sft on smaller qwen2.5 models. What the sky team found in their experiments is that you can use as few as 1-2k traces to reach similar results. Much cheaper.

- logit/internal representations based methods, where you need access to the raw model, and for each pair q -> response you train the small model on the entire distribution of the logits at the same time. This is a method suited for model creators, where they can take a pair of big + small model of the same architecture, and "distill" it in the smaller one. This is likely how they train their -flash -mini -pico and so on.

The first method can be used via API access. The second one can't. You need access to things that API providers won't give you.

m12k

From the article:

"Considering that the distillation requires access to the innards of the teacher model, it’s not possible for a third party to sneakily distill data from a closed-source model like OpenAI’s o1, as DeepSeek was thought to have done. That said, a student model could still learn quite a bit from a teacher model just through prompting the teacher with certain questions and using the answers to train its own models — an almost Socratic approach to distillation."

NitpickLawyer

Right, my bad then I read it in a hurry. They do mention the distinction.

dr_dshiv

Like PHI — textbooks are all you need. You can create entirely synthetic yet high quality training data with a strong model (the generated textbooks) and make very small models like PHI.

pyman

This is exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers.

https://malted.ai/deepseek-and-the-future-of-distillation/

While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago, something that shook the core of Nvidia's business, China is now showcasing the world's first commercial unhackable cryptography system using QKD and post-quantum cryptography to secure all phone calls between Beijing and Hefei.

dwohnitmok

You're misunderstanding subliminal learning.

Subliminal learning is a surprising result that sheds more light on the process of distillation. It's not Anthropic trying to take credit for distillation.

In particular subliminal learning is the finding that a student model distilled from a teacher model has a communication channel with the teacher model that is extremely difficult to observe or oversee.

If you later fine-tune the teacher model on a very specific thing (in Anthropic's case fine-tuning the teacher to prefer owls over other animals) and then simply prompt the teacher model to output "random" digits with no reference to owls whatsoever, simply training the student model on this stream of digits results in the student model also developing a preference for owls over other animals.

This is a novel result and has a lot of interesting implications both for how distillation works as a mechanism and also for novel problems in overseeing AI systems.

rcxdude

>While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago

The whole reason they're accusing them of distilling their models is that this was a well-known technique that's relatively easy compared to creating or improving on one in the first place. Deepseek was impressive for how lean it was (and it shook the markets because it demonstrated obviously what the savvier observers already had figured, that the big AI companies in the US didn't have a huge moat), but they certainly did not come up with this concept.

anonymoushn

"subliminal learning" does not even work for use cases like distilling o1 to R1 because they do not share a base model

danieldk

This is exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers.

What? Distillation is way older. The Hinton paper was from 2015 (maybe there is even earlier work):

https://arxiv.org/abs/1503.02531

When I was still in academia, we were distilling models from BERT/RoBERTa-large to smaller models (remember when those models were considered large?) in 2019 using logits and L2 distance of hidden layers. Before that we were also doing distillation of our own transformer/lstm models on model outputs (though with a different motivation than model compression, to learn selectional preferences, etc.).

null

[deleted]

null

[deleted]

ACCount36

[flagged]

sebau

I wonder how a company like OpenAI can be stolen/distilled via API without noticing, given the amount of data the is needed even for smaller models

ben_w

Stolen: There was some research a year or so ago that showed if you have access to the probability distribution for the next token, you can efficiently steal some layers of the model. When this work was done, OpenAI switched off direct access to those probabilities.

Distilled: Two years ago, one of the AI podcasts I was listening to (probably TWIML&AI) had someone use a big model to create a small high-quality training set for another model (as I understand it, this is what Microsoft's Phi series does, but that wasn't the example in whichever podcast I'm thinking of).

And remember, OpenAI's price for a million tokens is a rounding error for most businesses. Last year's reported revenue of USD 3.7 billion* suggests their customers collectively paid them for order-of a quadrillion tokens in and out, so even getting a trillion tokens from them without them noticing what you're up to (so long as you paid) is very plausible.

* https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...

null

[deleted]

oblio

Corporate espionage or a distributed, concerted, scraping effort. Which would make OpenAI user counts completely useless, but it doesn't sound impossible. If anyone could pull this off, it's some Chinese company.

funfunfunction

There are even companies starting to offer distillation as a service https://inference.net/explore/model-training

pyman

In 2024, DeepSeek's researchers used the DeepSeek-R1 model to transfer knowledge to a smaller model using distillation:

https://malted.ai/deepseek-and-the-future-of-distillation/

Honest question:

Isn't this exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers?

It's like if China claimed they invented the Transformer by renaming it the “Pattern Matching architecture.”

Why is Anthropic doing this? Isn't this the same company that recently scraped 7 million books? And now they’re “transforming” research papers too?

rcxdude

>and now Anthropic is repackaging it a year later, calling it “subliminal learning”

No, distillation and student/teacher is a well known technique (much older than even the original chatGPT), and Anthropic are not claiming to have invented it (it would be laughable to anyone familiar with the field). "subliminal learning" is an observation by Anthropic about something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher that is not obviously present in the information transferred between them (i.e. text outputted from the teacher and used to train the student. For example, the student's "favourite animal" changed despite the fact that the teacher was only creating 'random' numbers for the student to try to predict)

pyman

> something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher

By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.

LLMs aren't black boxes doing voodoo, like we like to tell politicians and regulators. They're just software processing massive amounts of data to find patterns and predict what comes next. It looks magical, but it's maths and stats, not magic.

This post is just selling second-hand ideas. And for those of us outside the US who spend all day reading scientific papers, sorry Anthropic, we're not buying it.

rcxdude

I don't think Alan Turing would have predicted the full sentence that I wrote there. The first half is not the interesting or surprising part! And of course it's not magic, but mathematics does in fact contain a lot of things we don't actually understand yet, and system like LLMs are in general something we don't have particularly robust mathematical frameworks for relating their structure to the observed behaviour (compared to other, much simpler, structures).

ben_w

> By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.

That's like saying Da Vinci figured out heavier-than-air flight. Useful foundation, obviously smart and on the right track, still didn't actually do enough to get all the credit for that.

> It looks magical, but it's maths and stats, not magic.

People keep saying "AI isn't magic, it's just maths" like this is some kind of gotcha.

Turning lead into gold isn't the magic of alchemy, it's just nucleosynthesis.

Taking a living human's heart out without killing them, and replacing it with one you got out a corpse, that isn't the magic of necromancy, neither is it a prayer or ritual to Sekhmet, it's just transplant surgery.

And so on: https://www.lesswrong.com/posts/hAwvJDRKWFibjxh4e/it-isn-t-m...

Even with access to the numbers and mechanisms, the inner workings of LLMs are as clear as mud and still full of surprises. Anthropic's work was, to many people, one such surprise.

Icko_

distillation and teacher-student models are definitely way older than 2024.

pyman

My point is: OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.

ACCount36

[flagged]

Animats

A good question is whether you can grind down a model specialized for, say, customer service for your products, down to where it's really cheap to run on an ordinary server, maybe with a GPU card.

Are we really going to need all those giant AI data centers?

vasco

Our brain works on a couple of bananas, so at least the amount of energy required for just inference doesn't look like it needs to be a lot. Training is another subject because we have that embedded in DNA and cultural behavior so its trickier.

seer

Well in this analogy “training” is the thousands of cycles of sleep and moving and rearranging the brain cell connections that happens at night. That is _a lot_ of bananas, though obviously not all of the energy of growing up goes to brain re-arranging.

Still - shouldn’t be no more than a few buckets of fat, if you only do the nrem “training” bit of sleep.

stingraycharles

No, that’s reinforcement learning and small incremental model updates. The real initial training & model deployment is more akin to DNA. Models cannot “learn” the same way humans do.

xwolfi

Well yeah you have to look at the entire training duration for your brain. It did take a while to be as perfect as you seem to be, several billion years, and I'm sure you make mistakes sometimes and hallucinate stupid ideas.

And don't run too long on a couple bananas, the brain is not just there to infer, it also needs to manage its autonomous transport system which requires much more energy itself.

TheFuzzball

> Our brain works on a couple of bananas

What a fantastic non sequitur

null

[deleted]

pama

Silicon is already more efficient for inference than the brain. If we use centralized decoding of the V3/R1 scale models as a baseline, one can produce 720,000 tokens (a wild guess for the tokens humans could produce in 24 hours) using the energy of only 0.36 bananas. Deeply thinking humans expend up to a a third of their total energy on the brain, but cannot sustain themselves on a single banana per day.

(You can use an LLM to check this work at the cost of a tiny speck of a banana, eg: https://grok.com/share/c2hhcmQtMw%3D%3D_60f4890d-711b-4331-9... )

Vetch

The brain is certainly vastly more energy efficient at inference than LLMs on GPUs. But it looks like you're trying to make a different argument, that an LLM can spend less energy than a human to complete a given task. Unfortunately, you have not made that argument and I won't be reading unverified LLM output that might contain hallucinated steps or claims.

> V3/R1 scale models as a baseline, one can produce 720,000 tokens

On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.

> Deeply thinking humans expend up to a a third of their total energy on the brain

Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.

bildung

Well compared to the human brain LLMs do approximately zero work. An LLM neuron is at least 3 orders of magnitude less complex than a neuron in the human brain - and this factor only accounts for the neuronal instrinsics we currently know of.

ben_w

We've already got distilled down versions of models designed to fit on consumer-sized devices, they are definitely not as performant as the bigger models.

But the models are RAM limited not compute limited, and there's no reason consumer devices need to have their current RAM limits. Get 256 GB of RAM in your phone and an LLM may drain the battery in 15 minutes, and I have no idea about the bus bandwidth, but the NPU (e.g. Neural Engine in Apple SoCs for the last few years) is already enough for the compute part of the problem.

yummybear

Even further - could it download a distilled modeb runtime in response to your type of question - if we’re talking vacation planning download vacation.model for 10 seconds and then let’s talk?

dragochat

YES

We'll always find uses for more intelligence if it keeps getting more and more general (I don't like the term AGI bc. I think the "G" there is quantity not a quality, and humans are very low on generality too compared to what could be mathematically and physically possible for intelligence in our universe).

...we won't stop until the planet is papered with compute hardware UNLES we accelerate space development too (that's why SPACE is CRUCIAL!) and go grind the asteroid belt into thousands of datacenters too, then on and on.

There's a whole yummy lightcone that awaits to be eaten :P

msgodel

You could probably use some heuristic on the tokens trained to try to weight customer service related data higher.

sebau

For what it worth nearly all public models are distilled versions of bigger internal ones

arnaudsm

Even flagships like o3 & gemini_2.5_pro ?

ffsm8

I doubt you'll get a response from someone with authority on the matter (that actually worked on these models and is willing and authorized to post this publicly)... So I'm gonna add my uninformed consumer perspective:

I sincerely doubt the o3/2.5 pro haven't been distilled. It's unimaginable to me they're that price insensitive (or expressed inversely: were so thrifty in training that the final product can be used without optimization for the consumer usage)

the only conclusion I can come to is that they're indeed not letting you access the "root" models.

regularfry

The more conservative version of this is that they'd want distilled models even if only as a speculative decoder to stick in front of the main model. That's an obvious optimisation to make.

creshal

I think OpenAI even mentioned in some papers that the internal o4(?) model used for some tests cost $6000 per query, pre-release.

That's absolutely getting distilled down for releases.

jgalt212

Distillation formerly was the key to self-hosted usable models. However, the unceasing pressure to be "agentic", has made self-hosting once again untenable. Agentic tools just hover up too many tokens.

ricardobeat

If they use more tokens isn’t that a case in favor of self-hosting to reduce costs? Or are you saying performance is not good enough for local agents?

regularfry

More tokens in the context means disproportionately more VRAM, to the extent that you really do need multiple GPUs if you're running an interestingly-sized model.

v3ss0n

Sometimes better, sometimes dumber

wizardforhire

Obligatory [1]

My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless.

Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15)

And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress.

[1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-...

DoctorOetker

pruning and distilling are 2 totally different things.

pruning: discarding low weight connections after training, makes the network sparser but also less regular (complications for memory layout, and compute kernels to access the sparse network weights).

distilling: take a large pretrained model, and train a smaller one from it, for example consider a cloze task (fill the blanked token in a sentence), then compute the probabilities using the large model, and train the smaller model to reproduce the same probabilities

distilling is a form of fitting into a smaller regular network, of potentially totally different architecture, while pruning is a form of discarding low weight coefficients resulting in a sparser network.

wizardforhire

Thanks for taking the time to clarify for me.

meatmanek

I'm surprised those things used neural networks. With a matrix of answer probabilities (trivially calculated from people's answers), you can choose the question that maximizes your expected information gain.

wizardforhire

As I remember it, it was the break out moment for NN that made them mainstream to the masses. Prior to that they were an academic / hacker oddity relegated to works of fictions and just one of the many competing theories towards functioning AI. After 20Q you could buy a handheld NN at walmart. The delay to LLM was such that 20Q made it apparent to the scene that the limiting factor for more practical ai development was purely a scaling problem of complexity limited by compute power. A lot of conversations on /. and the likes centered around when the threshold would be crossed. Most at the time could not have predicted nor accepted that moore’s law would fail putting development back a decade.

To the credit of the naysayers at the time hotmail was still the primary free email service, gmail had yet to come out. Google buying up the darkfiber and had yet to open up their excess compute starting the arms race for the cloud. Most still thought of GPUs only for graphics even though their architecture and intent was there since their inception at thinking machines…