Skip to content(if available)orjump to list(if available)

Bolt: Bootstrap long chain-of-thought in LLMs without distillation [pdf]

blackeyeblitzar

Can someone explain what distillation is exactly? I keep seeing people posting comments here and elsewhere about how DeepSeek “distilled” OpenAI outputs to train their new model. How could that even work - wouldn’t you need to ask millions of questions to get enough data to be able to train a whole another LLM? Or am I just uneducated about this topic?

anon373839

First, I don’t believe there has been a shred of evidence that DeepSeek distilled OpenAI’s model.

As to distillation: people use the term kind of imprecisely. The most powerful form of distillation is one in which you train a smaller “student” model on a large amount of predictions from a larger, more powerful “teacher” model that uses the same tokenization scheme. You train the smaller model to output not just the same tokens as the teacher, but rather on the full probability distribution predicted by that model for each token. It’s a very dense transfer of knowledge. This is best done using the same training data that the teacher model was trained on, since you are asking the student to learn the teacher’s distribution.

The more common version is to take an existing model and fine-tune it on a small number of text outputs from a larger model. No probability distributions over tokens - just the text itself. This doesn’t significantly alter the smaller model’s knowledge or capabilities, but rather, causes it to imitate the other model’s style. But it can unlock capabilities that were in the small model to begin with, but were not as accessible in its base form. For this type of distillation, it has been shown that a very small number of training examples are required if they are carefully selected.

kiratp

https://huggingface.co/docs/trl/main/en/gkd_trainer

Take a set of prompts, run it through the large model and small model, calculate KL divergence between all the logits, then update the small model to minimize the loss.

This gives the smaller model a much higher density signal than SFT which is usually cross entropy against the single correct logit.

Edit: note that the term is abused quite often and SFT on just the traces from the big model is also sometimes referred to as distillation (Eg the Deepseek smaller models). IMO that is incorrect.

Szpadel

in less technical terms: just run some prompts through original model, and fine tune smaller to respond in the same way

nickthegreek

I believe the number needed is not millions but more like thousands or even hundreds depending on the size of the model you are distilling into.