Skip to content(if available)orjump to list(if available)

Bolt: Bootstrap long chain-of-thought in LLMs without distillation [pdf]

blackeyeblitzar

Can someone explain what distillation is exactly? I keep seeing people posting comments here and elsewhere about how DeepSeek “distilled” OpenAI outputs to train their new model. How could that even work - wouldn’t you need to ask millions of questions to get enough data to be able to train a whole another LLM? Or am I just uneducated about this topic?

kiratp

https://huggingface.co/docs/trl/main/en/gkd_trainer

Take a set of prompts, run it through the large model and small model, calculate KL divergence between all the logits, then update the small model to minimize the loss.

This gives the smaller model a much higher density signal than SFT which is usually cross entropy against the single correct logit.

Edit: note that the term is abused quite often and SFT on just the traces from the big model is also sometimes referred to as distillation (Eg the Deepseek smaller models). IMO that is incorrect.