Skip to content(if available)orjump to list(if available)

Writing an LLM from scratch, part 10 – dropout

Scene_Cast2

I never did as much thinking or testing of dropout on transformers as the author, but it didn't seem to help with my "baby" (~10 million param) transformer models. IIRC the latest Llama models don't use dropout either.

mattnewton

Same, I was never able to debug why dropout > 5% really hurt convergence speed for my toy LLMs. I chalked it up to the models not having enough parameters to fit fineweb and just stop using it.