Skip to content(if available)orjump to list(if available)

Transformers Without Normalization

Transformers Without Normalization

4 comments

·March 15, 2025

kouteiheika

If true this is very nice incremental improvement. It looks like it doesn't meaningfully improve the capabilities of the model, but is cheaper to compute than RMSNorm (which essentially all current state of art LLMs use) which means faster/cheaper training.

gdiamos

What are the practical implications of this?

gricardo99

from the abstract

  By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.

adamnemecek

It feels like the end goal of this is energy-based models, Yann LeCun's favorite ML approach.

We at Traceoid http://traceoid.ai have identified a promising approach for scaling EBMs. Join the discord channel https://discord.com/invite/mr9TAhpyBW