Skip to content(if available)orjump to list(if available)

The Annotated Transformer

The Annotated Transformer

4 comments

·August 24, 2025

internetguy

wow - this is really well made! i've been doing research w/ Transformer-based audio/speech models and this is made with incredible detail. Attention as a concept itself is already quite unintuitive for beginners due to is non-linearity, so this also explains it very well

roadside_picnic

> Attention as a concept itself is already quite unintuitive

Once you realize that Attention is really just a re-framing of Kernel Smoothing it becomes wildly more intuitive [0]. It also allows you to view Transformers as basically learning a bunch of stacked Kernels which leaves them in a surprisingly close neighborhood to Gaussian Processes.

0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

adityamwagh

It’s a very popular article that has been around for a long time!

gdiamos

It's so good it is worth revisiting often