Skip to content(if available)orjump to list(if available)

Pre-Trained Large Language Models Use Fourier Features for Addition (2024)

bkitano19

Related work:

Interpreting Modular Addition in MLPs https://www.lesswrong.com/posts/cbDEjnRheYn38Dpc5/interpreti...

Paper Replication Walkthrough: Reverse-Engineering Modular Addition https://www.neelnanda.io/mechanistic-interpretability/modula...

wongarsu

Curious that they chose to use GPT-2-XL, given the age of that model. I guess they wanted a small model (1.5B) and started work on this quite a while ago. Today there is a decent selection of much more capable 1.5B models (Quen2-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, TinySwallow 1.5B, Stella 1.5B, Qwen2.5-Math-1.5B, etc). But they are all derived from the Qwen series of models, which wasn't available when they started this research.

imjonse

The paper predates Qwen2 and R1, this work is probably a year old.

scoresmoke

GPT-2 follows the very well-studied architecture of Transformer decoder, so the outcomes of this study might be applicable to the more complicated models.

ImHereToVote

I understand GPT-2 has been somewhat mapped to a certain extent.

null

[deleted]

nickpsecurity

I was collecting examples of models trained on single GPU or with very low cost. A number of projects used Bert or GPT to since the implementations were very simple with with some components optimized. There’s also a lot of projects that have trained Bert in GPT two models which make for more scientific comparisons.

With no other information, those would be my guesses as to why one would use a GPT2 model.

littlestymaar

TinyLlama would have worked too and is older than the Qwen family.

vessenes

This is .. pretty interesting! According to the abstract, models trained long enough use some feature layers for magnitude assessment, and others for modular assessment, (e.g. even / odd). It's surprising to me that this is a stable outcome for trained LLMs when they encounter math. Definitely not what seems simplest to my meatspace brain.

wongarsu

My meatspace brain can do fast accurate math up to about three digit results. After than I fall back to iterative processes with chain-of-thought, and possibly physical scratch space. My brain can however do magnitude assessment and modular assessment in near-constant time too, which I use to verify the correctness of the chain-of-thought result.

DoctorOetker

> Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy.

what's the convention on the meaning of "pre-training" vs "training from scratch" ?

Is this a nomenclature shift?

currymj

pre-trained model would mean training a language model to predict text, then starting from there and training it to add numbers.

training from scratch would be initializing a neural network, and training it to add numbers directly.

null

[deleted]

globalnode

Great, my mathematical nemesis is now a part of LLM functionality as well. Are people trying to make this stuff harder?

TeMPOraL

IDK, the more I learn the more it seems to me that Fourier transform is reality's cheat code. It keeps showing up everywhere.

Like, the other day I learned[0] that if you shine a light through a small opening, the diffraction pattern you get on the other side is basically the Fourier transform of the aperture outline.

(Yes, this also implies that if you take a Fourier transform of an image and make a diffraction grating off the resulting pattern, projecting light through it should paint you the original image.)

--

[0] - https://www.youtube.com/watch?v=Y9FZ4igNxNA

ruined

this kind of interference recording is typically termed a 'hologram'. it works in full 3d

mananaysiempre

(Almost-)linear models do linear things, it seems, and the Fourier transform is the quintessential linear thing.

It is also an extremely neat piece of the real world, but I’m hesitant to guess your background and offer an explanation because your phrasing makes me suspect an engineering one. With concepts usually being the first to be culled in a course targeted at engineers, there could be quite a bit of concept debt to pay off before I could really offer something I could honestly call an explanation.

Have you tried the 3Blue1Brown video on the topic[1]? It does not AFAIR offer any answers as to why the Fourier transform should exist or be useful, but it does show very well what it does in the immediate sense.

[1] https://www.youtube.com/watch?v=spUNpyF58BY

almostgotcaught

You gotta be in the in-crowd to understand that this paper, like so many others, is one of those dumb posthoc analogy/metaphor papers. These papers are where they just ran a bunch of experiments (ie just ran the training script over and over) and formulated a hypothesis empirically. Of course in order to lend the hypothesis some credibility they have to make an allusion to something formal/mathematical:

> Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain

Brilliant and very rigorous!