Pre-Trained Large Language Models Use Fourier Features for Addition (2024)

44 comments

·February 6, 2025

bkitano19

Related work:

Interpreting Modular Addition in MLPs https://www.lesswrong.com/posts/cbDEjnRheYn38Dpc5/interpreti...

Paper Replication Walkthrough: Reverse-Engineering Modular Addition https://www.neelnanda.io/mechanistic-interpretability/modula...

joelburget

And more recently, [Language Models Use Trigonometry to Do Addition](https://arxiv.org/abs/2502.00873)

vessenes

This is .. pretty interesting! According to the abstract, models trained long enough use some feature layers for magnitude assessment, and others for modular assessment, (e.g. even / odd). It's surprising to me that this is a stable outcome for trained LLMs when they encounter math. Definitely not what seems simplest to my meatspace brain.

wongarsu

My meatspace brain can do fast accurate math up to about three digit results. After than I fall back to iterative processes with chain-of-thought, and possibly physical scratch space. My brain can however do magnitude assessment and modular assessment in near-constant time too, which I use to verify the correctness of the chain-of-thought result.

svachalek

Meatspace brain uses digits, LLM uses tokens. i.e. when I enter "7953 + 5205" what gpt-4o is actually computing is [48186, 18, 659, 220, 26069, 20] (https://platform.openai.com/tokenizer)

So saying it's not the simplest is an understatement by far, it's doing millions or billions of times as much math as a calculator would. Ask an LLM to generate a program to do math, rather than doing the math itself.

duskwuff

Nor does it help that numbers are written with the most significant digit first, but actually computing the sum requires the digits to be evaluated in the opposite order.

null

[deleted]

lsy

Something like this seems expected, right? If you tune a statistical model to very high accuracy in "addition" over tokens, then the resulting structure of the model must correspond to some structure in the training data. And fourier would make sense for some token like "123" which internally is represented as the integer 7633, but needs to "contain" information about the text digits for math to work. Notably, this still ends up being in some way a statistical endeavor rather than truly learning addition, as even the fine-tuned model doesn't reach 100% accuracy.

wongarsu

> Notably, this still ends up being in some way a statistical endeavor rather than truly learning addition, as even the fine-tuned model doesn't reach 100% accuracy

If that's our metric then most humans haven't truly learned addition either

For any neural network, the standard you can expect for any learned skill is closer to a human learning that skill than to a computer programmed to do that thing. There will be occasional mistakes

mannykannot

How LLMs tackle addition is an interesting question in its own right, independently of whether their accuracy provides a metric for judging their ability relative to that of typical humans.

Bolwin

Well we have formally learned addition but most of the time I actually do it, I'm not doing it, I'm going based on some half remembered pattern checked with statistical expectations.

I'm sure the llm could formally do it too

godelski

There's always a lot of research that is "expected", but there's nothing wrong with that. The two most common reasons this happens are:

  1. Well somebody has got to do the work and we can't all just go around assuming stuff, even if we're pretty confident. The confirmation helps and is beneficial to the community.
  2. It's obvious post hoc and you have gaslit yourself into thinking that you already knew it because you kinda knew it at a high level and you only read the result at a high level too so you entirely miss all the actual details and all the context (especially since there is so much context that never makes it into a paper[0])
  3. (bonus) It addresses the same thing that someone else addressed but from a different approach and the new approach can provide additional insight.

Either way, it is beneficial to the community. Sure, nothing is groundbreaking but that's how science is. 99% incremental steps. And hey, these days most ML papers are just an active demonstration of how well you can brute force search optimal hyperparameters (i.e. how much compute you can afford). I see far fewer sufficiently isolating variables and actually provide strong evidence of the things claimed as novel, not recognizing that benchmark results are far from sufficient. But I blame reviewers for that, but also see rant in [0]

[0] I think the hardest thing about beginning a PhD is that you're reading a bunch of papers going like "why the fuck are they doing this?" and the problem is that you don't have enough breadth or depth to get it. You don't understand the decades long conversation of how we got here and what problems were being addressed along the way. To be fair, a lot of this is never stated explicitly and so you annoyingly have to piece everything together by reading a few hundred papers. But also, good luck providing all that context within page limits and besides, papers are written for /peers/, by which I mean niche peers, not domain peers. Ain't nobody got time to write textbooks, because you're just trying to publish so you don't perish and you're already exhausted from all the grant writing, rebuttals, bureaucratic work, and all that fun jazz.

null

[deleted]

bongodongobob

Seems closer to "truly learning addition" (whatever that means) than what humans do. We use a mechanical algorithm to carry the 1 etc.

pinkmuffinere

What? This is a crazy take. I feel we (humanity) has truly understood addition to a very high degree. The fact that we can see what an LLM is doing and understand “oh it’s doing addition in a rather roundabout way via Fourier transforms” is a testament to just how well we understand addition — we can recognize hundreds of different ways to achieve the operation, know that they are equivalent, and pick the most convenient one for the situation

Ukv

I think humans do two broad types of arithmetic:

#1: For small numbers, the answer is "directly" available to us without consciously applying any steps

#2: For larger numbers, we consciously apply some formal method to break the problem down into smaller steps of type #1. Like column addition, adding digit-by-digit and carrying the ones

Despite seeming direct, I'd argue that if we could see at a low-level what our neurons were actually doing for #1, like to get 3 + 5, it would likely also look roundabout. Might even be a similar process as with LLMs, approximating magnitude (~7-9) then snapping to parity (even, since 3 and 5 are odd).

LLMs should be capable of #2, including choosing an appropriate method, with chain-of-thought reasoning. But in addition to that, and I think what bongodongobob is getting at, is that LLMs appear to have a more robust #1 than us - being able to accurately add far larger numbers whereas we'd normally fall back to a step-by-step method after one or two digits.

globalnode

Great, my mathematical nemesis is now a part of LLM functionality as well. Are people trying to make this stuff harder?

TeMPOraL

IDK, the more I learn the more it seems to me that Fourier transform is reality's cheat code. It keeps showing up everywhere.

Like, the other day I learned[0] that if you shine a light through a small opening, the diffraction pattern you get on the other side is basically the Fourier transform of the aperture outline.

(Yes, this also implies that if you take a Fourier transform of an image and make a diffraction grating off the resulting pattern, projecting light through it should paint you the original image.)

[0] - https://www.youtube.com/watch?v=Y9FZ4igNxNA

smallmancontrov

Right, physics runs on differential equations and sinusoids/exponentials are eigenfunctions of differential equations.

You can project reality onto any complete basis of functions you like, but this one tends to diagonalize the physics of our universe, which is an overpowered ability inside of our universe.

mananaysiempre

> tends to diagonalize the physics of our universe

Because it diagonalizes all good translationally invariant operators, and our universe is fond of translation invariance until you get into general relativity. (This sounds less mysterious once you learn that all good translationally invariant operators are essentially convolutions. Neither of these statements is often taught at the elementary level, probably because of the difficulty and ambiguity in defining “all good” and “are essentially”.)

ruined

this kind of interference recording is typically termed a 'hologram'. it works in full 3d

danadam

3Blue1Brown has a very interesting video about how holograms are made:

https://www.youtube.com/watch?v=EmKQsSDlaa4

mananaysiempre

(Almost-)linear models do linear things, it seems, and the Fourier transform is the quintessential linear thing.

It is also an extremely neat piece of the real world, but I’m hesitant to guess your background and offer an explanation because your phrasing makes me suspect an engineering one. With concepts usually being the first to be culled in a course targeted at engineers, there could be quite a bit of concept debt to pay off before I could really offer something I could honestly call an explanation.

Have you tried the 3Blue1Brown video on the topic[1]? It does not AFAIR offer any answers as to why the Fourier transform should exist or be useful, but it does show very well what it does in the immediate sense.

[1] https://www.youtube.com/watch?v=spUNpyF58BY

globalnode

Thanks for your reply. I think I'm going to have to start reading up on physics/differential equations. My linalg is ok but quite a bit of my computing background has been "here, this is how you calculate it" instead of concepts. I really feel theres something about Fourier that seems pretty important.

almostgotcaught

You gotta be in the in-crowd to understand that this paper, like so many others, is one of those dumb posthoc analogy/metaphor papers. These papers are where they just ran a bunch of experiments (ie just ran the training script over and over) and formulated a hypothesis empirically. Of course in order to lend the hypothesis some credibility they have to make an allusion to something formal/mathematical:

> Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain

Brilliant and very rigorous!

wongarsu

Curious that they chose to use GPT-2-XL, given the age of that model. I guess they wanted a small model (1.5B) and started work on this quite a while ago. Today there is a decent selection of much more capable 1.5B models (Quen2-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, TinySwallow 1.5B, Stella 1.5B, Qwen2.5-Math-1.5B, etc). But they are all derived from the Qwen series of models, which wasn't available when they started this research.

Sharlin

You can think of GPT-2 as the D. melanogaster of language models.

nickpsecurity

I was collecting examples of models trained on single GPU or with very low cost. A number of projects used Bert or GPT to since the implementations were very simple with with some components optimized. There’s also a lot of projects that have trained Bert in GPT two models which make for more scientific comparisons.

With no other information, those would be my guesses as to why one would use a GPT2 model.

scoresmoke

GPT-2 follows the very well-studied architecture of Transformer decoder, so the outcomes of this study might be applicable to the more complicated models.

littlestymaar

TinyLlama would have worked too and is older than the Qwen family.

imjonse

The paper predates Qwen2 and R1, this work is probably a year old.

null

[deleted]

ImHereToVote

I understand GPT-2 has been somewhat mapped to a certain extent.

DoctorOetker

> Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy.

what's the convention on the meaning of "pre-training" vs "training from scratch" ?

Is this a nomenclature shift?

currymj

pre-trained model would mean training a language model to predict text, then starting from there and training it to add numbers.

training from scratch would be initializing a neural network, and training it to add numbers directly.

null

[deleted]

dvrp

Has anyone seen a similar paper to this applied to DiTs or diffusion in general? (or autoregressive models for image generation)

metadat

Does this also hold for other functions such as sin, asin, multiplication, division, etc?

zozbot234

A nice commentary on this paper from the LessWrong folks: https://www.lesswrong.com/posts/YhgjmCxcQXixStWMC/artificial...

sfink

I agree that the commentary is nice, and should be paid attention to, but it's also from 2007. So perhaps you mean to say "it would be useful to consider this paper in the context of this 2007 commentary"? I would agree with that.

(Either that, or you linked to the wrong commentary. https://www.lesswrong.com/posts/E7z89FKLsHk5DkmDL/language-m... would be closer, other than being a different paper.)

golol

b...but stochastic parrot!

adamnemecek

We are working on a startup that takes this idea to the extreme, Fourier is only the beginning http://traceoid.ai

HN

Pre-Trained Large Language Models Use Fourier Features for Addition (2024)

Pre-Trained Large Language Models Use Fourier Features for Addition (2024)