Why can't transformers learn multiplication?
20 comments
·October 21, 2025nico
serced
Yes, I also wonder about this! Progress from children books to scientific papers etc. Could it learn e.g. language structure faster in a pre-training stage? Also somehow one needs to define a proxy to generalization to compute a loss and do backpropagation.
arbot360
This field of study is known as "Curriculum Learning" for your Googling pleasure (or I guess ChatGPT Deep Research now).
alyxya
I think it should be able to learn multiplication with chain of thought. Without it, it's probably really difficult to generalize the multiplication of two n-digit integers when you have to accumulate up to n products of digits and handle carrying for each output digit.
daxfohl
The chains-of-thought here are artificially constructed, very information-dense partial sums formatted in a specific way that guides the fine tuning. A potential next step would be to look at real-world chains-of-thought and see whether some process could start with those and achieve the same result. Then you could really have a self-improving system!
Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?
kovek
I tried to ask a model to tell me what is the "long multiplication algorithm". It gave it to me. I asked it to follow that algorithm to solve eg. 12987318927 * 12098102983, and it followed the algorithm, and it got the right answer. It DOES fail more when the numbers are longer (because it results with more text in the context), but that can be improved by having the model focus on the right subset of the text, right?
LouisSayers
Given their names I'd say they're too busy optimising primes...
IAmBroom
Take your damned upvote, and go away.
carodgers
Because they produce output probabilistically, when multiplication is deterministic. Why is this so hard for everyone?
trollied
Not true though. Internally they can “shell out” to sub-tasks that know how to do specific things. The specific things don’t have to be models.
(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).
rrix2
yes, what your describing is not a transformer but a high-level LLM-based product with tool-calling wired up to it
mikkupikku
They're not any better at addition, are they? If they are, I wonder how good they are at adding numbers in log space.
yorwba
The paper uses a number representation that is designed to make attention easy to learn: each digit is a separate token and the least significant digit is put first, so that the first digit of the output is simply the sum of the first digits of the inputs and the second digit is the sum of the second digits plus an optional carry from the first digits and so on.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
jerf
This is a gut impression and I don't deny it, but LLMs are Large Language Models, and in my own brain, my Language Model isn't doing large-scale multiplication. I have a language-based intuition for the sigle-digit multiplication table and a touch beyond (and based on my observations that's already above average for a human Language Model, at least in my age peer group), but it's not my Language Model doing 283 times 9284. That requires a symbolic manipulation model, and in fact I would observe that my personal neural net, for all the things it is amazingly good at, is in fact quite terrible at that sort of multiplication too. A Commodore PET is by all measures vastly, vastly simpler than my brain, but it blows away my multiplication capabilities. And then the symbolic systems tacked on another, what, 15 orders of magnitude from that "blows away my multiplication capabilities"? Depends on how you count, but something like that.
You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.
It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".
I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.
daxfohl
Hmm, I wonder what happens if you let them manipulate their own context symbolically, maybe something like a stack machine. Perhaps all you need is a "delete" token, or a "replace" flag. That way you don't have context full of irrelevant information.
I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.
Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.
lacy_tinpot
A lot of savants that are able to do really cool calculations, or even people that have synesthesia seeing numbers as colors, don't actually do "real" calculations.
I think most humans that do math aren't actually literally computing things as some kind of logic machine.
We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.
suddenlybananas
Language _is_ the symbolic manipulation system par excellence though.
jerf
There's equivocation in that statement, though, whether you meant there to be or not. There is clearly a difference in how we manipulate English words for normal human activities and the symbolic manipulation with very strict rules we today associate with mathematics and computer science. Human language goes back thousands of years, into the indefinite past we can't track past. Symbolic manipulation is a much, much more recent development, starting only ~2300 years ago around Euclid and not really coming into full development until much later... you can argue about exactly when it is but I'd personally put it as late as the 19th century for it to be recognized in the modern sense. It must be something different if separated by that many centuries.
To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.
suddenlybananas
Doing multiplication in your head isn't the point though, you can externalise language and use it to do things you can't do in your head by writing it down.
Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.
westurner
[dead]
Would love to see an architecture that learned more like humans. Start with just imitating one letter, then a few more, than some syllables, then full words, then sentences, etc. Progressively adding on top of previous knowledge
Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate
Are there training methods/curriculums that explicitly maximize generalization?