Why Does Integer Addition Approximate Float Multiplication?
91 comments
·February 9, 2025HPsquared
advisedwang
As the very first article says:
> You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?
zahlman
You can multiply two numbers by multiplying their mantissas and adding their exponents.
Since the numbers are stored in base 2, the mantissa values range from 1 to 2 (subnormals notwithstanding) - but they're effectively stored as a 0..1 value. The effective range produced by adding vs. multiplying them have considerable overlap, and on average the difference won't be much. Because of how the mantissa and exponent fields are arranged (a deliberate choice by the standard, AFAIK), the mantissa effectively adds additional bits to the approximation of the logarithm given by the exponent.
Edit: this other comment chain https://news.ycombinator.com/item?id=43034230 perhaps puts it better.
vanderZwan
To give a "yes and" side-track to your comment: saying "logarithms have this relation between multiplication and addition" is even underselling what logarithms are, because reducing multiplication to an additive operation was the whole motivation for John Napier[0] to discover/invent logarithms:
> “…nothing is more tedious, fellow mathematicians, in the practice of the mathematical arts, than the great delays suffered in the tedium of lengthy multiplications and divisions, the finding of ratios, and in the extraction of square and cube roots… [with] the many slippery errors that can arise…I have found an amazing way of shortening the proceedings [in which]… all the numbers associated with the multiplications, and divisions of numbers, and with the long arduous tasks of extracting square and cube roots are themselves rejected from the work, and in their place other numbers are substituted, which perform the tasks of these rejected by means of addition, subtraction, and division by two or three only.”[1]
Logarithms were honestly an enormous breakthrough in optimization, computers wouldn't be remotely as useful without them, even if most of us don't "see" the logarithms being used.
In fact I'd argue that they are the second-biggest computational optimization in use today, with only positional notation being a bigger deal. Which, funny enough, works kind of similarly: imagine you only could count by tallying (so, unary). Adding two number M and N would take M+N operations, e.g. 1234 + 5678 would require counting all 6912 individual digits. Unary math scales O(n) in both data and computation. Systems like Roman numerals almost work, but as soon as we reach values larger than the largest symbol (M for 1000) it's O(n) again, just with a better constant factor.
With positional notation numbers require only log(n) symbols to write down, and log(n) operations for addition, e.g. 1234 + 5678 requires one or two additions for each digit pair in a given position - one addition if there's no carry from the previous addition, two if there is. So addition at most 2 × ceil( max( log(M), log(N) ) ) operations, so log(n).
Logarithms take that idea and "recursively" apply it to the notation, making the same optimization work for multiplication. Without it, the naive algorithm for the multiplication of two numbers requires iterating over each digit, e.g. 1234 × 5678 requires multiplying each of the four digits of the first number with each of the digit of the second number, and then adding all the resulting numbers. It scales O(di×dj), where di and dj are the digits of each number. If they're the same we can simplify that to O(d²). When the numbers are represented as two logarithms the operation is reduced to adding two numbers again, so O(log(d) + [whatever the log/inverse log conversion cost is]). Of course d is a different value here and the number of digits used affects the precision.
I think the craziest thing of all this is that we're so used to positional notation that nobody ever seems to consider it a data compression technique. Even though almost no other data compression method would work without it as a building block (run-length encoding, Lempel-Ziv, Arithmetic coding? Useless without positional notation's O(log(n)) scaling factor). The only exceptions are data compression methods that are based on inventing their own numerical notation[2].
We do this every day ever since we first learned addition and subtraction as kids. Or as David Bess[3] puts it in his book "Mathematica": ask almost any adult what one billion minus one is and they know the answer instantaneously, so most adults would appear to have mental superpowers in the eyes of pretty much all mathematicians before positional notation was invented (well, everyone except Archimedes maybe[4]). Positional notation is magical, we're all math wizards, and it's so normalized that we don't even realize it.
But to get back to your original point: yes, you are entirely correct. IEEE floats are a form of lossy compression of fractions, and the basis of that lossy compression is logarithmic notation (but with a fixed number of binary digits and some curious rules for encoding other values like NaN and infinity).
[0] https://en.wikipedia.org/wiki/John_Napier
[1] https://en.wikipedia.org/wiki/Mirifici_Logarithmorum_Canonis...
[2] https://en.wikipedia.org/wiki/Asymmetric_numeral_systems
[3] https://www.quantamagazine.org/mathematical-thinking-isnt-wh...
taejo
Excellent comment!
As a historical tidbit I'll add that Romans did develop two ways to write larger numbers.
1. Writing a line (vinculum) over a numeral to multiply its value by 1,000. This was in fact extended to writing a line to the left and above a numeral to multiply its value by 1,000,000, and could in principle be extended to lines below and to the right to multiply by 10^9 and 10^12, and even nested boxes for larger powers.
2. The use |), |)), |))), ... for 500, 5,000, 50,000, ... and (|), ((|)), (((|))), ... for 1,000, 10,000, 100,000, ... These can be continued indefinitely.
https://en.wikipedia.org/wiki/Roman_numerals#Large_numbers
Both require an ever increasing number of marks just to write the increasing powers, as well as an ever increasing number of powers being summed, but both increase only logarithmically, so we end up using O((log n)²) marks to write n. This is quadratically worse than positional notation, but exponentially better than just writing M over and over.
vanderZwan
Wow, that's amazing! I had somehow never heard of this before despite being into numeral systems. Thank you for sharing!
esafak
And if I may take this yet a step further, this is the point of mathematical transformations: to find a domain in which desirable operations are easier such that the round trip to and from the domain are offset.
In the past, transformations, like logarithms, Fourier transforms, wavelets, had to be proposed. Enabled by advances in computers, machine learning automated all that away by composing differentiable building blocks into universal estimators. The parameters of these building blocks are estimated through gradient descent in conjunction with a user-chosen loss function, which guides the optimization of the transform. Good representations can be manipulated through basic algebra (like added, averaged, and compared for similarity, depending on the task) in a way that corresponds to semantic operations when their raw, untransformed representations can not.
HPsquared
I'm still amazed at word embeddings. They allow to find related words, and even do addition and subtraction.
The way you can supposedly take "king", subtract "man" and add "woman" and you get "queen", and this kind of thing is the mathematical basis of LLMs.
hathawsh
I remember the glee I felt when I first used logarithms to represent large numbers in Excel. Suddenly, I could easily approximate large factorials, and they were fast! I wondered how many other people have discovered this trick. I couldn't find many references on the Internet. Over time, I realized this method of handling large numbers is used so often that most people who know about it don't bother to talk about it.
groby_b
This should have been part of the HS curriculum.
Logarithms pop up (IIRC) in 10th grade, and ln(x*y) = ln(x) + ln(y) is usually explained as part of that. What many teachers completely fail to teach is that it's not just a formula to memorize, but that it has profound implications on how you can do math. As you discovered by yourself. (Kudos!)
It is, with the right teacher, a super-exciting story with lots of colorful history & characters. You can even guide your students to come to that some crucial insight. Alas, few teachers have the necessary amount of passion to make that come alive.
But that's probably why few people write about it - for most, either their math interest got murdered in HS so they never get to look at it, or they did learn it in 10th grade and so consider it not worth mentioning.
(Corollary - we all should write more about the cool realizations we had, even if they're in retrospective "well known")
prerok
I must admit I wanted to write that all that is true but calculating logarithms is difficult but then looked it up and found that slide rule was invented shortly thereafter:
xyzzyz
When the numbers are represented as two logarithms the operation is reduced to adding two numbers again, so O(log(d) + [whatever the log/inverse log conversion cost is]).
This is pretty confused. First, if the d is the number of digits of a number, then you will need O(d) of memory to store its logarithm, otherwise you will loose a lot of precision. Thus, the addition of logarithms is still O(d), not O(log(d)).
Second, calculating log(x) and its inverse, exp(x) is much more expensive than multiplication. For example, if x is between 0 and 1, you can approximate exp(x) pretty well as 1 + x + x^2/2 + x^3/6 + x^4/24. This is 3 multiplications and 3 divisions just to convert.
vanderZwan
> Thus, the addition of logarithms is still O(d), not O(log(d)).
Oh dear, yes that's wrong, thank you for catching that.
Still, O(d) a lot better than O(d²). And perhaps more importantly: division (which is even more painful to compute) is a simple matter of subtracting the logarithmic tranformations and then taking the exponent.
> Second, calculating log(x) and its inverse, exp(x) is much more expensive than multiplication.
You're not wrong, but you're talking about the costs of a computer doing the calculation. I was talking about a human doing calculations by hand. What I forgot to mention is that Mirifici Logarithmorum Canonis Descriptio contained 90 pages of precomputed tables. Multiplying two numbers is then a matter of two look-ups, one addition, and another look-up. For large enough numbers that's worth it (and "large enough" is reached pretty quickly in the case of calculations done by hand).
And shortly after the introduction of logarithms William Oughtred invented the slide rule, using two log scales to create an analog computer, removing the need for a table as well (depending on the precision required)
ChuckMcM
Wish I could up vote this more than once :-). It also is why people had books of 'logs' back in the day that were just the log10 value of a given number.
saulpw
John Napier is my current math nerd hero! I've been trying to promote "magnitude notation"[0] as the next evolution in numeric notation. If we used mag numbers in science and journalism and even finance, we'd get the same increase in mental superpowers as we got with positional notation.
vanderZwan
What a nice idea!
Notation making it easier to intuit logarithmic scales does sound like a good idea in a world where human societies partially struggles to come to action because of things like the majority of people not intuitively understanding exponential growth, or the difference in scale between "millionaire" or "billionaire". It would be a good tool for the mind.
fragmede
> In fact I'd argue that they are the second-biggest computational optimization in use today, with only positional notation being a bigger deal
For the sake of argument, I think representing numbers in binary is the most fundamental optimization in use today.
vanderZwan
You're arguing that positional notation using a specific base is more fundamental than the concept of positional notation itself?
Although I agree that base 2 is a great pick, of course, for many reasons. So perhaps we should put binary in second place, and logarithms in third.
mlyle
Yes, and overflowing the mantissa moves the exponent in the right direction by the right amount; just now the least significant bits are not quite in the right place anymore. But that's by definition a small error.
emmelaich
Anyone who did high school before calculators (late 70s) would have carried around a book of logarithmic tables so they could do the multiplication necessary for many problems.
null
sethaurus
This is the core trick in the Fast Inverse Square Root[0], as made famous by the Quake III source code.
It uses a shift (equivalent to dividing) and a subtraction in integer-land to estimate x^(-0.5) in float-land.
[0]: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root
Timwi
I've always hated that it's called an “inverse”. It's not an inverse. The inverse of square root is square. If they had called it “reciprocal”, it would have been clear to me what it does, but “inverse” confused the hell out of me when I first saw it.
Someone
It’s confusing for non-mathematicians, but (and you may know that) it is not incorrect. https://en.wikipedia.org/wiki/Inverse_element:
“In mathematics, the concept of an inverse element generalises the concepts of opposite (−x) and reciprocal (1/x) of numbers.”
disconcision
i think this is just typical natural language ambiguity:
"Inverse Square root" is used to mean "(Multiplicative) Inverse (of the) Square root" as opposed to "(Square root) Inverse"
bmacho
No, it's confusing for mathematicians because it is directly opposing math terminology. (Which is okay, but confusing.)
seeknotfind
Right, the bang per syllable loss is it's ambiguous, not that it's incorrect. Inverse square root could also mean -(x^0.5) if it meant the additive inverse, and it could mean x^2 if it meant the functional inverse, as said here.
empath75
Since calling the square an inverse square root makes zero sense in any practical context, the other common meaning being applied is obvious. In this case, it's not the inverse of a _function_, it's the inverse of a _number_, which is usually considered to be the same as the reciprocal.
throwawaymaths
as a mathematician: its fine. the word "multiplicative" is just silent in the phrase.
null
bee_rider
Interestingly Intel got it right, thus the rsqrt family of instructions.
manojlds
The inverse is about the negative power right? Square root is the 0.5
vrighter
"inverse" has a very specific meaning: inverse(x) * x = 1
x^2 * x != 1 for any x other than 1. So no, x^2 is not the inverse of sqrt(x)
quietbritishjim
No. "inverse" has a very specific meaning: the operation that undoes whatever operation you're talking about.
https://en.m.wikipedia.org/wiki/Inverse_function
The inverse of multiplying by 3 is dividing by 3, and vice-versa.
The inverse of squaring is squart root, and vice-versa.
The inverse of a number or other element (rather than an operation / function) means whatever number would form the inverse if you use it with whatever binary operation you're talking about. So the additive inverse of 3 is –3, the inverse of "rotate clockwise 90 degrees" in the group of geometric operations is "rotate anticlockwise 90 degrees", and the multiplicative inverse of 3 is 1/3.
Saying "the inverse of 3" to mean 1/3, i.e. its multiplicative inverse, is sloppy but acceptable so long as everyone knows what you mean (and it's fairly common). Saying "the inverse square root" to mean 1/square root is just wrong.
tczMUFlmoNk
But sqrt · square = 1, where "sqrt: R⁺ → R⁺" is the square root operation, "square: R⁺ → R⁺" is the squaring operation, "1: R⁺ → R⁺" is the identity operation, and (·) is function composition: i.e., working in the monoid of functions R⁺ → R⁺. So x² and sqrt(x), as elements of R, are not inverses, but sqrt and square, as elements of R⁺ → R⁺, are.
It depends on how you parse the phrase "fast inverse square root".
Animats
Neat. Of course it works for the exponent, but that it's not that far off for the mantissa is unexpected. It helps that the mantissa is normalized.
guyomes
For the mantissa, the insight is probably that :
(1+e1)*(1+e2) = 1+e1+e2+(e1*e2)
If e1 and e2 are small, then e1*e2 is negligible.
SushiHippie
Previous discussion of the paper:
Addition is all you need for energy-efficient language models - https://news.ycombinator.com/item?id=41784591 - Oct 2024 - 126 comments
rowanG077
This can be very worth it in circuit design for custom accelerators. Floating point operation is often multicycle. If you can get close enough with addition it will save a ton of resources and probably also simplify the design.
nabla9
It's math and not just for integers. That's also how slide rule works.
a×b = 10^(log(a×b))
log(a×b) = log(a)+log(b)
thus a×b = 10^(log(a)+log(b))
Sharlin
Yes, but what is nontrivial here is how good the approximation is, despite the mantissa part being linear (much better than the upper bound of being within 2x of the real value).
shiandow
Nontrivial yes, but it's not that complicated to figure out that floats are a piecewise linear approximation of the base 2 logarithm.
Edit: or base 2^k, depends a bit how you view it.
advisedwang
The very first paragraph of the article says this and then goes on to say the trick gets the mantissa roughly right too:
> You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?
zeckalpha
> The paper talks about using this to save power, but that’s probably not worth it for a few reasons
It's probably not worth it to do this in software. But in hardware, it might be!
A similar approach with a hardware prototype. https://research.nvidia.com/publication/2022-12_lns-madam-lo...
WhitneyLand
I wonder if there was any inspiration from Quake III.
Didn’t the inverse square root trick rely on bit-level floating point and subtraction of a bias?
akomtu
float32 (1+m)×2^e as int32 = e<<23 | m
(1+m1)×2^e1 × (1+m2)×2^e2 = (1+m1+m2+m1×m2)×2^(e1+e2)
If m1×m2 is small, that's approximately float32(m1+m2, e1+e2).
fatuna
Would subtraction also approximate division?
Jaxan
Yes
Timwi
Would negation also approximate the reciprocal?
vanderZwan
Yes. Logarithms are wonderful like that :)
null
null
The most significant bits of a floating point representation are basically a logarithm of the number. Logarithms have this relation between multiplication and addition.