The Pentium contains a complicated circuit to multiply by three

155 comments

·March 2, 2025

marginalia_nu

This is a very tangential. I did some work in Trinary computer emulation a lifetime ago, there's a cute trick for finding a closed form translation between a division by a power of 3, and series of bit shifts and additions.

Start by observing that

1/3 - 1/2 = 2/6 - 3/6

1/3 = 1/2 - 1/2 (1/3)

Substitute equation above into RHS an infinite number of times and find

1/3 = -(-1/2)^N for N in 1..inf

You can do this with arbitrary pairs powers of 2 and 3 (also other bases).

The implication is that you can fairly easily build a fixed time divide-by-constant circuit as out of nothing but adders and subtractors for values that are close to a power of two.

ngneer

Incredible, thanks for sharing. My understanding is that ternary computers would have been based on tri-state logic which was less reliable than say transistors or even vacuum tubes encoding binary state. Is that understanding correct?

phkahler

The Cinematronics arcade game processor has two 12-bit accumulators. The multiply instruction shifts them right as a single 24-bit value and adds the contents of memory if a 1 came out the LSB. So you clear the upper half, load one value in the lower, I forget how you set the memory address for the other operand... and execute several 1-bit multiplies is a row. You can get a 24bit product this way, but most code I saw had runs of 8 multiplies. The most common thing was a 2x2 matrix multiple to do coordinate rotations for game objects.

This was in the mid 1970s using off the shelf 7400 series parts and had peak throughput of 5MIPS.

greesil

Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly. I have sometimes had to do fixed point math in the past 20 years and have had much respect for those programmers who came before me.

phkahler

>> Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly.

Sure taking 8 instructions for a multiply is a lot, but microprocessors at that time didn't even have multiply and ran below 1MIPS. I just wanted to bring it up as yet another way someone implemented multiply, and it was pretty effective for its time.

EDIT: too late to edit my original comment, but it added the memory value to the upper half of the 24-bit result.

greesil

Aside from a really weird 8051 project in the aughts I've only ever had single cycle multiplies.

philsnow

> you may have heard of techniques such as carry lookahead, Kogge-Stone addition

Just an aside, that is Peter Kogge, who did his PhD work at Stanford, worked on the space shuttle, is an IBM fellow, and invented the first multi-core cpu.

atq2119

> invented the first multi-core cpu

The man clearly has many deserved achievements to his name, but that is true without this, and I'm confident the world would be better without this kind of statement.

"A multi-core CPU" isn't much of an invention per se. It's an idea -- one that is fairly obvious and trivial at a certain point of semiconductor history. Getting a multi-core CPU to run is not trivial, but it's not a single invention either, and by the time we got there, development teams were so large that it would be downright insulting to claim that one person solved all these problems by himself. Perhaps Kogge led the development of the first multi-core CPU, perhaps he was even visionary in the sense that he pushed for it before others thought it was feasible (I do not know if that is the case). But either way, he didn't invent it.

philsnow

Thank you for keeping me honest, I concede the point; I was quoting from his Wikipedia entry and wasn’t particularly critical because I took an architecture class from him in grad school and liked him as a professor.

reaperman

This raises my general curiosity to ask myself: among the set of things that could be said to have been truly invented by a single person (or pair/trio/tiny team) ... which inventions are the most complex ... and which are the most technologically advanced?

oidar

I thought Kunle Olukotun led the team for the first multi-core CPU.

philsnow

You may absolutely be right, I don’t know who did it “first”.

I read your comment in exactly this voice https://www.getyarn.io/yarn-clip/6b70f8d0-5706-4e10-a6e9-e61...

(In this scene, Steve Martin’s character Vinnie is trying to get away from Rick Moranis’s character Barney, and he gets a bunch of actors to pretend to be his family (none of whom speak English) to play on the latter’s sympathy and ask to have his handcuffs removed. Vinnie introduces Barney as, among other things, the inventor of the rotary engine, causing one of the actresses to break character and say “I thought Wankel invented the rotary engine”.)

nxobject

As a double-aside, Peter Kogge wrote a very good early textbook on pipelined microarchitectures that's worth reading if you want to learn how early supercomputer vector processors were designed: The Architecture of Pipelined Computers (1981).

mturmon

Peter used to consult/collaborate with my lab. He was a proponent of moving remote sensing computations closer to the sensor ("edge computing" these days).

You could definitely make the intellectual case for this approach. In cases where there's latency or costs associated with moving data to central computing, this makes sense. (In our case, it was space-based sensors so you could make the case that way.)

But AFAIK this style of processing was never systematically adopted in any space-based processing system, although many such systems (like radars) have ad hoc data reductions performed in hardware close to the sensor.

Thanks for providing that connection!

kens

Author here if anyone has questions...

vanderZwan

What happened to the dedicated "times three" multiplier in later machines? Did some form of it stick around? Did they change tactics to something making it obsolete?

phire

Obsolete.

You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.

The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.

The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.

The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).

In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.

In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.

As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.

In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.

Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

[1] https://www.youtube.com/watch?v=nll5MWlG7q4 [2] https://ieeexplore.ieee.org/document/282948/

atq2119

It seems you misunderstood the design of the multiplier described in the article.

The Pentium multiplier isn't 3 bits per cycles, it's one full multiplication per cycle (though pipelined over multiple cycles).

The part that's 3 bits is the Booth multiplier trick of multiplying 3-bit digits instead of 1-bit digits, but they're all multiplied in parallel.

As such, I suspect (but don't know) that this trick or some variant of it is used even today.

Aardwolf

> Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

How do these manage to achieve the 3 cycle latency? If it has to go through a pipeline of 8 (or 64?) adders in a row (I don't know if it's actually that many! but I assume more than 3), how do you still manage to get just a 3 cycle latency, assuming a single adder has a latency of 1?

Also, given the trickery needed for the radix-8 multiply, where you get lucky with most values from 1-7 except 3 and 5 which use this specialized circuit, how much such specialized circuits do you need for radix-256 multiply?

ForOldHack

This is brilliant enough to be included in schtiffs books/printouts. I knew the R4400 was a big leap in FP, but had never heard of the R4200. I would have liked to see code compile on it.

Was there a difference in node size?

Like the M603e that became the G3, it would have been great to see 4 of these R4200s as a quad processor server.

As speed is paramount, there are many more things we will not learn about until the architecture is vintage.

It's sad that speculative execution became a vulnerability.

ForOldHack

Also, FP adders need prior normalisation.

kens

There were a lot of different algorithms in use back then. I don't know if there is a winner used in modern chips.

dsvf

No question, but just in general appreciation for the wonderful content you put out there that we get to enjoy!

java-man

Ken, maybe it's time to publish a book?

kens

That would require me to focus on one subject :-)

java-man

Here is an example: a certain blogger, Dave Addey [0] published a book [1] which is basically a hardcopy of the blog.

After the EMP flash of 2075, the centient raccoons will enjoy reading the stories of the civilization long gone.

[0] https://typesetinthefuture.com/author/typesetinthefuture/

[1] https://www.amazon.com/Typeset-Future-Typography-Science-Fic...

JohnKemeny

Then publish many books?

ForOldHack

I have to respect his choice to focus on the 8086 and the pentium, I would think he considers the 80286 a brain dead footnote, interesting as to what when wrong and what went terribly wrong.

I used to teach binary math to kids, and at a county fair I was upstaged by an 11 year old girl who demonstrated both multiplication by powers of two and division of powers of two.

"What does your father do for a living?'

"He is the director of the laser fusion project!"

"Oh."

kristianp

The 286 could process twice as many instructions per clock as an 8086. According to Wikipedia, an uplift similar to the 80486 and the Pentium over their predecessors.

kristianp

This is probably a basic question, but is this for floating point multiplies? Isn't the part thats being multiplied less than 64 bits because you also add the exponents?

kens

Yes, this is for floating point multiplies. But internally, the FPU uses x86 extended-precision numbers, 80-bit floats, with 64 bits of significand and 15 exponent bits. So the multiplier is dealing with 64-bit numbers, plus a few more bits to support rounding correctly.

https://en.wikipedia.org/wiki/Extended_precision#x86_extende...

dwedge

I only tenuously understand this so feel free to ignore the question if it's too asinine but:

> (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

Why can't you do the same to subtract x4 from x7 to get x3?

thombat

Because x8 is simply shifting 3 bits left so it's "free" (fixed-sized shifts are very simple to implement), whereas generating the x7 would require the additional circuitry, which additionally will be more expensive than the x3 was (since it's another shift+add beyond what x3 required)

mikequinlan

x7 is computed by x8 - x, so you get x8 - x - x4 which is 2 subtractions (and 2 delays). You only want 1 delay.

russdill

The question would be why isn't it 4x - 1x?

kens

The trick is that Booth's algorithm gives you the ×8 term for free; you multiply by one more in the next base-8 digit. You can put either ×4 or -×1 into a term, but you can't put both of them into one term. So you can do ×8 - ×1 but you can't do ×4 - ×1.

mjevans

I suspect it's logically equivalent, but one bit wider in hardware.

Binary 4x is left shift by 2, rather than 2x which is by one.

Adding or subtracting in 2's compliment space is the same operation for the bits of the word.

rob74

Not really a question, just one phrase that left me scratching my head a bit:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial.

I think you can leave out the "almost" there - multiplying by 0 is 0 and multiplying by 1 is the number itself, as shown in the example, so I would definitely call it trivial.

thaumasiotes

I'm missing something:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial. Fortunately, there are some shortcuts. Note that multiplying by 2 is the same as shifting the number to the left by 1 bit position, which is very easy in hardware—you wire each bit one position to the left. Similarly, to multiply by 4, shift the multiplicand two bit positions to the left.

> Multiplying by 7 seems inconvenient, but there is a trick, known as Booth's multiplication algorithm. Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

> Thus, you can get the ×7 by subtracting. Similarly, for a ×6 term, you can subtract a ×2 multiple and add ×8 in the next digit.

> Thus, the only difficult multiple is ×3. (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

If we can easily compute ×2 for the purposes of exploiting 6x = 8x - 2x, and we can easily compute ×4 for the purposes of exploiting 4x = 4x...

why is it more difficult than that to compute 3x as the sum of 2x and 1x, or the difference of 4x and 1x?

Also, if we can easily compute ×6 by any means, why not compute ×3 as a right shift of that value? That is admittedly an extra step, but the extra step is a shift.

kens

For the 64-bit multiplication, you're adding 22 terms, one for each base-8 digit. (Think of grade-school multiplication.) Each term needs to be trivial to compute, so you can do a shift and/or a negation to get the term, but you can't do another addition.

The point is that ×3 is precomputed once, so you can then simply feed it into any of the 22 terms as needed. You can't do ×2 and ×1 into a term to get ×3 because every term would need another adder. In other words, you want one circuit to compute ×3 rather than 22 circuits.

For your ×6 question, this value is computed by putting negative ×2 into the term and conceptually adding one more to the next digit to get ×8. You can't right-shift this ×8 value since it's part of a completely different term.

Hopefully this makes sense. There are a lot of digits and sums flying around.

thaumasiotes

How do you do a negation without doing an addition? Can you do better than -x = ~x + 1?

sifar

If you are negating a single term you can't.

If you are adding n terms using an adder tree, the 1 feeds to the carry in of the next level ( it is simply a concatenation since the carry term is shifted left by 1). Only thw final carry propagate adder will have the add with 1.

kens

I think that the +1 is accomplished simply by setting carry-in on the adder for that term, so you don't need a separate adder. (Disclaimer: I haven't looked at that part of the circuit yet.)

Another interesting question is how sign extension is implemented for a negative term. The problem is that you're adding 64-bit shifted terms to create a 128-bit result. When negating a 64-bit term, all the unused bits to the left need to be set to 1. But the chip doesn't do this. (Among other things, the adders would need to be much wider than 64 bits.) So it must be using a trick for the sign extension.

lewdwig

Multiplication by 3 is actually a common operation, particularly in address calculations, where a shift and an add means multiplying the index by 3. Implementing this naively would add significant latency. But using this circuitry the LEA (Load Effective Address) instruction can do it in a single cycle, so spending this much transistor budget on it was totally a good call.

quadhome

Is it used for that, though? As I understood the article, this circuit is used as a part of floating-point multiplication.

kazinator

What?

- LEA is just an instruction moves the address calculated by an addressing mode into its output operand, rather than performing a data transfer to or from that address; all the addressing that LEA can do, MOV instructions can also do.

- The indexed addressing modes of the x86 used by MOV or LEA do not support a scale factor of 3, only powers of two: 1, 2, 4, 8. So address generation has no use for multiplication by 3.

- Article makes it clear that the multiplier by 3 is part of the floating point multiplier.

mjevans

"This ×3 multiplier contains roughly 9000 transistors, a bit more than an entire Z80 microprocessor (1976). Keep in mind that the ×3 multiplier is a small part of the floating-point multiplier, which is part of the floating-point unit in the Pentium. Thus, this small piece of a feature is more complicated than an entire microprocessor from 17 years earlier, illustrating the incredible growth in processor complexity."

That's the pace of performance growth that lead software to become bloated today; next year's performance improvements would cover up most of the sins of failure to think critically about algorithms and data flow context / locality.

Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics. The pendulum needs to swing the other direction; it's time for computers to work smarter not harder.

JumpCrisscross

> Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics

We’ve been at that limit for decades.

GrumpyYoungMan

The fabs propped up the corpse of Moore's Law by throwing mountains of cash at expanding transistors into the third dimension: finFET, GAA, CFET, etc. That has kept the party going a little while longer than it would have lasted but it's a one-time deal since are no more dimensions to expand into.

brookst

…but that’s how it’s always worked. Moore’a law is dead, we’re at the limit of everything, oh hey, Moore’s lawn limps by again because someone did something clever.

dehrmann

> into the third dimension

Does this actually work? At some point, and this is been the case for a while, you're limited by thermals. You can't stack more layers without adding more cooling.

WXLCKNO

Depends, there could be 11 dimensions to expand into.

acchow

> since are no more dimensions to expand into.

Quantum computing is next, right?

Joker_vD

Depends on what limit exactly you are thinking about: the size of a transistor is still shrinking.

dlcarrier

The speed of software bloat matching hardware improvement is known as Wirth's law: https://en.wikipedia.org/wiki/Wirth%27s_law

I think software bloat is growing faster, though.

_carbyau_

I think the software bloat is being more affected by the speed of light. If only every damn thing didn't need internet access with its associated lag - often variable in timing...

klempner

Speed of light is very relevant for actual performance sensitive software, not just shitty overly chatty trash.

While in practice the latency is over an order of magnitude larger, the speed of light round trip distance between two sockets on the same motherboard is likely on the scale of 1-2 nanoseconds which is already several cycles. Once you're talking about warehouse sized datacenters the speed of light distance can get into the microseconds even in a straight line through vacuum/air.

dragontamer

The internet has largely replaced CD Roms, Floppies and DVDs.

The stuff you use a computer hard drive or Flash drive has remained consistent. Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.

I remember when data interchange meant burning a disc and walking a disc over to my buddy. XML will standardize that and simplify my workflow. /s

userbinator

On the other hand, the multiplier is far more regular in structure than a Z80. The datapath on the Pentium is several times wider too.

kens

A bit surprisingly, the Z80's datapath is 4 bits wide. The Pentium's FPU is more than 64 bits wide (extra bits for rounding and stuff). So, yes, there is a big difference in the datapath widths.

acchow

History of function calls:

- goto/jmp to instr

- vtable lookup

- hash and lookup in a dictionary

- run a Large Language Model

CamperBob2

Or, rather:

- goto/jmp to instr

- vtable lookup

- hash and lookup in a dictionary

- run everything through a universal approximator

HPsquared

Luckily there is plenty room for improvement in most applications.

ajsnigrutin

But why would we waste thousands of hours of programmers time to optimize stuff, if we can instead waste millions if not billions of hours of users' time?

dijit

you know it’s funny, the first time I ever read a book about compiled languages- I think it might have even been K&R C… it talked about the fact that code is run significantly more often than it’s compiled, and read more often than it’s written. Thus we should optimise for runtime speed and afterwards readability.

Yet on hacker news, the exact opposite sentiment is true.

Employee hours are expensive, so optimise as much as you possibly can and push as much time onto the user as they will tolerate. Focus primarily on readability to the detriment of all other factors. Heck, even don’t bother compiling your program. Use an interpreter on the server.

Keep your local cost down, your remote cost up… after all remote cost are not your cost, especially if every company is doing the same thing.

It’s ironic that I could read a book and it be so wrong, yet that book is the foundation Bible for many people’s programming journey.

Someone

Because it saves lives. https://folklore.org/Saving_Lives.html:

"Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?”

HPsquared

Billions of person-hours wasted... Windows Update.

Salgat

Which is how it should be. There's a limited amount of resources in developing a chip, best to take advantage of the least costly route given those constraints.

null

[deleted]

EncomLab

Jonathan Blow has entered the chat...

phkahler

>> Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

Does this mean there's an "adder" to add 1 to the "next digit" prior to feeding the digits to the main part of the multiplier? That itself seem like it'd be similar to the carry lookahead circuit. Also thinking about when that needs to be done:

7 = 8-1

6 = 8-2

5 = 8-3

4 = 8-4 <- he didn't say they do this, but it would mean the MSB of the 3-bit value could be used to determine if we need to add 1 to the next digit, saving a couple gates.

sifar

Interesting choice of radix-8 booth multiplier that needs the x3 circuit. Seems like a area/perf tradeoff in order to push the fmax which means they had a latency cycles constraint since the same could be achieved via more pipelining.

kens

Yes, it's a tradeoff. Many other FPUs at the time used radix-4 instead because it avoids the extra ×3 circuit. Pipelining is tricky because there isn't a nice place to break the multiplication array into two pieces.

sifar

Sure it takes a few iterations, but there is nothing inherently complex about it.

You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier

There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay. Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).

initramfs

https://github.com/EI2030/Low-power-E-Paper-OS/blob/master/P...

8086: 29,000 386: 275,000 486: 1.2 million Pentium: 3.1 million The NSA entered the game sometime after 2000, I recall.

artemonster

TLDR: 3a = (a << 1) + a

vanderZwan

From the article:

> Multiplying a number by three is straightforward in binary: add the number to itself, shifted to the left one position. (As mentioned above, shifting to the left is the same as multiplying by two and is easy in hardware.) Unfortunately, using a simple adder is too slow.

So no, not really. In fact the entire point of article is arguably explaining why it is not that simple.

brcmthrowaway

Can AI come up with this circuit?

goldfishgold

Almost certainly, insofar as it’s already known and described in a number of places, including Wikipedia.

eru

Though it would be interesting to run an experiment where you somehow remove all of that from the training data.

Perhaps in 10 years training runs will be cheap enough, that one a whim you could just train a new AI model on data from only before the Pentium was released, and then see if it could come up with it?

HN

The Pentium contains a complicated circuit to multiply by three

The Pentium contains a complicated circuit to multiply by three