Skip to content(if available)orjump to list(if available)

The Pentium contains a complicated circuit to multiply by three

marginalia_nu

This is a very tangential. I did some work in Trinary computer emulation a lifetime ago, there's a cute trick for finding a closed form translation between a division by a power of 3, and series of bit shifts and additions.

Start by observing that

1/3 - 1/2 = 2/6 - 3/6

or

1/3 = 1/2 - 1/2 (1/3)

Substitute equation above into RHS an infinite number of times and find

1/3 = -(-1/2)^N for N in 1..inf

You can do this with arbitrary pairs powers of 2 and 3 (also other bases).

The implication is that you can fairly easily build a fixed time divide-by-constant circuit as out of nothing but adders and subtractors for values that are close to a power of two.

philsnow

> you may have heard of techniques such as carry lookahead, Kogge-Stone addition

Just an aside, that is Peter Kogge, who did his PhD work at Stanford, worked on the space shuttle, is an IBM fellow, and invented the first multi-core cpu.

atq2119

> invented the first multi-core cpu

The man clearly has many deserved achievements to his name, but that is true without this, and I'm confident the world would be better without this kind of statement.

"A multi-core CPU" isn't much of an invention per se. It's an idea -- one that is fairly obvious and trivial at a certain point of semiconductor history. Getting a multi-core CPU to run is not trivial, but it's not a single invention either, and by the time we got there, development teams were so large that it would be downright insulting to claim that one person solved all these problems by himself. Perhaps Kogge led the development of the first multi-core CPU, perhaps he was even visionary in the sense that he pushed for it before others thought it was feasible (I do not know if that is the case). But either way, he didn't invent it.

philsnow

Thank you for keeping me honest, I concede the point; I was quoting from his Wikipedia entry and wasn’t particularly critical because I took an architecture class from him in grad school and liked him as a professor.

oidar

I thought Kunle Olukotun led the team for the first multi-core CPU.

philsnow

You may absolutely be right, I don’t know who did it “first”.

I read your comment in exactly this voice https://www.getyarn.io/yarn-clip/6b70f8d0-5706-4e10-a6e9-e61...

(In this scene, Steve Martin’s character Vinnie is trying to get away from Rick Moranis’s character Barney, and he gets a bunch of actors to pretend to be his family (none of whom speak English) to play on the latter’s sympathy and ask to have his handcuffs removed. Vinnie introduces Barney as, among other things, the inventor of the rotary engine, causing one of the actresses to break character and say “I thought Wankel invented the rotary engine”.)

phkahler

The Cinematronics arcade game processor has two 12-bit accumulators. The multiply instruction shifts them right as a single 24-bit value and adds the contents of memory if a 1 came out the LSB. So you clear the upper half, load one value in the lower, I forget how you set the memory address for the other operand... and execute several 1-bit multiplies is a row. You can get a 24bit product this way, but most code I saw had runs of 8 multiplies. The most common thing was a 2x2 matrix multiple to do coordinate rotations for game objects.

This was in the mid 1970s using off the shelf 7400 series parts and had peak throughput of 5MIPS.

greesil

Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly. I have sometimes had to do fixed point math in the past 20 years and have had much respect for those programmers who came before me.

kens

Author here if anyone has questions...

vanderZwan

What happened to the dedicated "times three" multiplier in later machines? Did some form of it stick around? Did they change tactics to something making it obsolete?

phire

Obsolete.

You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.

The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.

The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.

The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).

In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.

In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.

As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.

In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.

Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

[1] https://www.youtube.com/watch?v=nll5MWlG7q4 [2] https://ieeexplore.ieee.org/document/282948/

atq2119

It seems you misunderstood the design of the multiplier described in the article.

The Pentium multiplier isn't 3 bits per cycles, it's one full multiplication per cycle (though pipelined over multiple cycles).

The part that's 3 bits is the Booth multiplier trick of multiplying 3-bit digits instead of 1-bit digits, but they're all multiplied in parallel.

As such, I suspect (but don't know) that this trick or some variant of it is used even today.

kens

There were a lot of different algorithms in use back then. I don't know if there is a winner used in modern chips.

dsvf

No question, but just in general appreciation for the wonderful content you put out there that we get to enjoy!

kristianp

This is probably a basic question, but is this for floating point multiplies? Isn't the part thats being multiplied less than 64 bits because you also add the exponents?

kens

Yes, this is for floating point multiplies. But internally, the FPU uses x86 extended-precision numbers, 80-bit floats, with 64 bits of significand and 15 exponent bits. So the multiplier is dealing with 64-bit numbers, plus a few more bits to support rounding correctly.

https://en.wikipedia.org/wiki/Extended_precision#x86_extende...

java-man

Ken, maybe it's time to publish a book?

kens

That would require me to focus on one subject :-)

dwedge

I only tenuously understand this so feel free to ignore the question if it's too asinine but:

> (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

Why can't you do the same to subtract x4 from x7 to get x3?

thombat

Because x8 is simply shifting 3 bits left so it's "free" (fixed-sized shifts are very simple to implement), whereas generating the x7 would require the additional circuitry, which additionally will be more expensive than the x3 was (since it's another shift+add beyond what x3 required)

mikequinlan

x7 is computed by x8 - x, so you get x8 - x - x4 which is 2 subtractions (and 2 delays). You only want 1 delay.

russdill

The question would be why isn't it 4x - 1x?

kens

The trick is that Booth's algorithm gives you the ×8 term for free; you multiply by one more in the next base-8 digit. You can put either ×4 or -×1 into a term, but you can't put both of them into one term. So you can do ×8 - ×1 but you can't do ×4 - ×1.

mjevans

I suspect it's logically equivalent, but one bit wider in hardware.

Binary 4x is left shift by 2, rather than 2x which is by one.

Adding or subtracting in 2's compliment space is the same operation for the bits of the word.

mjevans

"This ×3 multiplier contains roughly 9000 transistors, a bit more than an entire Z80 microprocessor (1976). Keep in mind that the ×3 multiplier is a small part of the floating-point multiplier, which is part of the floating-point unit in the Pentium. Thus, this small piece of a feature is more complicated than an entire microprocessor from 17 years earlier, illustrating the incredible growth in processor complexity."

That's the pace of performance growth that lead software to become bloated today; next year's performance improvements would cover up most of the sins of failure to think critically about algorithms and data flow context / locality.

Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics. The pendulum needs to swing the other direction; it's time for computers to work smarter not harder.

JumpCrisscross

> Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics

We’ve been at that limit for decades.

GrumpyYoungMan

The fabs propped up the corpse of Moore's Law by throwing mountains of cash at expanding transistors into the third dimension: finFET, GAA, CFET, etc. That has kept the party going a little while longer than it would have lasted but it's a one-time deal since are no more dimensions to expand into.

brookst

…but that’s how it’s always worked. Moore’a law is dead, we’re at the limit of everything, oh hey, Moore’s lawn limps by again because someone did something clever.

dehrmann

> into the third dimension

Does this actually work? At some point, and this is been the case for a while, you're limited by thermals. You can't stack more layers without adding more cooling.

WXLCKNO

Depends, there could be 11 dimensions to expand into.

acchow

> since are no more dimensions to expand into.

Quantum computing is next, right?

Joker_vD

Depends on what limit exactly you are thinking about: the size of a transistor is still shrinking.

dlcarrier

The speed of software bloat matching hardware improvement is known as Wirth's law: https://en.wikipedia.org/wiki/Wirth%27s_law

I think software bloat is growing faster, though.

_carbyau_

I think the software bloat is being more affected by the speed of light. If only every damn thing didn't need internet access with its associated lag - often variable in timing...

klempner

Speed of light is very relevant for actual performance sensitive software, not just shitty overly chatty trash.

While in practice the latency is over an order of magnitude larger, the speed of light round trip distance between two sockets on the same motherboard is likely on the scale of 1-2 nanoseconds which is already several cycles. Once you're talking about warehouse sized datacenters the speed of light distance can get into the microseconds even in a straight line through vacuum/air.

dragontamer

The internet has largely replaced CD Roms, Floppies and DVDs.

The stuff you use a computer hard drive or Flash drive has remained consistent. Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.

I remember when data interchange meant burning a disc and walking a disc over to my buddy. XML will standardize that and simplify my workflow. /s

acchow

History of function calls:

- goto/jmp to instr

- vtable lookup

- hash and lookup in a dictionary

- run a Large Language Model

userbinator

On the other hand, the multiplier is far more regular in structure than a Z80. The datapath on the Pentium is several times wider too.

kens

A bit surprisingly, the Z80's datapath is 4 bits wide. The Pentium's FPU is more than 64 bits wide (extra bits for rounding and stuff). So, yes, there is a big difference in the datapath widths.

HPsquared

Luckily there is plenty room for improvement in most applications.

ajsnigrutin

But why would we waste thousands of hours of programmers time to optimize stuff, if we can instead waste millions if not billions of hours of users' time?

/s

Someone

Because it saves lives. https://folklore.org/Saving_Lives.html:

"Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?”

dijit

you know it’s funny, the first time I ever read a book about compiled languages- I think it might have even been K&R C… it talked about the fact that code is run significantly more often than it’s compiled, and read more often than it’s written. Thus we should optimise for runtime speed and afterwards readability.

Yet on hacker news, the exact opposite sentiment is true.

Employee hours are expensive, so optimise as much as you possibly can and push as much time onto the user as they will tolerate. Focus primarily on readability to the detriment of all other factors. Heck, even don’t bother compiling your program. Use an interpreter on the server.

Keep your local cost down, your remote cost up… after all remote cost are not your cost, especially if every company is doing the same thing.

It’s ironic that I could read a book and it be so wrong, yet that book is the foundation Bible for many people’s programming journey.

null

[deleted]

Salgat

Which is how it should be. There's a limited amount of resources in developing a chip, best to take advantage of the least costly route given those constraints.

EncomLab

Jonathan Blow has entered the chat...

sifar

Interesting choice of radix-8 booth multiplier that needs the x3 circuit. Seems like a area/perf tradeoff in order to push the fmax which means they had a latency cycles constraint since the same could be achieved via more pipelining.

kens

Yes, it's a tradeoff. Many other FPUs at the time used radix-4 instead because it avoids the extra ×3 circuit. Pipelining is tricky because there isn't a nice place to break the multiplication array into two pieces.

sifar

Sure it takes a few iterations, but there is nothing inherently complex about it.

You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier

There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay. Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).

initramfs

https://github.com/EI2030/Low-power-E-Paper-OS/blob/master/P...

8086: 29,000 386: 275,000 486: 1.2 million Pentium: 3.1 million The NSA entered the game sometime after 2000, I recall.

efitz

A long time ago on the 6502 I was trying to figure out how to quickly multiply by 3 in assembly and I came up with:

1. Shift-left

2. Add original value

For multibyte values use rotate left which carries, and do the rotates from LSB to MSB. Then clear carry, then do the adds from LSB to MSB with carries. Should work for any word size.

Ofc I was a teenager but this seemed clever and simple to me.

Was (am) I just missing something?

bean-weevil

This is covered in the article. See the section "Implementing a fast ×3 circuit with carry lookahead"

efitz

Thank you. I stopped reading after the octal discussion.

russdill

For most assembly, I'm not sure how this has an advantage over two adds.

dashtiarian

Shift add used to execute much faster than multiplication in 8088 days and people would use it when they had to multiply an int by a known scalar (shift took 4 clocks and add took 12).

funny_falcon

Compilers still prefer LEA to multiply on 3,5 and 9, which performs shift+add I believe.

jonsen

A shift is faster than an add.

russdill

On 6502 they are both 2 cycles. On the vast majority of processors I'm aware of, "simple" alu operarations don't vary in execution time

thaumasiotes

I'm missing something:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial. Fortunately, there are some shortcuts. Note that multiplying by 2 is the same as shifting the number to the left by 1 bit position, which is very easy in hardware—you wire each bit one position to the left. Similarly, to multiply by 4, shift the multiplicand two bit positions to the left.

> Multiplying by 7 seems inconvenient, but there is a trick, known as Booth's multiplication algorithm. Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

> Thus, you can get the ×7 by subtracting. Similarly, for a ×6 term, you can subtract a ×2 multiple and add ×8 in the next digit.

> Thus, the only difficult multiple is ×3. (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

If we can easily compute ×2 for the purposes of exploiting 6x = 8x - 2x, and we can easily compute ×4 for the purposes of exploiting 4x = 4x...

why is it more difficult than that to compute 3x as the sum of 2x and 1x, or the difference of 4x and 1x?

Also, if we can easily compute ×6 by any means, why not compute ×3 as a right shift of that value? That is admittedly an extra step, but the extra step is a shift.

kens

For the 64-bit multiplication, you're adding 22 terms, one for each base-8 digit. (Think of grade-school multiplication.) Each term needs to be trivial to compute, so you can do a shift and/or a negation to get the term, but you can't do another addition.

The point is that ×3 is precomputed once, so you can then simply feed it into any of the 22 terms as needed. You can't do ×2 and ×1 into a term to get ×3 because every term would need another adder. In other words, you want one circuit to compute ×3 rather than 22 circuits.

For your ×6 question, this value is computed by putting negative ×2 into the term and conceptually adding one more to the next digit to get ×8. You can't right-shift this ×8 value since it's part of a completely different term.

Hopefully this makes sense. There are a lot of digits and sums flying around.

thaumasiotes

How do you do a negation without doing an addition? Can you do better than -x = ~x + 1?

kens

I think that the +1 is accomplished simply by setting carry-in on the adder for that term, so you don't need a separate adder. (Disclaimer: I haven't looked at that part of the circuit yet.)

Another interesting question is how sign extension is implemented for a negative term. The problem is that you're adding 64-bit shifted terms to create a 128-bit result. When negating a 64-bit term, all the unused bits to the left need to be set to 1. But the chip doesn't do this. (Among other things, the adders would need to be much wider than 64 bits.) So it must be using a trick for the sign extension.

sifar

If you are negating a single term you can't.

If you are adding n terms using an adder tree, the 1 feeds to the carry in of the next level ( it is simply a concatenation since the carry term is shifted left by 1). Only thw final carry propagate adder will have the add with 1.

artemonster

TLDR: 3a = (a << 1) + a

vanderZwan

From the article:

> Multiplying a number by three is straightforward in binary: add the number to itself, shifted to the left one position. (As mentioned above, shifting to the left is the same as multiplying by two and is easy in hardware.) Unfortunately, using a simple adder is too slow.

So no, not really. In fact the entire point of article is arguably explaining why it is not that simple.

brcmthrowaway

Can AI come up with this circuit?

goldfishgold

Almost certainly, insofar as it’s already known and described in a number of places, including Wikipedia.