Top researchers leave Intel to build startup with 'the biggest, baddest CPU'
110 comments
·June 6, 2025zackmorris
zozbot234
If you have 256-1024+ multicore CPUs they will probably have a fake unified memory space that's really a lot more like NUMA underneath. Not too different from how GPU compute works under the hood. And it would let you write seamless parallel code by just using Rust.
null
anthk
Forth can.
jiggawatts
Check out the Azure HBv5 servers.
High bandwidth memory on-package with 352 AMD Zen 4 cores!
With 7 TB/s memory bandwidth, it’s basically an x86 GPU.
This is the future of high performance computing. It used to be available only for supercomputers but it’s trickling down to cloud VMs you can rent for reasonable money. Eventually it’ll be standard for workstations under your desk.
johnklos
One of the biggest problems with CPUs is legacy. Tie yourself to any legacy, and now you're spending millions of transistors to make sure some way that made sense ages ago still works.
Just as a thought experiment, consider the fact that the i80486 has 1.2 million transistors. An eight core Ryzen 9700X has around 12 billion. The difference in clock speed is roughly 80 times, and the difference in number of transistors is 1,250 times.
These are wild generalizations, but let's ask ourselves: If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?
It doesn't, because massive amounts of those transistors go to keeping things in sync, dealing with changes in execution, folding instructions, decoding a horrible instruction set, et cetera.
So what might we be able to do if we didn't need to worry about figuring out how long our instructions are? Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?
Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
AnthonyMouse
Modern CPUs don't actually execute the legacy instructions, they execute core-native instructions and have a piece of silicon dedicated to translating the legacy instructions into them. That piece of silicon isn't that big. Modern CPUs use more transistors because transistors are a lot cheaper now, e.g. the i486 had 8KiB of cache, the Ryzen 9700X has >40MiB. The extra transistors don't make it linearly faster but they make it faster enough to be worth it when transistors are cheap.
Modern CPUs also have a lot of things integrated into the "CPU" that used to be separate chips. The i486 didn't have on-die memory or PCI controllers etc., and those things were themselves less complicated then (e.g. a single memory channel and a shared peripheral bus for all devices). The i486SX didn't even have a floating point unit. The Ryzen 9000 series die contains an entire GPU.
Sohcahtoa82
> If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?
Would be interesting to see a benchmark on this.
If we restricted it to 486 instructions only, I'd expect the Ryzen to be 10-15x faster. The modern CPU will perform out-of-order execution with some instructions even run in parallel, even in single-core and single-threaded execution, not to mention superior branch prediction and more cache.
If you allowed modern instructions like AVX-512, then the speedup could easily be 30x or more.
> Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
I doubt you'd get significantly more performance, though you'd likely gain power efficiency.
Half of what you described in your hypothetical instruction set are already implemented in ARM.
mousethatroared
[dead]
layla5alive
In terms of FLOPS, Ryzen is ~1,000,000 times faster than a 486.
For serial branchy code, it isn't a million times faster, but that has almost nothing to do with legacy and everything to do with the nature of serial code and that you can't linearly improve serial execution with architecture and transistor counts (you can sublinearly improve it), but rather with Denard scaling.
It is worth noting, though, that purely via Denard scaling, Ryzen is already >100x faster, though! And via architecture (those transistors) it is several multiples beyond that.
In general compute, if you could clock it down at 33 or 66MHz, a Ryzen would be much faster than a 486, due to using those transistors for ILP (instruction-level parallelism) and TLP (thread-level parallelism). But you won't see any TLP in a single serial program that a 486 would have been running, and you won't get any of the SIMD benefits either, so you won't get anywhere near that in practice on 486 code.
The key to contemporary high performance computing is having more independent work to do, and organizing the data/work to expose the independence to the software/hardware.
Szpadel
that's exactly why Intel proposed x86S
that's basically x86 without 16 and 32 bit support, no real mode etc.
CPU starts initialized in 64bit without all that legacy crap.
that's IMO great idea. I think every few decades we need to stop and think again about what works best and take fresh start or drop some legacy unused features.
risc v have only mandatory basic set of instructions, as little as possible to be Turing complete and everything else is extension that can be (theoretically) removed in the future.
this also could be used to remove legacy parts without disrupting architecture
zozbot234
> Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?
I'm pretty sure that these goals will conflict with one another at some point. For example, the way one solves Spectre/Meltdown issues in a principled way is by changing the hardware and system architecture to have some notion of "privacy-sensitive" data that shouldn't be speculated on. But this will unavoidably limit the scope of OOO and the amount of instructions that can be "in-flight" at any given time.
For that matter, with modern chips, semaphores/locks are already implemented with hardware builtin operations, so you can't do that much better. Transactional memory is an interesting possibility but requires changes on the software side to work properly.
AtlasBarfed
If you have a very large CPU count, then I think you can dedicate a CPU to only process a given designated privacy/security focused execution thread. Especially for a specially designed syscall, perhaps
That kind of takes the specter meltdown thing out of the way to some degree I would think, although privilege elevation can happen in the darndest places.
But maybe I'm being too optimistic
kvemkon
Would be interesting to compare transistor count without L3 (and perhaps L2) cache.
16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.
It is weird, that the best consumer GPU can 4 TFLOPS. Some years ago GPUs were an order of magnitude and more faster than CPUs. Today GPUs are likely to be artificially limited.
kvemkon
E.g. AMD Radeon PRO VII with 13.23 billion transistors achieves 6.5 TFLOPS FP64 in 2020 [1].
[1] https://www.techpowerup.com/gpu-specs/radeon-pro-vii.c3575
zozbot234
> 16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.
These aren't realistic numbers in most cases because you're almost always limited by memory bandwidth, and even if memory bandwidth is not an issue you'll have to worry about thermals. Theoretical CPU compute ceiling is almost never the real bottleneck. GPU's have a very different architecture with higher memory bandwidth and running their chips a lot slower and cooler (lower clock frequency) so they can reach much higher numbers in practical scenarios.
kvemkon
Sure, not for BLAS Level 1 and 2 operations. But not even for Level 3?
layla5alive
Huh, consumer GPUs are doing Petaflops of floating point. FP64 isn't a useful comparison because FP64 is nerfed on consumer GPUs.
kvemkon
Even recent nVidia 5090 has 104.75 TFLOPS FP32.
It's useful comparison in terms of achievable performance per transistor count.
saati
But history showed exactly the opposite, if you don't have an already existing software ecosystem you are dead, the transistors for implementing x86 peculiarities are very much worth it if people in the market want x86.
epx
Aren't 99,99999% of these transistors used in cache?
nomel
Look up "CPU die diagram". You'll see the physical layout of the CPU with annotated blocks.
Zen 3 example: https://www.reddit.com/r/Amd/comments/jqjg8e/quick_zen3_die_...
So, more like 85%, or around 6 orders of magnitude difference from your guess. ;)
PopePompus
Gosh no. Often a majority of the transistors are used in cache, but not 99%.
kleiba
Good luck not infringing on any patents!
And that's not sarcasm, I'm serious.
neuroelectron
Intel restructures into patent troll, hiring reverse engineers and investing in chip sanding and epoxy acids.
aesbetic
This is more a bad look for Intel than anything truly exciting since they refuse to produce any details lol
Foobar8568
Bring back the architecture madness era of the 80s/90s.
esafak
Can't they make a GPU instead? Please save us!
AlotOfReading
A GPU is a very different beast that relies much more heavily on having a gigantic team of software developers supporting it. A CPU is (comparatively) straightforward. You fab and validate a world class design, make sure compiler support is good enough, upstream some drivers and kernel support, and make sure the standard documentation/debugging/optimization tools are all functional. This is incredibly difficult, but achievable because these are all standardized and well understood interface points.
With GPUs you have all these challenges while also building a massively complicated set of custom compilers and interfaces on the software side, while at the same time trying to keep broken user software written against some other company's interface not only functional, but performant.
esafak
It's not the GPU I want per se but its ability to run ML tasks. If you can do that with your CPU fine!
AlotOfReading
Echoing the other comment, this isn't easier. I was on a team that did it. The ML team was overheard by media complaining that we were preventing them from achieving their goals because we had taken 2 years to build something that didn't beat the latest hardware from Nvidia, let alone keep pace with how fast their demands had grown.
mort96
Well that's even more difficult because not only do you need drivers for the widespread graphics libraries Vulkan, OpenGL and Direct3D, but you also need to deal with the GPGPU mess. Most software won't ever support your compute-focused GPU because you won't support CUDA.
Bolwin
I mean you most certainly can. Pretty much every ml library has cpu support
Asraelite
> make sure compiler support is good enough
Do compilers optimize for specific RISC-V CPUs, not just profiles/extensions? Same for drivers and kernel support.
My understanding was that if it's RISC-V compliant, no extra work is needed for existing software to run on it.
Arnavion
You want to optimize for specific chips because different chips have different capabilities that are not captured by just what extensions they support.
A simple example is that the CPU might support running two specific instructions better if they were adjacent than if they were separated by other instructions ( https://en.wikichip.org/wiki/macro-operation_fusion ). So the optimizer can try to put those instructions next to each other. LLVM has target features for this, like "lui-addi-fusion" for CPUs that will fuse a `lui; addi` sequence into a single immediate load.
A more complex example is keeping track of the CPU's internal state. The optimizer models the state of the CPU's functional units (integer, address generation, etc) so that it has an idea of which units will be in use at what time. If the optimizer has to allocate multiple instructions that will use some combination of those units, it can try to lay them out in an order that will minimize stalling on busy units while leaving other units unused.
That information also tells the optimizer about the latency of each instruction, so when it has a choice between multiple ways to compute the same operation it can choose the one that works better on this CPU.
See also: https://myhsu.xyz/llvm-sched-model-1/ https://myhsu.xyz/llvm-sched-model-1.5/
If you don't do this your code will still run on your CPU. It just won't necessarily be as optimal as it could be.
AlotOfReading
The major compilers optimize for microarchitecture, yes. Here's the tablegen scheduling definition behind LLVM's -mtune=sifive-670 flag as an example: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...
It's not that things won't run, but this is necessary for compilers to generate well optimized code.
speedgoose
I hope to see dedicated GPU coprocessors disappear sooner rather than later, just like arithmetic coprocessors did.
wtallis
Arithmetic co-processors didn't disappear so much as they moved onto the main CPU die. There were performance advantages to having the FPU on the CPU, and there were no longer significant cost advantages to having the FPU be separate and optional.
For GPUs today and in the foreseeable future, there are still good reasons for them to remain discrete, in some market segments. Low-power laptops have already moved entirely to integrated GPUs, and entry-level gaming laptops are moving in that direction. Desktops have widely varying GPU needs ranging from the minimal iGPUs that all desktop CPUs now already have, up to GPUs that dwarf the CPU in die and package size and power budget. Servers have needs ranging from one to several GPUs per CPU. There's no one right answer for how much GPU to integrate with the CPU.
otabdeveloper4
By "GPU" they probably mean "matrix multiplication coprocessor for AI tasks", not actually a graphics processor.
saltcured
Aspects of this has been happening for a long time, as SIMD extensions and as multi-core packaging.
But, there is much more to discrete GPUs than vector instructions or parallel cores. It's very different memory and cache systems with very different synchronization tradeoffs. It's like an embedded computer hanging off your PCI bus, and this computer does not have the same stable architecture as your general purpose CPU running the host OS.
In some ways, the whole modern graphics stack is a sort of integration and commoditization of the supercomputers of decades ago. What used to be special vector machines and clusters full of regular CPUs and RAM has moved into massive chips.
But as other posters said, there is still a lot more abstraction in the graphics/numeric programming models and a lot more compiler and runtime tools to hide the platform. Unless one of these hidden platforms "wins" in the market, it's hard for me to imagine general purpose OS and apps being able to handle the massive differences between particular GPU systems.
It would easily be like prior decades where multicore wasn't taking off because most apps couldn't really use it. Or where special things like the "cell processor" in the playstation required very dedicated development to use effectively. The heterogeneity of system architectures makes it hard for general purpose reuse and hard to "port" software that wasn't written with the platform in mind.
rjsw
That was one of the ideas behind Larrabee [1]. You can run Mesa on the CPU today using the llvmpipe backend.
ngneer
I wonder if there is any relation to the cancelled Royal and Beast Lake projects.
https://www.notebookcheck.net/Intel-CEO-abruptly-trashed-Roy...
phendrenad2
Previous discussion 9 months ago: https://news.ycombinator.com/item?id=41353155
null
guywithahat
As someone who knows almost nothing about CPU architecture, I've always wondered if there could be a new instruction set, better suited to today's needs. I realize it would require a monumental software effort but most of these instruction sets are decades old. RISC-V is newer but my understanding is it's still based around ARM, just without royalties (and thus isn't bringing many new ideas to the table per say)
jcranmer
> RISC-V is newer but my understanding is it's still based around ARM, just without royalties (and thus isn't bringing many new ideas to the table per say)
RISC-V is the fifth version of a series of academic chip designs at Berkeley (hence it's name).
In terms of design philosophy, it's probably closest to MIPS of the major architectures; I'll point out that some of its early whitepapers are explicitly calling out ARM and x86 as the kind of architectural weirdos to avoid emulating.
dehrmann
> I'll point out that some of its early whitepapers are explicitly calling out ARM and x86 as the kind of architectural weirdos to avoid emulating.
Says every new system without legacy concerns.
guywithahat
Theoretically wouldn't MIPS be worse, since it was designed to help students understand CPU architectures (and not to be performant)?
Also I don't meet to come off confrontational, I genuinely don't know
jcranmer
The reason why I say RISC-V is probably most influenced by MIPS is because RISC-V places a rather heavy emphasis on being a "pure" RISC design. (Also, RISC-V was designed by a university team, not industry!) Some of the core criticisms of the RISC-V ISA is on it carrying on some of these trends even when experience has suggested that doing otherwise would be better (e.g., RISC-V uses load-linked/store-conditional instead of compare-and-swap).
Given that the core motivation of RISC was to be a maximally performant design for architectures, the authors of RISC-V would disagree with you that their approach is compromising performance.
BitwiseFool
MIPS was a genuine attempt at creating a new commercially viable architecture. Some of the design goals of MIPS made it conducive towards teaching, namely its relative simplicity and lack of legacy cruft. It was never intended to be an academic only ISA. Although I'm certain the owners hoped that learning MIPS in college would lead to wider industry adoption. That did not happen.
Interestingly, I recently completed a masters-level computer architecture course and we used MIPS. However, starting next semester the class will use RISC-V instead.
zozbot234
MIPS has a few weird features such as delay slots, that RISC-V sensibly dispenses with. There's been also quite a bit of convergent evolution in the meantime, such that AArch64 is significantly closer to MIPS and RISC-V compared to ARM32. Though it's still using condition codes where MIPS and RISC-V just have conditional branch instructions.
anthk
MIPS was used in the PSX and the N64 among the SGI workstations of its day, and the PSP too. Pretty powerful per cycle.
dehrmann
I'm far from an expert here, but these days, it's better to view the instruction set as a frontend to the actual implementation. You can see this with Intel's E/P cores; the instructions are the same, but the chips are optimized differently.
There actually have been changes for "today's needs," and they're usually things like AES acceleration. ARM tried to run Java natively with Jazelle, but it's still best to think of it as a frontend, and the fact that Android is mostly Java and ARM, but this feature got dropped says a lot.
The fact that there haven't been that many changes shows they got the fundamental operations and architecture styles right. What's lacking today is where GPUs step in: massively wide SIMD.
ksec
>I've always wondered if there could be a new instruction set, better suited to today's needs.
AArach64 is pretty much a completely new ISA built from ground up.
ItCouldBeWorse
I think the ideal would be something like a Xilinx offering, tailoring the CPU- regarding cache, parallelism and in hardware execution of hotloop components, depending on the task.
Your CPU changes with every app, tab and program you open. Changing from one core, to n-core plus AI-GPU and back. This idea, that you have to write it all in stone, always seemed wild to me.
dehrmann
I'm fuzzy on how FPGAs actually work, but they're heavier weight than you think, so I don't think you'd necessarily get the wins you're imagining.
FuriouslyAdrift
You should definitely look into AMD's Instict, Xynq, and Versal lines, then.
inkyoto
There are not really any newer instruction sets as we are locked into the von Neumann architecture and, until we move away from it, we will continue to move data between memory and CPU registers, or registers ↭ registers etc, which means that we will continue to add, shift, test conditions of arithmetic operations – same instructions across pretty much any CPU architecture relevant today.
So we have:
CISC – which is still used outside the x86 bubble;
RISC – which is widely used;
Hybrid RISC/CISC designs – x86 excluding, that would be the IBM z/Architecture (i.e. mainframes);
EPIC/VLIW – which has been largely unsuccessful outside DSP's and a few niches.
They all deal with registers, movements and testing the conditions, though, and one can't say that an ISA 123 that effectively does the same thing as an ISA 456 is older or newer. SIMD instructions have been the latest addition, and they also follow the same well known mental and compute models.Radically different designs, such as Intel APX 432, Smalltalk, Java CPU's, have not received any meaningful acceptance, and it seems that the idea of a CPU architecture that is tied to a higher level compute model has been eschewed in perpetuity. Java CPU's were the last massively hyped up attempt to change it, and that was 30 years ago.
What other viable alternatives outside the von Neumann architecture are available to us? I am not sure.
dmitrygr
> As someone who knows almost nothing about CPU architecture, I've always wondered if there could be a new instruction set, better suited to today's needs.
It exists, and was specifically designed to go wide since clock speeds have limits, bit ILP can be scaled almost infinitely if you are willing to put enough transistors into it. aarch64
I hope they design, build and sell a true 256-1024+ multicore CPU with local memories that appears as an ordinary desktop computer with a unified memory space for under $1000.
I've written about it at length and I'm sure that anyone who's seen my comments is sick of me sounding like a broken record. But there's truly a vast realm of uncharted territory there. I believe that transputers and reprogrammable logic chips like FPGAs failed because we didn't have languages like Erlang/Go and GNU Octave/MATLAB to orchestrate a large number of processes or handle SIMD/MIMD simultaneously. Modern techniques like passing by value via copy-on-write (used by UNIX forking, PHP arrays and Clojure state) were suppressed when mainstream imperative languages using pointers and references captured the market. And it's really hard to beat Amdahl's law when we're worried about side effects. I think that anxiety is what inspired Rust, but there are so many easier ways of avoiding those problems in the first place.