Zen 5's AVX-512 Frequency Behavior
69 comments
·March 1, 2025Remnant44
yaantc
On the L/S unit impact: data movement is expensive, computation is cheap (relatively).
In "Computer Architecture, A Quantitative Approach" there are numbers for the now old TSMC 45nm process: A 32 bits FP multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8 kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its tag comparison and LRU logic (more expansive).
Then I have some 2015 numbers for Intel 22nm process, old too. A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM 16.7 pJ. Basic SRAM here too, not a more expansive cache.
The cost of a multiplication is quadratic, and it should be more linear for access, so the computation cost in the second example is much heavier (compare the mantissa sizes, that's what is multiplied).
The trend gets even worse with more advanced processes. Data movement is usually what matters the most now, expect for workloads with very high arithmetic intensity where computation will dominate (in practice: large enough matrix multiplications).
Remnant44
Appreciate the detail! That explains a lot of what is going on.. It also dovetails with some interesting facts I remember reading about the relative power consumption for the zen cores versus the infinity fabric connecting them - The percentage of package power usage simply from running the fabric interconnect was shocking.
Earw0rm
Right, but a SIMD single precision mul is linear (or even sub linear) relative to it's scalar cousin. So a 16x32, 512-bit MUL won't be even 16x the cost of a scalar mul, the decoder has to do only the same amount of work for example.
kimixa
The calculations within each unit may be, true, but routing and data transfer is probably the biggest limiting factor on a modern chip. It should be clear that placing 16x units of non-trivial size means that the average will likely be further away from the data source than a single unit, and transmitting data over distances can have greater-than-linear increasing costs (not just resistance/capacitance losses, but to hit timing targets you need faster switching, which means higher voltages etc.)
eigenform
AFAIK you have to think about how many different 512b paths are being driven when this happens, like each cycle in the steady-state case is simultaneously (in the case where you can do two vfmadd132ps per cycle):
- Capturing 2x512b from the L1D cache
- Sending 2x512b to the vector register file
- Capturing 4x512b values from the vector register file
- Actually multiplying 4x512b values
- Sending 2x512b results to the vector register file
.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?
jiggawatts
… per core. There are eight per compute tile!
I like to ask IT people a trick question: how many numbers can a modern CPU multiply in the time it takes light to cross a room?
bgnn
Piggy backing on this: memory scaling was slowter than compute scaling, at least since 45nm in the example. For 4nm the difference is larger.
formerly_proven
Random logic had also much better area scaling than SRAM since EUV which implies that gap continues to widen at a faster rate.
bayindirh
> but as opposed to the old intel avx512 cores that got endless (deserved?) bad press for their transition behavior, this is more or less seamless.
The problem with Intel was, the AVX frequencies were secrets. They were never disclosed in later cores where power envelope got tight, and using AVX-512 killed performance throughout the core. This meant that if there was a core using AVX-512, any other cores in the same socket throttled down due to thermal load and power cap on the core. This led to every process on the same socket to suffer. Which is a big no-no for cloud or HPC workloads where nodes are shared by many users.
Secrecy and downplaying of this effect made Intel's AVX-512 frequency and behavior infamous.
Oh, doing your own benchmarks on your own hardware which you paid for and releasing the results to the public was verboten, btw.
LtdJorge
> Oh, doing your own benchmarks on your own hardware which you paid for and releasing the results to the public was verboten, btw.
Well, Cloudflare did anyway.
deaddodo
To be clear, the problem with the Skylake implementation was that triggering AVX-512 would downclock the entirety of the CPU. It didn’t do anything smart, it was fairly binary.
This AMD implementation instead seems to be better optimized and plug into the normal thermal operations of the CPU for better scaling.
eqvinox
Reading the section under "Load Another FP Pipe?" I'm coming away with the impression that it's not the LSU but rather total overall load that causes trouble. While that section is focused on transition time, the end steady state is also slower…
tanelpoder
I haven’t read the article yet, but back when I tried to get to over 100 GB/s IO rate from a bunch of SSDs on Zen4 (just fio direct IO workload without doing anything with the data), I ended up disabling Core Boost states (or maybe something else in BIOS too), to give more thermal allowance for the IO hub on the chip. As RAM load/store traffic goes through the IO hub too, maybe that’s it?
eqvinox
I don't think these things are related, this is talking about the LSU right inside the core. I'd also expect oscillations if there were a thermal problem like you're describing, i.e. core clocks up when IO hub delivers data, IO hub stalls, causes core to stall as well, IO hub can run again delivering data, repeat from beginning.
(Then again, boost clocks are an intentional oscillation anyway…)
rayiner
It seems even more interesting than the power envelope. It looks like the core is limited by the ability of the power supply to ramp up. So the dispatch rate drops momentarily and then goes back up to allow power delivery to catch up.
kristianp
I find it irritating that they are comparing clock scaling to the venerable Skylake-X. Surely Sapphire Rapids has been out for almost 2 years by now.
eqvinox
Seemed appropriate to me as comparing the "first core to use full-width AVX-512 datapaths"; my interpretation is that AMD threw more R&D into this than Intel before shipping it to customers…
(It's also not really a comparative article at all? Skylake-X is mostly just introduction…)
kristianp
> my interpretation is that AMD threw more R&D into this than Intel before shipping it to customers
AMD had the benefit of learning from Intel's mistakes in their first generation of AVX-512 chips. It seemed unfair to compare an Intel chip that's so old (albeit long-lasting due to Intel's scaling problems). Skylake-X chips were released in 2017! [1]
[1] https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Hi...
eqvinox
sure, but AMD's decision to start with a narrower datapath happened without insight from Intel's mistakes and could very well have backfired (if Intel had managed to produce a better-working implementation faster, that could've cost AMD a lot of market share). Intel had the benefit of designing the instructions along with the implementation as well, and also the choice of starting on a 2x256 datapath...
And again, yeah it's not great were it a comparison, but it really just doesn't read as a comparison to me at all. It's a reference.
fuhsnn
I think it's mostly the lack of comparable research other than the Skylake-X one by Travis Downs. I too would like to see how Zen 4 behaves in the situation with its double-pumping.
adrian_b
True, but who would bother to pay a lot of money for a CPU that is known to be inferior to alternatives, only to be able to test the details of its performance.
crest
A well funded investigative tech journalist?
adrian_b
With prices in the range $2,000 - $20,000 for the CPU, plus a couple grand for at least MB, cooler, memory and PSU, the journalist must be very well funded to spend that much for publishing one article analyzing the CPU.
I would like to read such an article or to be able to test myself such a CPU, but my curiosity is not so great as to make me spend such money.
For now, the best one can do is to examine the results of general-purpose benchmarks published on sites like:
where Intel, AMD or Ampere send some review systems with Xeon, Epyc or Arm-based server CPUs.
These sites are useful, but a more thorough micro-architectural investigation would have been nice.
kristianp
AWS, dedicated EC2 instance. A few dollars an hour.
Earw0rm
Depending on which CPU category. I think Intel HEDT stops at Cascade Lake, which is essentially Skylake-X Refresh from 2019?
Whereas AMD has full-fat AVX512 even in gaming laptop CPUs.
formerly_proven
Cascade Lake improved the situation a bit, but then you had Ice Lake where iirc the hard cutoffs were gone and you were just looking at regular power and thermal steering. IIRC, that was the first generation where we enabled AVX512 for all workloads.
null
menaerus
I don't understand why 2x FMAs in CPU design poses such a challenge when GPUs literally have hundreds of such ALUs? Both operate at similar TDP so where's the catch? Much lower GPU clock frequency?
dzaima
Zen 5 still clocks way higher than GPUs even with the penalties. Additionally, CPUs typically target much lower latency for operations even per-clock, which adds a ton of silicon cost for the same throughput, and especially so at high clock frequency.
The difficulty with transitions that Skylake-X suffered especially from just has no equivalent on GPU; if you always stay in the transitioned-to-AVX512 state on Skylake-X, things are largely normal; GPUs just are always unconditionally in such a state, but that be awful on CPUs, as it'd make scalar-only code (not a thing on GPUs, but the main target for CPUs) unnecessarily slow. And so Intel decided that the transitions are worth the improved clocks for code not utilizing AVX-512.
atq2119
It's not the execution of FMAs that's the challenge, it's the ramp up / down.
And I assure you GPUs do have challenges with that as well. That's just less well known because (1) in GPUs, all workloads are vector workloads, and so there was never a stark contrast between scalar and vector regimes like in Intel's AVX-512 implementation and (2) GPU performance characteristics are in general less broadly known.
eqvinox
It's not 2 FMAs, it's AVX-512 (and going with 32-bit words) ⇒ 2*512/32 = 32 FMAs per core, 256 on an 8-core CPU. The unit counts for GPUs - depending on which number you look at - count these separately.
CPUs also have much more complicated program flow control, versatility, and AFAIK latency (⇒ flow control cost) of individual instructions. GPUs are optimized for raw calculation throughput meanwhile.
Also note that modern GPUs and CPUs don't have a clear pricing relationship anymore, e.g. a desktop CPU is much cheaper than a high-end GPU, and large server CPUs are more expensive than either.
menaerus
1x 512-bit FMA or 2x 256-bit FMAs or 4x 128-bit FMAs is irrelevant here - it's still a single physical unit in a CPU that consumes 512 bits of data bandwidth. The question is why the CPU budget allows for 2x 512-bit or 4x 256-bit while H100, for example, has 14592 FP32 CUDA cores - in AVX terminology that would translate, if I am not mistaken, to 7926x 512-bit or 14592x 256-bit FMAs per clock cycle. Even considering the obvious differences between GPUs and CPUs, this is still a large difference. Since GPU cores operate at much lower frequencies than CPU cores, it is what it made me believe where the biggest difference comes from.
eqvinox
AIUI an FP32 core is only 32 bits wide, but this is outside my area of expertise really. Also note that CPUs also have additional ALUs that can't do FMAs, FMA is just the most capable one.
You're also repeating 2×512 / 4×256 — that's per core, you need to multiply by CPU core count.
[also, note e.g. an 8-core CPU is much cheaper than a H100 card ;) — if anything you'd be comparing the highest end server CPUs here. An 192-core Zen5c is 8.2~10.5k€ open retail, an H100 is 32~35k€…]
[reading through some random docs, a CPU core seems vaguely comparable to a SM; a SM might have 128 or 64 lanes (=FP32 cores) while a CPU only has 16 with AVX-512, but there is indeed also a notable clock difference and far more flexibility otherwise in the CPU core (which consumes silicon area)]
TinkersW
Nvidia calls them cores to deliberately confuse people, and make it appear vastly more powerful than it really is. What they are in reality is SIMD lanes.
So the H100(which costs vastly more than a Zen5..), has 14592 32 bit SIMD lanes, not cores.
A Zen 5 has 16x4(64) 32 bit SIMD lanes per core, so scale that by core count to get your answer. A higher end desktop Zen5 will have 16 cores, so 64x16 = 1024. The Zen5 also clocks much higher than the GPU, so you can also scale it up by perhaps 1.5-2x
While this is obviously less than the H100, the Zen5 chip costs $550 and the H100 cost $40k.
There is more to it than this, GPUs also have transcendental functions, texture sampling, and 16 bit ops(which are lacking in CPUs). While CPUs are much more flexible, and have powerful byte & integer manipulation instructions, along with full speed 64 bit integer/double support.
adrian_b
Like another poster already said, the power budget of a consumer CPU, like 9950X, executing programs at a double clock frequency in comparison with a GPU, allows for 16 cores x 2 execution units x 16 = 512 FP32 FMA per clock cycle, which provides the same throughput like an 1024 FP32 FMA per clock cycle iGPU from the best laptop CPUs, while consuming 3 times less power than a datacenter GPU, so the power budget and performance is like for a datacenter GPU with 3072 FP32 FMA per clock cycle.
However, because of its high clock frequency a consumer CPU has high performance per dollar, but low performance per watt.
Server CPUs with many cores have much better energy efficiency, e.g. around 3 times higher than a desktop CPU and the same with the most efficient laptop CPUs. For many generations of NVIDIA GPUs and Intel Xeon CPUs, until about 5-6 years ago, the ratio between their floating-point FMA throughput per watt has been of only 3.
This factor of 3 is mainly due to the overhead of various tricks used by CPUs to extract instruction-level parallelism from programs that do not use enough concurrent threads or array operations, e.g. superscalar out-of-order execution, register renaming, etc.
In recent years, starting with NVIDIA Volta, followed later by AMD and Intel GPUs, the GPUs have made a jump in performance that has increased the gap between their throughput and that of CPUs, by supplementing the vector instructions with matrix instructions, i.e. what NVIDIA calls tensor instructions.
However this current greater gap in performance between CPUs and GPUs could easily be removed and the performance per watt ratio could be brought back to a factor unlikely to be greater than 3, by adding matrix instructions to the CPUs.
Intel has introduced the AMX instruction set, besides AVX, but for now it is supported only in expensive server CPUs and Intel has defined only instructions for low-precision operations used for AI/ML. If AMX were extended with FP32 and FP64 operations, then the performance would be much more competitive with GPUs.
ARM is more advanced in this direction, with SME (Scalable Matrix Extension) defined besides SVE (Scalable Vector Extension). SME is already available in recent Apple CPUs and it is expected to be also available in the new Arm cores that will be announced in a few months for now, which should become available in the smartphones of 2026, and presumably also in future Arm-based CPUs for servers and laptops.
The current Apple CPUs do not have strong SME accelerators, because they also have an iGPU that can perform the operations whose latency is less important.
On the other hand, an Arm-based server CPU could have a much bigger SME accelerator, providing a performance much closer to a GPU.
adgjlsfhk1
It's gpu frequency 5.5 GHZ is ~4x the heat and power as the 2.5 GHZ for GPUs.
Szpadel
I'm curious how changing some OC parameters would affect those results if that it caused by voltage drop, how load line calibration affects it? is that's power constraint then how PBO would affect it?
sylware
They should put forward the fact that 512bits is the "sweet spot" as it is a data cache-line!
mgaunard
In practice everyone turns off AVX512 because they're afraid of the frequency throttling.
The damage was made by Skylake-X and won't be healed for years.
ksec
I wonder if this will be improved or fix in Zen 6. Although personally I much rather they focus on IPC.
Remnant44
Nothing to fix here. The behavior in the transition regimes is already quite good.
The overall throttling is dynamic and reactive based on heat and power draw - this is unavoidable and in fact desirable (the alternative is to simply run slower all the time, not to somehow be immune to physics and run faster all the time)
It's interesting that Zen5's FPUs running in full 512bit wide mode doesn't actually seem to cause any trouble, but that lighting up the load store units does. I don't know enough about hardware-level design to know if this would be "expected".
The fully investigation in this article is really interesting, but the TL;DR is: Light up enough of the core, and frequencies will have to drop to maintain power envelope. The transition period is done very smartly, but it still exists - but as opposed to the old intel avx512 cores that got endless (deserved?) bad press for their transition behavior, this is more or less seamless.