Optimizing Matrix Multiplication on RDNA3
29 comments
·March 25, 2025almostgotcaught
imtringued
You're making the assumption that every kernel developer has enough AMD GPUs from different eras that they can test their ifdefs on all the possible ISAs.
tgtweak
Cuda has similar inefficiencies and many use cases can have equal uplifts by going lower level on the code.
I think this is what deepseek had done to get their speedups on older hardware.
Even way back in the days of GPU crypto mining - custom kernels hand built (mostly just unrolling loops) would yield 20% improvements over just running opencl and letting the drivers compile it down.
touisteur
People have been trying to bypass CUDA and even PTX for a long time. One long rundown of optimizing gemm on NVIDIA hardware (https://salykova.github.io/sgemm-gpu) mentions 'maxas' (https://github.com/NervanaSystems/maxas/wiki/Introduction) - which was really a step forward in this space. I still blame Intel (buying NervanaSystems) for killing it...
almostgotcaught
> People have been trying to bypass CUDA and even PTX for a long time
i swear it's so funny when people talk about this stuff like it's all weird/surprising. y'all realize that there are hundreds (thousands?) of engineers across FAANG whose full time job is optimizing CUDA/ROCm/whatever code for their team/org/company's specific workloads? like do y'all think that serious shops really just go with whatever the vendor gives you? ie none of this is in the least surprising - it's completely expected that whatever the vendor designs generically for the entire market segment will fail to achieve peak perf for your use case.
touisteur
Not saying it's surprising. My day job is doing exactly this, not in any FAANG.
Working on a platform that hides so many low-level details is a challenge, and the fact people have to go to such length to get access to it is noteworthy. 'maxas' was noteworthy and unneeded on many (most ?) other platforms.
Not saying Intelstuff or armstuff is 'easier' but at least you get access and are tooled to work on the actual low-level asm.
cma
>it's completely expected that whatever the vendor designs generically for the entire market segment will fail to achieve peak perf for your use case.
When Carmack left Meta I believe he claimed they were only getting around 20% utilization on their even then enormous GPU fleet. So I could see them also leaving a lot of perf headroom on the table.
SavageNoble
This is really cool. 60% is no joke and as a 7900XTX owner I would love the performance boost.
Well done!
randomNumber7
Is the author a genius or has AMD questionable software?
kimixa
Many of the optimizations here rely heavily on the size of matrix and it's relationship to hardware specific details, like LDS size, how they're banked and register count.
It's probably not surprising that you can grind a decent improvement over a general solution, and many of the improvements shown here will need to be re-balanced, or even simply not work, for kernels working on different matrix layouts. Similarly for trying to work on different hardware - even in the same architecture and generation these sort of details are often changing.
And all that required going down to the ISA level, which is a lot less easy (certainly less documented) for Nvidia - for example the "inspiration" post linked [0] on CUDA didn't beat cuBLAS also didn't try modifying the SASS directly, so there might be similar level gains unrealized there.
almostgotcaught
> like LDS size, how they're banked and register count.
but you're acting like they pick these numbers using a random number generator for each generation when it's just reasonable/rational stuff like "here's 2x more LDS or more registers for free because the new process node is 2x smaller". like you must realize that they're not throwing everything away and starting completely from scratch for every new gen right? incidentally, while LDS will grow and # of registers will grow, there's absolutely no way they'd change the banking - e.g., CUDA hasn't changed it since 2.0.
kimixa
No, but it's not obviously clear that other sized kernels will hit the same bottlenecks seen in the post. It's not really shown one way or the other - is it that the rocm kernels are just inefficient, or just the author identified one that wasn't particularly well optimized? And do these opportunities for improvement really mean that the software is "Questionable", or just that you cannot really do an equivalent comparison at the level of ISA on other vendor's software stacks?
I'm not trying to minimize the work here, it's interesting and a good example of the sort of lengths you can go to in order to squeeze that last little bit of performance out (and again, showing the advantages of public ISA documentation and support for users working at that level), I just took issue to the parent comment seeming to use this work as evidence of a poor baseline.
roenxi
ROCm multiplies in 4.5ms and the author multiplies in 2.8ms. The naive algorithm is 136ms. I don't think anyone at AMD is losing sleep over this; for a general purpose library this isn't horrible performance. It could be better, hand optimising to specific conditions often is. But as this blog post shows, optimising kernels is the sort of thing that people can do for fun and post blogs about if they care. They don't need AMD to be involved.
The problem with ROCm isn't that it only half-utilises the hardware, the problem was that someone trying to write this blog post in 2020 would have had (or at least the probability was rather high) a heading somewhere around implementing Kernel 0 talking about how the software crashed or the kernel panicked when they tried to run the benchmarks. That was what happened to me when I tried a conceptually similar exercise. I was wandering around HN posting comments about how there were no articles like this one to be found for AMD hardware and musing whether it was technically possible to do.
This makes me wish I'd bought an RDNA3 card instead of a Nvidia one for my last purchase. Not that I really regret the choice, AMD are going to have to show that they're interested in supporting consumer cards for a little longer to get me to trust them again although they're on the right path.
saagarjha
AMD isn’t losing sleep over the fact that J. Random Blogger is beating their GEMM by 60% on 4096x4096? What universe are you living in? This company is fighting for their life against CUDA and you’re telling me their software stack being so bad it can’t use a third of the hardware on the the first and literally only thing people want it to do is somehow not a problem?
roenxi
The point of a platform is for software engineers to provide key functionality independently. Your issue here is you don't understand why CUDA has been so dominant over the last decade - a ~50% software performance gap isn't that material when hardware capacity doubles every generation. If we've reached the point where J. Random Blogger can solve their own problems then the CUDA moat has quite possibly been broken.
If AMD was only 1 hardware generation behind Nvidia they'd be pretty competitive. People are happy using CPUs with a gap of several generations from the cutting edge. And it isn't even that bad because anyone who particularly cares can optimise their software and avoid using rocBLAS.
latchkey
Follow Anush on Twitter and give him feedback. He's actively listening.
latchkey
He used to work for AMD.
imtringued
Considering the biggest difference between the kernels is the lack of dual issue instructions (an AMD specific innovation). I'd bet on the latter.
null
nyanpasu64
Is it worth implementing sub-cubic matrix multiplication algorithms like Strassen etc. for 4096x4096?
spookie
Dependent on your case, but yes, even for smaller matrices.
saagarjha
I don't think anyone really does this, at least on the GPU.
delusional
I find it quite interesting that while vector instructions are present every other sort of "hardware level grouping" (wave, SIMD, thread) is hidden from the programmer. Why would vector instructions be the only thing the programmer ought to care about?
I wonder if there's untapped potential in a GPU language which made all of those implicit classes explicit in code, now that we've sort of stabilized on them. It wouldn't allow you to do anything that you can't already do with clever optimizations and a profiler, but it could have the potential to make the optimizations clearer.
In general I'm very curious as to why we don't have any new languages that are better aligned with current hardware. For some reason we collectively decided that it was more fun to make everything general, which is especially unfortunate considering the real world got increasingly homogeneous. Compiling to some intermediate language makes no sense when you're only ever going to run on x86 anyway.
> Furthermore, performing custom ISA optimizations makes these changes RDNA3-specific
this is overblown at least wrt forward compatibility - all of the instructions used are in RDNA4 and most of them are even in CDNA3 (CDNA4 isn't public yet?) and the ones that aren't exactly there are only slightly renamed (ds_load -> ds_read). Sure it's annoying but it's not the end of the world to have some `#ifdef`s in your code (that's not very much different from what the compiler itself is going to do anyway).