Aiter: AI Tensor Engine for ROCm
·March 23, 2025bayindirh
> ROCm packages continue to land on Debian, so there's more than meets the eye
I've been volunteering with Debian to help package ROCm for four years now, but today it officially became my full-time job. AMA.
Congrats on the job! It's exciting to see developments in CUDA competitors.
One of the issues I've had with ROCm is not so great support for commercial GPUs. This is specifically with RX 7XXX series. Do you think there is any chance it will improve in future?
I'm not sure. What were your problems with the RX 7XXX series?
Who do you work for? And is packaging ROCm for Debian really a full-time job, or is it just a part of your job?
As messy as ROCm's packaging is, I can't imagine spending all day every day trying to fix it.
I work for AMD. To be clear, my new job is about integrating ROCm into the distribution not just about shipping ROCm packages that can run on Debian.
I'll be doing things like creating new packages in main, helping to get support for the HIP language embedded into existing dpkg tooling, helping to get GPU architecture awareness integrated into the Debian CI infrastructure, helping to enable ROCm support in other libraries and applications packaged for Debian, and ensuring that everything in Debian is successfully imported into the Ubuntu universe repositories.
Integrating HIP support into Debian so that it feels as natural as C or C++ and 'just works' across dozens of GPUs is a job for more than one person. That is why I'm glad there have been so many volunteers in the community stepping forward to help with various pieces.
> I work for AMD
I have no questions, but congrats! It's great to hear good things like this as both an HPC admin, and a Debian user of 20+ years.
Man, I'm old. :)
Could you please tell AMD that it is a major competitive advantage for Nvidia that they keep doing driver updates for cards for many many years after they were released and even very old cards still get current drivers.
AMD just drops your card within a few years it seems like and drops your card from the current releases. Makes me favor Nvidia.
The only driver I'm aware of is the AMDGPU driver in the Linux kernel. It is updated with every release of Linux and is used for all modern AMD GPUs. I find that the drivers generally work well. My complaints are more about the user space libraries.
The good news is that I have at least one AMD GPU of each architecture from Vega to RDNA 3 / CDNA 2 on the Debian ROCm CI. Debian Trixie has packages built and tested for every modern discrete AMD GPU from Vega to RDNA 3 / CDNA 2. (I'd have liked to include RDNA 4 / CDNA 3, but the effort was quite resource constrained and the packages are a bit old. I'm hoping to improve upon that going forward, but Trixie is already in feature freeze so it will have to wait for the next release.)
I personally own much of the equipment for the Debian ROCm CI and I can promise I will continue testing new releases on old hardware for a very long time.
The machine I'm writing this comment is running with a Radeon RX550, with the open source AMDGPU drivers coming with the mainline kernel.
OS is Debian Trixie (Testing). No secret sauce. Install & go. Everything is working perfectly.
Do super computers run in fp64 mostly? At fp8 an h100 hits 2 petaflops, and with only 1000 of them you’ve got more compute power than el capitan (in raw flop count)
Disclosure: I'm an HPC admin who developed a materials simulation framework for my Ph.D.
Simulations run on FP64, and you have to since you're already approximating stuff with numerical algorithms (analytic solution of many things are impossible anyway). Even if you can do things with FP8, transferring everything to GPU is not trivially possible.
A simulation contains tons of different algorithms, and not all of them can be modeled as a set of matrix operations effectively. Also, moving kernels in an out of GPU is not an instant affair, plus moving data to GPU is always more expensive.
You have GPUDirect and MultiDMA engines in modern GPUs, but they need hardcore coding and knowing what you're doing if you're not solving popular stuff with established libraries and so on.
Plus, if you don't prefer to be vendor locked, at least one of the vendors artificially limit the performance you can get from their cards.
On the other hand, all of the prominent linear algebra libraries squeeze out the CPUs you have relatively easily, and you don't have to have matrices and vectors to get this performance from CPUs anyway.
Lastly, I want to touch on that parallelization such problems are not always trivial even on CPUs. When you go multinode via MPI, things get fun. Getting GPUs into that mix is somewhat of a madness if you're not prepared.
It hits 2 petaflops on the tensor cores at fp8. If you want GPGPU, that plummets to 134 teraflops (for fp16, though)
El Capitan can also do FP8. HPC requires double precision generally but people are trying to make low precision work.
I'm particularly fond of the Ozaki scheme and its recent refinements. Hopefully it trickles down to standard HPC libraries soon.
They support their workstation cards pretty poorly though. I have a Radeon VII Pro and it's already deprecated in ROCm, it's not even 3 years old. They can really learn a lesson from Nvidia that supports old cards going back far and supports every card, not just a few hand-picked business models.
> ROCm development is probably mainly driven by the needs of these supercompuers' users currently.
Seems like a problem since AMD wants to go after AI capex?
If I understand correctly, this library provides some Torch kernels customized for AMD hardware. Why haven't they just upstreamed them to PyTorch for better adoption? Also, they seem to demo usage with Torch default eager execution mode and not Torch JIT/TorchScript. Is this library compatible with TorchScript?
I think a lot of stuff will get upstreamed eventually. PyTorch just moves slower and since it’s a stable library, I think it cannot rapidly adopt something like fused MoE until the dust has settled a little and it’s clear what the API would look like long-term.
I think it’s ok that stuff is tried first in Torch extensions. That’s how Flash Attention started after all and the same is true for newer kernels in CUDA-land (fused MoE, MLA, Marlin, etc.).
With regards to TorchScript, that’s really legacy - torch.compile is where it’s at. This post seems to suggest that the kernels work with torch.compile:
I really do not understand why can't they just work with existing OSS developers pulling their hair out trying to make AMD devices work and instead do it this way. It's like Mozilla with the questionable decisions.
There are a lot of OSS developers, I doubt AMD has the resources to do that. And realistically they don't need to, I wandered over to watch some George Hotz videos the other day and it looked like the AMD driver situation has improved to the point where specialist AMD access isn't needed to debug any more. Which is a huge change and very exciting for me personally because it means I might be able to jump back to an AMD card and ditch the mess that is Nvidia on Linux.
In theory they might not even need to be involved in optimising compute kernels, there is probably some PhD student who'll do the work because they want to be a kernel-optimising specialist. In practice a few strategic applications of paid talent is all they really need to do. Everyone wants to diversify off Nvidia so there is a lot of interest in supporting AMD if they are willing to push out firmware that multiplies matrices without crashing. Which has been a weird sticking point for AMD for a surprising amount of time.
There's only one Pytorch though, and it's what people are using for ML nowadays.
Back in the day you had to optimize your card for Quake, do everything to make it run well. Now you have to do that for Pytorch.
I think they are taken over by exactly the same people leading the AI-hype. Funny how in this article they are a) not advertising clearly what they are doing, b) solving a small subset of problems in a way noone asked for (I think most people just want ROCm to work at all...) and c) just adding to a complex product without any consideration of actually integrating with its environment.
I guess it's vibecoding "AI"...
solving a small subset of problems in a way noone asked for
What do you mean? Having ROCm fused MoE and MLA kernels as a counterpart to kernels for CUDA is very useful. AMD needs to provide this if they want to keep AMD accelerators competitive with new models.
> I think most people just want ROCm to work at all
I think most people don't want to have to think about vendor lock-in related bullshit. Most people just want their model to run on whatever hardware they happen to have available, don't want to have to worry about whether or not future hardware purchases will be compatible, and don't want to have to rewrite everything in a different framework.
Most people fundamentally don't care about ROCm or CUDA or OneAPI or whatever else beyond a means to an end.
They are imitating Nvidia's TensorRT with AITER. Basically AMD wants to have "CUDA, but not CUDA".
which Mozilla's questionable decisions are you referring to?
> Why haven't they just upstreamed them to PyTorch for better adoption?
They don't seem to care, or don't understand how to get broader adoption.
For some reason AMD's management is dead set on targeting only the high end part of the market. Like, for example, look at this blog post. Which model they're testing? DeepSeek R1, the 671B behemoth that no normal person can run. Or look at any of their tutorials/docs and see which GPUs they support - it's always only either the unobtanium-grade enterprise GPUs, or high end workstation cards that no one buys. And if your strategy is to target only the super rich entities then a little jank in the software isn't really all that punishing - if you can afford to drop a few million on GPUs then you can also afford to hire someone to spend a few weeks getting AMD's software to work/get it tuned by tweaking two dozen environment variables they do seem to like so much/etc.
> For some reason AMD's management is dead set on targeting only the high end part of the market.
Because those people are dropping $100 billion on GPU clusters and individuals are not
Yes, but researchers use Pytorch and those researchers end up being the end users of the GPU clusters.
NVIDIA GPUs sell so well because they work with what researchers actually use.
That would make the kernels the PyTorch Foundations's problem and they would have to set up CI infrastructure around AMD GPUs to maintain these kernels. For whatever reason, AMD really wants to keep everything in-house even though that has been a losing strategy so far.
I'm not a python expert, but this feels very odd to me (both the *init* construction and the return [](, self.weight, self.bias, None, None) call, which looks like markdown to me:
from aiter.tuned_gemm import tgemm
import torch
class LinearLayer(torch.nn.Module):
def **init**(self, in_features, out_features):
super(LinearLayer, self).**init**()
self.weight = torch.nn.Parameter(torch.randn(out_features, in_features).cuda())
self.bias = torch.nn.Parameter(torch.randn(out_features).cuda())
def forward(self, input):
input = input.cuda()
return [](, self.weight, self.bias, None, None)
I was puzzling over the code wondering why they .cuda() everything like that when I realised that that was only the beginning of the weirdness.
I'm assuming the scrambled annotations were due to some odd chain of things the code went through on the way to becoming a post.
Maybe they did it as a parable about the problems of having many layers of abstraction causing processes with unintended consequences?
Yeah this is AMD in a nutshell. A bunch of fluffy descriptions and then the only concrete example would clearly never run.
EDIT: They fixed the code pretty quickly
yep the syntax highlighting / doc hyperlinking clearly broke there (or, less charitably, whatever llm produced that prose had a moment)
it's __init__ of course
also why is it calling .cuda() to move tensors to a cuda driver? I suppose this is because this is based on HIP - which comes with it's own set of problems, but that's ROCm for the masses I guess.
Also the has to be a torch module (at first I thought this was some lowlevel library which they now have a preview of, because there is a ROCm-torch already ...) which is evident from the table just before the summary. That table also smells like they are mostly focused on inference...
EDIT: seems official ROCm-torch is also based on HIP.
So to do an efficient MM on AMD you need to find every MM in the pytorch model and replace it with a call to this library? Seems like something that should've been fixed years ago.
Also I assume nvidia does the same thing but it is still hilarious that this is how it works
Still waiting for ROCm on my cheap Radeon RX 7600. Would be nice to play around with it a little. I know that this card is nothing fancy. There is somewhere a github issue where they announced to port it for linux to consumer cards, but last time I checked (a few days ago) it still wasn't available
I used rocm on an RX 7600 a month after launch. Having no official support does not at all mean it doesn't work.
Use the PyTorch Nightly build. The ROCm libraries themselves have been built for the RX 7600 (gfx1102) since ROCm 5.4/5.5, but PyTorch itself wasn't enabled until a few weeks ago. The RX 7600 is still not 'officially supported' on Linux, but I have an RX 7600 XT and I haven't encountered any issues in my (admittedly intermittent) use of the card in AI applications. You may, however, find the 8GB of VRAM in the non-XT version to be a limitation.
You should be able to make it think you have another card: export HSA_OVERRIDE_GFX_VERSION=10.3.0 The possible values are said to be: # gfx1030 = "10.3.0" # gfx900 = "9.0.0" # gfx906 = "9.0.6" # gfx908 = "9.0.8" # gfx90a = "9.0.a"
Telling ROCm to pretend that your RDNA 3 GPU (gfx1102) is an RDNA 2 GPU (gfx1030) is not going to work. The ISAs are not backwards-compatible like that. You might get away with pretending your gfx1102 GPU is a gfx1100 GPU, but even that depends on the code that you're loading not using any gfx1100-specific features. I would generally recommend against using this override at all for RDNA 3 as those ISAs are all slightly different.
In any case, the possible values can be found in the LLVM documentation [1]. I would recommend looking closely at the notes for the generic ISAs, as they highlight the differences between the ISAs (which is important when you're loading code built for one ISA onto a GPU that implements a different ISA).
I forgot that there's an "11.0.0" as well. Perhaps others have been added since.
I believe the override for GP's 7600 is 1100 or 11.0.0 as GFX1030 is RDNA2 (6800 XT).
The 7900 models are all 1100, the 7800XT is 1101 and the 7600 is 1102.
See Shader ISA:
Wow, it sure sounds like a mess under there. They used 4 different languages?
Using one high level language and assembly sounds fine, but four feels incoherent. Would love to know why this has had happened.
"This infrastructure is built upon a variety of underlying technologies, including Triton, CK (Compute Kernel), ASM (Assembly), and HIP (Heterogeneous Interface for Portability)."
That's not exactly unusual, for example pytorch has Python, C++, C, and Cuda.
Notice those are all (except arguably CUDA) very mainstream languages. All four of AMDs are niche. Upstreaming this into pytorch would double the number of languages used. (Although HIP is very similar to CUDA)
HIP is essentially the same as CUDA, CK is not a language but a library, and assembly is basically used in the Nvidia ecosystem as well, in the form of PTX.
There is absolutely nothing out of the ordinary here. Yes, it's multiple languages, but not any more or any different than what you'd use on an Nvidia platform (except obviously for the assembly part -- AMD's ISA is different from PTX, but that's to be expected).
Well, if you're including ASM in AMD's you have to include it in CUDA too, people definitely will embed PTX in their kernels. Triton is also gaining steam, so not too crazy. But yes, HIP and CK are rather obscure. In my limited time working w/ the AMD software stack this was a trend -- lots of little languages and abandoned toolchains, no unified strategy.
I believe that PyTorch already uses Triton; I recently tried to do torch.compile on a Windows machine and it did not work because the inductor backend relies on Triton.
Those aren't four different languages. CK and HIP are both just libraries.
HIP is AMD's equivalent of CUDA and is certainly a language.
But you are right CK is indeed a library, thanks for pointing that out.
Wait, did they get their own library name wrong? CK should be Composable Kernel, I can’t find anything called compute kernel anywhere
It does look like that yes. It wasn't my error, the quote is copy pasted verbatim from the article.
Really interesting, how it compares to tinygrad support for AMD GPUs?
Performance increased 100% on an MI300X running a large LLM.
On one hand, cool. On the other hand wow have they been leaving a lot of performance on the table.
How does the performance compare to NVidia now?
Any one try any of this on a few 7900xtx (or familiarity with this hardware and platform)? I've just purchased 6 for some small-scale experimentation. I'm thinking the next machine I'll use AMD Radeon PRO W7900 (to get 128 GB VRAM / machine).
I have a 7900 GRE, which is the same except less memory. I run Gemma 3, LLama 3.1, the QwQ models and the DeepSeek distilled models using llama.cpp. They run fine, I especially like the new Gemma3-27b-Q6 (20 GB model), I get 2 tok/s on it.
I have also run Hunyuan3d-2 and generated 3d models. You would've to separate out the model generation and texture generation phase, but it works.
I run ComfyUI and bootleg gguf models. This is all on windows. Now even WSL2 works, so I am using Ubuntu-24.04 on Windows 11 to run Hunyuan3D-2.
For LLMs, llama.cpp native binaries are available. Everything just works out of the box.
We have a dual W7800 system in-house as our `gfx1100` rig. I'll try to install and run through the tests sometime this week.
Just export HSA_OVERRIDE_GFX_VERSION=11.0.0 and things should mostly work. Off the top of my head, some of the fp8 types aren't supported but <shrug>
The RX 7900 XTX and Radeon PRO W7900 are already 11.0.0. That override is unnecessary.
Thanks -- I don't need everything to work, just enough to explore the platform and develop some realistic prototypes which can be moved on to probably the Radeon PROs.
I run a large test suite daily (~30000) meant for MI300 on my local 7900. I don't keep track of fails outside of a specific few tests that I'm interested in but in general I get about 70-80% passing.
Silly question perhaps, but is this a true CUDA equivalent? Why (not)?
This is equivalent to something like cuDNN, a CUDA library.
Aiter is a ROCm library.
ROCm is the thing that is like CUDA, but for AMD.
Why is everyone using the GPUs of this other company for AI?
I just want to remind everyone that El Capitan, Frontier and LUMI supercomputers are powered by AMD instinct cards.
El Capitan is #1 in TOP500. Frontier is #2, LUMI is #8.
ROCm development is probably mainly driven by the needs of these supercompuers' users currently.
So, we're seeing the tip of the iceberg.
Also ROCm packages continue to land on Debian, so there's more than meets the eye.
Note: Search "AMD Instinct" at There are way more systems.