ROCm Device Support Wishlist

180 comments

·January 20, 2025

latchkey

For context, the submitter of the issue is Anush Elangovan from AMD who's recently been a lot more active on social after the SemiAnalysis article, and taking the reigns / responsibility of moving AMD's software efforts forward.

However you want to dissect this specific issue, I'd generally consider this a positive step and nice to see it hit the front page.

https://www.reddit.com/r/ROCm/comments/1i5aatx/rocm_feedback...

https://www.reddit.com/user/powderluv/

KeplerBoy

Also know as the AMD representative who recently argued with Hotz about supporting tinycorp.

latchkey

Is that a bad thing? Good for him to stand up to extortion.

KeplerBoy

Hard to say from my perspective.

I think AMDs offer was fair (full remote access to several test machines), then again just giving tinycorp the boxes on their terms with no strings attached as a kind of research grant would have earned them some goodwill with that corner of the community.

Either way both parties will continue making controversial decisions.

rikafurude21

"I estimate having software on par with NVDA would raise their market cap by 100B. Then you estimate what the chance it that @__tinygrad__ can close that gap, say it's 0.1%, probably a very low estimate when you see what we have done so far, but still...

That's worth 100M. And they won't even send us 2 ~100k boxes. In what world does that make sense, except in a world where decisions are made based on pride instead of ROI. Culture issue."

https://x.com/__tinygrad__/status/1879620242315317304

catgary

Yeah, AMD is already pouring a lot of support into OpenXLA/IREE, which has a lot of well-respected compiler engineers and researchers working on it, and companies like AWS are also investing into it.

I don’t really think TinyCorp has anything to offer AMD.

modeless

Offering software support in exchange for payment is extortion?

daguava

[dead]

magic_at_nodai

hey thats me. Happy to help answer anything here and look forward to your constructive feedback to make AMD software better. We got work to do and look forward to it.

imtringued

Ok, why does running koboldcpp with a "BLAS Batch Size" of 512 via Vulkan on an RX570 crash my entire computer? You know, to the point where I manually have to turn it on again.

I personally couldn't think of a better reason to never buy AMD GPUs ever again by the way.

latchkey

I have experience running 130,000 RX470/570/480/580... if you're doing heavy workloads, those things full machine crash if you breathe on them wrong. That said, when they do run, they run extremely well.

There is 1000 reasons why your one GPU could have crashed, what does it say in the logs before it crashed?

clhodapp

Which SemiAnalysis article?

latchkey

https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-b...

sorenjan

ROCm is a mistake. It's fundamentally broken by compiling to hardware specific code instead of CUDA's RTX, so it will always be plagued with this issue of not supporting all cards, and even if a certain GPU is supported today they can stop supporting it next version. It has happened, it will continue happen.

It's also a strange value proposition. If I'm a programmer in some super computer facility and my boss has bought a new CDNA based computer, fine, I'll write AMD specific code for it. Otherwise why should I? If I want to write proprietary GPU code I'll probably use the de facto industry standard from the industry giant and pick CUDA.

AMD could be collaborating with Intel and a myriad of other companies and organizations and focus on a good open cross platform GPU programming platform. I don't want to have to think about who makes my GPU! I recently switched from an Intel CPU to an AMD, obviously to problem. If I had to get new software written for AMD processors I would have just bought a new Intel, even though AMD are leading in performance at the moment. Even Windows on ARM seems to work ok, because most things aren't written in x86 assembly anymore.

Get behind SYCL, stop with the platform specific compilation nonsense, and start supporting consumer GPUs on Windows. If you provide a good base the rest of the software community will build on top. This should have been done ten years ago.

frognumber

Agreed.

Honestly, the problem isn't just which devices, but even more so, this (from the page, not your comment):

> No guarantees of future support but we will try hard to add support.

During the Great GPU Shortage, I bought an AMD RX5xx card for ML work. It was explicitly advertised to work with ROCm. Within a couple of months, AMD dropped ROCm support. EOLing an actively-sold product from being used for an advertised purpose within the warranty period was, if I understand consumer protection laws in my state correctly, fraud. There was no support from either the card vendor (MSI). No support from AMD. No support from the reseller. Short of small claims, which was not worth it, there was no recourse.

This is on a long list of issues AMD needs to sort out to be a credible player in this space:

* Those are the kinds of experiences which cause people to drop a vendor and not look back. AMD needs to either support cards forever, or at the very least, have an advertised expiration date (like Chromebooks and Android phones).

* Broad support is helpful from a consumer perspective from the simply pragmatic point of view that only a tiny fraction of the population has the time to read online forums, footnotes, or fine print. People should be able to buy a card on Amazon, at Best Buy, and Microcenter, and expect things to Just Work.

* Being able to plan is essential for enterprise use. I can't build a system around AMD if AMD might stop supporting their platform on 0 days notice, and the next day, there might be a security exploit which requires a version bump.

I'm hoping Intel gets their act together here, since NVidia needs a credible competitor. I've given up on AMD.

magic_at_nodai

PTX does provide a low level machine abstraction. However you still target some version of hardware ( https://arnon.dk/matching-sm-architectures-arch-and-gencode-... ). However a lot of software effort has gone into it to make it look and work seamlessly.

Though AMD doesn't have the same "virtual ISA" as PTX right now there are increasing levels of such abstraction available in compiled flows with MLIR / Linalg etc. Those are higher level and can be compiled / jitted in realtime to obviate the need for a low level virtual ISA.

danjl

We already fought and lost this battle with 3D APIs for GPUs. What makes you think that winning strategy would play out any other way for tensor processing?

cherryteastain

Really telling they have to ask us for what cards we want as opposed to supporting all cards by default from day 1 like Nvidia.

All because they went with a boneheaded decision to require per-device code compilation (gfx1030, gfx1031...) instead of compiling to an intermediate representation like CUDA's PTX. Doubly boneheaded considering the graphics API they developed, Vulkan, literally does that via SPIR-V!

diggan

Really telling who comments before reading :)

The author of the issue comments that they'll eventually support all cards. What he really is asking for, is what cards people want them to prioritize, not just support.

cherryteastain

I read it fully. Whole point of my post is that, based on their track record so far plus the technical limitations, it is impossible for AMD to provide the same day 1 drop in compatibility that the CUDA ecosystem offers.

Edit:

> No guarantees of future support but we will try hard to add support.

magic_at_nodai

yes. We are behind on software support for all consumer cards and would love to support all cards. But are looking for guidance / feedback so we can prioritize.

cherryteastain

This line sparks no confidence:

> No guarantees of future support but we will try hard to add support.

AMD reps told me exactly the same thing years ago about how they'd love to support all cards, when RDNA2 had just launched. Fast forward, only W6800 is properly supported from that gen. The last time I tried, it had tons of kernel bugs that caused hard freezes outside most basic cases.

You need to come out and say that you will support all cards, no ifs or buts, by a hard deadline.

wtcactus

I’m constantly baffled and amused on why AMD keeps majorly failing at this.

Either the management at AMD is not smart enough to understand that without the computing software side they will always be a distant number 2 to NVIDIA, or the management at AMD considers it hopeless to ever be able to create something as good as CUDA because they don’t have and can’t hire smart enough people to write the software.

Really, it’s just baffling why they continue on this path to irrelevance. Give it a few years and even Intel will get ahead of them on the GPU side.

musicale

If I were Jensen, I would snap up all the GPU software experts I possibly could, and put them to work improving the CUDA ecosystem. I'd also spin up a big research group to further fuel the CUDA pipeline for hardware, software, and application areas.

Which is exactly what NVIDIA seems to be doing.

AMD's ROCm software group seems far behind, is probably understaffed, and probably is paid a fraction of what NVIDIA pays its CUDA software groups.

AMD also has to catch up with NVlink and Spectrum-X (and/or InfiniBand.)

AMD's main leverage point is its CPUs, and its raw GPU hardware isn't bad, but there is a long way to go in terms of GPU software ecosystem and interconnect.

omcnoe

I've never understood why they have such a fractured approach to software:hardware support. I remember reading and writing comments about this on hn nearly a decade ago now. It's a long time to keep making the same mistake.

They had the exact same kind of support issues back in the OpenCL days, where they didn't manage to provide cross platform, cross card support for same versions of the platform.

I have never been able to reconcile it with their turnaround and newfound competence on the CPU side.

almostgotcaught

> I’m constantly baffled and amused on why AMD keeps majorly failing at this.

i wonder if you've considered the possibility that there's some component/dimension of this that you're simply unaware of? that it's not as straightforward as whatever reductive mental model you have? is that even like within the universe of possibilities?

rcxdude

I mean, they did say they were baffled. I'd say that probably includes "I don't know"

superkuh

My wishlist for ROCm support is actually supporting the cards they already released. But that's not going to happen.

By the time an (consumer) AMD device is supported by ROCm it'll only have a few years of ROCm support left before support is removed. Lifespan of support for AMD cards with ROCm is very short. You end up having to use Vulkan which is not optimized, of course, and a bit slower. I once bought an AMD GPU 2 years after release and 1 year after I bought it ROCm support was dropped.

slavik81

FWIW, every ROCm library currently in the Debian 13 'main' and Ubuntu 24.04 'universe' repository has been built for and tested on every discrete consumer GPU architecture since Vega. Not every package is available that way, but the ones that are have been tested on and work on Vega 10, Vega 20, RDNA 1, 2 and 3.

Note that these are not the packages distributed by AMD. They are the packages in the OS repositories. Not all the ROCm packages are there, but most of them are. The biggest downside is that some of them are a little old and don't have all the latest performance optimizations for RDNA 3.

Those operating systems will be around for the next decade, so that should at least provide one option for users of older hardware.

buildbot

Packages existing and the software actually working are very different things. You can run rocm on unsupported GPUs like a 780m, but as soon as you hit an issue you are out of luck. And you’ll hit an issue.

For example, my 780m gets 1-2 inferences from llama.cpp before dropping off the bus due to a segfault in the driver. It’s a bad enough lockup that linux can’t cleanly shutdown and will hang under hard rebooted.

slavik81

The 780m is an integrated GPU. I specified discrete GPUs because that's what I have tested and can confirm will work.

I have dozens of different AMD GPUs and I personally host most of the Debian ROCm Team's continuous integration servers. Over the past year, I have worked together with other members of the Debian project to ensure that every potentially affected ROCm library is tested on every discrete consumer AMD GPU architecture since Vega whenever a new version of a package is uploaded to Debian.

FWIW, Framework Computers donated a few laptops to Debian last year, which I plan to use to enable the 780m too. I just haven't had the time yet. Fedora has some patches that add support for that architecture.

mappu

I can confirm this, Debian's ROCm distribution worked great for me on some "unsupported" cards.

null

[deleted]

mikepurvis

As the underdog AMD can't afford to have their efforts perceived as half-assed or a hobby or whatever. They should be moving heaven and earth to maximize their value proposition, promising and delivering on longer support horizons to demonstrate the long term value of their ecosystem.

seanhunter

Honestly at this point half-assed support would be a significant step up from their historical position. The one thing they have pioneered is new tiers of fractional assedness asymptotically approaching zero.

XorNot

I mean at this point my next card is going to be an nvidia. It has been a total waste of time trying to use rocm for anything machine-learning based. No one uses it. No one can use it. The card I have is somehow always not quite supported.

llm_trw

We go from:

Support is coming in three months!

This card is ancient and will be no longer developed for. Buy our brand new card released in three months!

Every damned time.

null

[deleted]

7speter

I have a mi50 with 16gb of hbm thats collecting dust (its Vega bases, so it can play games, I guess) because I don’t want to bother setting up a system with Ubuntu 20.04, the last version of Ubuntu the last version of ROCM that supported the MI50 works on.

With situations like this, its not hard to see why Nvidia totally dominates in the compute/ai market.

slavik81

The MI50 may be considered deprecated in newer releases, but it seems to work fine in my experience. I have a Radeon VII in my workstation (which shares the same architecture) and I host the MI60 test machine for Debian AI Team. I haven't had any trouble with them.

nalllar

I had the impression Debian applied patches that widen arch support from what upstream officially supports, including for the MI50/MI60.

https://salsa.debian.org/rocm-team/rocm-hipamd/-/raw/d6d2014... (one patch of many)

7speter

I don’t think the mi60 has reached deprecated status yet (the last time I look at prices for the mi50 and mi60, the mi60 was something like 3x expensive, and I think thats because its still officially supported), but I’ll check this all out. Thanks.

FuriouslyAdrift

AMD did over $5 billion in GPU compute (Instinct line) last year. Not nVidia numbers but also not bad. Customers love that they can actually get Instinct system rather than trying to compete with the hyperscalers for limited supplies of nVidia systems. Meta and Microsoft are the two biggest buyers of AMD Instincts, though...

AMD Instinct is also more power efficient and has comparable (if not better) performance for the same (or less) price.

7speter

Meta and Microsoft buys hundreds of thousands of Nvidia accelerators a year, and are a big reason why everyone else has to compete for nvidia units.

nubinetwork

Seeing Radeon VII on the deprecation list is a little saddening, unless they start putting out more 16gb+ GPUs that aren't overly expensive...

bb88

They should have at a minimum 5 year support release cycle.

null

[deleted]

kllrnohj

It kinda seems like they do - 5 years would only include the RX 6xxx and 7xxx.

5 years is not very long tbh.

suprjami

RX 7800 XT was supported for 15 months before being dropped. Significantly less than 5 years.

bb88

True but business hardware (and home for that matter) often goes on 3-5 year cycles though. At 5 years it's kinda expected hardware will get replaced.

FuriouslyAdrift

AMD has separate architectures for GPU compute (Instinct https://www.amd.com/en/products/accelerators/instinct/mi300....) and consumer video (Radeon).

AMD are merging the architectures (UDNA) like nVidia but it's not going to be before 2026. (https://wccftech.com/amd-ryzen-zen-6-cpus-radeon-udna-gpus-u...)

7speter

You can use ROCM on consumer radeon as long as you pay more than 400 dollars for one of their gpus. Meanwhile, you can run stable diffusion with the -lowvram flag on a 3050 6gb that goes for 180 dollars

ghostpepper

I can understand wanting to prioritize support for the cards people want to use most, but they should still plan to write software support for all the cards that have hardware support.

Gigachad

I've long since given up on my 5700xt getting supported. AMD is just not a good pick if you care about non graphics compute.

suprjami

If you use Debian libraries then it will work. eg:

https://github.com/superjamie/rocswap

I ran this on an 5600 XT, just recently switched to nVidia.

KeplerBoy

Imagine Nvidia not supporting CUDA on any of their cards. Unthinkable.

latchkey

Nvidia takes a software first approach and AMD takes a hardware first approach.

It is clear that AMD's approach isn't working and they need to change their balance.

washadjeffmad

I've always described Nvidia as an accelerated compute company that happens to sell hardware.

AMD are smart, and they solve big problems in ways that are baffling to many. They're very sensitive to moats and position themselves with products or frameworks to drain them.

I consider their primary product "engineering competence as a service", but when no one external picks up the reigns, they don't try very hard to play market maker. I remember when Intel's R&D budget was more than AMD's market cap– they're effective both at and when running lean.

The reality here is that people don't have grievances with CUDA and Nvidia aren't doing anything egregious with it. But whether that's due to ROCm's existence... we can only speculate.

Eval-Apply

I read a story last year from Techpowerup where they said that AMD is making big changes to the way it approaches technology, shifting its focus from hardware development to emphasizing software experiences, APIs and AI.

Roadmap: 3 to 5 years.

https://www.techpowerup.com/324171/amd-is-becoming-a-softwar...

kouteiheika

Hardware first, but then their hardware isn't any better than NVidia's, so I don't see how that's a valid excuse here.

(Okay, maybe their super high end unobtanium-level GPUs are better hardware-wise. Don't know, don't care about enterprise-only hardware that is unbuyable by mere mortals.)

make3

this is a-posteriori development.. we have no idea of how hard it is to implement with older GPUs

npteljes

People set up Stable Diffusion with automatic1111 and rocm for all kinds of weird setups successfully. What AMD needs to do is basically just provide a better out of the box experience, as even following the other people's instructions have been flaky at best. For example, for my 6600 XT, I have tried setting up SD twice. I succeeded in Manjaro in the past (like, a year ago), but didn't succeed now, and I succeeded in Debian now, but it uses the CPU for some reason. Hardware setup was the same, the only thing that changed is that I have updated my Linuxes in the meantime.

__turbobrew__

rocm is kind of a joke. Recently I wanted to write some golang code which talks to rocm devices using amd smi. You have to build and install the go amd smi from source, the go amd smi repo has dead links and there is basically no documentation anywhere on how to get this working.

Compare this to nvidia where I just imported the go nvml library and it built the cgo code and automatically links to nvidia-ml.so at runtime.

magic_at_nodai

Is this the repo you are referring to https://github.com/amd/go_amd_smi ? Would having a prebuilt version there help you ?

__turbobrew__

“ * NOTE: * The GO SMI binding depends on the following libraries:

- E-SMI inband library ("https://github.com/amd/esmi_ib_library") - ROCm SMI library("https://github.com/ROCm/rocm_smi_lib") - AMDSMI library("https://github.com/ROCm/amdsmi") - goamdsmi_shim library ("https://github.com/amd/goamdsmi/goamdsmi_shim")”

First of all this link is dead: https://github.com/amd/goamdsmi/goamdsmi_shim

Second: these dependencies should all be packaged into deb/rpm

Third: there should be a goamdsmi package which has a proper dependency tree. I should be able to do ‘apt-get install goamdsmi’ and it should install everything I need. This is how it works with go-nvml.

ac29

AMD supports only a single Radeon GPU in Linux (RX 7900 in three variants)?

Windows support is also bad, but supports significantly more than one GPU.

llm_trw

Imagine nvidia supported only the 4090, 4080 and 4070 for cuda at the consumer level. With the 3090 not being supported since the 40xx series came out. This is what amd is defending here.

Delk

I honestly can't figure out which Radeon GPUs are supposed to be supported.

The GitHub discussion page in the title lists RX 6800 (and a bunch of RX 7xxx GPUs) as supported, and some lower-end RX 6xxx ones as supported for runtime. The same comment also links to a page on the AMD website for a "compatibility matrix" [1].

That page only shows RX 7900 variants as supported on the consumer Radeon tab. On the workstation side, Radeon Pro W6800 and some W7xxx cards are listed as supported. It also suggests to see the "Use ROCm on Radeon GPU documentation" page [2] if using ROCm on Radeon or Radeon Pro cards.

That link leads to a page for "compatibility matrices" -- again. If you click the link for Linux compatibility, you get a page on "Linux support matrices by ROCm version" [3].

That "by ROCm version" page literally only has a subsection for ROCm 6.2.3. It only lists RX 7900 and Pro W7xxx cards as supported. No mention of W6800.

(The page does have an unintuitively placed "Version List" link through which you can find docs for ROCm 5.7 [4]. Those older docs are no more useful than the 6.2.3 ones.)

Is RX 6800 supported? Or W6800? Even the amd.com pages seem to contradict each other on the latter.

Maybe the pages on the AMD site only list official production support or something. In any case it's confusing as hell.

Nothing against the GitHub page author who at least seems to try and be clear but the official documentation leaves a lot to be desired.

[1] https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

[2] https://rocm.docs.amd.com/projects/radeon/en/latest/docs/com...

[3] https://rocm.docs.amd.com/projects/radeon/en/latest/docs/com...

[4] https://rocm.docs.amd.com/projects/radeon/en/docs-5.7.0/docs...

magic_at_nodai

I will provide this feedback to the docs team to clean up. I found it hard when i was making that Poll :D but I looked harder instead of trying to fix the docs. So thank you for the feedback.

Delk

Thanks for the effort! Documentation is so easy to neglect especially when there's also a ton of other stuff to do, but it's also so important for anything intended for technical use. Especially when things change over time or from one version to another.

baby_souffle

> I honestly can't figure out which Radeon GPUs are supposed to be supported.

Exactly.

I have a 6700 XT with 12 gig ram and a 5700 with 8 gig ram.

If i ctrl+f for either of those numbers on the GH issue, I get one hit. For the 6700, it's a single row that has a green check for "runtime" and a red x for "HIP SDK". For the 5700 card, it's somebody in the peanut gallery saying "don't forget about us!".

HIP is the c++ "flavor" that can compile down to work on amd _and_ nvidia gpus. If the 6700 has support for the "runtime" but not HIP ... what does that even mean for me?

And as you pointed out, the 6800 series card has green checks for both so that means it's fully supported? But ... it's not listed on AMD's site?!

Bad docs are how you cement a reputation of "just buy nvidia and install their latest drivers and it'll be fine".

xmodem

I think the matrix shown in the github issue is for Windows support, which is much better: https://rocm.docs.amd.com/projects/install-on-windows/en/lat...

Having said that, on the weekend I set up ROCm on Linux on my 6800XT and it seems to work just fine.

redmajor12

Removing support for Radeon VII is a bonehead move that smacks of stupidity or greed. The cards were targeted for enthusiast gamers but have enterprise level hardware, like HBM2 memory and 1 TB/s bandwidth.

cokecan

Super annoying. I have an RX 6600 XT and can't get ROCm to work on Linux. Vulkan ML however worked perfectly out of the box, so at least I got something.

Just weird the official thing doesn't work.

suprjami

Use the Debian libraries, it works:

https://github.com/superjamie/rocswap

slavik81

The caveat being that PyTorch has a lot of dependencies and a couple of them are not yet available in Debian Unstable. For folks wanting to use StableDiffusion, that's a problem. However, the available packages are more than sufficient for llama-cpp as you point out.

curt15

I found that striking as well. Does AMD expect everyone wanting to try out PyTorch or LLMs on Linux to splurge on Instinct servers?

magic_at_nodai

ROCm on Radeon should work too and the poll above was to seek feedback on what to cards to support next.

phkahler

Add support for every APU. They can have much more RAM than discrete graphics.

RandyOrion

Why are people in AMD assuming other people don't want more software support for their GPUs by default? This is not nice.

suprjami

Because they don't have infinite resources like nVidia so they're asking what people want the most to prioritise it.

Please read the link before commenting on future. We do that here. This info is is an early comment by an AMD employee.

RandyOrion

It's not nice to assume that people don't read then proceed to comment.

I read the link and I upvoted the "just support all GPUs you recently produced" comment.

I don't think the solution to bad software support is the prioritization. The prioritization is causing even more discrimination among different GPUs and different customers.

You can say whatever you want, and downvote whatever you want. However, that doesn't solve the real problem.

maverwa

I figure that list is only what’s officially supported, meaning things not on that list may or may not work?. For example, my 6800 XT runs stable diffusion just fine on Linux with PyTorch ROCm.

Toutouxc

What’s the performance like? Was it easy to set up?

maverwa

I cannot compare the performance with other cards, but it takes a few seconds for SDXL images (e.g. 1024x512) as long as it doesn’t run OOM.

I use a fork of the stable diffusion webui [0] which, for me, handled memory better. Setup was relatively easy: install the pytorch packages from the ROCm repo and it worked.

[0]: https://github.com/lllyasviel/stable-diffusion-webui-forge

criticalfault

A lot of people think rocm is basically a big pile of crap.

What are the chances for amd to consider alternatives: - adopt oneapi and try to fight Nvidia together with intel - Vulkan and implement pytorch backend - sycl

HN

ROCm Device Support Wishlist

ROCm Device Support Wishlist