How to Run DeepSeek R1 Distilled Reasoning Models on RyzenAI and Radeon GPUs
10 comments
·February 2, 2025larntz
I wrote a similar post about a week ago, but for an [unsupported] Radeon RX 5500 with 4Gi RAM with ollama and fedora 41. Can only run llama:3.2 or deepseek-r1:1.5b, but they're pretty usable if you're ok with a small model and it's for personal use.
I didn't go into detail about how to setup openweb-ui, but there is documentation for the on the project's site.
heavyset_go
As an aside, either the latest Linux or 6.14 has/will have support for Ryzen XDNA AI chips on their mobile APUs.
Might not be appropriate for this model, but it could be for small models.
ekianjo
any idea how they will appear to the OS? As additional processors?
shosca
in my case with a 6900xt:
1. sudo pacman -S ollama-rocm
2. ollama serve
3. ollama run deepseek-r1:32b
larntz
Does that entire model fit in gpu memory? How's it run?
I tried running a model larger than ram size and it loads some layers into the gpu but offloads to the cpu also. It's faster than cpu alone for me, but not by a lot.
shosca
you're right, actually noticed gpu clocking up and down with 32b, 14b clocks up fully and actually runs faster
heavyset_go
Nice, last time I tried out ROCm on Arch a few years ago it was a nightmare. Glad to see it's just one package install away these days, assuming you didn't do any setup beforehand.
I have a Radeon 7900 XTX 24GB and have been using the deepseek-r1:14b for a couple days. It achieves about 45 tokens/s. Only after reading this article did I realize that the 32B model would also fit entirely (23GB used). And since Ollama [0] was already installed, it as as easy as running: ollama run deepseek-r1:32b
The 32B model achieves about 25 tokens/s, which is faster than I can read. However, the "thinking" time is mostly a lower quality overhead taking ~1-4 minutes before the Solution/Answer
You can view the model performance within ollama using the command: /set verbose
[0] https://github.com/ollama/ollama