Apple's MLX adding CUDA support
75 comments
·July 14, 2025nxobject
If you're going "wait, no Apple platform has first-party CUDA support!", note that this set of patches also adds support for "Linux [platforms] with CUDA 12 and SM 7.0 (Volta) and up".
paulirish
It's coming from zcbenz who created Electron among others https://zcbenz.com/ Nice.
zdw
How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )
I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?
edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.
numpad0
> This PR is an ongoing effort to add a CUDA backend to MLX
looks like it allows MLX code to compile and run on x86 + GeForce hardware, not the other way around.
sciencesama
Apple is planing to build data centers with mseries of chips for both app development, testing and to host external services!
null
benreesman
I wonder how much this is a result of Strix Halo. I had a fairly standard stipend for a work computer that I didn't end up using for a while so I recently cashed it in on the EVO-X2 and fuck me sideways: that thing is easily competitive with the mid-range znver5 EPYC machines I run substitors on. It mops the floor with any mere-mortal EC2 or GCE instance, like maybe some r1337.xxxxlarge.metal.metal or something has an edge, but the z1d.metal and the c6.2xlarge or whatever type stuff (fast cores, good NIC, table stakes), blows them away. And those things are 3-10K a month with heavy provisioned IOPS. This thing has real NVME and it cost 1800.
I haven't done much local inference on it, but various YouTubers are starting to call the DGX Spark overkill / overpriced next to Strix Halo. The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).
Flawless CUDA on Apple gear would make it really tempting in a way that isn't true with Strix so cheap and good.
hamandcheese
For the uninitiated, Strix Halo is the same as the AMD Ryzen AI Max+ 395 which will be in the Framework Desktop and is starting to show up in some mini PCs as well.
The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).
It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".
Rohansi
You're comparing theoretical maximum memory bandwidth. It's not enough to only look at memory bandwidth because you're a lot more likely to be compute limited when you have a lot of memory bandwidth available. For example, M1 had so much bandwidth available that it couldn't make use of even when fully loaded.
yieldcrv
Apple is just being stupid, handicapping their own hardware so they can sell the fixed one next year or the year after
This is time tested Apple strategy is now undermining their AI strategy and potential competitiveness
tl;dr they could have done 1600GB/s
saagarjha
They could have shipped a B200 too. Obviously there are reasons they don't do that.
nl
> The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).
Competitive AMD GPU neural compute has been any day now for at least 10 years.
bigyabai
The inference side is fine, nowadays. llama.cpp has had a GPU-agnostic Vulkan backend for a while, it's the training side that tends to be a sticking point for consumer GPUs.
jitl
It’s pretty explicitly targeting cloud cluster training in the PR description.
attentive
how is it vs m4 mac mini?
albertzeyer
This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?
zcbenz
In the absence of hardware unified memory, CUDA will automatically copy data between CPU/GPU when there are page faults.
nickysielicki
nit: it doesn’t really copy data - it’s more about migrating pages between memory spaces. The core problem is that the GPU’s BAR aperture isn’t wide enough to expose all device memory, leading to an overcommitted mapping situation.
The driver manages collections of contiguous pages as units for BAR aperture mapping. Since the GPU’s BAR window is significantly smaller than total device memory (typically 256MB-4GB vs 80GB-140GB of VRAM), the driver maintains these memory segments to track which page ranges are currently mapped through the aperture. They may or may not correspond whatsoever to the mappings as seen from userspace.
When the GPU attempts to access memory that’s not currently mapped, the driver’s fault handler intervenes, unmaps an existing memory segment from the BAR window, and maps in the segment containing the required pages. It looks more like a software-managed TLB for the BAR aperture, and less like traditional page faulting.
This segmented approach reduces mapping overhead since you’re managing larger chunks rather than individual 4KB pages. It’s a delicate balance because if you work in too big of chunks, you make the thrashing worse. This entire remapping process under the hood can be really expensive.
eg: 2P Threadripper and EPYC systems. Every BAR remapping triggers TLB shootdowns that must propagate across the infinity fabric to invalidate stale entries, on both sockets. Since these systems all lack DDIO, the GPU cannot directly access the CPU’s LLC, so every memory access requires a round trip through main memory. This creates a vicious cycle where frequent BAR thrashing causes continuous cross-socket TLB invalidations, each one stalling cores across the entire NUMA topology while you wait on IPIs, which block reads on the GPU of anything in host memory. The lack of cache coherency between GPU and CPU memory hierarchies means you’re consuming bandwidth on both the PCIe bus and memory controllers, while cores (host and device) remain idle waiting for TLB flushes to complete.
It’s an absolute nightmare to debug and profile, because most all of this^ is invisible to your standard tools, especially if aren’t running bare metal and can’t use zen-specific debuggers and perf counters.
saagarjha
This seems like it would be slow…
fenced_load
There is also NVLink c2c support between Nvidia's CPUs and GPUs that doesn't require any copy, CPUs and GPUs directly access each other's memory over a coherent bus. IIRC, they have 4 CPU + 4 GPU servers already available.
benreesman
Yeah NCCL is a whole world and it's not even the only thing involved, but IIRC that's the difference between 8xH100 PCI and 8xH100 SXM2.
MBCook
This is my guess, but does higher end hardware they sell, like the server rack stuff for AI, perhaps have the unified memory?
I know standard GPUs don’t.
The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.
ajuhasz
The Jetsons[1] have unified memory[2].
[1] https://www.nvidia.com/en-us/autonomous-machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s...
tonyarkles
They sure do and it's pretty amazing. One iteration of a vision system I worked on got frames from a camera over a Mellanox NIC that supports RDMA (Rivermax), preprocessed the images using CUDA, did inference on them with TensorRT, and the first time a single byte of the inference pipeline hit the CPU itself was when we were consuming the output.
patrickkrusiec
The physical memory is not be unified, but on modern rack scale Nvidia systems, like Grace Hopper or NVL72, the CPU and the GPU(s) share the same virtual address space and have non-uniform memory access to each other's memory.
Y_Y
The servers don't, but the Jetsons do
orliesaurus
Why is this a big deal, can anyone explain if they are familiar with the space?
elpakal
> NVIDIA hardware is widely used for academic and massive computations. Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.
That one stands out to me as a mac user.
radicaldreamer
MacBooks used to use Nvidia GPUs, then Apple had a falling out with Nvidia and the beef stands to this day (Apple didn’t use Nvidia hardware when training it’s own LLMs for Apple Intelligence).
I wouldn’t be surprised if within the next few years we see a return of Nvidia hardware to the Mac, probably starting with low volume products like the MacPro, strictly for professional/high-end use cases.
fooker
> Apple didn’t use Nvidia hardware when training it’s own LLMs for Apple Intelligence
Do you have some links for this?
MuffinFlavored
Is this for Mac's with NVIDIA cards in them or Apple Metal/Apple Silicon speaking CUDA?... I can't really tell.
Edit: looks like it's "write once, use everywhere". Write MLX, run it on Linux CUDA, and Apple Silicon/Metal.
MBCook
Seems you already found the answer.
I’ll note Apple hasn’t shipped an Nvidia card in a very very long time. Even on the Mac pros before Apple Silicon they only ever sold AMD cards.
My understanding from rumors is that they had a falling out over the problems with the dual GPU MacBook Pros and the quality of drivers.
I have no idea if sticking one in on the PCI bus let you use it for AI stuff though.
VladVladikoff
Won’t work. No driver support.
xuki
That particular MBP model had a high rate of GPU failure because it ran too hot.
I imagined the convo between Steve Jobs and Jensen Huang went like this:
S: your GPU is shit
J: your thermal design is shit
S: f u
J: f u too
Apple is the kind of company that hold a grudge for a very long time, their relationships with suppliers are very one way, their way or the highway.
sciencesama
And so is the same with nvidea too
rcruzeiro
I think the ones that failed were the AMD ones, specifically the old 17 inches MacBook Pro.
bobmcnamara
S: omg so thin!!1!1!!l!
cindyllm
[dead]
kmeisthax
On Apple Silicon, writing to memory on a PCIe / Thunderbolt device will generate an exception. ARM spec says you're allowed to write to devices as if they were memory but Apple enforces that all writes to external devices go through a device memory mapping[0]. This makes using an external GPU on Apple Silicon[1] way more of a pain in the ass, if not impossible. AFAIK nobody's managed to write an eGPU driver for Apple Silicon, even with Asahi.
[0] https://developer.arm.com/documentation/102376/0200/Device-m...
[1] Raspberry Pi 4's PCIe has the same problem AFAIK
bobmcnamara
Ewww, that kills out of order CPU performance. If it's like ARMv7, it effectively turns each same-page access into it's own ordering barrier.
hbcondo714
> "write once, use everywhere"
So my MLX workloads can soon be offloaded to the cloud!?
dkga
This is the only strategy humble me can see working for CUDA in MLX
cowsandmilk
Neither, it is for Linux computers with NVIDIA cards
Keyframe
Now do linux support / drivers for Mac hardware!
bigyabai
I think we're seeing the twilight of those efforts. Asahi Linux was an absolute powerhouse of reverse-engineering prowess, and it took years to get decent Vulkan coverage and half of the modern lineup's GPUs supported. Meanwhile AMD and even Intel are shipping Vulkan 1.3 drivers day-one on new hardware. It's a cool enthusiast effort to extend the longevity of the hardware, but it bears repeating; nobody is disrupting Nvidia's bottom-line here. Apple doesn't sell hardware competitive with Nvidia's datacenter hardware, and even if they did it's not supported by the community. It's doubtful that Apple would make any attempt to help them.
There seems to a pervading assumption that Apple is still making a VolksComputer in 2025, blithely supporting a freer status-quo for computing. They laid out their priorities completely with Apple Silicon, you're either on Apple's side or falling behind. Just the way they want it.
lvl155
Seriously. Those Apple guys became delusional especially after Jobs passed away. These guys just sat on their successes and did nothing for a decade plus. M1 was nice but that was all Jobs doing and planning. I don’t like this Apple. They forgot how to innovate.
But I guess we have a VR device nobody wants.
marcellus23
> M1 was nice but that was all Jobs doing and planning
M1 was launched 9 years after Jobs died. You're saying they had everything ready to go back then and just sat on their asses for a decade?
lvl155
Who bought Semi? Jobs knew they had to make their own. M1 is just a product of their iPhone chips hence all the efficiency.
jjtheblunt
It would be funny if you were typing out your response on an iPhone that has been running for 36 hours without recharging.
macinjosh
if only their batteries would last that long.
teaearlgraycold
I wonder if Jensen is scared. If this opens up the door to other implementations this could be a real threat to Nvidia. CUDA on AMD, CUDA on Intel, etc. Might we see actual competition?
jsight
I think this is the other way around. It won't be cuda on anything except for nvidia.
However, this might make mlx into a much stronger competitor for Pytorch.
mayli
Yeah, nice to have MLX-opencl or MLX-amd-whatever
baby_souffle
If you implement compatible apis, are you prohibited from calling it cuda?
15155
Considering 100% of the low-level CUDA API headers have the word "CUDA" in them, this would be interesting to know.
moralestapia
I'm sure I saw this lawsuit somewhere ...
The gist is the API specification in itself is copyright, so it is copyright infringement then.
teaearlgraycold
Oh bummer. Almost got excited.
tekacs
This instance is the other way around, but that's what this is – CUDA on AMD (or other platforms): https://docs.scale-lang.com/stable/
almostgotcaught
> CUDA backend
backend
So to make sure I understand, this would mean:
1. Programs built against MLX -> Can take advantage of CUDA-enabled chips
but not:
2. CUDA programs -> Can now run on Apple Silicon.
Because the #2 would be a copyright violation (specifically with respect to NVidia's famous moat).
Is this correct?