The race to build a distributed GPU runtime

58 comments

·September 4, 2025

RachelF

I find it odd that given the billions of dollars involved, no competitor has managed to replicate the functions of CUDA.

Is it that hard to do, or is the software lock-in so great?

JonChesterfield

I'm pretty sure it's a political limitation, not a technical one. Implementing it is definitely a pain - it's a mix of hardcore backwards compatibility (i.e. cruft) and a rapidly moving target - but it's also obviously just a lot of carefully chosen ascii written down in text files.

The non-nvidia hardware vendors really don't want cuda to win. AMD went for open source + collaborative in a big way, opencl then hsa. Both broadly ignored. I'm not sure what Intel are playing at with spirv - that stack doesn't make any sense to me whatsoever.

Cuda is alright though, in a kind of crufty obfuscation over SSA sense. Way less annoying than opencl certainly. You can run it on amdgpu hardware if you want to - https://docs.scale-lang.com/stable/ and https://github.com/vosen/ZLUDA already exist. I'm hacking on scale these days.

ronsor

The problem is that CUDA is tightly integrated with NVIDIA hardware. You don't just have to replicate CUDA (which is a lot of tedious work at best), but you also need the hardware to run your "I can't believe it's not CUDA"

triknomeister

It is hard to do in the sense that it requires a very good taste about programming languages, which in turn requires really listening to the customers, and that requires huge number of people who are skilled. And no one has really invested that much money into their software ecosystem yet.

pjmlp

Because most fail to understand what makes CUDA great, and keep only trying to replicate C++ API.

They overlook CUDA is a polyglot ecosystem composed by C, C++ and Fortran as main languages, with Python JIT DSL since this year, compiler infrastructure for any compiler backend that wishes to target it of which there are a few including strange stuff like Haskell, IDE integration with Eclipse and Visual Studio, graphical debugging just like on the CPU.

It is like when Khronos puts out those spaghetti riddled standards, expecting each vendor/open source community, to create some kind of SDK, versus the vertical integration of console devkits and proprietary APIs, and then asking why professional studios have no qualms with proprietary tooling.

oivey

Slight correction: CUDA Python JIT has existed for a very long time. Warp is a late comer.

pjmlp

Kind of, none of those are at the integration level of CUTLASS 4, and the new cu tile architecture, introduced at GTC 2025.

But you're right there was already something in place.

melodyogonna

Cuda is so many things I'm not sure it is even possible to replicate it.

joe_the_user

CUDA does involve a massive investment for Nvidia. It's not that it's impossible to replicate the functionality. But once a company has replicated that functionality, that company basically is going to be selling at competitive prices, which isn't a formula for high profits.

Notably, AMD funded a CUDA clone, ZLUDA, and then quashed it[1]. Comments at the time here involved a lot of "they would always be playing catch up".

I think the mentality of chip makers generally is that they'd rather control a small slice of a market than fight competitively for a large slice. It makes sense in that they invest years in advance and expect those investments to pay high profits.

[1] https://www.tomshardware.com/pc-components/gpus/amd-asks-dev...

noosphr

Cuda isn't a massive investment, it's 20 years worth of institutional knowledge with a stable external api. There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.

Izikiel43

> Cuda isn't a massive investmen

> it's 20 years worth of institutional knowledge with a stable external api

> There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.

To me that sounds like massive investment

triknomeister

ZLUDA was quashed due to concerns about infringement /violating terms of use.

shmerl

A better question is why there is no stronger push for a nicer GPU language that's not tied to any particular GPU and serves any purpose of GPU usage (whether it's compute or graphics).

I mean efforts like rust-gpu: https://github.com/Rust-GPU/rust-gpu/

Combine such language with Vulkan (using Rust as well) and why would you need CUDA?

fnands

Mojo might be what you are looking for: https://docs.modular.com/mojo/manual/gpu/intro-tutorial/

The language is general, but the current focus is really on programming GPUs.

bee_rider

I think Intel Fortran has some ability to offload to their GPUs now. And Nvidia has some stuff to run(?) CUDA from Fortran.

Probably just needs a couple short decades of refinement…

pjmlp

One of the reasons CUDA won over OpenCL, was that NVidia, contrary to Khronos, saw a value in helping those HPC researchers move their Fortran code into the GPU.

Hence they bought PGI, and improved their compiler.

Intel eventually did the same with Open API (which isn't plain OpenCL, rather an extension with Intel goodies).

I was on a Khronos webminar where the panel showed disbelief why anyone would care about Fortran, oh well.

JonChesterfield

I like Julia for this. Pretty language, layered on LLVM like most things. Modular are doing interesting things with Mojo too. People seem to like cuda though.

shmerl

CUDA is just DOA as a nice language being Nvidia only (not counting efforts like ZLUDA).

pjmlp

Tooling and ecosystem, that is why.

shmerl

Rust has great tooling and ecosystem. The point here is more of interest of those who want better alternatives to CUDA. AMD would be an obvious beneficiary to back the above, so I'm surprised about some lack of interest from their likes.

null

[deleted]

smj-edison

This reminds me a lot of Seymour Cray's two maxims of supercomputing: get the data where it needs to be when it needs to be there, and get the heat out. Still seems to apply today!

CamperBob2

Calls to mind his other famous quote, "Would you rather plow your field with two strong oxen or 1024 chickens?"

How about ten billion chickens?

smj-edison

Yeah. I feel like he's still partially vindicated with things like the dragonfly topology, as a lot of problems don't nicely map onto a 2D or 3D topology (so longest distance is still the limiting factor). But the chicken approach certainly scales better, and I feel like since Cray's time there are more local-aware algorithms around.

anonymousDan

Can someone tell me if the challenges the article describes and indeed the frameworks they mention are mostly relevant for training or also for inference?

benreesman

The fast interconnect between nodes has aaplications in inference at scale (big KV caches and other semi-durable state, multi-node tensor parallelism on mega models).

But this article in particular is emphasizing extreme performance ambitions for columnar data processing with hardware acceleration. Relevant to many ML training scenarios, but also other kinds of massive MapReduce-style (or at least scale) workloads. There are lots of applications of "magic massive petabyte plus DataFrame" (which is not I think solved in the general case).

JonChesterfield

The underlying problem here is real and legitimately difficult. Shunting data around a cluster (ideally as parts of it fall over) to minimise overall time, in an application independent fashion, is a definable dataflow problem and also a serious discrete optimisation challenge. The more compute you spend on trying to work out where to move the data around, the less you have left over for the application. Also tricky working out what the data access patterns seem to be. Very like choosing how much of the runtime budget to spend on a JIT compiler.

This _should_ breakdown as running optimised programs on their runtime makes things worse and running less-carefully-structured ones makes things better, where many programs out there turn out to be either quite naive or obsessively optimised for an architecture that hasn't existed for decades. I'd expect this runtime to be difficult to build but with high value on success. Interesting project, thanks for posting it.

KaiserPro

One thing thats not addressed here is that the bigger you scale your shared memory cluster the closer to 100% chance that one node fucks up and corrupts your global memory space.

Currently the fastest way to get data from node a to node b is to RDMA it. which means that any node can inject anything into your memory space.

I'm not really sure how Theseus guards against that.

buildbot

I’m not sure any system prevents RDMA from ruining your day :(

Back in grad school I remember we did something fairly simple but clearly illegal and wedged the machine so bad the out of band management also went down!

KaiserPro

> wedged the machine so bad the out of band management also went down!

Now thats living the dream of a shared cluster!

This is hazy now, but I do remember a massive outage of a lustre cluster, which I think was because there was a dodgy node injecting crap into everyone's memory space via the old lustre fast filesystem kernel driver. I think they switched to NFS export nodes after that. (for the render farm and desktops at least.)

tucnak

I don't understand, doesn't kauldron[0] already exist?

[0] https://github.com/google-research/kauldron

up2isomorphism

As of today GPU is just too expensive for data processing. The direction they took makes it a very hard sell.

jauntywundrkind

Lot of hype, but man does Voltron Data keep blowing me away with what they bring out. Mad respect.

> There’s a strong argument to be made that RAPIDS cuDF/RAPIDS libcudf drives NVIDIA’s CUDA-X Data Processing stack, from ETL (NVTabular) and SQL (BlazingSQL) to MLOps/security (Morpheus) and Spark acceleration (cuDF-Java).

Yeah this seems like the core indeed, libcudf.

Focus here is on TCP & GPUDirect (Nvidia's pci-p2p, letting for example RDMA without CPU involvement across a full GPU -> NIC -> switch -> nic -> GPU happen).

Personally it feels super dangerous to just trust Nvidia on all of this, to just buy the solution available. Pytorch nicely sees this somewhat, adopted & took over Facebook/Meta's gloo project, which wraps a lot of the rdma efforts. But man there's just so so many steps ahead that Theseus is here with figuring out & planning what to do with these capabilities, these ultra efficient links, figuring out how to not need to use them if possible! The coordination problems keep growing in computing. I think of RISC-V with its arbitrary vector-based alternative to conventional x86 simd, going from a specific instruction for each particular operation to instructions being parameterized over different data lengths & types. https://github.com/pytorch/gloo

I'd really like to see a concerted effort to around Ultra Ethernet emerge, fast. Hardware isnt really available, and it's going to start out being absurdly expensive. But Ultra Ethernet looks like a lovely mix of collision-less credit-based Infiniband RDMA and Ethernet, with lots of other niceties (transport security). Deployments just starting (AMD Pensando Pollara 400 at Oracle). I worry that without broader availability & interest, without mass saturation, AI is going to stay stuck on libcudf forever; getting hardware out there & getting software stacos using it is a chicken & egg problem that big players need to spend real effort accelerating UET or else. https://www.tomshardware.com/networking/amd-deploys-its-firs...

latchkey

Our MI300x boxes have had 8x400G Thor2 RDMA working great for a year now.

danielz4tp5

[dead]

varelse

[dead]

HN

The race to build a distributed GPU runtime

The race to build a distributed GPU runtime