Exo: Exocompilation for productive programming of hardware accelerators

40 comments

·March 14, 2025

alex7o

Halide does something similar but as C++ for it dsl langauge: https://halide-lang.org/

gotoeleven

The Exo docs mention Halide as an example of a language similar to Exo but is "lowering-based", while Exo is "rewrite-based." This seems to mean that Halide is more of a DSL where what you want is specified at a higher level while Exo is a set of transformations you can apply to an existing kernel of code (though at least in the examples the kernel of code is written in python and then somehow C is generated from it, after transformations are applied). Do you know what the relative strengths of these two approaches are?

Also, are there any languages in this vein that take more of a declarative (as opposed to imperative) approach?

alex7o

I have mostly a passing interest in the topic, however from what I understand Halide first lowers to an intermediate language then they do the scheduling and optimisation. While exo, does the optimisations directly on the python code. Also Halide needs you to tell it how to lower the code and how to do the scheduling. Which is something that exo can determine itself.

DiabloD3

Neat idea, but why is this implemented in Python instead of a more suitable language? Also, there are many DSLs and APIs already for this, some even supported by industries and provided as part of the SDKs for various hardware blobs, so why Exo?

Like, I hate SYCL, Kokkos, and all the other similar things as much as the next guy, but major companies and orgs support and use those already.

erdaniels

I'm not the target audience but the GitHub and website getting started page feel so poorly explained. What the hell is a schedule?

gnabgib

This MIT article covers it a bit more (with a slightly too generic title) High-performance computing, with much less code https://news.mit.edu/2025/high-performance-computing-with-mu... (https://news.ycombinator.com/item?id=43357091)

EVa5I7bHFq9mnYK

Word "schedule" is already taken for thread scheduling in the kernel, so reuse of that word is confusing. This is a code generator that operates on nested loops - allows to reorder and split them, replace instructions etc. All to maximize performance.

ajb

A schedule is the order in which machine instructions get executed.

So, I've done this professionally (written assembler code, and then scheduled it manually to improve performance). Normally you don't need to do that these days, as even mobile CPUs use out-of-order cores which dynamically schedule at runtime.

It's only going to be useful if you're writing code for some machine that doesn't do that (they give examples of TPU etc)

almostgotcaught

> out-of-order cores which dynamically schedule at runtime.

OOO architectures don't reschedule dynamically - that's impossible - they just have multiple instruction buffers that can issue the instructions. So scheduling is still important for OOO it's just at the level of DDG instead of literally linear order in the binary.

Edit: just want to emphasize

> It's only going to be useful if you're writing code for some machine that doesn't do that

There is no architecture for which instruction scheduling isn't crucial.

achierius

> There is no architecture for which instruction scheduling isn't crucial.

In my experience doing back-end compiler work, it's definitely last on the list of major concerns. Obviously you can't ignore it, but it's not where any significant gains are coming from, and anything 'fancy' you do there is likely to be picked up by future generations of hardware.

ajb

If you're talking about modifying the DDG, I would not call that scheduling. Because then you need to do serious work to prove that your code is actually doing the same thing. But I haven't spent a lot of time in the compiler world, so maybe they do call it that. Perhaps you could give your definition?

null

[deleted]

imtringued

https://github.com/exo-lang/exo/blob/main/examples/avx2_matm...

I personally am not convinced.

lostmsu

Agreed. It looks like if you need to optimize it would be much easier to just modify the code directly. The result will also be more readable and therefore easier to support in the future.

null

[deleted]

stanleykm

You guys might want to explain what this is a little better. My first question was “what do they mean by ‘hardware accelerators’?”

It wasnt until I saw the matmul example that I realized this is (probably, its still unclear) for GPUs.

Netcob

Probably any hardware that requires you to use some intermediate code. Nowadays probably some TPU or NPU or whatever.

This made me think of what "accelerators" I've come across (just as a regular consumer):

In the late 90s, that's what we called GPUs - "3D accelerators". For a short time, they were cards separate from the actual graphics card, and often would involve routing your VGA cable through that thing. Before it all merged into one device. I was very slightly disappointed as a kid that I narrowly missed that time, but bilinear filtering and higher resolutions and framerates helped me get over the fact I couldn't cram more cool weird tech into my PCI slots.

Then you had sound cards with "audio accelerators" using technologies like EAX. All that eventually migrated back to the CPU I think.

For a while you could by a "physics accelerator" for PhysX, then acquired by nvidia and moved to the GPU using CUDA. I never had one, but one time I kept around an older GPU after upgrading as a dedicated physx processor. Now that's the only way to run older 32bit games with physx turned up, since that's not supported in 5000 series GPUs.

Finally, I got this "Coral TPU", a USB device with 4GB RAM (and I think around 4 TOPS or something?) for very efficient inferencing. There are some open source projects supporting this, like frigate, which lets you process object detection with multiple surveillance camera streams on a raspberry pi. I never really used it though.

And of course now we have NPUs as sub-systems in CPUs and GPUs. I'd love to have a dedicated PCIe card with that, but of course having yet another computer architecture with dozens/hundreds of GB of redundant RAM is kind of a bummer.

cubefox

A bummer yes, but two NPUs, one on the CPU, another one on the GPU, that just sounds silly.

almostgotcaught

"[compilation] for productive programming of hardware accelerators"

But 93% of the codebase is Python lol. Whatever you think about Python, it is not a systems programming language. Takeaway: this is not a serious compiler project (and it's clearly not, it's a PhD student project).

Deeper take: this is just a DSL that behind the scenes calls a (SMT) solver for tuning (what they call "scheduling"). There are a million examples of this approach for every architecture under the sun. My org is literally building out the same thing right now. Performance is directly a function of how good your model of the architecture is. At such a high-level it's very likely to produce suboptimal results because you have no control over ISA/intrinsic level decisions. Lower-level implementations are much more robustly "powerful".

https://dl.acm.org/doi/10.1145/3332373

rscho

Well, this is clearly an attempt at abstracting the kind of low-level stuff you describe. Perhaps it doesn't work (yet), but that shouldn't prevent people from trying ? Involving a SMT solver suggests that the solver is doing the heavy-lifting, not python. PhDs often produce inapplicable stuff, but they are also the basis for industry/application R&D, such as what your org is doing... PhDs are the slaves of science. They make stuff happen for peanuts in return and deserve our respect for that, even if what happens is oftentimes a dead-end. It's really sad seeing people shitting on PhDs.

pclmulqdq

Unfortunately, the comment you are responding to is more correct on this than I think you are. The python thing was stupid, though - a lot of high-performance code gets written in libraries like numpy (that call C 99.99% of the time) or pytorch (JIT-ed before executing) that keep Python our of the critical path.

The problem with this research is that many similar ideas have been tried before in the contexts of supercomputers or CPU compilers. They ultimately all fail because they end up (1) being more work to program in, and (2) not being any faster because real life happens. Networks drop packets, clock frequencies jitter, and all sorts of non-determinism happens when you have large scale. A static scheduler forces you to stall the whole program for any of these faults. All the gain you got by painstakingly optimizing things goes away.

PhD theses, in a best-case scenario, are the basis for new applications. Most of them amount to nothing. This one belongs on that pile. The sad part about that is that it isn't the student's fault that the professor sent them down a research direction that is guaranteed to amount to nothing of use. This one is on the professor.

fancyfredbot

I don't think the paper is about statically scheduled architectures. In fact they mention it's for modern accelerators. These switch between threads in a dynamic way rather than stalling. The scheduling being referred to seemed to mean the order in which instructions should be fed to a potentially dynamic scheduler to enable efficient usage of caches etc.

So I'm not sure you can dismiss it as a thesis which will amount to nothing on the basis that static scheduling is a bad idea!

I could easily have missed something though. It's not a particularly clear or succinct write-up and I have only read some of it. If it does say that it only works for strictly deterministic in-order architectures somewhere please can you point out where?

null

[deleted]

fancyfredbot

Your take seems to contradict the article? You say SMT solvers give "no control over ISA/intrinsic level decisions" but their design.md says "user-defined scheduling operations can encapsulate common optimization patterns and hardware-specific transformations". Are they wrong about this? Can you explain why?

QuadmasterXLII

any sufficiently powerful compiler is going to run an interpreted language at compile time, and there’s no reason it can’t be Python instead of C++ template metaprograms or CMake

null

[deleted]

saagarjha

Maybe a takeaway that boils down to “how could anyone ever write a compiler on Python” is the wrong one to have.

null

[deleted]

null

[deleted]

k_bx

Spent 5 minutes browsing but still couldn't figure this out. Would it be able to target FPGA?

skavi

are the transforms one applies with this always correct (not necessarily optimal)? or is it possible to introduce a bug which didn’t exist in the original kernel.

null

[deleted]

HN

Exo: Exocompilation for productive programming of hardware accelerators

Exo: Exocompilation for productive programming of hardware accelerators