Compiling LLMs into a MegaKernel: A path to low-latency inference

40 comments

·June 19, 2025

bytepoet

This is very cool. I enjoyed going through the writeup and GitHub README.

I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

zhihaojia

Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.

Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!

bytepoet

Thanks for the inputs. It's very helpful to know.

I look forward to following mirage development.

ActorNightly

Personally I think its a bit of a waste to invest time into gradient training optimizations. A lot of training tasks IRL have discrete values in nature, which can't be trained with gradients.

kp1197

After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.

zhihaojia

Thanks a lot for your positive feedback! We believe that MPK can enhance existing LLM serving systems, especially for low-latency LLM serving. We are very excited about the opportunity to collaborate with others on direction.

baq

Next step - compile straight to verilog so I can buy some LLMs on aliexpress

fc417fc802

Because training costs weren't high enough already so lets add mask costs on top.

More seriously, isn't that pretty much what all those AI hardware startups have already been doing for a while now?

bigcat12345678

https://riscv.org/blog/2021/02/hardware-description-language... That was one of the promising ideas before AI & GPUs come to the scene. As CPUs are stagnant, and naturally people want further optimize the middle layers software and hardware.

But I suspect parallel computing in GPU style is going to dominate acclerated computing.

General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.

Ideas of software direct to hardware transition might never be the mainstream.

baq

I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.

mycall

> General purpose CPUs are going to stay to become the little brain that orchestrates GPUs

Brings the deterministic compute to the indeterministic.

anitil

I mean.... LLM-in-a-box would actually be pretty neat! I'm looking at some air-gapped work coming up and having something like that would be quite handy

fc417fc802

Isn't that easily accomplished by setting up a local deployment and then yanking the network cable? Anything that can quickly run a capable LLM is going to be a pretty beefy box though. More like LLM in an expensive space heater.

flakiness

This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles

Good to see the competition in this area.

(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751

zhihaojia

This is the writer of the blog post. You are right that Stanford's work is a parallel effort. The main difference is that our focus is on compilation: making it easier to generate megakernels automatically.

zekrioca

And their focus is..?

sigbottle

Hazy Research also has ThunderKittens, pretty cool library. There's a lot of effort to really formalize, pipeline, divide and conquer in the current NVIDIA GPU model for maximize GPU efficiency, and to write compilers/DSL's for things, it seems.

bdbenton5255

Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.

zhihaojia

Thanks for the feedback! Yes, we believe the approach is general and applicable to other ML workloads.

tuananh

if you want to try on 5090, it's not supported yet

> Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.

zhihaojia

The task implementations used by MPK are currently optimized for A100. While the Mirage compiler can generate task implementations for other architectures such as Hopper and Blackwell, but we haven't integrated things together yet. This is on the very top of our todo list. Stay tuned!

liuliu

The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.

That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.

fxtentacle

Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?

zhihaojia

JAX's operator fusion (https://apxml.com/courses/advanced-jax/chapter-2-optimizing-...) can fuse a few local operators (e.g., matmul and elementwise computation) into a single kernel. But JAX's approach cannot fuse an entire LLM with hundreds of operators into a single kernel because many operators involve loop transformations.

MPK takes a different approach where instead of incrementally fusing local operators, it decomposes operators into a task graph and builds a runtime system within a single kernel to execute all tasks specified in the task graph.

skavi

Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?

saagarjha

> The CPU launch cost of a graph is tiny

Absolutely not; it’s comparable to the launch overhead of a kernel.

refulgentis

You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.

CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.

With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.

qihqi

Probably should make this into a backend of torch.compile

zhihaojia

Yes, it would be a lot of fun if MPK can enable torch.compile to generate megakernels. Torch-generated kernels are currently too slow for latency-sensitive workloads.

olivia111

really cool. would love to try it for our 3b model.

olivia111

any detailed tutorial about how to use it?

zhihaojia

The github repo includes a tutorial for using MPK: https://github.com/mirage-project/mirage/tree/mpk

HN

Compiling LLMs into a MegaKernel: A path to low-latency inference

Compiling LLMs into a MegaKernel: A path to low-latency inference