AI Is Writing Its Own Kernels, and They Are 17x Faster

matll

As someone who spent the better part of last year trying to hand-tune kernels for a niche accelerator (not Trainium, but similar vibe), this honestly looks like a dream.

The hardest part of this work isn't coming up with the math; it's the mental overhead of managing the scratchpad memory and async DMA calls without stepping on your own toes. You spend 3 days debugging a race condition just to find out you got a 2% speedup.

If this tool can actually handle the 'grunt work' of generating the tiling logic and memory moves based on a high-level plan, that’s a game changer. I don't even care about the 17x number as much as I care about the '0 to 1' speed. getting any performant kernel running on new hardware usually takes weeks. If this cuts it down to a few hours of LLM churning, that's huge for the industry.

simonw

Optimization work sounds like it might be a really good fit for coding agents. If you can provide a robust test which "proves" the implementation works the actual work of increasing its performance is the kind of thing a coding agent could run in a loop, testing each optimization to see if the tests still pass and it runs faster.

whynotmaybe

But we might end up with "work on my infrastructure" optimization that would be hard to reproduce.

Like that research that evolved an FPGA where some unconnected parts where crucial for the the expected behaviour.

https://www.eetimes.com/whatever-happened-to-evolvable-hardw...

mholm

Adding a few diverse hardware environments available for testing during the duration would mitigate this. Many companies wouldn't have any issues having infrastructure specific optimizations either. (Part of) Deepseek's big advantage over their chinese competitors was their intelligent use of the hardware, after all.

cadamsdotcom

Correction:

Charles Hong, Sahil Bhatia, Alvin Cheung, and Yakun Sophia Shao, and the ADRS team ..

are USING AI to write kernels.

“AI” is not writing its own anything.

It is doing what humans say to do.

comrade1234

This is completely believable and you should invest in this technology.

DroneBetter

I can't tell whether you're trying to convince humans, parody someone who might be, or give superficial sentiment for automated traders' webscrapers to be influenced by

cornonthecobra

or they left the /s off and it's a remark about how the fine article sounds more like hype-machine emesis than legitimate, substantive research

oceansky

I think he's just being extremely ironic, meaning the exact opposite of what it actually says.

UncleOxidant

Was in a startup where we were trying to do this (our tagline was "using AI to make AI run faster and more efficiently"). But we ran out of funding at the end of '22 :(

We were just a little early, I think.

accheng

Interesting, did you have any learnings that would apply to this problem now?

jryio

Chris Latner of Apple's Swift and Tesla fame is running a company entirely predicated on this, but at the deterministic language design level rather than the inference level.

https://www.modular.com/mojo

If a beam search, initiative plan and execute phase is more effective than having better tooling in a deterministic programming language then this will clearly take the lead.

accheng

Thanks for the link! I am not familiar with the company but reminds me of the whole formal methods debate in distributed systems. Sure, writing TLA+ specs is the 'correct' deterministic way to build a Raft implementation, but in reality everyone just writes messy Go/Java and patches bugs as they pop up because its faster.

karek

usually i scroll past these 'LLM optimizes code' posts bc 99% of them are just finding basic peephole optimizations that -O3 wouldve caught anyway. but looking at the conv1d example in the blog, this is actually doing real architectural changes.

the 'dropout' on the optimization menu is a pretty neat hack. kinda reminds me how i work when im stuck... 'ok what if i dont unroll this loop, what else can i do?'. forces the search out of local minima. nice to see an AI tool designed around verification (the simulator loop) rather than just hoping the llm guesses right on the first shot.

jryio

paper: https://arxiv.org/abs/2505.18574

quc1k

I really appreciate the focus on interpretability. Usually, super-optimizers give you a blob of assembly that runs fast but is impossible to debug or maintain. By forcing the model to output a natural language 'Plan' first, you essentially get documentation for free. If the code breaks later, you can look at the plan to understand why the loop was unrolled or why the memory was laid out that way. That makes this actually usable in a production CI/CD pipeline, unlike most black-box ML optimizations.

kap901

manually writing tiling logic for systolic arrays is the absolute worst. if this actually works it saves me so much headache.

measurablefunc

I wonder if this type of work can be applied towards translating kernels between GPU vendors, e.g. CUDA → AMD. Does anyone know if that's possible or whether that kind of problem is AGI-complete?

jryio

There's a higher level of abstraction

https://www.modular.com/mojo

measurablefunc

So if CUDA could be ported to Mojo w/ AI then it would be basically available for any GPU/accelerator vendor. Seems like the right kind of approach towards making CUDA a non-issue.

UncleOxidant

It seems like it could be possible now with a bit of work. I don't think that it would require AGI. Didn't AMD have (or fund) something like this and then decide not to pursue it further recently? It was called HIP. There's also ZLUDA https://www.blopig.com/blog/2024/03/an-open-source-cuda-for-...

measurablefunc

Very interesting.

dataeaa

Crazy that it beat the hand-tuned amazon kernels. really shows how early we still are with these software stacks.

what are the risks of using these kinds of tools thou? Did you get any tricky/silent bugs you had to manually fix?

mavt6

Love the concept of using AI to make the hardware run AI faster. feels like we're finally closing the loop on this stuff!

dfdsfds

Very impressive results! Will be curious to see how correctness is guaranteed and what kind of failures are normal from the LLM-generated code

HN

AI Is Writing Its Own Kernels, and They Are 17x Faster

AI Is Writing Its Own Kernels, and They Are 17x Faster