AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition
6 comments
·February 20, 2025ragnarok451
stephantul
This was interesting to see happen live on X
I think putting this on the llm is a bit generous. Their results were apparently 30x above the theoretical maximum, according to gpu master Tri Dao, so there was also a lack of understanding on what was possible with CUDA. See:
https://x.com/tri_dao/status/1892610951662153945
Also see this thread by Lucas Beyer: https://x.com/giffmana/status/1892510741242036468
One of the greatest skills in research is to remain skeptical of one’s own results, especially when they are exceptional. They chose to pull the trigger, and release too quickly.
This can happen in any setting, not just codegen, e.g. inadvertently training on the test set. Science is a slow ascent: if it looks too good to be true, you probably just have a bug.
imtringued
The lack of understanding was obvious from the start. They didn't benchmark CPU only vs the GPU ported equivalent, which would be fair, since there is a lot of CPU code that benefits from being ported to CUDA.
They dishonestly thought that you can have GPU code that is faster than CUDA experts have written using extensive hardware knowledge, the intuition of which is unlikely to be in the training data and used in the generation of the kernel. The very thing they are attempting to do goes beyond what current generation LLMs can do.
The stated goal is also very silly. People don't need help running pytorch on CUDA. One of the most important fused kernels in machine learning is called flash attention and the reason why it can fuse operations has to do with the fact that flash attention is actually a very different algorithm to conventional attention that lets you reorder the operations, thereby lets you fuse them and happens to calculate an approximately similar but not quite the same result.
neonate
Looks like that's also at https://sakana.ai/ai-cuda-engineer/#limitations-and-bloopers
rnrn
Since i posted https://news.ycombinator.com/item?id=43124176, they have revised again to acknowledge that many of the other generated kernels are also broken:
> Furthermore, we find the system could also find other novel exploits in the benchmark’s tasks
“Novel exploit” is a pretty fancy and generous way of saying that some of the kernels wrote a constant value to the entire output because the evaluation code only tested one set of inputs that can pass if you replace the computation with a memset.
01100011
Nvidia is doing work like this internally: https://developer.nvidia.com/blog/automating-gpu-kernel-gene...
This was debunked - the agent was actually fooling the verification harness https://x.com/SakanaAILabs/status/1892992938013270019. One particular test that showed a 150x speedup is actually 3x slower.