CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL
5 comments
·December 4, 2025j2kun
AlexCoventry
In the future, we will all be Jürgen Schmidhuber. :-)
stonogo
Am I reading this wrong, or does this only support FP16 inputs, and compares its performance against an FP32 solver?
bgwalter
> To valid kernel correctness, we need to compare its output to a reference correct kernel with the same inputs.
No, you need a numerical proof, which you don't have.
krapht
This is a standard which few kernels will ever meet. I'd say requiring a numerical proof is the same as requiring no proof at all - because it won't ever happen unless you're validating silicon or something equally expensive.
They claim the algorithm "discovered" the new techniques, but the methods described in section 5 do not seem all that novel to me. It smells like it could be "laundering" the literature [1] and reshuffling existing techniques. This is not inherently a bad thing, but I would hope that if it is borrowing existing techniques, the appropriate citation would eventually make it into this paper.
[1]: https://www.argmin.net/p/lore-laundering-machines