CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

j2kun

They claim the algorithm "discovered" the new techniques, but the methods described in section 5 do not seem all that novel to me. It smells like it could be "laundering" the literature [1] and reshuffling existing techniques. This is not inherently a bad thing, but I would hope that if it is borrowing existing techniques, the appropriate citation would eventually make it into this paper.

[1]: https://www.argmin.net/p/lore-laundering-machines

AlexCoventry

In the future, we will all be Jürgen Schmidhuber. :-)

stonogo

Am I reading this wrong, or does this only support FP16 inputs, and compares its performance against an FP32 solver?

bgwalter

> To valid kernel correctness, we need to compare its output to a reference correct kernel with the same inputs.

No, you need a numerical proof, which you don't have.

krapht

This is a standard which few kernels will ever meet. I'd say requiring a numerical proof is the same as requiring no proof at all - because it won't ever happen unless you're validating silicon or something equally expensive.

HN

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL