Writing Speed-of-Light Flash Attention for 5090 in CUDA C++
7 comments
·August 23, 2025ProofHouse
Damn awesome. This going to take me 3 reads and a week to digest
steinvakt2
I had a 5090 some months ago but couldnt get flash attention to work. Does it now work natively? What about 5080?
sigmoid10
Pytorch now has native support for the Blackwell architecture:
zackangelo
Curious what issues you were having. The kernel should compile natively if you pass nvcc the correct arch flags, although it probably won't take advantage of any new hardware features.
doctorpangloss
Hmm, but supposing the accelerated NVIDIA specific inference data types were available for Triton, then you would just use that? Why not contribute to Triton, they accept PRs? Like so what if you do free product ecosystem development for NVIDIA and giant corporations by contributing to Triton?
qeternity
Second line of the post:
> The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.
I was surprised to see 5090's theoretical BF16 TFLOPs at just 209.5. That's not even 10% of the server Blackwell (B200 is 2250, and GB200 is 2500). B200 costs around $30-40k per GPU, so that's almost in line with their relative performance.
Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, or removed entirely on the workstation-class cards like RTX Pro 6000.
It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 4090 than H100). They are generous with memory bandwidth though, nearly 2TB/s is amazing!