DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

helloericsf

X:https://x.com/deepseek_ai/status/1893836827574030466 BF16 support Paged KV cache (block size 64) 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

nokun7

In my view, FlashMLA’s exclusive targeting of Hopper GPUs restricts its cross-platform use, and the lack of comprehensive documentation, vague compatibility with wider frameworks, and absence of benchmark comparisons or trade-off insights reduce its ease of use and adaptability. While it holds potential for specialists with tailored requirements, its specialized nature and limited community backing indicate it’s not yet a broadly practical tool, requiring more detailed guides and expanded hardware support to unlock its full capabilities.

deyiao

I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp

reissbaker

By "lower" you mean cheaper/better?

I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.

helloericsf

What do you mean by "lower"? To my understanding, they will open 5 infra related repos this week. Let's revisit your comparison question on Friday.

find0x90

I don't see any use of PTX, might be in one of the other repos they plan to release.

behnamoh

Open AI is back!

rvz

This is the minimum bar that I expect very elite programmers should be striving for in the age of AI and DeepSeek should be studied as an example and this is the only just the first of many projects from them.

There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.

Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.

You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.

mohsen1

I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!

thot_experiment

Just the H100, the H800 is a region-specific version of the card for china with shitty nvlink bandwidth which makes it rougher for making big clusters, but deepseek was able to mitigate the impact of that by being clever (rumored to have made significant use of PTX assembly instead of just using CUDA, we'll probably find out in the releases this week)

Tiberium

H800 is the export variant that they had access to. They directly reference it in the repo:

>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.

HN

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs