Skip to content(if available)orjump to list(if available)

DeepSeek open source DeepEP – library for MoE training and Inference

pama

I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.

helloericsf

- Efficient and optimized all-to-all communication - Both intranode and internode support with NVLink and RDMA - High-throughput kernels for training and inference prefilling - Low-latency kernels for inference decoding - Native FP8 dispatch support - Flexible GPU resource control for computation-communication overlapping X: https://x.com/deepseek_ai/status/1894211757604049133

ofou

You gotta love these guys, they're really pushing the open source frontier for all of us, thanks for sharing

grg0

Open AI™ (with a space)

hackit2

Kind of ironic that DeepSeek is more Open than ChatGPT

gostsamo

They do it for their own reasons, but OpenAI are straight up liars and they are neither open nor give a fuck about humanity.

echelon

I hope you're reading this Sam Altman:

Make Open AI open.

Or else you'll lose to the ecosystem.

deyiao

Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...

deyiao

Is the PTX that everyone was looking forward to included this time?

find0x90

Yes, there's some in the csrc/kernels directory. Search for 'asm' to find uses of it.

swyx

> the PTX that everyone was looking forward to

explanation for the rest of us why this is so important?

find0x90

[dead]

Bimos

The PTX instructions they talked about in the tech report should be pointing to the code here?

zardinality

"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."

rvz

Round 2 of open source releases from an actual "Open AI™" company and licensed under MIT.

Once again, DeepSeek is more open than the $157B+ one that is claiming to be "Open".

Almost no-one is talking about Meta's Llama and everyone should expect them to release Llama 4 with reasoning.

The objective is to not be squeezed in the middle of the race to zero.