DualPipe: Bidirectional pipeline parallelism algorithm
22 comments
·February 27, 2025xnhbx
anonzzzies
When my company was still working closely with CN factories a few years ago (before the bans / clients no longer wanting to work with companies working with china etc), the CEO's of the factories we worked with all were electronic engineers at that company or another before; they all could jump in, debug schematics, sold and write firmware themselves. And they did. These were places with massive campuses with towering buildings with robots and a few (relative to the massive space) employees doing maintenance etc + prototyping.
larodi
It sounds so more reasonable to have a director who is actually technical, doesn't it? I'm absolutely amazed how this (to the east) contrasts to understanding (to the west) that directors rather need to know finance, strategic planning, and marketing, than the actual nuance of the work.
tway223
To be blunt this is exactly what is wrong with the “leadership” mindset in the west, as decisions are often made without understanding the “nuances” yet they are confident it would work.
danielhanchen
I attached all 3 algorithms 1F1B (1 forward 1 backward), ZB1P (zero bubble pipeline parallelism) and DualPipe as a picture here: https://x.com/danielhanchen/status/1894937006352031832 for those interested :)
Bimos
Maybe add Chimera as well?
isoprophlex
it looks as if Chimera has marginally less bubbles than DualPipe?
danielhanchen
Oh more nice pictures :)
alphan0n
Off topic, but this is the Rick and Morty episode where Rick creates a perfectly level space.
The symmetry is uuugh.
danielhanchen
You'll have to refresh my memory :) Is there like a Youtube clip for it?
puppycodes
Sorry for us utter simpletons can someone explain what it do?
fasterergpes
It makes it so that having more GPUs makes inference run faster. Worst case has been you can only use memory from them and gain no speed at all
456yetdh6r
[flagged]
qrios
In very simple words: it is one way to reduce the white squares in the picture from @danielhanchen[1].
In more complex words: imagine a processor which is able to process every instruction in 10 clock cycles. But also the processor is able to get new input for this instruction on every clock cycle and starts to process this new input in a pipeline. After the first input you have to wait ten clock cycles. But if you feed the input line every time you will get the output also permanently.
In the case of GPUs, it is now not only a topic of a single pipeline, but multiple in parallel. Depends on your data and algorithm it can be thousands in parallel.
optimalplusone
I hope all the open sources Deepseek is doing encourages American labs to do more of the same. Surely they'll realize their momentum is more of a moat than their tech at any one point in time.
jpcom
Does this remind anyone else of the Pied Piper compression algorithm?
aqueueaqueue
Middle out or something?
snake_doc
Hmm weren’t there also supposed to be the SM re-allocation, doesn’t look like it was included; I may have been mis-remembering the explanation.
ringer007
[dead]
> DualPipe was created and developed by Jiashi Li and Chengqi Deng and Wenfeng Liang.
A CEO who codes.