Fault Tolerant Llama training
5 comments
·June 23, 2025bjt12345
This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.
timzaman
300 L40s? What's this, 1998?
d4l3k
Hey Tim, how's it going?
Interested in lending PyTorch some compute? :)
torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.
Stay tuned though -- planning on doing some much larger demos on B200s!
kcorbitt
I was curious about this so I had o3 do a bit of research. Turns out 300 L40s have more compute than any supercomputer before 2013 (and arguably before 2016, depending on how you count reduced-precision FLOPs).
https://chatgpt.com/share/685dea79-26ec-8002-bd62-7ed83aedf4...
Hey, nice to see this here!
I'm the primary author so happy to answer any questions you might have!