The bug that taught me more about PyTorch than years of using it
21 comments
·October 23, 2025jebarker
This is a great write up and I’d love to see more like it. Debugging this sort of thing in the megatron->pytorch->CUDA stack is what my team spends more than half of their time on as an ML research team.
montebicyclelo
Incorrect Pytorch gradients with Apple MPS backend...
Yep this kind of thing can happen. I found and reported incorrect gradients for Apple's Metal-backed tensorflow conv2d in 2021 [1].
(Pretty sure I've seen incorrect gradients with another Pytorch backend, but that was a few years ago and I don't seem to have raised an issue to refer to... )
One might think this class of errors would be caught by a test suite. Autodiff can be tested quite comprehensively against numerical differentiation [2]. (Although this example is from a much simpler lib than Pytorch, so I could be missing something.)
[1] https://github.com/apple/tensorflow_macos/issues/230
[2] https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...
cadamsdotcom
Sounds like Placeholder should somehow be split into InputPlaceholder and OutputPlaceholder, based on the usage.
Even identical classes could help future folks know copying back is platform specific: “hm, we wrote to an OutputPlaceholder but didn’t read back from it, that seems wrong”.
ramses0
Apps Hungarian v. System Hungarian: https://herbsutter.com/2008/07/15/hungarian-notation-is-clea...
CaptainOfCoit
Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening.
Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...
hansvm
They exist, but they're not that common (give or take the "expected" numerical deviations based on the order of summation and whatnot, which can both be nontrivial and propagate error further).
Something I recommend doing, the best time being the start of the project and the second best time being now, is adding numerical gradient checking tests to all operations. You will make mistakes in your kernels from time to time, and it's valuable to know at a glance where those mistakes are.
Mind you, it's possible to write both the forward pass and the backward pass in a way that's wrong but compatible. An additional layer of checks I like to add is a dead-simple implementation of all algorithms -- no vectorization, no fancy blocking or re-orderings, nothing. Compare results to the simple implementation.
It sounds like a lot of work, but writing an optimized kernel is much slower than the numerical gradient checking and the simple kernel, and given how in numerical code it's basically impossible to identify the source of a bug without doing the equivalent of all of those checks, it only takes one bug in the whole project for the effort to pay off.
CaptainOfCoit
Thanks a lot for the pointers, I think I've done a similar approach to what you suggest, lots of tiny (relative) tests for each step in the process, and doing sort of sanity checking between the naive stuff I first wrote which works and which does inference correctly, and the new kernel which is a lot more performant, but currently incorrect and produces incoherent outputs.
I'll try to replace bits by simplified versions though, probably could help at least getting closer to knowing where the issue is.
Anyone have more debugging tips I'd greatly appreciate it! Nothing is too small or "obvious", as I'm about to lose my mind more or less.
QuadmasterXLII
You may be running into jensen (huang)’s inequality,
E(loss).cuda() <= E(loss.cuda())
CaptainOfCoit
Would make sense I suppose if I was using two different GPUs for the same thing and get two different outcomes. But instead I have two implementations (one naive, one tensor cores) running on the same GPU, but getting different outcomes, where they should be the same.
But then this joke might be flying above my head as well.
p1esk
Tensor cores use lower precision, so small numerical differences should be expected.
saagarjha
How big is the numerical difference? If it's small it might be within the precision of the operation itself.
CaptainOfCoit
Magnitudes away (maybe "small numerical difference" was an understatement), my current hypothesis is that I'm doing scaling wrong somewhere, but I can't help but sometimes slide into the "maybe there is something deeper wrong" territory in the evening after another day...
airza
I too have been insanely burned by an MPS bug. I wish Apple would throw an engineer or two at making sure their hardware works with PyTorch.
brilee
Great write-up, but I admit that I found the interweaving of human and AI-written content/headlines/summaries pretty distracting. I kept on wanting to scroll past, but had to keep on backtracking to find the human thread again.
I think if you want to give your reader a quick intro to, e.g., what is the Adam optimizer, a simple link to Wikipedia is fine. No need to copy-paste an AI tutorial on Adam into the blog post.
CaptainOfCoit
To be fair, you can easily click to hide those expanded sections. I found it a neat compromise between "Link to (usually) obtuse Wikipedia article" which aren't usually written for laypersons, and forcing me to read through stuff I already know about, I just hid the sections I already understood but found value in the others.
hobom
What a fantastic way to write a post mortem, pedagogically very useful.
kccqzy
This is a minor quibble but I don't really like the author calling Placeholder a leaky abstraction. It's just straight up an incomplete abstraction that only handles inputs but not outputs. As the author says, Placeholder should know about the difference and do the copy-back itself.
gugagore
This is the first time I see "SGD" to mean "standard gradient descent" and not "stochastic gradient descent".
tavianator
Presumably that's just a mistake. The author calls it "stochastic gradient descent" correctly elsewhere in the article
saagarjha
Non-contiguous tensors have to be the #1 source of bugs in PyTorch lol
The tinygrad folks talk about this a lot.
Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.
It would be interesting to see model weights comparison of the same model trained with the two to see if they exhibit meaningfully different behavior.