How to Think About GPUs
15 comments
·August 18, 2025nickysielicki
aschleck
It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.
Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.
physicsguy
It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.
Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.
gregorygoc
It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.
What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.
aanet
Fantastic resource! Thanks for posting it here.
null
akshaydatazip
Thanks for the really thorough research on that . Right what I wanted for my morning coffee
porridgeraisin
A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.
camel-cdr
SIMT is just a programming model for SIMD.
Modern GPUs still are just SIMD with good predication support at ISA level.
achierius
That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.
adrian_b
"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.
"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").
SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.
What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.
camel-cdr
I'm not aware of any GPU that implements this.
Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
if (theradIdx.x < 4) {
A;
B;
} else {
X;
Y;
}
Z;
The diagram shows how this executes in the following order:Volta:
->| ->X ->Y ->Z|->
->|->A ->B ->Z |->
pre Volta: ->| ->X->Y|->Z
->|->A->B |->Z
The SIMD equivilant of pre Volta is: vslt mask, vid, 4
vopA ..., mask
vopB ..., mask
vopX ..., ~mask
vopY ..., ~mask
vopZ ...
The Volta model is: vslt mask, vid, 4
vopA ..., mask
vopX ..., ~mask
vopB ..., mask
vopY ..., ~mask
vopZ ...
[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...[2] https://stackoverflow.com/questions/70987051/independent-thr...
porridgeraisin
I was referring to this portion of TFA
> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.
adrian_b
This flexibility of CUDA is a software facility, which is independent of the hardware implementation.
For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).
tomhow
Discussion of original series:
How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)
The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.
Shamelessly: I’m open to work if anyone is hiring.