All in on MatMul? Don’t Put All Your Tensors in One Basket!

When GPUs started being used for deep learning (after AlexNet), GPUs were not at all matmul machines. They were machines that excel in most kinds of heavily parallel workloads. And this holds to this day, with the exception of the tensor core, which is an additional hardware block designed to accelerate this specific task.

Matrix multiplication didn't "win" because HW was designed for it. It won because matrix multiplication is a fundamental part of linear algebra and is very effective in deep learning (most kinds of functions you might want to write for deep learning can be expressed as a matmul). Acceleration of it became later. Additionally, matrix multiplication is a good fit for physics, as you can design the HW so that data movement is minimized, and most of the chip area and power are spent in actual computation, and not moving data around.

Fundamentally speaking, you also want to make your algorithm compatible with real-world physics. The need for heavy parallelism is required by the fact that you cannot physically make a fast chip that processes dependent operations. It's just not possible to propagate signals through transistors fast enough to make it possible. Even CPUs, even if they present a non-parallel programming environment, have to rely on expensive tricks like speculative out-of-order execution to make "sequential" code parallel to make it fast.

In general though, I personally would wish that chips would be made with taking programmability in mind. A fixed-function matrix multiplier might be slightly more efficient than a parallel computing chip with smaller matrix multipliers. But it would be significantly more programmable, and you can design much more interesting (and potentially more efficient) algorithms for it.

HN

All in on MatMul? Don’t Put All Your Tensors in One Basket!

All in on MatMul? Don’t Put All Your Tensors in One Basket!