Matrix Calculus (For Machine Learning and Beyond)
30 comments
·March 29, 2025sfpotter
C-x_C-f
> Also, don't forget the Jacobian and gradient aren't the same thing!
Every gradient is a Jacobian but not every Jacobian is a gradient.
If you have a map f from R^n to R^m then the Jacobian at a point x is an m x n matrix which linearly approximates f at x. If m = 1 (namely if f is a scalar function) then the Jacobian is exactly the gradient.
If you already know about gradients (e.g. from physics or ML) and can't quite wrap your head around the Jacobian, the following might help (it's how I first got to understand Jacobians better):
1. write your function f from R^n to R^m as m scalar functions f_1, ..., f_m, namely f(x) = (f_1(x), ..., f_m(x))
2. take the gradient of f_i for each i
3. make an m x n matrix where the i-th row is the gradient of f_i
The matrix you build in step 3 is precisely the Jacobian. This is obvious if you know the definition and it's not a mathematically remarkable fact but for me at least it was useful to demystify the whole thing.
sfpotter
For m = 1, the gradient is a "vector" (a column vector). The Jacobian is a functional/a linear map (a row vector, dual to a column vector). They're transposes of one another. For m > 1, I would normally just define the Jacobian as a linear map in the usual way and define the gradient to be its transpose. Remember that these are all just definitions at the end of the day and a little bit arbitrary.
oddthink
I'd say a gradient is usually a covector / one-form. It's a map from vector directions to a scalar change. ie. df = f_x dx + f_y dy is what you can actually compute without a metric; it's in T*M, not TM. If you have a direction vector (e.g. 2 d/dx), you can get from there to a scalar.
edflsafoiewq
Can you give an example?
sfpotter
If you mean for how to use Taylor expansions and linear algebra, here's one I just made up.
Let's say I want to differentiate tr(X^T X), tr is the trace, X is a matrix, and X^T is its transpose. Expand:
tr((X + dX)^T (X + dX)) = tr(X^T X) + 2 tr(X^T dX) + tr(dX^T dX).
Our knowledge of linear algebra tells us that tr is a linear map. Hence, dX -> 2 tr(X^T dX) is the linear mapping corresponding to the Jacobian of tr(X^T X). With a little more work we could figure out how to write it as a matrix.godelski
https://math.stackexchange.com/questions/3680708/what-is-the-difference-between-the-jacobian-hessian-and-the-gradient
https://carmencincotti.com/2022-08-15/the-jacobian-vs-the-hessian-vs-the-gradient/
vismit2000
Check out this classic from 3b1b - How (and why) to raise e to the power of a matrix: https://youtu.be/O85OWBJ2ayo
kgwgk
For those who prefer reading (I’ve not seen the video, but it seems related):
https://sassafras13.github.io/MatrixExps/
“Thanks to a fabulous video by 3Blue1Brown [1], I am going to present some of the basic concepts behind matrix exponentials and why they are useful in robotics when we are writing down the kinematics and dynamics of a robot.”
esafak
They didn't show how to actually do it using matrix decomposition!
i_am_proteus
Those looking for a shorter primer could consult https://arxiv.org/abs/1802.01528
Valk3_
I've only skimmed through both of them, so I might be entirely incorrect here, but isn't the essential approach a bit different for both? The MIT one emphasis not to view matrices as tables of entries, but instead as holistic mathematical objects. So when they perform the derivatives, they try to avoid the "element-wise" approach of differentiation, while the one by Parr et Howard seems to do the "element-wise" approach, although with some shortcuts.
godelski
I got the same impression as you the Bright, Edelman, and Johnson (MIT) notes seems more driven my mathematicians where I find the Parr and Howard paper wanting. Though I agree with them
> Note that you do not need to understand this material before you start learning to train and use deep learning in practice
I have an alternative version > You don't need to know math to train good models, but you do need to know math to know why your models are wrong.
Referencing "All models are wrong"I think another part is that the Bright, Edleman, and Johnson paper are also introducing concepts such as Automatic Differentiation, Root Finding, Finite Difference Methods, and ODEs. With that in mind it is far more important to be coming from the approach where you are understanding structures.
I think there is an odd pushback against math in the ML world (I'm a ML researcher). Mostly because it is hard and there's a lot of success you can gain without it. But I don't think that should discourage people from learning math. And frankly, the math is extremely useful. If we're ever going to understand these models we're going to need to do a fuck ton more math. So best to get started sooner than later (if that's anyone's personal goal anyways)
Valk3_
Regarding the math in ML, what I would love to see (links if you have any) is a nuanced take on the matter, showing examples from both sides. Like in good faith discussing what contributions one can make with and without a strong math background in the ML world.
edit: On the math side I've encountered one that seemed unique, as I haven't seen anything like this elsewhere: https://irregular-rhomboid.github.io/2022/12/07/applied-math.... However, this only points out courses that he enrolled in his math education that he thinks is relevant to ML, each course is given a very short description and or motivation as to the usefulness it has to ML.
I like this concluding remarks:
Through my curriculum, I learned about a broad variety of subjects that provide useful ideas and intuitions when applied to ML. Arguably the most valuable thing I got out of it is a rough map of mathematics that I can use to navigate and learn more advanced topics on my own.
Having already been exposed to these ideas, I wasn’t confused when I encountered them in ML papers. Rather, I could leverage them to get intuition about the ML part.
Strictly speaking, the only math that is actually needed for ML is real analysis, linear algebra, probability and optimization. And even there, your mileage may vary. Everything else is helpful, because it provides additional language and intuition. But if you’re trying to tackle hard problems like alignment or actually getting a grasp on what large neural nets actually do, you need all the intuition you can get. If you’re already confused about the simple cases, you have no hope of deconfusing the complex ones.
westurner
> The class involved numerous example numerical computations using the Julia language, which you can install on your own computer following these instructions. The material for this class is also located on GitHub at https://github.com/mitmath/matrixcalc
dandanua
The Matrix Cookbook [1] can be handy when learning this topic.
refusingsalt
I think this encourages "look it up" and "this layout is just a convention" when one convention is much more "natural" i.e. at a point x, a dual element df(x) acts on vectors y via df(x)(y) = < gradf(x), y>.
I wouldn't teach it this way, but I would definitely take the taylor expansion and define the grad vector as the one to make the best local linear approximation. This tells you that the grad lives in the same space, i.e. same dimensions.
Of course, you can always switch things around if you want to calculate things and put in transposes where they belong. But I find it insane to take the unnatural view point as standard, which I find a lot of papers do.
hasley
When I worked at the university, this used to be my go-to reference about matrix identities (including matrix calculus).
vismit2000
3b1b classic on this topic with beautiful visualizations: https://youtu.be/O85OWBJ2ayo
windsignaling
Great course. I highly recommended anyone interested in this topic to check it out on the MIT website, taught by the same authors. They are great lecturers.
jeffreyrogers
Looks like the lectures from a prior version are on youtube too: https://www.youtube.com/playlist?list=PLUl4u3cNGP62EaLLH92E_...
Koncopd
I have skimmed it, and it looks very good. It is actually not solely about matrix calculus, but shows a practical approach to differentiation in different vector spaces with many examples and intuitions.
revskill
What does calculus mean ?
seanhunter
Calculus is the branch of mathematics that deals with continuous change. Broadly speaking there are two parts to it - differential calculus, which deals with rates of change and integral calculus which deals with areas, volumes and that sort of thing. Pretty early on you learn that these are essentially two sides of the same coin.
So in this particular instance since we are talking about matrix calculus it’s a type of multivariable calculus where you’re dealing with functions which take matrices as inputs and “matrix-valued functions” (ie functions which return matrices as an output).
Calculus is used for a lot of things but for example if you have a continuous function you can use calculus to find maxima and minima, inflection points etc. Since the focus here is machine learning, one of the most important things you want to be able to do is gradient descent to optimise some cost function by tweaking the weights on your model. The gradient here is a vector field which at every point in the space of your model points in the direction of steepest ascent[1]. So if you want to go down you take the gradient at a particular point and go exactly in the opposite direction. That means you know exactly what model weight tweaks will cause the biggest decrease in your cost function the next time you do a training run.
[1] To imagine the gradient, think of an old-school contour map like you’re going to do a hike or something. This is a “scalar field” (a map from spatial coordinates to a scalar value - altitude). The contour lines on the map link points of equal altitude. The gradient is a “vector field” which is a map from spatial coordinates to a vector. So imagine at every point on your contour map there was a little arrow that pointed in the direction of steepest ascent. Because this is matrix calculus you will be dealing with “matrix fields” (maps from spatial coordinates to matrices) as well. So for example say you did a measurement of stress in a steel beam. At each point you would have a “stress tensor” which says what the forces are at that point and in which direction they are pointing. This is a “tensor field” (map from spatial coordinates to a tensor) and a tensor is like a multidimensional matrix but with some additional rules about how it transforms.
odyssey7
Time to schedule with a dentist
FilosofumRex
wait what - another math textbook recommendation by academicians. ML and MLL are arts of tinkering not academic subjects.
Though Steven Johnson is the real deal and writes lots of code, Edelman is a shyster/imposter who used to ride the coattails of G. Strang and now shills for Julia where he makes most of his money. You don't need, and won't understand ML/LLM by reading textbooks.
1. If you want to have a little fun with ML/LLM, fire up Google Collab and run one of tutorials on the web - Karpathy, Hugging Face or PyTorch examples.
2. If you don't want to do, but just read for fun, Howard & Parr's essay as recommended by someone else here is much shorter and more succinct. https://explained.ai/matrix-calculus/ this link renders better
3. If you insist on academic textbooks, Boyd & Vandenberghe skips calculus and has more applications (engineering). Unfortunately, code examples are in Julia! https://web.stanford.edu/~boyd/vmls/vmls.pdf https://web.stanford.edu/~boyd/vmls/. link to python version
4. If you Want to become a tensor & differential programming ninja, learn Jax, XLA https://docs.jax.dev/en/latest/quickstart.html https://colab.research.google.com/github/exoplanet-dev/jaxop...
If you want to get handy with matrix calculus, the real prerequisite is being comfortable with Taylor expansions and linear algebra.
In a graduate numerical optimization class I took over a decade ago, the professor spent 10 minutes on the first day deriving some matrix calculus identity by working out the expressions for partial derivatives using simple calculus rules and a lot of manual labor. Then, as the class was winding up, he joked and said "just kidding, don't do that... here's how we can do this with a Taylor expansion", and proceeded to derive the same identity in what felt like 30 seconds.
Also, don't forget the Jacobian and gradient aren't the same thing!