Skip to content(if available)orjump to list(if available)

The most important machine learning equations: A comprehensive guide

cl3misch

In the entropy implementation:

    return -np.sum(p * np.log(p, where=p > 0))
Using `where` in ufuncs like log results in the output being uninitialized (undefined) at the locations where the condition is not met. Summing over that array will return incorrect results for sure.

Better would be e.g.

    return -np.sum((p * np.log(p))[p > 0])
Also, the cross entropy code doesn't match the equation. And, as explained in the comment below the post, Ax+b is not a linear operation but affine (because of the +b).

Overall it seems like an imprecise post to me. Not bad, but not stringent enough to serve as a reference.

jpcompartir

I would echo some caution if using as a reference, as in another blog the writer states:

"Backpropagation, often referred to as “backward propagation of errors,” is the cornerstone of training deep neural networks. It is a supervised learning algorithm that optimizes the weights and biases of a neural network to minimize the error between predicted and actual outputs.."

https://chizkidd.github.io/2025/05/30/backpropagation/

backpropagation is a supervised machine learning algorithm, pardon?

cl3misch

I actually see this a lot: confusing backpropagation with gradient descent (or any optimizer). Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights.

I guess giving the (mathematically) simple principle of computing a gradient with the chain rule the fancy name "backpropagation" comes from the early days of AI where the computers were much less powerful and this seemed less obvious?

null

[deleted]

cgadski

> This blog post has explored the most critical equations in machine learning, from foundational probability and linear algebra to advanced concepts like diffusion and attention. With theoretical explanations, practical implementations, and visualizations, you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here—they’ll learn 95% of what they need in one place!

It makes me sad to see LLM slop on the front page.

maerch

Apart from the “—“, what else gives it away? Just asking from a non-native perspective.

kace91

Not op, but it is very clearly the final summary telling the user that the post they asked the AI to write is now created.

bee_rider

Are eigenvalues or singular values used much in the popular recent stuff, like LLMs?

calebkaiser

LoRa uses singular value decomposition to get the low rank matrices. In different optimizers, you'll also see eigendecomposition or some approximation used (I think Shampoo does something like this, but it's been a while).

bob1029

MSE remains my favorite distance measure by a long shot. Its quadratic nature still helps even in non-linear problem spaces where convexity is no longer guaranteed. When working with generic/raw binary data where hamming distance would be theoretically more ideal, I still prefer MSE over byte-level values because of this property.

Other fitness measures take much longer to converge or are very unreliable in the way in which they bootstrap. MSE can start from a dead cold nothing on threading the needle through 20 hidden layers and still give you a workable gradient in a short period of time.