Skip to content(if available)orjump to list(if available)

Pen and Paper Exercises in Machine Learning (2022)

lucasoshiro

Seems to be cool, but, one of thing that most annoys me on studying machine learning is that I may dive as deep as it is possible in theory, but I can't see how it connects to the practice, i. e. how it makes me choose the correct number of neurons in a layer, how many layers, the activation functions, if I should use a neural network or other techniques, and so on...

If someone have something explaining that I'll be grateful

danielmarkbruce

Most things can't be learned via pure theory or pure practice. Almost nothing related to work in the modern day can.

In ML not everything can be derived from theory. If it could, we'd not have been so surprised by the performance of really really large language models. At the same time, if you can't reason about the math involved, you are going to have a difficult time figuring why something isn't working or what options you have - could be around architecture or loss functions or choice of activation function or optimizer or hyperparameters or training time/resources or a dozen other things.

grandiego

Beginner here. A takeaway I got from the Andrew Ng's Coursera course (specifically for neural networks) is that adding more neurons and layers than the "minimum needed" is usually okay (that is, no risk of overfitting when considering reasonable regularization terms.) Sadly, there is no rule for that minimum, so you must do trial and error; on the other side, carelessly extending the network will be inefficient and eventually slow. For the activation functions, the output layer's is mostly determined by the problem being tackled, and for the inner layers you usually start with ReLU and then try some of the common variants using some heuristics (again related to the current problem.) Of course you should consider other successful models for similar problems as your starting point.

joshdavham

I think you're just more interested in the practical side of ML which is totally fine!

I'm a bit skeptical of how much math and theory the average MLE actually needs. Obviously they do need some, but how much? I'm not sure.

But on the other hand, the theoreticians often need much more math. Something like the SVM could only have been invented by a math genius like Vapnik.

godelski

Just curious, how "deep" have you gone into the theory? What resources have you used? How strong is your math background?

Unfortunately a lot of the theory does require some heavy mathematics, the type you won't see in a typical undergraduate degree even for more math heavy subjects like physics. Topics such as differential geometry, metric theory, set theory, abstract algebra, and high dimensional statistics. But I do promise that the theory helps and can build some very strong intuition. It is also extremely important that you have a deep understanding of what these mathematical operations are doing. It does look like this exercise book is trying to build that intuition, but I haven't read it in depth. I can say it is a good start, but only the very beginning of the theory journey. There is a long road ahead beyond this.

  > how it makes me choose the correct number of neurons in a layer, how many layers,
Take a look at the Whitney embedding theorem. While this isn't a precise answer, it'll help you gain some intuition about the minimal number of parameters you need (and the VGG paper will help you understand width vs depth). In a transformer, the MLP layer post attention scales up 4x the dimensions before coming down, which allows for untangling any knots in the data. While 2x is the minimum, 4x creates a smoother landscape and so the problem can be solved more easily. Some of this is discussed in paper (Schaeffer, Miranda, and Koyejo) that counters the famous Emergent Abilities paper by Wei et al. This should be discussed early on in ML courses when discussing problems like XOR or the concentric circle. These problems are difficult because in their natural dimension you cannot draw a hyperplane discriminating them, but by increasing the dimensionality of the problem you can. This fact is usually mentioned in intro ML courses but I'm not aware of one that contains more details such as a discussion of the Whitney embedding theorem that allow you to better generalize the concepts here.

  > the activation functions
There's a very short video I like that visualizes Gelu[0], even using the concentric circles! The channel has a lot of other visualizations that will really benefit your intuition. You may see where the differential geometry background can provide benefits. Understanding how to manipulate manifolds is critical to understanding what these networks are doing to the data. Unfortunately these visualizations will not benefit you once you scale beyond 3D as weird things happen in high dimensions, even as low as 10[1]. A lot of visual intuition goes out the window and this often leads people to either completely abandon it or make erroneous assumptions (no, your friend cannot visualize 4D objects[2,3] and that image you see of a tesseract is quite misleading).

The activation functions provide non-linearity to the networks. A key ingredient missing from the preceptron model. Remember that with the universal approximation theorem you can approximate any smooth, Lipschitz-continuious function, over a closed boundary. You can, in simple cases, relate this to Riemann Summation, but you are using smooth "bump functions" instead of rectangles. I'm being fairly hand-wavy here on purpose because this is not precise but there are relationships to be found here. This is a HN comment, I have to overly simplify. Also remember that a linear layer without an activation can only perform Affine Transformations. That is, after all, what a matrix multiplication is capable of (another oversimplification).

The learning curve is quite steep and there's a big jump from the common "it's just GMMs" or "it's just linear algebra" that is commonly claimed[4]. There is a lot of depth here, and unfortunately due to the hype there is a lot of stuff that says "deep" or "advanced mathematics" but it is important to remember that these terms are extremely relative. What is deep to one person is shallow to another. But if it isn't going beyond calculus, you are going to struggle, and I am extremely empathetic to that. But again, I do promise that there is a lot of insight to be gained by digging into the mathematics. There is benefit to doing things the hard way. I won't try to convince you that it is easy or that there isn't a lot of noise surrounding the topic, because that'd be a lie. If it were easy, ML systems wouldn't be "black boxes"![5]

I would also encourage you to learn some meta physics. Something like Ian Hacking's representing and Intervening is a good start. There are limitations to what can be understand through experimentation alone, famously illustrated in Dyson's recounting of then Fermi rejected his paper[6]. There is a common misunderstanding of the saying "with 4 parameters I can fit an elephant and with 5 I can make it wiggle its trunk." [6] can help provide a better understanding to this, but we truly do need to understand the limitation of empirical studies. Science relies on the combination of empirical studies and theory. They are no good without the other. This is because science is about creating causal models, so one must be quite careful and be extremely nuanced when doing any form of evaluation. The subtle details can easily trick you.

[0] https://www.youtube.com/watch?v=uiB97cPEVxM

[1] https://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres

[2] https://www.youtube.com/shorts/_n7TMDnYdVY

[3] https://www.youtube.com/watch?v=FfiQBvcdFG0

[4] https://news.ycombinator.com/item?id=43418334

[5] I actually dislike this term. It is better to say that they are opaque. A black box would imply that we have zero insights. But in reality we can see everything going on inside, it is just extremely difficult to interpret. We also do have some understanding, so the interpretation isn't impenetrable.

[6] https://www.youtube.com/watch?v=hV41QEKiMlM

moffkalast

> how it makes me choose the correct number of neurons in a layer, how many layers, the activation function

Seeing massive ablation studies on each one of those in just about every ML paper should be fairly indicative that nobody knows shit about fuck when it comes to that. Just people trying things out randomly and seeing what works, copying ideas from each other resulting in some vague guidelines. It's the worst field if you want things to be logical and explainable. It's mostly labelling datasets, paying for compute and hoping for the best.

bob1029

> nobody knows shit about fuck when it comes to that

This is why I've abandoned neural networks as a computational substrate for genetic programming experiments.

Tape-based UTMs may be extremely rigid in how they execute instruction streams, but at least you can eventually understand and describe everything that contributes to their behavior.

Changing the fan-out from 12 to 15 in a NN is like ancient voodoo ritual compared to realizing a program tape is probably not long enough based upon rough entropy measures.

hansvm

> all of the above

NFL says something about it being a wash for arbitrary data. All results are going to be tuned to assumptions we have about our data in particular (not too many discontinuities, sufficiently well sampled, ...).

> neurons in a layer, how many layers, ...

Scaling laws are, currently, empirically derived. From those you can pick your goals (e.g., at most $X and maximize accuracy) and work backward to one or more optimal sets of parameters. Except in very restricted domains or with other strong assumptions I haven't seen anything giving you more than that.

> activation functions

All of the above about how it can't matter for arbitrary data and how parameters need to be empirically derived apply. However: An important inductive bias a lot of practitioners use is that every weight in the model should be roughly equally important. There are other ways you choose activation functions, especially in specialized domains, but when designing a deep network one of the most important things you can do is control the magnitude of information at each level of backpropagation. If your activation function (and surrounding infrastructure) approximately handles that problem then it's probably good enough.

> neural network or other techniques

For almost every problem you're better off using something other than a neural network (like catboost). I don't have any good intuition for why that's the case. Test them both. That's what the validation dataset is for.

> how it connects to the practice

For this article in particular, it doesn't connect to a ton of what I personally do. I'm sure it resonates with someone. As soon as pytorch or jax or whatever isn't good enough though and you have to go implement stuff from scratch, you need a deep dive in the theory you're implementing. To a lesser degree, if you're interfacing with big frameworks nontrivially or working around their limitations, you still need a deep understanding of the things you're implementing.

Imagine, e.g., that you want all the modern ML tools in a world where dynamic allocation, virtual functions, and all that garbage aren't tractable. You can resoundedly beat every human heuristic for phantom touchpad events in your mouse driver with a tiny neural network, but you can't use pytorch to do it without turning your laptop into a space heater.

Embedded devices aren't the only scenario where you might have to venture off the beaten path. Much like the age-old argument of importing a data structure vs writing your own, as soon as you have requirements beyond what the library author provides it's often worth it to do the whole thing on your own, and it takes a firm theoretical foundation to do so swiftly and correctly.

> how it connects to practice

That's a criticism I have of a lot of educational materials. Connecting the dots is important in writing (competing with all the advantages of brevity).

Pick on the Model-Based Learning section as an example. We're asked, to start, to MLE a gaussian. (M)aximum (L)ikelihood (E)stimation is an extremely important concept, and a lot of ML practitioners throw it to the side.

Imagine, e.g., a 2-stage process where for each price bracket you have a model reporting the likelihood of conversion and then a second stage where you synthesize those predictions into an optimal strategy. Common failure modes include (a) mishandling variance, (b) assuming that MLE on each of the models allows you to combine the mean/mode/... results into an MLE composite action, (c) really an extension of [b], but if you have the wrong loss function for your model(s) then they aren't meaningfully combinable, ....

Something that should be obvious (predict conversion rates, combine those rates to determine what you should do) has tons of pitfalls if you don't holistically reason about the composite process. That's perhaps a failure in the primitives we use to construct those composite processes, but in today's day and age it's still something you have to consider.

How does the book connect? I dunno. It looks more like a "kata" (keep your fundamental skills sharp) than anything else. An explicit connection to some real-world problem might make it more tractable.

bschmidt702

[flagged]

FilosofumRex

Funny how mathematicians always try to sneak their linear algebra and matrix theory into ML. If you didn't know any better, you'd think academicians had invented LLMs and are the experts to be consulted with.

If anything academicians and theoreticians held ML back and forced generations of grad students doing symbolic proofs, like in this example, just because computational techniques were too lowbrow for them.

simojo

Very neat! Reminds me of Tom Yeh's "AI By Hand" exercises [0].

[0] https://www.byhand.ai/

Sysreq2

This is what I was expecting. Very much appreciated. OP’s paper is good - but I sort of feel like it’s singing to the choir. It’s a great resource if you already know the material.

S4M

Looks neat! My only criticism would be that the solutions are given right after the questions so I couldn't help to read the answer of one question before thinking it through by myself.

plants

This is really neat! I work in machine learning but still feel imposter syndrome with my foundations with math (specifically linear algebra and matrix/tensor operations). Does anyone have any more good resources for problem sets with an emphasis on deep learning foundational skills? I find I learn best if I do a bit of hands-on work every day (and if I can learn things from multiple teachers’ perspectives)

antipaul

So who among current ML practitioners building “useful” ML could solve some of these?

_Should they_ be able to?

jerrygenser

Nope, i don't think they should or need to be able to.

These exercises are useful for mathematical maturity which results in intuition needed to develop novel algorithms or low level optimizations.

Not needed to use existing train and deploy ML algorithms in general.

danielmarkbruce

Depends on your definition of "ML practitioner", "building" and "ML". Look at the section on optimization - some people have an extremely good grasp of this and it helps them mentally iterate through possible loss functions and possible ways to update parameters and what can go wrong.

biotechbio

I am curious about the same thing. I worked as a ML engineer for several years and have a couple of degrees in the field. Skimming over the document, I recognized almost everything but I would not be able to recall many of these topics if asked without context, although at one time I might have been able to.

What are others' general level of recall for this stuff? Am I a charlatan who never was very good at math or is it just expected that you will forget these things in time if you're not using them regularly?

psyklic

Good news -- if you're not interested in extending state-of-the-art and simply want to call APIs, you don't have to learn ML deeply.

kingkongjaffa

complete with solutions, beautiful, thank you for sharing!

I'd be interested in more of these pen and paper exercises, if there is such a term, for other topics.

blackbear_

Not sure which other topics you mean, but "1000 exercises in probability" should keep you busy for a while (one can find the PDF online). For other math oriented riddles, check out "The colossal book of short puzzles and problems" and "The art and craft of problem solving"

dang

Discussed at the time:

Pen and paper exercises in machine learning (2021) - https://news.ycombinator.com/item?id=31913057 - June 2022 (55 comments)

bschmidt269

[flagged]

bschmidt600

[flagged]

bschmidt700

[flagged]

bschmidt720

[flagged]

axpy906

Love it.