Pen and Paper Exercises in Machine Learning (2022)

58 comments

·March 21, 2025

lucasoshiro

Seems to be cool, but, one of thing that most annoys me on studying machine learning is that I may dive as deep as it is possible in theory, but I can't see how it connects to the practice, i. e. how it makes me choose the correct number of neurons in a layer, how many layers, the activation functions, if I should use a neural network or other techniques, and so on...

If someone have something explaining that I'll be grateful

incognito124

Maybe this will help:

https://github.com/google-research/tuning_playbook

danielscrubs

”Summary: When starting a new project, try to reuse a model that already works.”

”Summary: Start with the most popular optimizer for the type of problem at hand.”

This is like when game designers say: try a bunch of different stuff!

Aka. we have no idea, but we have the manpower to throw things at the wall for years. May the richest, not most clever, company win!

lucasoshiro

Thanks!

moffkalast

> how it makes me choose the correct number of neurons in a layer, how many layers, the activation function

Seeing massive ablation studies on each one of those in just about every ML paper should be fairly indicative that nobody knows shit about fuck when it comes to that. Just people trying things out randomly and seeing what works, copying ideas from each other resulting in some vague guidelines. It's the worst field if you want things to be logical and explainable. It's mostly labelling datasets, paying for compute and hoping for the best.

bob1029

> nobody knows shit about fuck when it comes to that

This is why I've abandoned neural networks as a computational substrate for genetic programming experiments.

Tape-based UTMs may be extremely rigid in how they execute instruction streams, but at least you can eventually understand and describe everything that contributes to their behavior.

Changing the fan-out from 12 to 15 in a NN is like ancient voodoo ritual compared to realizing a program tape is probably not long enough based upon rough entropy measures.

danielmarkbruce

Most things can't be learned via pure theory or pure practice. Almost nothing related to work in the modern day can.

In ML not everything can be derived from theory. If it could, we'd not have been so surprised by the performance of really really large language models. At the same time, if you can't reason about the math involved, you are going to have a difficult time figuring why something isn't working or what options you have - could be around architecture or loss functions or choice of activation function or optimizer or hyperparameters or training time/resources or a dozen other things.

cubefox

> In ML not everything can be derived from theory.

And not every theory in ML has a lot of applications to practice. For example, statistical learning theory has only limited relevance in practice, and algorithmic learning theory has basically none at all. There are a lot of mathematical theories that are relatively old (often much older than the deep learning boom and definitely older than transformers) and that are more interesting from a conceptual perspective rather than from the point of practical applications.

yomritoyj

ML practice has for the moment far outstripped ML theory. But even if ML theory catches up, the answers to your question will get likely be still dependent on the nature of the process generating the data and hence they would still have to be answered empirically. I see the value of theory more in providing a general conceptual framework. Just as the asymptotic theory of algorithms today cannot tell you which algorithm to use, but gives you some broad guidance.

xg15

> the answers to your question will get likely be still dependent on the nature of the process generating the data and hence they would still have to be answered empirically.

And I think that would be perfectly fine, or rather weird if otherwise. Part(*) of the unpredictability of ML models stems from the fact that the training data is unpredictable.

What is missing for me so far are more detailed explanations how the training data and task would influence specific decisions in model architecture. So I wouldn't expect a hard answer in the sense of "always use this architecture or that amount of neurons" but rather more insight what effects a specific architecture would have on the model.

E.g. every ML 101 course teaches the difference between single-layer and "multi"-layer (usually 2-layer) perceptrons: Linear separability, XOR problem etc.

But I haven't seen a lot of resources about e.g. the differences between 2- and 3-layer perceptrons, or 3- and 32-layer, etc. Similarly, how are your model capabilities influenced by the number of neurons inside a layer, or for convolutional layers, by parameters such as kernel dimensions, stride dimensions, etc? Same for transformers: What effects do embedding size, number of attention heads and number of consecutive transformer layers have on the model's abilities? How do I determine good values?

I don't want absolute numbers here, but rather any kind of understanding at all how to choose those numbers.

(There are some great answers in this thread already)

(* part of it, not all. I'm starting to get annoyed by the "culture" of ML algorithm design that seems to love throwing in additional sources of randomness and nondeterminism whenever they don't have a good idea what to do otherwise: Randomly shuffling/splitting the training data, random initialization of weights, random neuron/layer dropouts, random jumps during gradient descent, etc etc. All fine if you only care about statistics and probability distributions, but horrible if you want to debug a specific training setup or understand why your model learned some specific behavior).

lucasoshiro

> every ML 101 course teaches the difference between single-layer and "multi"-layer (usually 2-layer) perceptrons: Linear separability, XOR problem etc.

Yeah, that's the point! ML related stuff seems to be starting with simpler problems like linear separation and XOR, then diving into some math, and soon it shows a magical python code out of nowhere that solves a problem (e.g. MNIST) and only that problem

grandiego

Beginner here. A takeaway I got from the Andrew Ng's Coursera course (specifically for neural networks) is that adding more neurons and layers than the "minimum needed" is usually okay (that is, no risk of overfitting when considering reasonable regularization terms.) Sadly, there is no rule for that minimum, so you must do trial and error; on the other side, carelessly extending the network will be inefficient and eventually slow. For the activation functions, the output layer's is mostly determined by the problem being tackled, and for the inner layers you usually start with ReLU and then try some of the common variants using some heuristics (again related to the current problem.) Of course you should consider other successful models for similar problems as your starting point.

joshdavham

I think you're just more interested in the practical side of ML which is totally fine!

I'm a bit skeptical of how much math and theory the average MLE actually needs. Obviously they do need some, but how much? I'm not sure.

But on the other hand, the theoreticians often need much more math. Something like the SVM could only have been invented by a math genius like Vapnik.

nickpsecurity

Meat pro's told me linear algebra and some differential calculus is the bare minimum. That's because some classes are designed to build on only that. However, I think statistics and probability would be helpful since they keep using techniques from both. Also, you can solve many problems without deep learning just using older, statistical methods.

lucasoshiro

> But on the other hand, the theoreticians often need much more math

I see... I'm really not interested (at the moment, at least) to be a pro, only to be able to train models for simple tasks and understand the process

rramadass

Essential Math for AI: Next-Level Mathematics for Efficient and Successful AI Systems by Hala Nelson provides a broad and comprehensive overview. The author provides a "big picture" view (important for AI/ML study since it is easy to get lost in specific details) and relates concepts to each other at a high level thus enhancing knowledge comprehension and assimilation.

SJC_Hacker

I guess the real question would be, given a budget of P parameters, how many hidden layers in something like a multi-layer perception is a good idea, and what are the size of those hidden layers? As well as questions like is it every a good idea to have a hidden layer that is larger than the previous layer (including the input) ? Or are you just wasting compute / parameter space?

I'm no expert, but a "rule of thumb" might be the more non-linear the system is, the more hidden layers you would want.

Also let us consider the information in the input vector from the perspective of compression.

How much you can compress without losing information depends on the entropy of the system. Low entropy = high compression ratio, while high entropy = low compression. High entropy is essentially noise (total disorder), on the other hand very low entropy just doesn't have much information (like a very long string of 10101010 ...)

Most "interesting data" (video/audio/images) can be compressed at ratios of about 50% before information loss kicks in. Note: text can be compressed quite heavily, but that is partially because the encoding is extremely inefficient - e.g. 8-bits per char when really only ~5 are needed, and also of much lower entropy (only ~30k words in the English language, for example)

On the other hand, information loss might not be such a bad thing if the input data has extraneous information, which it often does. This is why video, audio and image data can be compressed at ratios 10x-20x before noticeable loss of quality.

So I think the answer would be, you don't want to decrease the size of the previous layer, especially the input layer, by more than about 10x-20x.

simojo

Very neat! Reminds me of Tom Yeh's "AI By Hand" exercises [0].

[0] https://www.byhand.ai/

Sysreq2

This is what I was expecting. Very much appreciated. OP’s paper is good - but I sort of feel like it’s singing to the choir. It’s a great resource if you already know the material.

S4M

Looks neat! My only criticism would be that the solutions are given right after the questions so I couldn't help to read the answer of one question before thinking it through by myself.

plants

This is really neat! I work in machine learning but still feel imposter syndrome with my foundations with math (specifically linear algebra and matrix/tensor operations). Does anyone have any more good resources for problem sets with an emphasis on deep learning foundational skills? I find I learn best if I do a bit of hands-on work every day (and if I can learn things from multiple teachers’ perspectives)

antipaul

So who among current ML practitioners building “useful” ML could solve some of these?

_Should they_ be able to?

jerrygenser

Nope, i don't think they should or need to be able to.

These exercises are useful for mathematical maturity which results in intuition needed to develop novel algorithms or low level optimizations.

Not needed to use existing train and deploy ML algorithms in general.

psyklic

Good news -- if you're not interested in extending state-of-the-art and simply want to call APIs, you don't have to learn ML deeply.

danielmarkbruce

Depends on your definition of "ML practitioner", "building" and "ML". Look at the section on optimization - some people have an extremely good grasp of this and it helps them mentally iterate through possible loss functions and possible ways to update parameters and what can go wrong.

grandempire

Some people see studying as a chore and want to learn the minimum to get the job done. Others find it insightful and fun and enjoy doing problems and reading material.

Both approaches make contributions and can lead to success, but in different ways.

biotechbio

I am curious about the same thing. I worked as a ML engineer for several years and have a couple of degrees in the field. Skimming over the document, I recognized almost everything but I would not be able to recall many of these topics if asked without context, although at one time I might have been able to.

What are others' general level of recall for this stuff? Am I a charlatan who never was very good at math or is it just expected that you will forget these things in time if you're not using them regularly?

kingkongjaffa

complete with solutions, beautiful, thank you for sharing!

I'd be interested in more of these pen and paper exercises, if there is such a term, for other topics.

blackbear_

Not sure which other topics you mean, but "1000 exercises in probability" should keep you busy for a while (one can find the PDF online). For other math oriented riddles, check out "The colossal book of short puzzles and problems" and "The art and craft of problem solving"

dang

Discussed at the time:

Pen and paper exercises in machine learning (2021) - https://news.ycombinator.com/item?id=31913057 - June 2022 (55 comments)

FilosofumRex

Funny how mathematicians always try to sneak their linear algebra and matrix theory into ML. If you didn't know any better, you'd think academicians had invented LLMs and are the experts to be consulted with.

If anything academicians and theoreticians held ML back and forced generations of grad students doing symbolic proofs, like in this example, just because computational techniques were too lowbrow for them.

grandempire

If you want to contribute to ML and not just use existing techniques, math skills are the most important limiter.

Who do you know making contributions who isn’t fluent in linear algebra?

Also why are you summarizing the entire field as “LLMs”?

Matthyze

Are math skills really? Most aspects of deep learning don't require a deep understanding of mathematics to understand. Backprop, convolution, attention, recurrent networks, skip connections, GANs, RL, GNNs, etc. can all be stood with only simple calculus and linear algebra.

I understand that the theoretical motivation for models is often more math-heavy, but I'm skeptical that motivations need always be mathematical in nature.

grandempire

I’m not saying you can’t use these existing techniques without understand all the theory, but you’re not going to be able to find new techniques.

For example, how would you know optimizing a convolution kernel is a good idea if you aren’t familiar with linear time invariant systems?

thecleaner

Every MLE who didnt study Math really likes to downplay its importance. Yeah you dont need measure theoretic probability, but you need a grasp of Lin Alg to structure your computations better. Remember the normalization that we do in attention ? That has a math justification. So I guess yeah academics did have a role in building LLMs.

I mean computer scientists really do like to pretend like they invented the whole field. Whereas in reality the average OS, compilers, networks class has nothing to do with core ML. But of course are also important and these barbs dont get us anywhere.

anthk

Forget actual CS and proper engineering without discrete math.

Also, without Shannon you wouldn't have neither Telecomms nor Computer Science.

Heck, Lisp it's just a formalisation and implementation of Lambda Calculus, which began as a paper... from a Mathematician.

Also: https://hakmem.org

Forget any serious reading without Math skills.

Abishek_Muthian

Interesting perspective, Would you have recommendations for resources which prioritizes "computational techniques" over "symbolic proofs"?

nophunphil

Interesting. Can you share an example of this?

BeetleB

Isn't arxiv meant for research level papers? Surprised to see this hosted there.

axpy906

Love it.

imranq

If someone could turn these into an adaptive Khan Academy style app, that would be incredible

mathandsurf

Just curious for you or anyone else, what would make such an app compelling for you to use? And maybe not one that's just aimed at learning the content of this document, but if you'd like to think more broadly, an app aimed at helping you learn and retain things that you're currently interested in, studying, etc.

For these machine learning problems specifically, feel like there are so many people that would greatly benefit from having some form of spaced repetitive practice (as you mention like the adaptive Khan Academy style app), or some other easy-to-use format. I just wonder what other features people would want that would make them want to use something like this over learning with other resources (e.g., YouTube videos, reading books, etc.)

imranq

There are already some resources like that like:

leetgpu.com

https://github.com/srush/GPU-Puzzles

For me its about a sense of progress, like in chess you can have an ELO score. Or in Duolingo theres a roadmap. If there were levels to this you could get more confident in your abilities.

Right now the levels are basically bachelors, masters, and PhD. Coarse and expensive

HN

Pen and Paper Exercises in Machine Learning (2022)

Pen and Paper Exercises in Machine Learning (2022)