Deep Learning Is Not So Mysterious or Different
54 comments
·March 17, 2025rottc0dd
chamomeal
I watched the 3b1b series on neural nets years ago, and it still accounts for 95% of my understanding of AI in general.
I’m not an ML person, but still. That guy has a serious gift for explaining stuff.
His video on the uncertainty principle explained stuff to me that my entire undergrad education failed to!
randomtoast
Apparently the word “delve” is the biggest indicator of the use of ChatGPT according to Paul Graham.
wincy
I’d love to see an article delve into why that is.
treyd
Because it's common in Nigerian English, which is where they outsourced a lot of the RLHF conditioning work to.
EGreg
Saying that kind of stuff is the biggest indicator of Paul Graham (pg) himself
bogeholm
Looks nice - are there written versions?
rottc0dd
Caltech's learning from data was really good too, if someone is looking for theoretical understanding of ML topics.
samstave
[dead]
perkele1989
[dead]
TechDebtDevin
Anyone who wants to demystify ML should read: The StatQuest Illustrated Guide to Machine Learning [0] By Josh Starmer.
To this day I haven't found a teacher who could express complex ideas as clearly and concisely as Starmer does. It's written in an almost children's book like format that is very easy to read and understand. He also just published a book on NN that is just as good. Highly recommend even if you are already an expert as it will give you great ways to teach and communicate complex ideas in ML.
[0]: https://www.goodreads.com/book/show/75622146-the-statquest-i...
Lerc
I have followed a fair few StatQuest and other videos (treadmills with Youtube are great for fitness and learning in one)
I find that no single source seems to cover things in a way that I easily understand, but cumulatively they fill in the blanks of each other.
Serrano Academy has been a good source for me as well. https://www.youtube.com/@SerranoAcademy/videos
The best tutorials give you a clear sense that the teacher has a clear understanding of the underlying principles and how/why they are applied.
I have seen a fair few things that are effectively.
'To do X, you {math thing}' While also creating the impression that they don't understand why {math thing} is the right thing to do, just that {math thing} has a name and it produces the result. Meticulously explaining the minutiae of {math thing} substitutes for a understanding of what it is doing.
It really stood out to me when looking at UMAP and seeing a bunch of things where they got into the weeds in the math without explaining why these were the particular weeds to be looking in.
Then I found a talk by Leland McInnes that had the format.
{math thing} is a tool to do {objective}. It works, there is a proof, you don't need to understand it to use the tool but the info for that is over there if you want to tale a look. These are our objectives, let's use these tools to achieve them.
The tools are neither magical black boxes, nor confused with the actual goal. It really showed the power of fully understanding the topic.
ajitid
Also would like to add that he has a YouTube channel as well https://youtube.com/@statquest
kaptainscarlet
Double Bam
cgdl
Agreed, but PAC-Bayes or other descendants of VC theory is probably not the best explanation. The notion of algorithmic stability provides a (much) more compelling explanation. See [1] (particularly Sections 11 and 12)
bigfatfrock
I'm a huge fan of HN just for replies such as this that smash the OP's post/product with something better. It's like at least half the reason I stick around here.
Thanks for the great read.
hlynurd
>smash with something better
Not a fan of the aggressive rhetoric here...
superidiot1932
I too felt threatened
noney4
[dead]
esafak
Statistical mechanics is the lens that makes most sense to me. Plus it's well studied.
mxwsn
Good read, thanks for sharing
getnormality
> rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.
How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.
chriskanan
Here is an example for data-efficient vision transformers: https://arxiv.org/abs/2401.12511
Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.
whiteandnerdy
You're correct, and the term you're looking for is "regularisation".
There are two common ways of doing this: * L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements) * Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights
levocardia
Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.
Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.
Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane
jonathanhuml
The solution to the L1 regularization problem is actually a specific form of the classical ReLU nonlinearity used in deep learning. I’m not sure if similar results hold for other nonlinearities, but this gave me good intuition for what thresholding is doing mathematically!
bornfreddy
I'm not a guru myself, but I'm sure someone will correct me if I'm wrong. :-)
The usual approach to supervised ML is to "invent" the model (layers, their parameters) or more often copy one from known good reference, then define the cost function and feed it data. "Deep" learning just means that instead of a few layers you use a big number of them.
What you describe sounds like an automated way of tweaking the architecture, IIUC? Never done that, usually the cost of a run was too high to let an algorithm do that for me. But I'm curious if this approach is being used?
null
woopwoop
Yeah, it's straightforward to reproduce the results of the paper whose conclusion they criticize, "Understanding deep learning requires rethinking generalization", without any (explicit) regularization or anything else that can be easily described as a "soft preference for simpler solutions".
buffalobuffalo
When I was first getting into Deep Learning, learning the proof of the universal approximation theorem helped a lot. Once you understand why neural networks are able to approximate functions, it makes everything built on top of them much easier to understand.
YesBox
I wish I had the time to try this:
1.) Grab many GBs of text (books, etc).
2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.
3.) For each word, store most frequent word for each $N distance. [a]
4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)
How close would this be to GPT 2?
[a] You could go one step further and store multiple words for each distance, ordered by frequency
0cf8612b2e1e
The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.
GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.
andrewla
Importantly, though, LLMs do not take the embeddings as input during training; they take the tokens and learn the embeddings as part of the training.
Specifically all Transformer-based models; older models used things like word2vec or elmo, but all current LLMs train their embeddings from scratch.
docfort
There is some recent work [0] that explores this idea, scaling up n-gram models substantially while using word2vec vectors to understand similarity. Used to compute something the authors call the Creativity Index [1].
[0]: https://infini-gram.io [1]: https://arxiv.org/abs/2410.04265v1
currymj
this is pretty close to how language models worked in the 90s-2000s. deep language models -- even GPT 2 -- are much much better. on the other hand, the n-gram language models are "surprisingly good" even for small n.
janalsncm
The problem is that for any reasonable value of N (>100) you will need prohibitive amounts of storage. And it will be extremely sparse. And you won’t capture any interactions between N-99 and N-98.
Transformers do that fairly well and are pretty efficient in training.
montebicyclelo
> How close would this be to GPT 2
Here's a post from 2015 doing something a bit like this [1]
WheatMillington
Pretty sure this wouldn't produce anything useful. Pretty sure this would generate incoherent gibberish that looks and sounds like English but makes no sense. This ignores perhaps the most important element of LLM's, the attention mechanism.
nickysielicki
Markov chains are very very far off from gpt2.
inciampati
An example, which is interesting, in which "deep" networks are necessary, is discussed in this fascinating and popular recent paper on RNNs [1]. Despite the fact that the minGRU and minLSTM models they propose don't explicitly model ordered state dependency, they can learn them as long as they are deep enough (deep >= 3):
> Instead of explicitly modelling dependencies on previous states to capture long-range dependencies, these kinds of recurrent models can learn them by stacking multiple layers.
talles
Correct me if I'm wrong, but an artificial neuron is just good old linear regression followed by an activation function to make it non linear. Make a network out of it and cool stuff happens.
andrewla
In a sense; linear regression can be computed exactly so refers to a specific technique for producing a linear model.
Most artificial neurons are trained stochastically rather than holistically, i.e. rather than looking at the entire training set and computing the gradient to minimize the squared loss or something similar, they look at each training example and compute the local gradient and make small changes in that direction.
In addition, the "activation function" almost universally used now is the rectified linear unit, which is linear for positive input and zero for negative input. This is non-decreasing at least as a function, but the fact that it is not monotonic means that there is no additional loss accrued for overcorrecting in the negative direction.
Given this, using the term "linear regression" to describe the model of an artificial neuron is not really a useful heuristic.
hatthew
This is like saying "the human brain is just some chemistry." You have the general idea correct, but there's a lot more going on that just that, and the emergent system is so much more complex that it deserves its own separate field.
esafak
MLPs are compositions of generalized linear models. That's not very enlightening though; the "mysterious" part is the macroscopics of the composition.
uoaei
[flagged]
bonoboTP
Please formulate your critique instead of simply labeling it with negative words.
FirstTimePostin
It's a higher quality article than half the submissions to NeuroIPS and the rest of the AI/ML conferences, because it has potential to remain relevant next year, and because of its high didactic content. I wouldn't classify it as a "hot take".
null
EncomLab
The implication that any software is "mysterious" is problematic - there is no "woo" here - the exact state of the machine running the software may be determined at every cycle. The exact instruction and the data it executed with may be precisely determined, as can the next instruction. The entire mythos of any software being a "black box" is just so much advertising jargon, perpetuated by tech bros who want to believe they are part of some Mr. Robot self-styled priestly class.
xmprt
The mystery was never in the "how do computers calculate the probabilities of next tokens" but rather in the "why is it able to work so well" and "what does this individual neuron contribute to the whole model"
margalabargala
You're misunderstanding. A level of abstraction is necessary for operation of modern systems. There is no human alive who, given an intermediate step in the middle of some running learning algorithm, is able to understand and mentally model the full system at full man-made resolution, that is, down to the transistor level, on a modern CPU. Someone wishing to understand a piece of software in 2025 is forced to, at some point, accept that something somewhere "does what it says on the tin" and model it thusly rather than having a full understanding.
EncomLab
It's not misunderstanding at all - but your response is certainly an attempt to obfuscate the point being made. The moment you represent anything in code, you are abstracting a real thing into it's digital representation. That digital representation if fully formed at every cycle of the digital system processing it, and the state of the system - all the way down to the transistor level may be precisely determined. To say otherwise is to make the same error as those who claim that consciousness or understanding are indefinable "extra-ordinary" things that we have to just accept exist without any justification or evidence.
01HNNWZ0MV43FF
But the weights trained from machine learning are a black box, in the sense that no human designed e.g. the image processing kernels that those weights represent.
That is one reason people are skeptical of them, not only is training a large model at home expensive, not only is the data too big to trivially store, but the weights are not trivial to debug either
If anyone wants to delve into machine learning, one of the superb resources I have found is, Stanfords "Probability for computer scientists"(https://www.youtube.com/watch?v=2MuDZIAzBMY&list=PLoROMvodv4...).
It delves into theoretical underpinnings of probability theory and ML, IMO better than any other course I have seen. (Yeah, Andrew Ng is legendary, but his course demands some mathematical familarity with linear algebra topics)
And of course, for deep learning, 3b1b is great for getting some visual introduction (https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ...).