"The Bitter Lesson" is wrong. Well sort of
20 comments
·July 20, 2025aabhay
bobbiechen
So true. I recently wrote about how Merlin achieved magical bird identification not through better algorithms, but better expertise in creating great datasets: https://digitalseams.com/blog/what-birdsong-and-backends-can...
I think "harsh reality" is one way to look at it, but you can also take an optimistic perspective: you really can achieve great, magical experiences by putting in (what could be considered) unreasonable effort.
v9v
I think your comment has some threads in common with Rodney Brooks' response: https://rodneybrooks.com/a-better-lesson/
vineyardmike
While I agree with you, it’s worth noting that current LLM training uses a significant percentage of all available written data for training. The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools. It’s only after the readily available data was exhausted, that future gains came curation and large amounts of synthetic data.
Calavar
> The transition from GPT-2 era models to now (GPT-3+) saw the transition from novel models that can kinda imitate speech to models that can converse, write code, and use tools.
Which is fundamentally about data. OpenAI invested an absurd amount of money to get the human annotations to drive RHLF.
RHLF itself is a very vanilla reinforcement learning algo + some branding/marketing.
aabhay
Transfer learning isn’t about “exhausting” all available un-curated data, its simply that the systems are large enough to support it. There’s not that much of a reason to train on all available data. And its not all, there’s still a very significant filtration happening. For example they don’t train on petabytes of log files, that would just be terribly uninteresting data.
pphysch
Another name for gathering and curating high-quality datasets is "science". One would hope "AI pioneer" USA would embrace this harsh reality and invest massively in basic science education and infrastructure. But we are seeing the opposite, and basically no awareness of this "harsh reality" among the AI hype...
rdw
The bitter lesson is becoming misunderstood as the world moves on. Unstated yet core to it is that AI researchers were historically attempting to build an understanding of human intelligence. They intended to, piece-by-piece, assemble a human brain and thus be able to explain (and fix) our own biological ones. Much like can be done with physical simulations of knee joints. Of course, you can also use that knowledge to create useful thinking machines, because you understand it well enough to be able to control it. Much like how we have many robotic joints.
So, the bitter lesson is based on a disappointment that you're building intelligence without understanding why it works.
DoctorOetker
Right, like discovering Huygens principle, or interference, integrals/sums of all paths in physics.
It is not because a whole lot of physical phenomena can be explained by a couple of foundational principles, that understanding those core patterns automatically endows one with an understanding of how and why materials refract light and a plethora of other specific effects... effects worth understanding individually, even if still explained in terms of those foundational concepts.
Knowing a complicated set of axioms or postulates endows one to derive theorems from them, but those implied theorem proofs are nonetheless non-trivial, and have a value of their own (even though they can be expressed and expanded into a DAG of applications of those "bitterly minimal" axiomatization.
Once enough patterns are correctly modeled by machines, and given enough time to analyze it, people will eventually discover a better how and why things work (beyond the mere abstract, knowledge that latent parameters were fitted against a loss function).
In some sense deeper understanding has already come for the simpler models like word2vec, where many papers have analyzed and explained relations between word vectors. This too lagged behind the creation and utilization of word vector embeddings.
It is not inconceivable that someday someone observes an analogy between say QKV tensors and triples resulting from graph linearization: think subject, object, predicate; (even though I hate those triples, try modeling a ternary relation like 2+5=7 with SOP-triples, its really only meant to capture "sky - is - blue" associations. A better type of triple would be player-role-act triples, one can then model ternary relations, but one needs to reify the relation)
Similarly, without mathematical training, humans display awareness of the concepts of sets, membership, existence, ... without a formal system. The chatbots display this awareness. It's all vague naive set theory. But how are DNN's modeling set theory? Thats a paper someday.
ta8645
> you're building intelligence without understanding why it works.
But if we do a good enough job of that, it should then be able to explain to us why it works (after it does some research/science on itself). Yes?
samrus
Bit fantastical. We are a general intelligence and we dont understand ourselves
grillitoazul
Perphaps the Bitter Lesson only works up to a point, that is throwing more compute and data only allow you to get to a certain point, and to go farther you need to add some new algorithm to be discovered.
macawfish
In my opinion the useful part of "the bitter lesson" has nothing to do with throwing more compute and more data at stuff, it has to do with actually using ML instead of trying to manually and cleverly tweak stuff, and with actually leveraging the data you have effectively as a part of that (again using more ML) rather than trying to manually label everything.
rhaps0dy
Sutton was talking about progress in AI overall, whereas Pinhasi (OP) is talking about building one model for production right now. Of course adding some hand-coded knowledge is essential for the latter, but it has not provided much long-term progress. (Even CNNs and group-convolutional NNs, which seek to encode invariants to increase efficiency while still doing almost only learning, seem to be on the way out)
roadside_picnic
"The Bitter Lesson" certainly seems correct when applied to whatever the limit of the current state of the art is, but in practice solving day-to-day ML problems, outside of FAANG-style companies and cutting edge research, data is always much more constrained.
I have, multiple times in my career, solved a problem using simple, intelligible models that have empirically outperformed neural models ultimately because there was not enough data for the neural approach to learn anything. As a community we tend to obsess over architecture and then infrastructure, but data is often the real limiting factor.
When I was early in my career I used to always try to apply very general, data hungry, models to all my problems.. with very mixed success. As I became more skilled I started to be a staunch advocated of only using simple models you could understand, with much more successful results (which is what lead to this revised opinion). But, at this point in my career, I increasingly see that one's approach to modeling should basically be to approach the problem more information theoretically: try to figure out the model with a channel capacity that best matches your information rate.
As a Bayesian, I also think there's a very reasonable explanation for why "The Bitter Lesson" rings true over and over again. In ET Jaynes' writing he often talks about Bayes' Theorem in terms of P(D|H) (i.e. probably of the Data given the Hypothesis, or vice versa), but, especially in the earlier chapters, purposefully adds an X to that equation: P(D|H,X) where X is a stand in for all of our prior information about the world. Typically we think of prior data as being literal data, but Jaynes' points out that our entire world of understand is also part of our prior context.
In this view, models that "leverage human understanding" (i.e. are fully intelligible) are essentially throwing out information at the limit. But to my earlier point, if the data falls quite short of that limit, then those intelligible models are adding information in data constrained scenarios. I think the challenge in practical application is figuring out where the threshold is that you need to adopt a more general approach.
Currently I'm very much in love with Gaussian Processes that, for constrained data environments, offer a powerful combination of both of these methods. You can give the model prior hints at what things should look like in terms of the relative structure of the kernel and it's priors (e.g. there should be some roughly annual seasonal component, and one roughly weekly seasonal component) but otherwise let the data decide.
godelski
I'm not sure if the Bitter Lesson is wrong, I think we'd need clarification from Sutton (does someone have this?)
But I do know "Scale is All You Need" is wrong. And VERY wrong.
Scaling has done a lot. Without a doubt it is very useful. But this is a drastic oversimplification of all the work that has happened over the last 10-20 years. ConvNext and "ResNets Strike Back" didn't take off for reasons, despite being very impressive. There's been a lot of algorithmic changes, a lot of changes to training procedures, a lot of changes to how we collect data[0], and more.
We have to be very honest, you can't just buy your way to AGI. There's still innovation that needs be done. This is great for anyone still looking to get into the space. The game isn't close to being over. I'd argue that this is great for investors too, as there are a lot of techniques looking to try themselves at scale. Your unicorns are going to be over here. A dark horse isn't a horse that just looks like every other horse. Might be a "safer" bet, but that's like betting on amateur jockies and horses that just train similar to professional ones. They have to do a lot of catch-up, even if the results are fairly certain. At that point you're not investing in the tech, you're investing in the person or the market strategy.
[0] Okay, I'll buy this one as scale if we really want to argue that these changes are about scaling data effectively but we also look at smaller datasets differently because of these lessons.
littlestymaar
The Leela Chess Zero vs Stockfish case also offers an interesting perspective on the bitter lesson.
Here's my (maybe a bit loose) recollection of what happened:
Step 1- Stockfish was the typical human-knowledge AI, with tons of actual chess knowledge injected in the process of building an efficient chess engine.
Step 2. Then came Leela Chess Zero, with its Alpha Zero-inspired training, a chess engine trained fully with RL with no prior chess knowledge added. And it has beaten Stockfish. This is a “bitter lesson” moment.
Step 3. The Stockfish devs added a neural network trained with RL to their chess engine, in addition to their existing heuristics. And Stockfish easily took back its crown.
Yes sending more compute at a problem is an efficient way to solve it, but if all you have is compute, you'll pretty certainly lose to somebody who has both compute and knowledge.
symbolicAGI
The Stockfish chess engine example nails it.
For AI researchers, the Bitter Lesson is not to rely on supervised learning, not to rely on manual data labeling, nor on manual ontologies nor manual business rules,
Nor on *manually coded* AI systems, except as the bootstrap code.
Unsupervised methods prevail, even if compute expensive.
The challenge from Sutton's Bitter Lesson for AI researchers is to develop sufficient unsupervised methods for learning and AI self-improvement.
The main problem with the “Bitter Lesson” is that there’s something even bitter-er behind it — the “Harsh Reality” that while we may scale models on compute and data, that simply broadly inserting tons of data without any sort of curation yields essentially garbage models.
The “Harsh Reality” is that while you may only need data, the current best models and companies behind them spend enormously on gathering high quality labeled data with extensive oversight and curation. This curation is of course being partially automated as well, but ultimately there’s billions or even tens of billions of dollars flowing into gathering, reviewing, and processing subjectively high quality data.
Interestingly, in the time that this paper was published, the harsh reality was not so harsh. For example in things like face detection, (actual) next word prediction, and other purely self supervised and not instruction tuned or “Chat” style models, data was truly all you needed. You didn’t need “good” faces. As long as it was indeed a face, the data itself was enough. Now, it’s not. In order to make these machines useful and not just function approximators, we need extremely large dataset curation industries.
If you learned the bitter lesson, you better accept the harsh reality, too.