The Bitter Lesson Is Misunderstood

49 comments

·August 28, 2025

kushalc

Hey folks, OOP/original author here and ≥10-year HN lurker — a friend just told me about this and thought I'd chime in.

Reading through the comments, I think there's one key point that might be getting lost: this isn't really about whether scaling is "dead" (it's not), but rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks.

Someone commented below about verifiable rewards and IMO that's exactly it: if you can find a way to produce verifiable rewards about a target world, you can essentially produce unlimited amounts of data and (likely) scale past the current bottleneck. Then the question becomes, working backwards from the set of 4-8h METR tasks, what worlds can we make verifiable rewards for and how do we scalably make them? (There's another path with better design, e.g. CLIP that improves both architecture _and_ data, but let's leave that aside for now.)

Which is to say, it's not about more data in general, it's about the specific kind of data (or architecture) we need to break a specific bottleneck. For instance, real-world data is indeed verifiable and will be amazing for robotics, etc. but that frontier is further behind: there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today.

FloorEgg

The problem I am facing in my domain is that all of the data is human generated and riddled with human errors. I am not talking about typos in phone numbers, but rather fundamental errors in critical thinking, reasoning, semantic and pragmatic oversights, etc. all in long-form unstructured text. It's very much an LLM-domain problem, but converging on the existing data is like trying to converge on noise.

The opportunity in the market is the gap between what people have been doing and what they are trying to do, and I have developed very specialized approaches to narrow this gap in my niche, and so far customers are loving it.

I seriously doubt that the gap could ever be closed by throwing more data and compute at it. I imagine though that the outputs of my approach could be used to train a base model to close the gap at a lower unit cost, but I am skeptical that it would be economically worth while anytime soon.

mediaman

This is one reason why verifiable rewards works really well, if it's possible for a given domain. Figuring out how to extract signal and verify it for an RL loop will be very popular for a lot of niche fields.

incompatible

When studying human-created data, you always need to be aware of these factors, including bias from doctrines, such as religion, older information becoming superseded, outright lies and misinformation, fiction, etc. You can't just swallow it all uncritically.

stego-tech

This is my current drum I bang on when an uninformed stakeholder tries shoving LLMs blindly down everyone’s throats: it’s the data, stupid. Current data aggregates outside of industries wholly dependent on it (so anyone not in web advertising, GIS, or intelligence) are garbage, riddled with errors and in awful structures that are opaque to LLMs. For your AI strategy to have any chance of success, your data has to be pristine and fresh, otherwise you’re lighting money on fire.

Throwing more compute and data at the problem won’t magically manifest AGI. To reach those lofty heights, we must first address the gaping wounds holding us back.

FloorEgg

Yes, for me both customers and colleagues continually suggested "hey let's just take all these samples of past work and dump it in the magical black box and then replicate what they have been doing".

Instead I developed a UX that made it as easy as possible for people to explain what they want to be done, and a system that then goes and does that. Then we compare the system's output to their historical data and there is always variance, and when the customer inspects the variance they realize that their data was wrong and the system's output is far more accurate and precise than their process (and ~3 orders of magnitude cheaper). This is around when they ask how they can buy it.

This is the difference between making what people actually want and what they say they want: it's untangling the why from the how.

marlott

Interesting! Could you give an example with a bit more specific detail here? I take it there's some kind of work output, like a report, in a semi-structured format, and the goal is to automate creation of these. And you would provide a UX that lets them explain what they want the system to create?

PLenz

I've worked in 2 of those domains (I was a geographer at a web advertising company) and let me tell you, the data is only slightly better than the median industry and in the case of the geodata from apps I'd say it's far, far, far worse.

Workaccount2

They'll pay academics to create data, in fact this is already happening.

cs702

I don't think Sutton's essay is misunderstood, but I agree with the OP's conclusion:

We're reaching scaling limits with transformers. The number of parameters in our largest transformers, N, is now in the order of trillions, which is the most we can apply given the total number of tokens of training data available worldwide, D, also in the order of trillions, resulting in a compute budget C = 6N × D, which is in the order of D². OpenAI and Google were the first to show these transformer "scaling laws." We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. As the OP puts it, if we want to increase the number of GPUs by 2x, we must also increase the number of parameters and training tokens by 1.41x, but... we've already run out of training tokens.

We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).

antirez

This is true for the pre-training step. What if advancements in the reinforcement learning steps performed later may benefit from more compute and more models parameters? If right now the RL steps only help with sampling, that is, they only optimize the output of a given possible reply instead of the other (there are papers pointing at this: that if you generate many replies with just the common sampling methods, and you can verify correctness of the reply, then you discover that what RL helps with is selecting what was already potentially within the model output) this would be futile. But maybe advancements in the RL will do to LLMs what AlphaZero-alike models did with Chess/Go.

cs702

It's possible. We're talking about pretraining meaningfully larger models past the point at which they plateau, only to see if they can improve beyond that plateau with RL. Call it option (3). No one knows if it would work, and it would be very expensive, so only the largest players can try it, but why the heck not?

charleshn

> We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. > We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).

Of course we can, this is a non issue.

See e.g. AlphaZero [0] that's 8 years old at this point, and any modern RL training using synthetic data, e.g. DeepSeek-R1-Zero [1].

[0] https://en.m.wikipedia.org/wiki/AlphaZero

[1] https://arxiv.org/abs/2501.12948

jeremyjh

AlphaZero trained itself through chess games that it played with itself. Chess positions have something very close to an objective truth about the evaluation, the rules are clear and bounded. Winning is measurable. How do you achieve this for a language model?

Yes, distillation is a thing but that is more about compression and filtering. Distillation does not produce new data in the same way that chess games produce new positions.

charleshn

You can have a look at the DeepSeek paper, in particular section "2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Mode".

But generally the idea is that it's, you need some notion of reward, verifiers etc.

Works really well for maths, algorithms, amd many things actually.

See also this very short essay/introduction: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

That's why we have IMO gold level models now, and I'm pretty confident we'll have superhuman mathematics, algorithmic etc models before long.

Now domains which are very hard to verify - think e.g. theoretical physics etc - that's another story.

voxic11

Synthetic data is already widely used to do training in the programming and mathematics domains where automated verification is possible. Here is an example of an open source verified reasoning synthetic dataset https://www.primeintellect.ai/blog/synthetic-1

scotty79

Simple, you just need to turn language into a game.

You make models talk to each other, create puzzles for each other's to solve, ask each other to make cases and evaluate how well they were made.

Will some of it look like ramblings of pre-scientific philosophers? (or modern ones because philosophy never progressed after science left it in the dust)

Sure! But human culture was once there too. And we pulled ourselves out of this nonsense by the bootstraps. We didn't need to be exposed to 3 alien internet's with higher truth.

It's really a miracle that AIs got as much as they did from purely human generated mostly garbage we cared to write down.

cs702

> Of course we can, ... synthetic data ...

That's option (2) in the parent comment: synthetic data.

FloorEgg

What about or (3) models that interact with the real world?

To be clear I also agree with your (1) and (2).

tliltocatl

That's the endgame, but on the other hand, we already have one, it's called "humanity". No reason to believe that another one would be much cheaper. Interacting with the real world is __expensive__. It's the most expensive thing of all.

FloorEgg

Very true. Living cells are ~4-5 orders of magnitude more functional-information-dense than the most advanced chips, and there is a lot more living mass than advanced chips.

But the networking potential of digital compute is a fundamentally different paradigm than living systems. The human brain is constrained in size by the width of the female pelvis.

So while it's expensive, we can trade scope-constrained robustness (replication and redundancy at many levels of abstraction), for broader cognitive scale and fragility (data centers can't repair themselves and self-replicate).

Going to be interesting to see it all unfold... my bet is on stacking S-curves all the way.

jvanderbot

Play in the real world generates a data point every few minutes. Seems a bit slow?

pizzly

Humans experience (play in the real world) is multi modal though vision, sound, touch, pressure, muscle feedback, gravitational, etc. Its extremely rich in data. Its also not a data point its continuous stream of information. Also I would bet that humans synthesize data at the same time. Everytime we run multiple scenarios in our mind before choosing the one we execute without even thinking about it is synthesizing data. Also humans dream which is another form of data synthesizing. Allowing AI to interact with the real world is definitely a way to go.

FloorEgg

What are you basing that statement on?

What exactly are you considering a "data point"?

Are you assuming one model = one agent instance?

I am pretty sure that there is more information (molecular structure) and functional information (I(Ex )) just in the room I am sitting in than all the unique, useful, digitized information on earth.

geetee

I don't understand why we need more data for training. Assuming we've already digitized every book, magazine, research paper, newspaper, and other forms of media, why do we need this "second internet?" Legal issues aside, don't we already have the totality of human knowledge available to us for training?

kbenson

I interpreted it as a roundabout way of increasing quality. Take any given subreddit. You have posts and comments, and scores, but what if the data quality isn't very good overall? What if instead of using it as is, you instead had an AI evaluate and reason about all the posts, and classify them itself based on how useful the posts and comments are, how well they work out in practice (if easily simulated), etc? Essentially you're using the AI to provide a moderated and carefully curated set of information about the information that was already present. If you then ingest this information, does that increase the quality of the data? Probably(?), since you're throwing compute and AI reasoning at the problem ahead of time reducing compute and lowering the low quality data by adding additional high quality data.

dr_dshiv

Let’s keep in mind that we don’t have most of the renaissance through the early modern period (1400-1800) because it was published in neolatin with older typefaces— and only about 10% is even digitized.

We probably don’t have most of the Arabic corpus either — and barely any Sanskrit. Classical Chinese is probably also lacking — only about 1% of it is translated to English.

null

[deleted]

jacobolus

The volume of text in English and digitized from the past few years dwarfs the volume of Latin text from all time. Unless you are wondering about a very niche historical topic there’s more written in English than Latin about basically everything.

typpilol

Don't most models learn from different languages sets already?

null

[deleted]

brazzy

The point is that current methods are unable to get more than the current state-of-the-art models' degree of intelligence out of training on the totality of human knowledge. Previously, the amount of compute needed to process that much data was a limit, but not anymore.

So now, in order to progress further, we either have to improve the methods, or synthetically generate more training data, or both.

incompatible

A lot of newspapers seem to be stuck behind paywalls, even when in the public domain.

frankenstine

> The path forward: data alchemists (high-variance, 300% lottery ticket) or model architects (20-30% steady gains)

No, the paths forward are: better design, training, feeding in more video, audio, and general data from the outside world. The web is just a small part of our experience. What about apps, webcam streams, radio from all over the world in its many forms, OTA TV, interacting with streaming content via remote, playing every video game, playing board games with humans, feeds and data from robots LLMs control, watching everyone via their phones and computers, car cameras, security footage and CCTV, live weather and atmospheric data, cable television, stereoscopic data, ViewMaster reels, realtime electrical input from various types of brains while interacting with their attached creatures, touch and smell, understanding birth, growth, disease, death, and all facets of life as an observer, observing those as a subject, expanding to other worlds, solar systems, galaxies, etc., affecting time and space, search and communication with a universal creator, and finally understanding birth and death of the universe.

fao_

I'll give this comment more or less exactly the level of seriousness as it deserves, and say: lol

hn_acc1

Reminds me a bit of "Person of Interest" (the TV show).

NooneAtAll3

while I don't disagree with the facts, I don't understand the... tone?

when Dennard scaling (single core performance) started to fail in 90s-00s, I don't think there was a sentiment "how stupid was it to believe such a scaling at all"?

sure, people were compliant (and we still meme about running Crysis), but in the end the discussion resulted in "no more free lunch" - progress in one direction has hit a bottleneck, so it's time to choose some other direction to improve on (and multi-threading has now become mostly the norm)

I don't really see much of a difference?

Quarrelsome

Can I just whine for a second? As someone that didn't go to expensive maths club, the way people who did, talk about maths is disgraceful imho. Consider the equasion in this article:

(C ~ 6 N⋅D)

I can look up the symbol for "roughly equals", that was super cool and is a great part of curiousity. But this _implied_ multiplication between the 6 and the N combined with using a fucking diamond symbol (that I already despise given how long it took me to figure the first time I encountered it) is just gross. I figured it was likely that but then I was like: "but why not just 6ND? Maybe there's a reason why N⋅D but 6 N? Does that mean there's a difference between those operations"?

Thankfully I can use gippity these days to get by, but before gippity I had to look up an entire list of maths symbols to find the diamond symbol to work out what it meant. Its why I love code because there's considerably less implicit behaviour once you slap down the formula into code and you can play with the input/output.

I don't think mathsy people realise how exclusionary their communication is, but its so frustrating when I end up fumbling around in slow-mo when the maths kicks in, because "oh the /2 when discussing logarithms in comp sci is _obvious_, so we just don't put it in the equasion" just kills me. Idiot me, staring at the equasion thinking it actually makes sense without knowing the special maths knowledge of implication means that it actually doesn't solve as it reads on the page. Unless of course you went to expensive maths club where they tell you all this.

What drives me nuts is that every time I spend ages finally grokking something, I realise how obvious it is and how non-trivial it is to explain it simply. Comp sci isn't much better to be honest, where we use CQRS instead of "read here, write there". Which results in thousands of newbies trying to parse the unfathomable complexity of "Command Query Responsibility Segregation" and spending as much time staring at its opaqueness as I did the opening sentence of the wikipedia article on logarithms.

bwhiting2356

Audio and video data can be collected from the real world. It won't be immediate and won't be cheap.

back2dafucha

About 28 years ago a wise person said to me: "Data will kill you" Even mainframe programmers knew it.

TheDudeMan

I interpret The Bitter Lesson as suggesting that you should be selecting methods that do not need all that data (in many domains, we don't know those methods yet).

Mistletoe

>And herein lies the problem — we’ve basically ingested the entire Internet, and there is no second Internet.

One of the best things I've read in a while.

throwaway314155

The scaling laws for transformers _deliberately_ factor in the amount of data as well as the amount of compute needed in order to scale.

The premise of this article, that data is more important than compute has been obvious to people who are paying attention.

Sorry but the unnecessary sensationalism in this article was mildly annoying to me. Like the author discovered some novel new insight. A bit like that doctor who published a "no el" paper about how to find the area under a curve.