Skip to content(if available)orjump to list(if available)

Ilya Sutskever: We're moving from the age of scaling to the age of research

JimmyBuckets

I respect Ilya hugely as a researcher in ML and quite admire his overall humility, but I have to say I cringed quite a bit at the start of this interview when he talks about emotions, their relative complexity, and origin. Emotion is so complex, even taking all the systems in the body that it interacts with. And many mammals have very intricate socio-emotional lives - take Orcas or Elephants. There is an arrogance I have seen that is typical of ML (having worked in the field) that makes its members too comfortable trodding into adjacent intellectual fields they should have more respect and reverence for. Anyone else notice this? It's something physicists are often accused of also.

marcinzm

Armchair X and Dunning-Kruger effect are not things unique to ML engineers. You just interact with engineers to see it more and engineers tend to not be as good at social BS so you notice it more. I'm sure everyone here has had a classic micro manager boss who knew better than them despite never having touched code.

tmp10423288442

He's talking his book. Doesn't mean he's wrong, but Dwarkesh is now big enough that you should assume every big name there is talking their book.

l5870uoo9y

> These models somehow just generalize dramatically worse than people.

The whole mess surrounding Grok's ridiculous overestimation of Elon's abilities in comparison to other world stars, did not so much show Grok's sycophancy or bias towards Elon, as it showed that Grok fundamentally cannot compare (generalize) or has a deeper understanding of what the generated text is about. Calling for more research and less scaling is essentially saying; we don't know where to go from here. Seems reasonable.

radicaldreamer

I think the problem with that is that Grok has likely been prompted to do that in the system prompt or some prompts that get added for questions about Elon. That doesn't reflect on the actual reasoning or generalization abilities of the underlying model most likely.

l5870uoo9y

You can also give AI models Nobel-prize winning world literature and ask why this is bad and they will tear apart the text, without ever thinking "wait this is some of the best writing produced by man".

ffsm8

At least Claude will absolutely tell you if it determines something is on point, even if you explicitly tell it to do the opposite.

I'm just pointing this out because they're not quite as 2 dimensional as you are insinuating - even if they're frequently wrong and need careful prompting for decent quality

(after the initial "you're absolutely right!" And it finished "thinking" about it)

signatoremo

I bet that you can find plenty of exactly that from the human reviews of any past winner.

asolove

Yes it does.

Today on X, people are having fun baiting Grok into saying that Elon Musk is the world’s best drinker of human piss.

If you hired a paid PR sycophant human, even of moderate intelligence, it would know not to generalize from “say nice things about Elon” to “say he’s the best at drinking piss”.

Havoc

There’s no way that wasn’t specifically prompted.

bugglebeetle

To be fair, it could’ve been post-trained into the model as well…

null

[deleted]

alyxya

The impactful innovations in AI these days aren't really from scaling models to be larger. It's more concrete to show higher benchmark scores, and this implies higher intelligence, but this higher intelligence doesn't necessarily translate to all users feeling like the model has significantly improved for their use case. Models sometimes still struggle with simple questions like counting letters in a word, and most people don't have a use case of a model needing phd level research ability.

Research now matters more than scaling when research can fix limitations that scaling alone can't. I'd also argue that we're in the age of product where the integration of product and models play a major role in what they can do combined.

pron

> this implies higher intelligence

Not necessarily. The problem is that we can't precisely define intelligence (or, at least, haven't so far), and we certainly can't (yet?) measure it directly. And so what we have are certain tests whose scores, we believe, are correlated with that vague thing we call intelligence in humans. Except these test scores can correlate with intelligence (whatever it is) in humans and at the same time correlate with something that's not intelligence in machines. So a high score may well imply high intellignce in humans but not in machines (e.g. perhaps because machine models may overfit more than a human brain does, and so an intelligence test designed for humans doesn't necessarily measure the same thing we think of when we say "intelligence" when applied to a machine).

This is like the following situation: Imagine we have some type of signal, and the only process we know produces that type of signal is process A. Process A always produces signals that contain a maximal frequency of X Hz. We devise a test for classifying signals of that type that is based on sampling them at a frequency of 2X Hz. Then we discover some process B that produces a similar type of signal, and we apply the same test to classify its signals in a similar way. Only, process B can produce signals containing a maximal frequency of 10X Hz and so our test is not suitable for classifying the signals produced by process B (we'll need a different test that samples at 20X Hz).

alyxya

Fair, I think it would be more appropriate to say higher capacity.

pron

Ok, but the point of a test of this kind is to generalise its result. I.e. the whole point of an intelligence test is that we believe that a human getting a high score on such a test is more likely to do some useful things not on the test better than a human with a low score. But if the problem is that the test results - as you said - don't generalise as we expect them, then the tests are not very meaningful to begin with. If we don't know what to expect from a machine with a high test score when it comes to doing things not on the test, then the only "capacity" we're measuring is the capacity to do well on such tests, and that's not very useful.

TheBlight

"Scaling" is going to eventually apply to the ability to run more and higher fidelity simulations such that AI can run experiments and gather data about the world as fast and as accurately as possible. Pre-training is mostly dead. The corresponding compute spend will be orders of magnitude higher.

alyxya

That's true, I expect more inference time scaling and hybrid inference/training time scaling when there's continual learning rather than scaling model size or pretraining compute.

TheBlight

Simulation scaling will be the most insane though. Simulating "everything" at the quantum level is impossible and the vast majority of new learning won't require anything near that. But answers to the hardest questions will require as close to it as possible so it will be tried. Millions upon millions of times. It's hard to imagine.

pessimizer

> most people don't have a use case of a model needing phd level research ability.

Models also struggle at not fabricating references or entire branches of science.

edit: "needing phd level research ability [to create]"?

nutjob2

> this implies higher intelligence

Models aren't intelligent, the intelligence is latent in the text (etc) that the model ingests. There is no concrete definition of intelligence, only that humans have it (in varying degrees).

The best you can really state is that a model extracts/reveals/harnesses more intelligence from its training data.

dragonwriter

> There is no concrete definition of intelligence

Note that if this is true (and it is!) all the other statements about intelligence and where it is and isn’t found in the post (and elsewhere) are meaningless.

darkmighty

There is no concrete definition of a chair either.

delichon

If the scaling reaches the point at which the AI can do the research at all better than natural intelligence, then scaling and research amount to the same thing, for the validity of the bitter lesson. Ilya's commitment to this path is a statement that he doesn't think we're all that close to parity.

pron

I agree with your conclusion but not with your premise. To do the same research it's not enough to be as capable as a human intelligence; you'd need to be as capable as all of humanity combined. Maybe Albert Einstein was smarter than Alexander Fleming, but Einstein didn't discover penicillin.

Even if some AI was smarter than any human being, and even if it devoted all of its time to trying to improve itself, that doesn't mean it would have better luck than 100 human researchers working on the problem. And maybe it would take 1000 people? Or 10,000?

Herring

rdedev

The 3rd graph is interesting. Once the model performance reaches above human baseline, the growth seems to be logarithmic instead of exponential.

rockinghigh

You should read the transcript. He's including 2025 in the age of scaling.

> Maybe here’s another way to put it. Up until 2020, from 2012 to 2020, it was the age of research. Now, from 2020 to 2025, it was the age of scaling—maybe plus or minus, let’s add error bars to those years—because people say, “This is amazing. You’ve got to scale more. Keep scaling.” The one word: scaling.

> But now the scale is so big. Is the belief really, “Oh, it’s so big, but if you had 100x more, everything would be so different?” It would be different, for sure. But is the belief that if you just 100x the scale, everything would be transformed? I don’t think that’s true. So it’s back to the age of research again, just with big computers.

Herring

Nope, Epoch.ai thinks we have enough to scale till 2030 at least. https://epoch.ai/blog/can-ai-scaling-continue-through-2030

^

/_\

***

bugglebeetle

…and what has the team at Epoch.ai have under their belt in terms of applied theory as compared to Sutskever? The report reads like consulting firm pablum.

imiric

That article is from August 2024. A lot has changed since then.

Specifically, performance of SOTA models has been reaching a plateau on all popular benchmarks, and this has been especially evident in 2025. This is why every major model announcement shows comparisons relative to other models, but not a historical graph of performance over time. Regardless, benchmarks are far from being a reliable measurement of the capabilities of these tools, and they will continue to be reinvented and gamed, but the asymptote is showing even on their own benchmarks.

We can certainly continue to throw more compute at the problem. But the point is that scaling the current generation of tech will continue to have fewer returns.

To make up for this, "AI" companies are now focusing on engineering. 2025 has been the year of MCP, "agents", "skills", etc., which will continue in 2026. This is a good thing, as these tools need better engineering around them, so they can deliver actual value. But the hype train is running out of steam, and unless there is a significant breakthrough soon, I suspect that next year will be a turning point in this hype cycle.

epistasis

That blog post is eight months old. That feels like pretty old news in the age of AI. Has it held since then?

conception

It looks like it’s been updated as it has codex 5.1 max on it

null

[deleted]

itissid

All coding agents are geared towards optimizing one metric, more or less, getting people to put out more tokens — or $$$.

If these agents moved towards a policy where $$$ were charged for project completion + lower ongoing code maintenance cost, moving large projects forward, _somewhat_ similar to how IT consultants charge, this would be a much better world.

Right now we have chaos monkey called AI and the poor human is doing all the cleanup. Not to mention an effing manager telling me you now "have" AI push 50 Features instead of 5 in this cycle.

kace91

>this would be a much better world.

Would it?

We’d close one of the few remaining social elevators, displace higher educated people by the millions and accumulate even more wealth at the top of the chain.

If LLMs manage similar results to engineers and everyone gets free unlimited engineering, we’re in for the mother of all crashes.

On the other hand, if LLMs don’t succeed we’re in for a bubble bust.

londons_explore

> These models somehow just generalize dramatically worse than people. It's a very fundamental thing

My guess is we'll discover that biological intelligence is 'learning' not just from your experience, but that of thousands of ancestors.

There are a few weak pointers in that direction. Eg. A father who experiences a specific fear can pass that fear to grandchildren through sperm alone. [1].

I believe this is at least part of the reason humans appear to perform so well with so little training data compared to machines.

[1]: https://www.nature.com/articles/nn.3594

null

[deleted]

johnxie

I don’t think he meant scaling is done. It still helps, just not in the clean way it used to. You make the model bigger and the odd failures don’t really disappear. They drift, forget, lose the shape of what they’re doing. So “age of research” feels more like an admission that the next jump won’t come from size alone.

energy123

It still does help in the clean way it used to. The problem is that the physical world is providing more constraints like lack of power and chips. Three years ago there was scaling headroom created by the gaming industry and other precursor activities.

kmmlng

The scaling laws are also power laws, meaning that most of the big gains happen early in the curve, and improvements become more expensive the further you go along.

oytis

Ages just keep flying by

andy_ppp

So is the translation endless scaling has stopped being as effective?

Animats

It's stopped being cost-effective. Another order of magnitude of data centers? Not happening.

The business question is, what if AI works about as well as it does now for the next decade or so? No worse, maybe a little better in spots. What does the industry look like? NVidia and TSMC are telling us that price/performance isn't improving through at least 2030. Hardware is not going to save us in the near term. Major improvement has to come from better approaches.

Sutskever: "I think stalling out will look like…it will all look very similar among all the different companies. It could be something like this. I’m not sure because I think even with stalling out, I think these companies could make a stupendous revenue. Maybe not profits because they will need to work hard to differentiate each other from themselves, but revenue definitely."

Somebody didn't get the memo that the age of free money at zero interest rates is over.

The "age of research" thing reminds me too much of mid-1980s AI at Stanford, when everybody was stuck, but they weren't willing to admit it. They were hoping, against hope, that someone would come up with a breakthrough that would make it work before the house of cards fell apart.

Except this time everything costs many orders of magnitude more to research. It's not like Sutskever is proposing that everybody should go back to academia and quietly try to come up with a new idea to get things un-stuck. They want to spend SSI's market cap of $32 billion on some vague ideas involving "generalization". Timescale? "5 to 20 years".

This is a strange way to do corporate R&D when you're kind of stuck. Lots of little and medium sized projects seem more promising, along the lines of Google X. The discussion here seems to lean in the direction of one big bet.

You have to admire them for thinking big. And even if the whole thing goes bust, they probably get to keep the house and the really nice microphone holder.

energy123

The ideas likely aren't vague at all given who is speaking. I'd bet they're extremely specific. Just not transparently shared with the public because it's intellectual property.

jsheard

The translation is that SSI says that SSIs strategy is the way forward so could investors please stop giving OpenAI money and give SSI that money instead. SSI has not shown anything yet, nor does SSI intend to show anything until they have created an actual Machine God, but SSI says they can pull it off so it's all good to go ahead and wire the GDP of Norway directly to Ilya.

aunty_helen

If we take AGI as a certainty, ie we think we can achieve AGI using silicon, then Ilya is one of the best bets you can take if you are looking to invest in this space. He has a history and he's motivated to continue working on this problem.

If you think that AGI is not possible to achieve, then you probably wouldn't be giving anyone money in this space.

gessha

It’s a snake oil salesman’s world.

shwaj

Are you asking whether the whole podcast can be boiled down to that translation, or whether you can infer/translate that from the title?

If the former, no. If the latter, sure, approximately.

Quothling

Not really, but there is a finite amount of data to train models on. I found it rather interesting to hear him talk about how Gemini has been better at getting results out of the data than their competition, and how this is the first insights into a new way of dealing with how they train models on the same data to get different results.

I think the title is an interesting thing, because the scaling isn't about compute. At least as I understand it, what they're running out of is data, and one of the ways they deal with this, or may deal with this, is to have LLM's running concurrently and in competition. So you'll have thousands of models competing against eachother to solve challenges through different approaches. Which to me would suggest that the need for hardware scaling isn't about to stop.

imiric

The translation to me is: this cow has run out of milk. Now we actually need to deliver value, or the party stops.

null

[deleted]

wrs

"The idea that we’d be investing 1% of GDP in AI, I feel like it would have felt like a bigger deal, whereas right now it just feels...[normal]."

Wow. No. Like so many other crazy things that are happening right now, unless you're inside the requisite reality distortion field, I assure you it does not feel normal. It feels like being stuck on Calvin's toboggan, headed for the cliff.