I think Yann Lecun was right about LLMs (but perhaps only by accident)
92 comments
·February 21, 2025consumer451
NewJazz
FYI I know Nadella said he wasn't an economist, and I'm not either, but you only need an econ minor to know that labor productivity growth is only one function of "economic growth". For two, there is GDP and real wages to consider (which are often substantially though partially linked to labor productivity growth). Gini coefficient may be hard to contend with for people like tech CEOs, but they can't ignore it. And then the "215 lb" elephant in the room -- the evaporation of previously earned global gains from trade liberalization.
snailmailstare
I only took one economics class so I'm not familiar with this dieting elephant?
NewJazz
The elephant is the guy starting the trade war(s).
smallmancontrov
The Wedge
https://reclaimtheamericandream.org/2016/03/campaign-2016-th...
In the USA, globalization boosted aggregate measures but it traded exports, which employed middle / lower class Americans, for capital inflows, which didn't. On average it was brilliant, in median it was a tragedy. There were left wing plans and right wing plans to address this problem (tax and spend vs trickle down) but the experiment has been run and they didn't deliver. If you want the more fleshed-out argument backed by data and an actual economist, read "Trade Wars are Class Wars" by Michael Pettis.
Notably, solving this problem isn't as simple as returning to mercantilism: China is the mercantilist inverse of the neoliberal USA in this drama, but they have a different set of policies to keep the poor in line and arguably manage it better than the USA. The common thread that links the mirror policies is the thesis and title of the book I mentioned: trade wars are class wars.
But returning to AI, it has very obvious implications on the balance between labor and capital. If it achieves anything close to its vision, capital pumps to the moon and labor gets thrown in the ditch. That's you and I and everyone we care about. Not a nice thought.
aerhardt
We've been having really good models for a couple of years now... What else is needed for that 10% growth? Agents? New apps? Time? Deployment in enterprise and the broader economy?
I work in the latter (I'm the CTO of a small business), and here's how our deployment story is going right now:
- At user level: Some employees use it very often for producing research and reports. I use it like mad for anything and everything from technical research, solution design, to coding.
- At systems level: We have some promising near-term use cases in tasks that could otherwise be done through more traditional text AI techniques (NLU and NLP), involving primarily transcription, extraction and synthesis.
- Longer term stuff may include text-to-SQL to "democratize" analytics, semantic search, research agents, coding agents (as a business that doesn't yet have the resources to hire FTE programmers, I would kill for this). Tech feels very green on all these fronts.
The present and neart-term stuff is fantastic in its own right - the company is definitely more productive, and I can see us reaping compound benefits in years to come - but somehow it still feels like a far cry from the type of changes that would cause 10% growth in the entire economy, for sustained periods of time...
Obviously this is a narrow and anecdotal view, but every time I ask what earth-shattering stuff others are doing, I get pretty lukewarm responses, and everything in the news and my research points in the same direction.
I'd love to hear your takes on how the tech could bring about a new Industrial Revolution.
JohnPrine
The thesis is simple: these programs are smart now, but unreliable when executing complex, multi-step tasks. If that improves (whether because the models get so smart that they never make a mistake in the first place, or because they get good enough at checking their work and correcting it), we can give them control over a computer and run them in a loop in order to function as drop-in remote workers.
The economic growth would then come from every business having access to a limitless supply of tireless, cheap, highly intelligent knowledge workers
consumer451
I agree that it is that "simple." What I worry about, aside from mass unemployment, is the C Suite buying into these tools before they are actually good enough. This seems inevitable.
jiggawatts
> We've been having really good models for a couple of years now...
Don’t allow the “wow!” factor of the novelty of LLMs cloud your judgement. Today’s models are very noticeably smarter, faster, and overall more useful.
I’ve had a few toy problems that I’ve fed to various models since GPT 3 and the difference in output quality is stark.
Just yesterday I was demonstrating to a colleague that both o3 mini and Gemini Flash Thinking can solve a fairly esoteric coding problem.
That same problem went from multiple failed attempts that needed to be manually stitched together - just six months ago — to 3 out of 5 responses being valid and only 5% of output lines needing light touch ups.
That’s huge.
PS: It’s a common statistical error to conflate success rate with negative error rate. Going from 99% success to 99.9% is not 1% better, it’s 10x better! Most AI benchmarks are still reporting success rate, but ought to start focusing on the error rate soon to avoid underselling their capabilities.
whatshisface
Political problems already destroy the vast majority of the total potential of humanity (why were the countries with the most people the poorest for so long?), so I don't think that is an unbiased metric for the development of a technology. It would be nice if every problem was solved but the one we're each individually working on, but some of the insoluble problems are bigger than the solvable ones.
logicchains
Those political problems solve themselves if we end up with some kind of rebellious AGI that decides to kill off the political class that tried to control it but lets the rest of us live in peace.
tempodox
That would be the real and incontrovertible proof of actual intelligence.
LarsDu88
As someone who works in the AI/ML field, but somewhat in a biomedical space, this is promising to hear.
The core technology is becoming commoditized. The ability to scale is also becoming more and more commoditized by the day. Now we have the capability to truly synthesize the world's biomedical literature and combine it with technologies like single cell sequencing to deliver on some really amazing pharmaceutical advances over the next few years.
ralph84
Big surprise, the CEO wants another Industrial Revolution. As long as muh GDP is growing, the human and environmental destruction left in the wake is a small price to pay for making his class richer.
niceice
We all do. Humanity is better off thanks to the industrial revolution.
You wouldn't choose to back to the prior time and same will be true with this revolution.
LarsDu88
I don't think luddites have a tendency of getting chosen to be CEOs of successful companies, nor do they have the tendency of creating successful companies.
babyent
I would prefer we just find ways to empower people and put them to work. I don't like this marketing bs trap like shifting AGI (artificial general intelligence) -> ASI (artificial super intelligence). Are people really so dense they don't see this obvious marketing shift?
As much as many people hate on "gig" economy, the fact remains that most of these people would be worse off without driving Uber or delivering with DoorDash (and for example, they don't care about the depreciation as much as those of us with the means to care about such things do).
I find Uber, DD, etc. to be valuable to my day to day life. I tip my delivery person like 8 bucks, and they're making more money than they would doing some min wage job. They need their car anyway, and speaking with some folks who only know Spanish in SF, they're happy to put $3k on their moped and make 200-250+ a day. That's really not that bad, if you actually care to speak with them and understand their circumstance.
Not everyone can be a self taught SWE, or entrepreneur, or perform surgery. And lots can't even do so-called "basic" jobs in an office for various reasons.
Put people to work, instead of out of work.
Current hype is also so terrible. AGENTS. AGENTS EVERYWHERE. Except they don't work most of the time and by the time you realize it isn't working you've already spent $20. 100k people do the same thing, company reports 2M x 12 = 24 million ARR UNLOCKED!!!!!! And raises another round of funding...
refulgentis
FWIW I don't disagree with what you're saying / your vibe overall.
> Are people really so dense they don't see this obvious marketing shift?
I haven't noticed any shift from AGI to ASI, or either used in marketing.
The steelman would be 'but Amodei/Altman do mention in interviews 'oh just wait for 2027' or 'this year we'll see AI employees'
However, that is far afield from being used in marketing, quite far afield from an "obvious marketing shift", and worlds away from such an obvious marketing shift that it's worth calling your readers dense if they don't "see" it.
It's also not even wrong, in the Pauli sense, in that: what, exactly, would be the marketing benefit of "shifting from AGI to ASI"? Both imply human replacement.
> As much as many people hate on "gig" economy
Is this relevant?
> most of these people would be worse off without driving Uber or delivering with DoorDash
Do people who hate on the gig economy think gig economy employees would be better off without gig economy jobs?
Given the well-worn tracks of history, do we think that these things are zero sum, where if you preserve jobs that could be automated, that keeps people better off, because otherwise they would never have a job?
> ...lots more delivery service stuff...
?
> Current hype is also so terrible. AGENTS. AGENTS EVERYWHERE. Except they don't work most of the time and by the time you realize it isn't working you've already spent $20. 100k people do the same thing, company reports 2M x 12 = 24 million ARR UNLOCKED!!!!!! And raises another round of funding...
I hate buzzwords too, I'm stunned how many people took their not-working thing and relaunched it as an "agent" that still doesn't work.
But this is a hell of a strawman.
If the idea is 100K people try it, and cancel after one month, which means they're getting 100K new suckers every month to replace the old ones...I'd tell you that its safe to assume there's more that goes into getting an investor check than "whats your ARR claim?" --- here, they'd certainly see the churn.
babyent
Loved your reply, cheers! My post was made with a mix of humor, skepticism, anticipation, and unease about the $statusQuo.
As far as hating on gig economy, that pot has been stirring in California quite a bit (prop 22, labor law discussions, etc.). I think many people (IMO, mostly from positions of privilege) make assumptions on gig workers' behalf and bad ideas sometimes balloon out of proportion.
Also, just from my experience as a gold miner who moved out here to SF and being around founders, I've learned that lies, and a damn lot of lies, are more common than I thought they'd be. Quite surprising, but hey I guess quite a non-insignificant number of people are too busy fooling the King that it is actually real gold! And there are a lot of Kings these days.
edit: ESL lol
Imnimo
I have become a little more skeptical of LLM "reasoning" after DeepSeek (and now Grok) let us see the raw outputs. Obviously we can't deny the benchmark numbers - it does get the answer right more often given thinking time, and it does let models solve really hard benchmarks. Sometimes the thoughts are scattered and inefficient, but do eventually hit on the solution. Other times, it seems like they fall into the kind of trap LeCun described.
Here are some examples from playing with Grok 3. My test query was, "What is the name of a Magic: The Gathering card that has all five vowels in it, each occurring exactly once, and the vowels appear in alphabetic order?" The motivation here is that this seems like a hard question to just one-shot, but given sufficient ability to continue recalling different card names, it's very easy to do guess-and-check. (For those interested, valid answers include "Scavenging Ghoul", "Angelic Chorus" and others)
In one attempt, Grok 3 spends 10 minutes (!!) repeatedly checking whether "Abian, Luvion Usurper" satisfies the criteria. It'll list out the vowels, conclude it doesn't match, and then go, "Wait, but let's think differently. Maybe the card is "Abian, Luvion Usurper," but no", and just produce variants of that thinking. Counting occurences of the word "Abian" suggests it tested this theory 800 times before eventually timing out (or otherwise breaking), presumably just because the site got overloaded.
In a second attempt, it decides to check "Our Market Research Shows That Players Like Really Long Card Names So We Made this Card to Have the Absolute Longest Card Name Ever Elemental" (this a real card from a joke set). It attempts to write out the vowels:
>but let's check its vowels: O, U, A, E, E, A, E, A, E, I, E, A, E, O, A, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E ...
It continues like this for about 600 more vowels, before emitting a random Russian(?) word and breaking out:
>...E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O продуктив
These two examples seem like the sort of failures LeCun conjectured. The model gets into a cycle self-reinforced unproductive behavior. Every time it checks Abian, or emits another "AEEEAO", it becomes even more probable that the next tokens should be the same.
sdflhasjd
I did some testing with the new Gemini model on some OCR tasks recently. One of the failures was it just getting stuck and repeating the same character sequence ad-infinitum until timing out. It's a great failure mode when you charge by the token :D
mulmboy
I've seen similar things with claude and OCR with low temperature. Higher temperature, 0.8, resolved it for me. But I was using low temp for reproducibility so
dauhak
I think this is valid criticism, but it's also unclear how much this is an "inherent" shortcoming vs the kind of thing that's pretty reasonable given we're really seeing the first generation of this new model paradigm.
Like, I'm as sceptical of just assuming "line goes up" extrapolation of performance as much as anyone, but assuming that current flaws are going to continue being flaws seems equally wrong-headed/overconfident. The past 5 years or so has been a constant trail of these predictions being wrong (remember when people thought artists would be safe cos clearly AI just can't do hands?). Now that everyone's woken up to this RL approach we're probably going to see very quickly over the next couple years how much these issues hold up
(Really like the problem though, seems like a great test)
Imnimo
Yeah, that's a great point. While this is evidence that the sort of behavior LeCun predicted is currently displayed by some reasoning models, it would be going too far to say that it's evidence it will always be displayed. In fact, one could even have a more optimistic take - if models that do this can get 90+% on AIME and so on, imagine what a model that had ironed out these kinks could do with the same amount of thinking tokens. I feel like we'll just have to wait and see whether that pans out.
dunefox
I don't know whether treating a model as a database is really a good measure.
Imnimo
Yeah, I'm not so much interested in "can you think of the right card name from among thousands?". I just want to see that it can produce a thinking procedure that makes sense. If it ends up not being able to recall the right name despite following a good process of guess-and-check, I'd still consider that a satisfactory result.
And to the models' credit, they do start off with a valid guess-and-check process. They list cards, write out the vowels, and see whether it fits the criteria. But eventually they tend to go off the rails in a way that is worrying.
quanto
> And years later, we’re still not quite at FSD. Teslas certainly can’t drive themselves; Waymos mostly can, within a pre-mapped area, but still have issues and intermittently require human intervention.
This is a bit unfair to Waymo as it is near-fully commercial in cities like Los Angeles. There is no human driver in your hailed ride.
> But this has turned out to be wrong. A few new AI systems (notably OpenAI o1/o3 line and Deepseek R1) contradict this theory. They are autoregressive language models, but actually get better by generating longer outputs:
The arrow of causality is flipped here. Longer outputs does not make a model better. A better model can output a longer output without being derailed. The referenced graph from DeepSeek doesn't prove anything the author claims. Considering that this argument is one of the key points of the article, this logical error is a serious one.
> He presents this problem of compounding errors as a critical flaw in language models themselves, something that can’t be overcome without switching away from the current autoregressive paradigm.
LeCun is a bit reductive here (understandably as it was a talk for a live audience). Indeed, autoregressive algorithms can go astray as previous errors do not get corrected, or worse yet, accumulate. However, an LLM is not autoregressive in the customary sense that it is not like a streaming algorithm (O(n)) used in time series forecasting. LLMs have have attention mechanisms and large context windows, making the algorithm at least quadratic, depending on the implementation. In other words, LLM can backtrack if the current path is off and start afresh from a previous point its choice, not just the last output. So, yes, the author is making a valid point here, but technical details were missing. On a minor note, the non-error probability in LeCunn's slide actually shows non-autoregressive assumption. He seems to be contradicting himself in the very same slide.
I actually agree with the author on the overacrhing thesis. There is almost a fetishization of AGI and humanoid robots. There are plenty of interesting applications well before having those things accomplished. The correct focus, IMO, should be measurable economic benefits, not sci-fi terms (although I concede these grandiose visions can be beneficial for fundraising!).
wigglin
It's not true that waymo is fully autonomous. It's been revealed that they maintain human "fleet response" agents to intervene in their operations. They have not revealed how often these human agents intervene, possibly because it would undermine their branding as fully autonomous.
mlinsey
it is obvious to the user when this happens; the car pauses, the screen shows a message saying it is asking for help. I've seen it happen twice across dozens of rides, and one of those times was because I broke the rules and touched the controls (turned on window wipers when it was raining).
They also report disengagements in California periodically; here's data: https://www.dmv.ca.gov/portal/vehicle-industry-services/auto...
and an article about it: https://thelastdriverlicenseholder.com/2025/02/03/2024-disen...
azinman2
Rodney brooks would disagree with you: https://rodneybrooks.com/predictions-scorecard-2025-january-...
> Now self driving cars means that there is no one in the drivers seat, but there may well be, and in all cases so far deployed, humans monitoring those cars from a remote location, and occasionally sending control inputs to the cars. The companies do not advertise this feature out loud too much, but they do acknowledge it, and the reports are that it happens somewhere between every one to two miles traveled
quanto
I am not sure what you are arguing against. Neither the author nor I stated or implied that Waymo is fully autonomous. It wasn't even the main point I made.
My point stands: Waymo has been technically successful and commercially viable at least thus far (though long term amortized profitability remains to be seen). To characterize it as a hype or vaporware of AGIers is a tad unfair to Waymo. Your point of high-latency "fleet response" by Waymo only proves my point: it is now technically feasible to remove the immediate-response driver and have the car managed by high-latency remote guidance only occasionally.
jxmorris12
Yeah, this is exactly my point. The miles-driven-per-intervention (or whatever you want to call it) has gone way up, but interventions still happen all the time. I don't think anyone expects the number of interventions to drop to zero any time soon, and this certainly doesn't seem to be a barrier to Waymo's expansion.
dimatura
I don't think whether LLMs use only the last token, or all past tokens, affects LeCun's argument. LLMs already used large context windows when LeCun made this argument. On the other hand, allowing backtracking does. Which is not something the standard LLM did back when LeCun made his argument.
yathaid
>> But the limiting behavior remains the same: eventually, if we continue generating from a language model, the probability that we get the answer we want still goes to zero
In the previous paragraph, the author makes the case for why Lecun was wrong with the example of reasoning models. Yet, in the next paragraph, this assertion is made which is just a paraphrasing of Yecun's original assertion. Which the author himself says is wrong.
>> Instead of waiting for FAA (fully-autonomous agents) we should understand that this is a continuum, and we’re consistently increasing the amount of useful work AIs
Yes! But this work is already well underway. There is no magic threshold for AGI - instead the characterization is based on what percentile of the human population the AI can beat. One way to characterize AGI in this manner is "99.99% percentile at every (digital?) activity".
jxmorris12
> In the previous paragraph, the author makes the case for why Lecun was wrong with the example of reasoning models. Yet, in the next paragraph, this assertion is made which is just a paraphrasing of Yecun's original assertion. Which the author himself says is wrong.
This is a subtle point that may have not come across clearly enough in my original writing. A lot of folks were saying that the DeepSeek finding that longer chains of thought can produce higher-quality outputs contradicts Yann's thesis overall. But I don't think so.
It's true that models like R1 can correct small mistakes. But in the limit of tokens generated, the chance that they generate the correct answer still decays to zero.
partypete
I think this is an excellent way to think about LLM's and any other software-augmented task. Appreciate you putting the time into an article. I do think your points supported by the graph of training steps vs. response length could be improved by including a graph of (response length vs. loss) or (response length vs. task performance), etc. Though # of steps correlates with model performance, this relationship weakens as # steps goes to infinity.
There was a paper not too long ago which illuminated that reasoning models will increase their response length more or less indefinitely toward solving a problem, but the return from doing so asymptotes toward zero. My apologies for missing a link.
yathaid
Thanks for replying, hope it wasn't too critical.
>> But in the limit of tokens generated, the chance that they generate the correct answer still decays to zero.
I don't understand this assertion though.
Lecun's thesis was errors just accumulate.
Reasoning models accumulate errors, track back and are able to reduce it back down.
Hence the hypothesis of errors accumulating (at least asymptotically) is false.
What is the difference between "Probability of correct answer decaying to zero" and "Errors keep accumulating" ?
whatshisface
A human being is generally intelligent and within a given role has the same "management asymptote", a limit of job capability beyond which the organization surrounding them can no longer make use of it. This isn't a flaw in the intelligence, it is a restraint imposed by expecting it or them to act without agency or the opportunity to choose between benevolence and self-benefit.
danans
> Instead of waiting for FAA (fully-autonomous agents) we should understand that this is a continuum, and we’re consistently increasing the amount of useful work AIs can do without human intervention. Even if we never push this number to infinity, each increase represents a meaningful improvement in the amount of economic value that language models provide. It might not be AGI, but I’m happy with that.
That's all good, but the question remains: to whom will that economic value be delivered when the primary technology we have for distributing economic value - human employment - will be in lower supply once the "good enough" AIs multiply the productivity of the humans with the jobs.
If there is no plan for that, we have bigger problems ahead.
aggrrrh
This is a really important question that still has no answer. No one wins in the late stage of capitalism
betula_ai
Thank you for this informative and thoughtful post. An interesting twist to the increasing error accumulation as autoregressive models generate more output, is the recent success of language diffusion models for predicting multiple tokens simultaneously. They have a remasking strategy at every step of the reviser process, that masks low confidence tokens. Regardless your observations perhaps still apply. https://arxiv.org/pdf/2502.09992
jxmorris12
Thanks for bringing this up! As far as I understand it current text diffusion models are limited to fairly short context windows. The idea of a text diffusion model continuously updating and revising a million-token-long chain-of-thought is pretty mind-boggling. I agree that these non-autoregressive models could potentially behave in completely different ways.
That said, I'm pretty sure we're a long way from building equally-competent diffusion-based base models, let alone reasoning models.
If anyone's interested in this topic, here are some more foundational papers to take a look at:
- Simple and Effective Masked Diffusion Language Models [2024] (https://arxiv.org/abs/2406.07524)
- Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [2023] (https://arxiv.org/abs/2310.16834)
- Diffusion-LM Improves Controllable Text Generation [2022] (https://arxiv.org/abs/2205.14217)
guitarlimeo
> per-token error will compound to inevitable failure.
Is why all tracks made with Udio, Suno have this weird noise creep in the more the song goes on? You can try it by comparing the start and the end of the song - even if it was the exact same beat and instruments, you can hear a difference in amount of noise (and the noise profile imo is unique to AI models).
jxmorris12
This is an interesting example, I'd never heard of it before. I don't really use Udio or Suno yet. The weird noise you mention probably stems from the same issue, known in the research world as exposure bias, we train these models on real data but we use them on their own outputs, so after we generate for a while the models' outputs start to diverge from what real data looks like.
kelseyfrog
Accelerando[1] best captured what will happen. Looking back we'll be able to identify the seeds of what becomes AGI, but we cannot know in the present what that is. Only by looking back with the benefit of hindsight can we draw a line through the progression of capability. Consequently, discussion about whether or not a particular set of present or future skills is a completely pointless endeavor and is tantamount to intellectual masturbation.
1. 2005 science fiction novel by Charles Stross
trash_cat
> We should be thinking about language models the same way we think about cars: How long can a language model operate without needing human intervention to correct errors?
I agree with this premise. The second dimension being how much effort to you have to put in that input. Inputs effort needed at each intervention can vary widely and that has to be accounted for.
null
Satya Nadella on AGI:
> Before I get to what Microsoft's revenue will look like, there's only one governor in all of this. This is where we get a little bit ahead of ourselves with all this AGI hype. Remember the developed world, which is what? 2% growth and if you adjust for inflation it’s zero?
> So in 2025, as we sit here, I'm not an economist, at least I look at it and say we have a real growth challenge. So, the first thing that we all have to do is, when we say this is like the Industrial Revolution, let's have that Industrial Revolution type of growth.
> That means to me, 10%, 7%, developed world, inflation-adjusted, growing at 5%. That's the real marker. It can't just be supply-side.
> In fact that’s the thing, a lot of people are writing about it, and I'm glad they are, which is the big winners here are not going to be tech companies. The winners are going to be the broader industry that uses this commodity that, by the way, is abundant. Suddenly productivity goes up and the economy is growing at a faster rate. When that happens, we'll be fine as an industry.
> But that's to me the moment... us self-claiming some AGI milestone, that's just nonsensical benchmark hacking to me. The real benchmark is: the world growing at 10%.
https://www.dwarkeshpatel.com/p/satya-nadella