Skip to content(if available)orjump to list(if available)

Ask HN: Any insider takes on Yann LeCun's push against current architectures?

Ask HN: Any insider takes on Yann LeCun's push against current architectures?

140 comments

·March 10, 2025

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

bravura

Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

codenlearn

Doesn't Language itself encode multimodal experiences? Let's take this case write when we write text, we have the skill and opportunity to encode the visual, tactile, and other sensory experiences into words. and the fact is llm's trained on massive text corpora are indirectly learning from human multimodal experiences translated into language. This might be less direct than firsthand sensory experience, but potentially more efficient by leveraging human-curated information. Text can describe simulations of physical environments. Models might learn physical dynamics through textual descriptions of physics, video game logs, scientific papers, etc. A sufficiently comprehensive text corpus might contain enough information to develop reasonable physical intuition without direct sensory experience.

As I'm typing this there is one reality that I'm understanding, the quality and completeness of the data fundamentally determines how well an AI system will work. and with just text this is hard to achieve and a multi modal experience is a must.

thank you for explaining in very simple terms where I could understand

cm2012

This seems strongly backed up by Claude Plays Pokemon

fewhil

Isn't Claude Plays Pokemon using image input in addition to text? Not that it's perfect at it (some of its most glaring mistakes are when it just doesn't seem to understand what's on the screen correctly).

jawiggins

I'm not an ML researcher, but I do work in the field.

My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.

I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.

Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.

I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.

[1]: https://www.open.edu/openlearn/nature-environment/organisati...

Matthyze

> Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found.

That seems to be how science works as a whole. Long periods of little progress between productive paradigm shifts.

semi-extrinsic

Punctuated equilibrium theory.

tyronehed

This is actually a lazy approach as you describe it. Instead, what is needed is an elegant and simple approach that is 99% of the way there out of the gate. Soon as you start doing statistical tweaking and overfitting models, you are not approaching a solution.

ActorNightly

Not an official ML researcher, but I do happen to understand this stuff.

The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

seanhunter

> The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

I don't think this explanation is correct. The input to the decoder at the end of all the attention heads etc (as I understand it) is a probability distribution over tokens. So the model as a whole does have an ability to score low confidence in something by assigning it a low probability.

The problem is that thing is a token (part of a word). So the LLM can say "I don't have enough information" to decide on the next part of a word but has no ability to say "I don't know what on earth I'm talking about" (in general - not associated with a particular token).

Lerc

I feel like we're stacking naive misinterpretations of how LLMs function on top of one another here. Grasping gradient descent and autoregressive generation can give you a false sense of confidence. It is like knowing how transistors make up logic gates and believing you know more than CPU design than you actually do.

Rather than inferring from how you imagine the architecture working, you can look at examples and counterexamples to see what capabilities they have.

One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels. It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.

Models predict the next word, but they don't just predict the next word. They generate a great deal of internal information in service of that goal. Placing limits on their abilities by assuming the output they express is the sum total of what they have done is a mistake. The output probability is not what it thinks, it is a reduction of what it thinks.

One of Andrej Karpathy's recent videos talked about how researchers showed that models do have an internal sense of not knowing the answer, but fine tuning on question answering I'd not give them the ability to express that knowledge. Finding information the model did and didn't know then fine tuning to say I don't know for cases where it had no information allowed the model to generalise and express "I don't know"

metaxz

Thanks for writing this so clearly... I hear wrong/misguided arguments like we see hear every day from friends, colleagues, "experts in the media" etc.

It's strange because just a moment of thinking will show that such ideas are wrong or paint a clearly incomplete picture. And there's plenty of analogies to the dangers of such reductionism. It should be obviously wrong to anyone who has at least tried ChatGPT.

My only explanation is that a denial mechanism must be at play. It simply feels more comfortable to diminish LLM capabilities and/or feel that you understand them from reading a Medium article on transformer-network, than to consider the consequences in terms of the inner black-box nature.

null

[deleted]

littlestymaar

No an ML researcher or anything (I'm basically only a few Karpathy video into ML, so please someone correct me if I'm misunderstanding this), but it seems that you're getting this backwards:

> One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels.

My understanding is that there's simply not “'an' ahead of a word that starts with a vowel”, the model (or more accurately, the sampler) picks “an” and then the model will never predict a word that starts with a consonant after that. It's not like it “knows” in advance that it wants to put a word with a vowel and then anticipates that it needs to put “an”, it generates a probability for both tokens “a” and “an”, picks one, and then when it generates the following token, it will necessarily take its previous choice into account and never puts a word starting with a vowel after it has already chosen “a”.

estebarb

The problem is exactly that: the probability distribution. The network has no way to say: 0% everyone, this is non sense, backtrack everything.

Other architectures, like energy based models or bayesian ones can assess uncertainty. Transformers simply cannot do it (yet). Yes, there are ways to do it, but we are already spending millions to get coherent phrases, few ones will burn billions to train a model that can do that kind of assessments.

ortsa

Has anybody ever messed with adding a "backspace" token?

skybrian

I think some “reasoning” models do backtracking by inserting “But wait” at the start of a new paragraph? There’s more to it, but that seems like a pretty good trick.

duskwuff

Right. And, as a result, low token-level confidence can end up indicating "there are other ways this could have been worded" or "there are other topics which could have been mentioned here" just as often as it does "this output is factually incorrect". Possibly even more often, in fact.

vessenes

My first reaction is that a model can’t, but a sampling architecture probably could. I’m trying to understand if what we have as a whole architecture for most inference now is responsive to the critique or not.

derefr

You get scores for the outputs of the last layer; so in theory, you could notice when those scores form a particularly flat distribution, and fault.

What you can't currently get, from a (linear) Transformer, is a way to induce a similar observable "fault" in any of the hidden layers. Each hidden layer only speaks the "language" of the next layer after it, so there's no clear way to program an inference-framework-level observer side-channel that can examine the output vector of each layer and say "yup, it has no confidence in any of what it's doing at this point; everything done by layers feeding from this one will just be pareidolia — promoting meaningless deviations from the random-noise output of this layer into increasing significance."

You could in theory build a model as a Transformer-like model in a sort of pine-cone shape, where each layer feeds its output both to the next layer (where the final layer's output is measured and backpropped during training) and to an "introspection layer" that emits a single confidence score (a 1-vector). You start with a pre-trained linear Transformer base model, with fresh random-weighted introspection layers attached. Then you do supervised training of (prompt, response, confidence) triples, where on each training step, the minimum confidence score of all introspection layers becomes the controlled variable tested against the training data. (So you aren't trying to enforce that any particular layer notice when it's not confident, thus coercing the model to "do that check" at that layer; you just enforce that a "vote of no confidence" comes either from somewhere within the model, or nowhere within the model, at each pass.)

This seems like a hack designed just to compensate for this one inadequacy, though; it doesn't seem like it would generalize to helping with anything else. Some other architecture might be able to provide a fully-general solution to enforcing these kinds of global constraints.

(Also, it's not clear at all, for such training, "when" during the generation of a response sequence you should expect to see the vote-of-no-confidence crop up — and whether it would be tenable to force the model to "notice" its non-confidence earlier in a response-sequence-generating loop rather than later. I would guess that a model trained in this way would either explicitly evaluate its own confidence with some self-talk before proceeding [if its base model were trained as a thinking model]; or it would encode hidden thinking state to itself in the form of word-choices et al, gradually resolving its confidence as it goes. In neither case do you really want to "rush" that deliberation process; it'd probably just corrupt it.)

thijson

I watched an Andrej Karpathy video recently. He said that hallucination was because in the training data there were no examples where the answer is, "I don't know". Maybe I'm misinterpreting what he was saying though.

https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4832s

spmurrayzzz

> i.e there isn't a "I don't have enough information" option.

This is true in terms of default mode for LLMs, but there's a fair amount of research dedicated to the idea of training models to signal when they need grounding.

SelfRAG is an interesting, early example of this [1]. The basic idea is that the model is trained to first decide whether retrieval/grounding is necessary and then, if so, after retrieval it outputs certain "reflection" tokens to decide whether a passage is relevant to answer a user query, whether the passage is supported (or requires further grounding), and whether the passage is useful. A score is calculated from the reflection tokens.

The model then critiques itself further by generating a tree of candidate responses, and scoring them using a weighted sum of the score and the log probabilities of the generated candidate tokens.

We can probably quibble about the loaded terms used here like "self-reflection", but the idea that models can be trained to know when they don't have enough information isn't pure fantasy today.

[1] https://arxiv.org/abs/2310.11511

EDIT: I should also note that I generally do side with Lecun's stance on this, but not due to the "not enough information" canard. I think models learning from abstraction (i.e. JEPA, energy-based models) rather than memorization is the better path forward.

unsupp0rted

> The problem with LLMs is that the output is inherently stochastic

Isn't that true with humans too?

There's some leap humans make, even as stochastic parrots, that lets us generate new knowledge.

josh-sematic

I don’t buy Lecun’s argument. Once you get good RL going (as we are now seeing with reasoning models) you can give the model a reward function that rewards a correct answer most highly, an “I’m sorry but I don’t know” less highly than that, a wrong answer penalized, a confidently wrong answer more severely penalized. As the RL learns to maximize rewards I would think it would find the strategy of saying it doesn’t know in cases where it can’t find an answer it deems to have a high probability of correctness.

Tryk

How do you define the "correct" answer?

jpadkins

obviously the truth is what is the most popular. /s

throw310822

> there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Have you ever tried telling ChatGPT that you're "in the city centre" and asking it if you need to turn left or right to reach some landmark? It will not answer with the average of the directions given to everybody who asked the question before, it will answer asking you to tell it where you are precisely and which way you are facing.

TZubiri

If multiple answers are equally likely, couldn't that be considered uncertainty? Conversely if there's only one answer and there's a huge leap to the second best, that's pretty certain.

bashfulpup

He's right but at the same time wrong. Current AI methods are essentially scaled up methods that we learned decades ago.

These long horizon (agi) problems have been there since the very beginning. We have never had a solution to them. RL assumes we know the future which is a poor proxy. These energy based methods fundamentally do very little that an RNN didn't do long ago.

I worked on higher dimensionality methods which is a very different angle. My take is that it's about the way we scale dependencies between connections. The human brain makes and breaks a massive amount of nueron connections daily. Scaling the dimensionality would imply that a single connection could be scalled to encompass significantly more "thoughts" over time.

Additionally the true to solution to these problems are likely to be solved by a kid with a laptop as much as an top researcher. You find the solution to CL on a small AI model (mnist) you solve it at all scales.

haolez

Not exactly related, but I wonder sometimes if the fact that the weights in current models are very expansive to change is a feature and not a "bug".

Somehow, it feels harder to trust a model that could evolve over time. It's performance might even degrade. That's a steep price to pay for having memory built in and a (possibly) self-evolving model.

bashfulpup

We degrade, and I think we are far more valuable than one model.

nradov

For a kid with a laptop to solve it would require the problem to be solvable with current standard hardware. There's no evidence for that. We might need a completely different hardware paradigm.

bashfulpup

Also possible and a fair point. My point is that it's a "tiny" solution that we can scale.

I could revise that by saying a kid with a whiteboard.

It's an einstein×10 moment so who know when that'll happen.

__rito__

Sligtly related: Energy Based Models (EBMs) are better in theory and yet too resource intensive. I tried to sell using EBMs to my org, but the price for even a small use case was prohibitive.

I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...

Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.

Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.

Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.

There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.

There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.

He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.

Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.

Horffupolde

WTF. The cardinality of words is 100,000.

albertzeyer

Jürgen Schmidhuber has a paper with Lucas Beyer? I'm not aware of it. Which do you mean?

vessenes

Thanks. This is interesting. What kind of equation is used to assess an ebm during training? I’m afraid I still don’t get the core concept well enough to have an intuition for it.

tyronehed

Since this exposes the answer, the new architecture has to be based on world model building.

uoaei

The thing is, this has been known since even before the current crop of LLMs. Anyone who considered (only the English) language to be sufficient to model the world understands so little about cognition as to be irrelevant in this conversation.

hnfong

I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

https://arxiv.org/abs/2502.09992

https://www.inceptionlabs.ai/news

(these are results from two different teams/orgs)

It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.

hnuser123456

And they seem to be about 10x as fast as similar sized transformers.

317070

No, 10x less sampling steps. Whether or not that means 10x faster remains to be seen, as a diffusion step tends to be more expensive than an autoregressive step.

littlestymaar

If I understood correctly, in practice they show actual speed improvement on high-end cards, because autoregressive LLMs are bandwidth limited and do not compute bound, so switching to a more expensive but less memory bandwidth heavy is going to work well on current hardware.

eximius

I believe that so long as weights are fixed at inference time, we'll be at a dead end.

Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.

Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.

mft_

Very much this. I’ve been wondering why I’ve not seen it much discussed.

jononor

There are many roadblocks to continual learning still. Most current models and training paradigms are very vulnerable to catastrophic forgetting. And are very sample inefficient. And we/the methods are not so good at separating what is "interesting" (should be learned) vs "not". But this is being researched, for example under the topic of open ended learning, active inference, etc.

TrainedMonkey

This is a somewhat nihilistic take with an optimistic ending. I believe humans will never fix hallucinations. Amount of totally or partially untrue statements people make is significant. Especially in tech, it's rare for people to admit that they do not know something. And yet, despite all of that the progress keeps marching forward and maybe even accelerating.

danielmarkbruce

Once one starts thinking of them as "concept models" rather than language models or fact models, "hallucinations" become something not to be so fixated on. We transform tokens into 12k+ length embeddings... right at the start. They stop being language immediately.

They aren't fact machines. They are concept machines.

ketzo

Yeah, I think a lot of people talk about "fixing hallucinations" as the end goal, rather than "LLMs providing value", which misses the forest for the trees; it's obviously already true that we don't need totally hallucination-free output to get value from these models.

dtnewman

I’m not sure I follow. Sure, people lie, and make stuff up all the time. If an LLM goes and parrots that, then I would argue that it isn’t hallucinating. Hallucinating would be where it makes something up that is not in its training site nor logically deducible from it.

esafak

I think most humans are perfectly capable of admitting to themselves when they do not know something. Computers ought to do better.

danielmarkbruce

You must interact with a very different set of humans than most.

inimino

I have a paper coming up that I modestly hope will clarify some of this.

The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.

snowwrestler

I think the hard question is whether those wins can be realized with less effort than what we’re already doing, though.

What I mean is this: A brain today is obviously far more efficient at intelligence than our current approaches to AI. But a brain is a highly specialized chemical computer that evolved over hundreds of millions of years. That leaves a lot of room for inefficient and implausible strategies to play out! As long as wins are preserved, efficiency can improve this way anyway.

So the question is really, can we short cut that somehow?

It does seem like doing so would require a different approach. But so far all our other approaches to creating intelligence have been beaten by the big simple inefficient one. So it’s hard to see a path from here that doesn’t go that route.

sockaddr

Also, a brain evolved to be a stable compute platform in body that finds itself in many different temperature and energy regimes. And the brain can withstand and recover from some pretty severe damage. So I'd suspect an intelligence that is designed to run in a tighter temp/power envelope with no need for recovery or redundancy could be significantly more efficient than our brain.

choilive

Most brain damage would not be considered in the realm of what most people would consider "recoverable".

fallingknife

The brain only operates in a very narrow temperature range too. 5 degrees C in either direction from 37 and you're in deep trouble.

null

[deleted]

Etheryte

How does this idea compare to the rationale presented by Rich Sutton in The Bitter Lesson [0]? Shortly put, why do you think biological plausibility has significance?

[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

rsfern

I’m not GP, but I don’t think their position is necessarily in tension with leveraging computation. Not all FLOPs are equal, and furthermore FLOPs != Watts. In fact a much more efficient architecture might be that much more effective at leveraging computation than just burning a bigger pile of GPUs with the current transformer stack

zamubafoo

Honest question: Given that the only wide consensus of anything approaching general intelligence are humans and that humans are biological systems that have evolved in physical reality, is there any arguments that better efficiency is even possible without relying on leveraging the nature of reality?

For example, analog computers can differentiate near instantly by leveraging the nature of electromagnetism and you can do very basic analogs of complex equations by just connecting containers of water together in certain (very specific) configurations. Are we sure that these optimizations to get us to AGI are possible without abusing the physical nature of the world? This is without even touching the hot mess that is quantum mechanics and its role in chemistry which in turn affects biology. I wouldn't put it past evolution to have stumbled upon some quantum mechanic that allowed for the emergence of general intelligence.

I'm super interested in anything discussing this but have very limited exposure to the literature in this space.

HDThoreaun

The advantage of artificial intelligence doesnt even need to be energy efficiency. We are pretty good at generating energy, if we had human level AI even if it used an order of magnitude more energy that humans use that would likely still be cheaper than a human.

vessenes

I’m looking forward to it! Inefficiency (if we mean energy efficiency) conceptually doesn’t bother me very much in that feels like Silicon design has a long way to go still, but I like the idea of looking at biology for both ideas and guidance.

Inefficiency in data input is also an interesting concept. It seems to me humans get more data in than even modern frontier models; if you use the gigabit/s estimates for sensory input. Care to elaborate on your thoughts?

jedberg

> and biologically implausible

I really like this approach. Showing that we must be doing it wrong because our brains are more efficient and we aren't doing it like our brains.

Is this a common thing in ML papers or something you came up with?

fluidcruft

How are you separating the efficiency of the architecture from the efficiency of the substrate? Unless you have a brain made of transistors or an LLM made of neurons how can you identify the source of the inefficiency?

esafak

Evolution does not need to converge on the optimum solution.

Have you heard of https://en.wikipedia.org/wiki/Bio-inspired_computing ?

parsimo2010

I don't think GP was implying that brains are the optimum solution. I think you can interpret GP's comments like this- if our brains are more efficient than LLMs, then clearly LLMs aren't optimally efficient. We have at least one data point showing that better efficiency is possible, even if we don't know what the optimal approach is.

jedberg

It does not, you're right. But it's an interesting way to approach the problem never the less. And given that we definitely aren't as efficient as a human brain right now, it makes sense to look at the brain for inspiration.

_3u10

Nah it’s just physics, it’s like wheels being more efficient than legs.

We know there is a more efficient solution (human brain) but we don’t know how to make it.

So it stands to reason that we can make more efficient LLMs, just like a CPU can add numbers more efficiently than humans.

jonplackett

Wheels is an interesting analogy. Wheels are more efficient now that we have roads. But there could never have been evolutionary pressure to make them before there were roads. Wheels are also a lot easier to get to work than robotic legs and so long as there’s a road do a lot more than robotic legs.

killthebuddha

I've always felt like the argument is super flimsy because "of course we can _in theory_ do error correction". I've never seen even a semi-rigorous argument that error correction is _theoretically_ impossible. Do you have a link to somewhere where such an argument is made?

aithrowawaycomm

In theory transformers are Turing-complete and LLMs can do anything computable. The more down-to-earth argument is that transformer LLMs aren't able to correct errors in a systematic way like Lecun is describing: it's task-specific "whack-a-mole," involving either tailored synthetic data or expensive RLHF.

In particular, if you train an LLM to do Task A and Task B with acceptable accuracy, that does not guarantee it can combine the tasks in a common-sense way. "For each step of A, do B on the intermediate results" is a whole new Task C that likely needs to be fine-tuned. (This one actually does have some theoretical evidence coming from computational complexity, and it was the first thing I noticed in 2023 when testing chain-of-thought prompting. It's not that the LLM can't do Task C, it just takes extra training.)

vhantz

> of course we can _in theory_ do error correction

Oh yeah? This is begging the question.

schainks

This seems really intuitive to me. If I can express something concisely and succinctly because I understand it, I will literally spend less energy to explain it.

bobosha

I argue that JEPA and its Energy-Based Model (EBM) framework fail to capture the deeply intertwined nature of learning and prediction in the human brain—the “yin and yang” of intelligence. Contemporary machine learning approaches remain heavily reliant on resource-intensive, front-loaded training phases. I advocate for a paradigm shift toward seamlessly integrating training and prediction, aligning with the principles of online learning.

Disclosure: I am the author of this paper.

Reference: (PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh... [accessed Mar 14, 2025].

vessenes

Update: Interesting paper, thanks. Comment on selection for Hydra — you mention v1 uses an arithmetic mean across timescales for prediction. Taking this analogy of the longer windows encapsulating different timescales, I’d propose it would be interesting to train a layer to predict weighting of the timescale predictions. Essentially — is this a moment where I need to focus on what just happened, or is this a moment in which my long range predictions are more important?

esafak

So you believe humans spend more energy on prediction, relative to computers? Isn't that because personal computers are not powerful enough to train big models, and most people have no desire to? It is more economically efficient to socialize the cost of training, as is done. Are you thinking of a distributed training, where we split the work and cost? That could happen when robots become more widespread.

vessenes

Thank you. So, quick q - it would make sense to me that JEPA is an outcome of the YLC work; would you say that’s the case?