Ask HN: Any insider takes on Yann LeCun's push against current architectures?

342 comments

·March 10, 2025

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.

In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.

Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

Visit

bravura

Okay I think I qualify. I'll bite.

LeCun's argument is this:

1) You can't learn an accurate world model just from text.

2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.

A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.

LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.

The energy minimization architecture is more about joint multimodal learning.

(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

somenameforme

Is that what he's arguing? My perspective on what he's arguing is that LLMs effectively rely on a probabilistic approach to the next token based on the previous. When they're wrong, which the technology all but ensures will happen with some significant degree of frequency, you get cascading errors. It's like in science where we all build upon the shoulders of giants, but if it turns out that one of those shoulders was simply wrong, somehow, then everything built on top of it would be increasingly absurd. E.g. - how the assumption of a geocentric universe inevitably leads to epicycles which leads to ever more elaborate, and plainly wrong, 'outputs.'

Without any 'understanding' or knowledge of what they're saying, they will remain irreconcilably dysfunctional. Hence the typical pattern with LLMs:

---

How do I do [x]?

You do [a].

No that's wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [b].

No that's also wrong because reasons.

Oh I'm sorry. You're completely right. Thanks for correcting me. I'll keep that in mind. You do [a].

FML

---

More advanced systems might add a c or a d, but it's just more noise before repeating the same pattern. Deep Seek's more visible (and lengthy) reasoning demonstrates this perhaps the most clearly. It just can't stop coming back to the same wrong (but statistically probable) answer and so ping-ponging off that (which it at least acknowledges is wrong due to user input) makes up basically the entirety of its reasoning phase.

gsf_emergency_2

on "stochastic parrots"

Table stakes for sentience: knowing when the best answer is not good enough.. try prompting LLMs with that..

It's related to LeCun's (and Ravid's) subtle question I mentioned in passing below:

To Compress Or Not To Compress?

(For even a vast majority of Humans, except tacitly, that is not a question!)

tmaly

Right now, humans still have enough practice thinking to point out the errors, but what happens when humanity becomes increasingly dependent on LLMs to do this thinking?

jcims

Over the last few years I’ve become exceedingly aware at how insufficient language really is. It feels like a 2D plane and no matter how many projections you attempt to create from it, they are ultimately limited in the fidelity of the information transfer.

Just a lay opinion here but to me each mode of input creates a new, largely orthogonal dimension for the network to grow into. The experience of your heel slipping on a cold sidewalk can be explained in a clinical fashion, but an android’s association of that to the powerful dynamic response required to even attempt to recover will give a newfound association and power to the word ‘slip’.

amw-zero

This exactly describes my intuition as well. Language is limited by its representation, and we have to jam so many bits of information into one dimension of text. It works well enough to have a functioning society, but it’s not very precise.

ninetyninenine

LLM is just the name. You can encode anything into the "language" including pictures video and sound.

pessimizer

I've always been wondering if anyone is working on using nerve impulses. My first thought when transformers came around was if they could be used for prosthetics, but I've been too lazy to do the research to find anybody working on anything like that, or to experiment myself with it.

kryogen1c

> You can encode anything into the "language

Im just a layman here, but i don't think this is true. Language is an abstraction, an interpreative mechanism of reality. A reproduction of reality, like a picture, by definition holds more information than it's abstraction does.

numba888

Great, but how do you imagine multimodal with text, video. Just 2 for simplicity, what will be in the training set. With text model tries to predict next, then more steps were added. But what to do with multimodal?

codenlearn

Doesn't Language itself encode multimodal experiences? Let's take this case write when we write text, we have the skill and opportunity to encode the visual, tactile, and other sensory experiences into words. and the fact is llm's trained on massive text corpora are indirectly learning from human multimodal experiences translated into language. This might be less direct than firsthand sensory experience, but potentially more efficient by leveraging human-curated information. Text can describe simulations of physical environments. Models might learn physical dynamics through textual descriptions of physics, video game logs, scientific papers, etc. A sufficiently comprehensive text corpus might contain enough information to develop reasonable physical intuition without direct sensory experience.

As I'm typing this there is one reality that I'm understanding, the quality and completeness of the data fundamentally determines how well an AI system will work. and with just text this is hard to achieve and a multi modal experience is a must.

thank you for explaining in very simple terms where I could understand

ThinkBeat

No.

> The sun feels hot on your skin.

No matter how many times you read that, you cannot understand what the experience is like.

> You can read a book about Yoga and read about the Tittibhasana pose

But by just reading you will not understand what it feels like. And unless you are in great shape and with greate balance you will fail for a while before you get it right. (which is only human).

I have read what shooting up with heroin feels like. From a few different sources. I certain that I will have no real idea unless I try it. (and I dont want to do that).

Waterboarding. I have read about it. I have seen it on tv. I am certain that is all abstract to having someone do it to you.

Hand eye cordination, balance, color, taste, pain, and so on, How we encode things is from all senses, state of mind, experiences up until that time.

We also forget and change what we remember.

Many songs takes me back to a certain time, a certain place, a certain feeling Taste is the same. Location.

The way we learn and the way we remember things is incredebily more complex than text.

But if you have shared excperiences, then when you write about it, other people will know. Most people felt the sun hot on their skin.

To different extents this is also true for animals. Now I dont think most mice can read, but they do learn with many different senses, and remeber some combination or permutation.

rcpt

I can't see as much color as a mantis shrimp or sense electric fields like a shark but I still think I'm closer to AGI than they are

MITSardine

Even beyond sensations (which are never described except circumstantially, as in "the taste of chocolate" says nothing of the taste, only of the circumstances in which the sensation is felt), it's very often people don't understand something another person says (typically a work of art) until they have lived the relevant experiences to connect to the meaning behind the (whatever medium of communication).

spyder

> No.

Huh, text definitely encodes multimodal experiences, it's just not as accurate and as rich encoding as the encodings of real sensations.

deepGem

Doesn't this imply that the future of AGI lies not just in vision and text but in tactile feelings and actions as well ?

Essentially, engineering the complete human body and mind including the nervous system. Seems highly intractable for the next couple of decades at least.

csomar

All of these "experiences" are encoded in your brain as electricity. So "text" can encode them, though English words might not be the proper way to do it.

golergka

> No matter how many times you read that, you cannot understand what the experience is like.

OK, so you don't have qualia. But if know all the data needed to complete any tasks that can be related to this knowledge, does it matter?

not2b

I'm reminded of the story of Helen Keller, and how it took a long time for her to realize that the symbols her teacher was signing into her hand had meaning, as she was blind and deaf and only experienced the world via touch and smell. She didn't get it until her teacher spelled the word "water" as water from a pump was flowing over her hand. In other words, a multimodal experience. If the model only sees text, it can appear to be brilliant but is missing a lot. If it's also fed other channels, if it can (maybe just virtually) move around, if it can interact, the way babies do, learning about gravity by dropping things and so forth, it seems that there's lots more possibility to understand the world, not just to predict what someone will type next on the Internet.

bmitc

It is important to note that Helen Keller was not born blind and deaf, though. (I am not reducing the struggle she went through. Just commentary on embodied cognition and learning.) There were around 19 months of normal speech and hearing development until then and also 3D object space traversal and object manipulation.

PaulDavisThe1st

at least a few decades ago, this idea was called "embodied intelligence" or "embodied cognition". just FYI.

furyofantares

> Doesn't Language itself encode multimodal experiences?

When communicating between two entities with similar brains who have both had many thousands of hours of similar types of sensory experiences, yeah. When I read text I have a lot more than other text to relate it to in my mind; I bring to bear my experiences as a human in the world. The author is typically aware of this and effectively exploits this fact.

andsoitis

Some aspects of experience— e.g. raw emotions, sensory perceptions, or deeply personal, ineffable states—often resist full articulation.

The taste of a specific dish, the exact feeling of nostalgia, or the full depth of a traumatic or ecstatic moment can be approximated in words but never fully captured. Language is symbolic and structured, while experience is often fluid, embodied, and multi-sensory. Even the most precise or poetic descriptions rely on shared context and personal interpretation, meaning that some aspects of experience inevitably remain untranslatable.

im3w1l

Just because we struggle to verbalize something, doesn't mean that it cannot be verbalized. The taste of a specific dish can be broken down into its components. The basic tastes: how sweet, sour, salty, bitter and savory it is. The smell of it: there are are apparently ~400 olfactory receptor types in the nose. So you could describe how strongly each of them is activated. Thermoception, the temperature of the food itself, but also fake temperature sensation produced by capsaicin and menthol. The mechanoceptors play a part, detecting both the shape of the food as well as the texture of it. The texture also contributes to a sound sensation as we hear the cracks and pops when we chew. And that is just the static part of it. Food is actually an interactive experience, where all those impressions change over time and varies over time as the food is chewed.

It is highly complex, but it can all be described.

mystified5016

Imagine I give you a text of any arbitrary length in an unknown language with no images. With no context other than the text, what could you learn?

If I told you the text contained a detailed theory of FTL travel, could you ever construct the engine? Could you even prove it contained what I told you?

Can you imagine that given enough time, you'd recognize patterns in the text? Some sequences of glyphs usually follow other sequences, eventually you could deduce a grammar, and begin putting together strings of glyphs that seem statistically likely compared to the source.

You can do all the analysis you like and produce text that matches the structure and complexity of the source. A speaker of that language might even be convinced.

At what point do you start building the space ship? When do you realize the source text was fictional?

There's many untranslatable human languages across history. Famously, ancient Egyptian hieroglyphs. We had lots and lots of source text, but all context relating the text to the world had been lost. It wasnt until we found a translation on the Rosetta stone that we could understand the meaning of the language.

Text alone has historically proven to not be enough for humans to extract meaning from an unknown language. Machines might hypothetically change that but I'm not convinced.

Just think of how much effort it takes to establish bidirectional spoken communication between two people with no common language. You have to be taught the word for apple by being given an apple. There's really no exception to this.

pessimizer

I'm optimistic about this. I think enough pictures of an apple, chemical analyses of the air, the ability to arbitrarily move around in space, a bunch of pressure sensors, or a bunch of senses we don't even have, will solve this. I suspect there might be a continuum of more concept understanding that comes with more senses. We're bathed in senses all the time, to the point where we have many systems just to block out senses temporarily, and to constantly throw away information (but different information at different times.)

It's not a theory of consciousness, it's a theory of quality. I don't think that something can be considered conscious that is constantly encoding and decoding things into and out of binary.

CamperBob2

A few GB worth of photographs of hieroglyphs? OK, you're going to need a Rosetta Stone.

A few PB worth? Relax, HAL's got this. When it comes to information, it turns out that quantity has a quality all its own.

null

[deleted]

danielmarkbruce

> Doesn't Language itself encode multimodal experiences

Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.

heyjamesknight

There are absolutely reasons that we cannot capture the entirety—or even a proper image—of human cognition in semantic space.

Cognition is not purely semantic. It is dynamic, embodied, socially distributed, culturally extended, and conscious.

LLMs are great semantic heuristic machines. But they don't even have access to those other components.

iainctduncan

Thanks for articulating this so well. I'm a musician and music/CS phd student, and as a jazz improvisor of advanced skill (30+ years), I'm accutely aware that there are significant areas of intelligence for which linguistic thinking is not only not good enough, but something to be avoided as much as one can (which is bloody hard sometimes). I have found it so frustrating, but hard to figure out how to counter, that the current LLM zeitgeist seems to hinge on a belief that linguistic intelligence is both necessary and sufficient for AGI.

kadushka

Most modern LLMs are multimodal.

yahoozoo

Does it really matter? At the end of the day, all the modalities and their architectures boil down to matrices of numbers and statistical probability. There’s no agency, no soul.

YeGoblynQueenne

Tri-modal at best: text, sound and video, and that's it. That's just barely "multi" (it's one more than two).

throw310822

I don't get it.

1) Yes it's true, learning from text is very hard. But LLMs are multimodal now.

2) That "size of a lion" paper is from 2019, which is a geological era from now. The SOTA was GPT2 which was barely able to spit out coherent text.

3) Have you tried asking a mouse to play chess or reason its way through some physics problem or to write some code? I'm really curious in which benchmark are mice surpassing chatgpt/ grok/ claude etc.

nextts

Mice can survive, forage, reproduce. Reproduce a mammal. There is a whole load of capability not available in an LLM.

An LLM is essentially a search over a compressed dataset with a tiny bit of reasoning as emergent behaviour. Because it is a parrot that is why you get "hallucinations". The search failed (like when you get a bad result in Google) or the lossy compression failed or it's reasoning failed.

Obviously there is a lot of stuff the LLM can find in its searches that are reminiscent of the great intelligence of the people writing for its training data.

The magic trick is impressive because when we judge a human what do we do... an exam? an interview? Someone with a perfect memory can fool many people because most people only acquire memory from tacit knowledge. Most people need to live in Paris to become fluent in French. So we see a robot that has a tiny bit of reasoning and a brilliant memory as a brilliant mind. But this is an illusion.

Here is an example:

User: what is the French Revolution?

Agent: The French Revolution was a period of political and societal change in France which began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799. Many of the revolution's ideas are considered fundamental principles of liberal democracy and its values remain central to modern French political discourse.

Can you spot the trick?

pfisch

When you talk to ~3 year old children they hallucinate quite a lot. Really almost nonstop when you ask them about almost anything.

I'm not convinced that what LLM's are doing is that far off the beaten path from our own cognition.

CamperBob2

Mice can survive, forage, reproduce. Reproduce a mammal. There is a whole load of capability not available in an LLM.

And if it stood for "Large Literal Mouse", that might be a meaningful point. The subject is artificial intelligence, and a brief glance at your newspaper, TV, or nearest window will remind you that it doesn't take intelligence to survive, forage, or reproduce.

The mouse comparison is absurd. You might as well criticize an LLM for being bad at putting out a fire, fixing a flat, or holding a door open.

YeGoblynQueenne

Oh mice can solve a plethora of physics problems before it's time for breakfast. They have to navigate the, well, physical world, after all.

I'm also really curious what benchmarks LLMs have passed that include surviving without being eaten by a cat, or a gull, or an owl, while looking for food to survive and feed one's young in an arbitrary environment chosen from urban, rural, natural etc, at random. What's ChatGPT's score on that kind of benchmark?

CyberDildonics

Oh a rock rolling down a hill is, well, navigating the physical world. Is it well, solving physics problem?

throw310822

> mice can solve a plethora of physics problems before it's time for breakfast

Ah really? Which ones? And nope, physical agility is not "solving a physics problem", otherwise a soccer players and figure skaters would all have PhDs, which doesn't seem to be the case.

I mean, an automated system that solves equations to keep balance is not particularly "intelligent". We usually call intelligence the ability to solve generic problems, not the ability of a very specialized system to solve the same problem again and again.

gsf_emergency_2

usual disclaimer: you decide on your own whether I'm an insider or not :)

where LeCun might be prescient should intersect with the nemesis SCHMIDHUBER. They can't both be wrong, I suppose?!

It's only "tangentially" related to energy minimization, technically speaking :) connection to multimodalities is spot-on.

https://www.mdpi.com/1099-4300/26/3/252

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

With Ravid, double-handedly blue-flag MDPI!

Sunmarized for the layman (propaganda?) https://archive.is/https://nyudatascience.medium.com/how-sho...

>When asked about practical applications and areas where these insights might be immediately used, Shwartz-Ziv highlighted the potential in multi-modalities and tabula

Imho, best take I've seen on this thread (irony: literal energy minimization) https://news.ycombinator.com/item?id=43367126

Of course, this would make Google/OpenAI/DeepSeek wrong by two whole levels (both architecturally and conceptually)

ninetyninenine

>1) You can't learn an accurate world model just from text. >2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.

LLMs can be trained with multimodal data. Language is only tokens and pixel and sound data can be encoded into tokens. All data can be serialized. You can train this thing on data we can't even comprehend.

Here's the big question. It's clear we need less data then an LLM. But I think it's because evolution has pretrained our brains for this so we have brains geared towards specific things. Like we are geared towards walking, talking, reading, in the same way a cheetah is geared towards ground speed more then it is at flight.

If we placed a human and an LLM in completely unfamiliar spaces and tried to train both with data. Which will perform better?

And I mean completely non familiar spaces. Like let's make it non Euclidean space and only using sonar for visualization. Something totally foreign to reality as humans know it.

I honestly think the LLM will beat us in this environment. We might've succeeded already in creating AGI it's just the G is too much. It's too general so it's learning everything from scratch and it can't catch up to us.

Maybe what we need is to figure out how to bias the AI to think and be biased in the way humans are biased.

MITSardine

Humans are more adaptable than you think:

- echolocation in blind humans https://en.wikipedia.org/wiki/Human_echolocation

- sight through signals sent on tongue https://www.scientificamerican.com/article/device-lets-blind...

In the latter case, I recall reading the people involved ended up perceiving these signals as a "first order" sense (not consciously treated information, but on an intuitive level like hearing or vision).

physicsguy

Hugely different data too?

If you think of all the neurons connected up to vision, touch, hearing, heat receptors, balance, etc. there’s a constant stream of multimodal data of different types along with constant reinforcement learning - e.g. ‘if you move your eye in this way, the scene you see changes’, ‘if you tilt your body this way your balance changes’, etc. and this runs from even before you are born, throughout your life.

kedarkhand

> non Euclidean space and only using sonar for visualization

Pretty good idea for a video game!

hintymad

I'm curious why their claims are controversial. It seems pretty obvious to me that LLMs sometimes generate idiotic answers because the models lack common sense, do not have ability for deductive logical reasoning, let alone the ability to induce. And the current transformer architectures plus all the post-training techniques do not do anything to build such intelligence or the world model per LeCun's words.

jawiggins

I'm not an ML researcher, but I do work in the field.

My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.

I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.

Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.

I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.

[1]: https://www.open.edu/openlearn/nature-environment/organisati...

Matthyze

> Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found.

That seems to be how science works as a whole. Long periods of little progress between productive paradigm shifts.

ahazred8ta

It's been described as fumbling around in a dark room until you find the light switch. At which point you can see the doorway leading to the next dark room.

semi-extrinsic

Punctuated equilibrium theory.

calmbell

That is how science seems to work as a whole. What worries me is that the market views the emergence of additional productive paradigm shifts in AI as only a matter of money. A normal scientific advancement plateau for another five years in AI would be a short-term disaster for the stock market and economy.

the13

[dead]

tyronehed

This is actually a lazy approach as you describe it. Instead, what is needed is an elegant and simple approach that is 99% of the way there out of the gate. Soon as you start doing statistical tweaking and overfitting models, you are not approaching a solution.

klabb3

In a way yes. For models in physics that should make you suspicious, since most of our famous and useful models found are simple and accurate. However, in general intelligence or even multimodal pattern matching there’s no guarantee there’s an elegant architecture at the core. Elegant models in social sciences like economics, sociology and even fields like biology tend to be hilariously off.

ActorNightly

Not an official ML researcher, but I do happen to understand this stuff.

The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

seanhunter

> The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

I don't think this explanation is correct. The input to the decoder at the end of all the attention heads etc (as I understand it) is a probability distribution over tokens. So the model as a whole does have an ability to score low confidence in something by assigning it a low probability.

The problem is that thing is a token (part of a word). So the LLM can say "I don't have enough information" to decide on the next part of a word but has no ability to say "I don't know what on earth I'm talking about" (in general - not associated with a particular token).

Lerc

I feel like we're stacking naive misinterpretations of how LLMs function on top of one another here. Grasping gradient descent and autoregressive generation can give you a false sense of confidence. It is like knowing how transistors make up logic gates and believing you know more than CPU design than you actually do.

Rather than inferring from how you imagine the architecture working, you can look at examples and counterexamples to see what capabilities they have.

One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels. It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.

Models predict the next word, but they don't just predict the next word. They generate a great deal of internal information in service of that goal. Placing limits on their abilities by assuming the output they express is the sum total of what they have done is a mistake. The output probability is not what it thinks, it is a reduction of what it thinks.

One of Andrej Karpathy's recent videos talked about how researchers showed that models do have an internal sense of not knowing the answer, but fine tuning on question answering I'd not give them the ability to express that knowledge. Finding information the model did and didn't know then fine tuning to say I don't know for cases where it had no information allowed the model to generalise and express "I don't know"

metaxz

Thanks for writing this so clearly... I hear wrong/misguided arguments like we see hear every day from friends, colleagues, "experts in the media" etc.

It's strange because just a moment of thinking will show that such ideas are wrong or paint a clearly incomplete picture. And there's plenty of analogies to the dangers of such reductionism. It should be obviously wrong to anyone who has at least tried ChatGPT.

My only explanation is that a denial mechanism must be at play. It simply feels more comfortable to diminish LLM capabilities and/or feel that you understand them from reading a Medium article on transformer-network, than to consider the consequences in terms of the inner black-box nature.

littlestymaar

No an ML researcher or anything (I'm basically only a few Karpathy video into ML, so please someone correct me if I'm misunderstanding this), but it seems that you're getting this backwards:

> One misconception is that predicting the next word means there is no internal idea on the word after next. The simple disproof of this is that models put 'an' instead of 'a' ahead of words beginning with vowels.

My understanding is that there's simply not “'an' ahead of a word that starts with a vowel”, the model (or more accurately, the sampler) picks “an” and then the model will never predict a word that starts with a consonant after that. It's not like it “knows” in advance that it wants to put a word with a vowel and then anticipates that it needs to put “an”, it generates a probability for both tokens “a” and “an”, picks one, and then when it generates the following token, it will necessarily take its previous choice into account and never puts a word starting with a vowel after it has already chosen “a”.

cruffle_duffle

> It would be quite easy to detect (and exploit) behaviour that decided to use a vowel word just because it somewhat arbitrarily used an 'an'.

That is a very interesting observation!

Doesn’t that internal state get blown away and recreated for every “next token”? Isn’t the output always the previous context plus the new token, which gets fed back and out pops the new token? There is no transfer of internal state to the new iteration beyond what is “encoded” in its input tokens?

jkhdigital

I think your analogy about logic gates vs. CPUs is spot on. Another apt analogy would be missing the forest for the trees—the model may in fact be generating a complete forest, but its output (natural language) is inherently serial so it can only plant one tree at a time. The sequence of distributions that is the proximate driver of token selection is just the final distillation step.

flamedoge

It literally doesn't know how to handle 'I don't know' and needs to be taught. fascinating.

null

[deleted]

skybrian

I think some “reasoning” models do backtracking by inserting “But wait” at the start of a new paragraph? There’s more to it, but that seems like a pretty good trick.

estebarb

The problem is exactly that: the probability distribution. The network has no way to say: 0% everyone, this is non sense, backtrack everything.

Other architectures, like energy based models or bayesian ones can assess uncertainty. Transformers simply cannot do it (yet). Yes, there are ways to do it, but we are already spending millions to get coherent phrases, few ones will burn billions to train a model that can do that kind of assessments.

ortsa

Has anybody ever messed with adding a "backspace" token?

duskwuff

Right. And, as a result, low token-level confidence can end up indicating "there are other ways this could have been worded" or "there are other topics which could have been mentioned here" just as often as it does "this output is factually incorrect". Possibly even more often, in fact.

vessenes

My first reaction is that a model can’t, but a sampling architecture probably could. I’m trying to understand if what we have as a whole architecture for most inference now is responsive to the critique or not.

derefr

You get scores for the outputs of the last layer; so in theory, you could notice when those scores form a particularly flat distribution, and fault.

What you can't currently get, from a (linear) Transformer, is a way to induce a similar observable "fault" in any of the hidden layers. Each hidden layer only speaks the "language" of the next layer after it, so there's no clear way to program an inference-framework-level observer side-channel that can examine the output vector of each layer and say "yup, it has no confidence in any of what it's doing at this point; everything done by layers feeding from this one will just be pareidolia — promoting meaningless deviations from the random-noise output of this layer into increasing significance."

You could in theory build a model as a Transformer-like model in a sort of pine-cone shape, where each layer feeds its output both to the next layer (where the final layer's output is measured and backpropped during training) and to an "introspection layer" that emits a single confidence score (a 1-vector). You start with a pre-trained linear Transformer base model, with fresh random-weighted introspection layers attached. Then you do supervised training of (prompt, response, confidence) triples, where on each training step, the minimum confidence score of all introspection layers becomes the controlled variable tested against the training data. (So you aren't trying to enforce that any particular layer notice when it's not confident, thus coercing the model to "do that check" at that layer; you just enforce that a "vote of no confidence" comes either from somewhere within the model, or nowhere within the model, at each pass.)

This seems like a hack designed just to compensate for this one inadequacy, though; it doesn't seem like it would generalize to helping with anything else. Some other architecture might be able to provide a fully-general solution to enforcing these kinds of global constraints.

(Also, it's not clear at all, for such training, "when" during the generation of a response sequence you should expect to see the vote-of-no-confidence crop up — and whether it would be tenable to force the model to "notice" its non-confidence earlier in a response-sequence-generating loop rather than later. I would guess that a model trained in this way would either explicitly evaluate its own confidence with some self-talk before proceeding [if its base model were trained as a thinking model]; or it would encode hidden thinking state to itself in the form of word-choices et al, gradually resolving its confidence as it goes. In neither case do you really want to "rush" that deliberation process; it'd probably just corrupt it.)

spmurrayzzz

> i.e there isn't a "I don't have enough information" option.

This is true in terms of default mode for LLMs, but there's a fair amount of research dedicated to the idea of training models to signal when they need grounding.

SelfRAG is an interesting, early example of this [1]. The basic idea is that the model is trained to first decide whether retrieval/grounding is necessary and then, if so, after retrieval it outputs certain "reflection" tokens to decide whether a passage is relevant to answer a user query, whether the passage is supported (or requires further grounding), and whether the passage is useful. A score is calculated from the reflection tokens.

The model then critiques itself further by generating a tree of candidate responses, and scoring them using a weighted sum of the score and the log probabilities of the generated candidate tokens.

We can probably quibble about the loaded terms used here like "self-reflection", but the idea that models can be trained to know when they don't have enough information isn't pure fantasy today.

[1] https://arxiv.org/abs/2310.11511

EDIT: I should also note that I generally do side with Lecun's stance on this, but not due to the "not enough information" canard. I think models learning from abstraction (i.e. JEPA, energy-based models) rather than memorization is the better path forward.

thijson

I watched an Andrej Karpathy video recently. He said that hallucination was because in the training data there were no examples where the answer is, "I don't know". Maybe I'm misinterpreting what he was saying though.

https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4832s

unsupp0rted

> The problem with LLMs is that the output is inherently stochastic

Isn't that true with humans too?

There's some leap humans make, even as stochastic parrots, that lets us generate new knowledge.

borgdefenser

I think it is because we don't feel the random and chaotic nature of what we know as individuals.

If I had been born a day earlier or later I would have a completely different life because of initial conditions and randomness but life doesn't feel that way even though I think this is obviously true.

throw310822

> there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Have you ever tried telling ChatGPT that you're "in the city centre" and asking it if you need to turn left or right to reach some landmark? It will not answer with the average of the directions given to everybody who asked the question before, it will answer asking you to tell it where you are precisely and which way you are facing.

wavemode

That's because, based on the training data, the most likely response to asking for directions is to clarify exactly where you are and what you see.

But if you ask it in terms of a knowledge test ("I'm at the corner of 1st and 2nd, what public park am I standing next to?") a model lacking web search capabilities will confidently hallucinate (unless it's a well-known park).

In fact, my person opinion is that, therein lies the most realistic way to reduce hallucination rates: rather than trying to train models to say "I don't know" (which is not really a trainable thing - models are fundamentally unaware of the limits of their own training data), instead just train them on which kinds of questions warrant a web search and which ones should be answered creatively.

QuesnayJr

I tried this just now on Chatbot Arena, and both chatbots asked for more information.

One was GPT 4.5 preview, and one was cohort-chowder (which is someone's idea of a cute code name, I assume).

throwawaymaths

i dont think the stochasticity that's the problem -- the problem is that model gets "locked in" once it picks a token and there's no takesies backsies.

that also entails information destruction in the form of the logits table, but for the most part that should be accounted for in the last step before final feedforward

itkovian_

>This is due to the fact that LLMs are basically just giant look up maps with interpolation.

This is obviously not true at this point except for the most loose definition of interpolation.

>don't rely on things like differentiability.

I've never heard lecun say we need to move away from gradient descent. The opposite actually.

TZubiri

If multiple answers are equally likely, couldn't that be considered uncertainty? Conversely if there's only one answer and there's a huge leap to the second best, that's pretty certain.

chriskanan

A lot of the responses seem to be answering a different question: "Why does LeCun think LLMs won't lead to AGI?" I could answer that, but the question you are asking is "Why does LeCun think hallucinations are inherent in LLMs?"

To answer your question, think about how we train LLMs: We have them learn the statistical distribution of all written human language, such that given a chunk of text (a prompt, etc.) it then samples its output distribution to produces the next most likely token (word, sub-word, etc.) that should be produced and keeps doing that. It never learns how to judge what is true or false and during training it never needs to learn "Do I already know this?" It is just spoon fed information that it has to memorize and has no ability to acquire metacognition, which is something that it would need to be trained to attain. As humans, we know what we don't know (to an extent) and can identify when we already know something or don't already know something, such that we can say "I don't know." During training, an LLM is never taught to do this sort of introspection, so it never will know what it doesn't know.

I have a bunch of ideas about how to address this with a new architecture and a lifelong learning training paradigm, but it has been hard to execute. I'm an AI professor, but really pushing the envelope in that direction requires I think a small team (10-20) of strong AI scientists and engineers working collaboratively and significant computational resources. It just can't be done efficiently in academia where we have PhD student trainees who all need to be first author and work largely in isolation. By the time AI PhD students get good, they graduate.

I've been trying to find the time to focus on getting a start-up going focused on this. With Terry Sejnowski, I pitched my ideas to a group affiliated with Schmidt Sciences that funds science non-profits at around $20M per year for 5 years. They claimed to love my ideas, but didn't go for it....

emrah

Would you care to post your ideas somewhere online so others can read, critique, try etc?

random3

"we love your ideas" == no

"when do you close the round?" = maybe

money in the bank account = yes

eximius

I believe that so long as weights are fixed at inference time, we'll be at a dead end.

Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.

Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.

chriskanan

I 100% agree with this and sampling from thought space rather than "thinking" in terms of language. I spent forever writing up an NSF grant proposal on exactly this idea and submitted it last May. I haven't heard back, but it probably won't be funded.

randomNumber7

Why, even animals sleep? And if you for example learn an instrument you will notice that a lot of the learning of the muscel memory happens during sleep.

eximius

I guess you're saying that non-inference time training can be that "sleep period"?

randomNumber7

Yes, i could imagine something like a humanoid robot, where the "short term memory" is just a big enough context to keep all input of the day. Then during "sleep" there is training where the information is processed.

But I also think that current LLM tech does not lead to agi. You cant train something on pattern matchin and then it becomes magically intelligent (although i could be wrong).

Imo an AGI would need to be able to interact with the environment and learn to reflect on its interactions and its abilities within it. I suspect we have the hardware to build s.th. intelligent as a cat or a dog, but not the algorithms.

mft_

Very much this. I’ve been wondering why I’ve not seen it much discussed.

jononor

There are many roadblocks to continual learning still. Most current models and training paradigms are very vulnerable to catastrophic forgetting. And are very sample inefficient. And we/the methods are not so good at separating what is "interesting" (should be learned) vs "not". But this is being researched, for example under the topic of open ended learning, active inference, etc.

chriskanan

As a leader in the field of continual learning, I somewhat agree, but I'd say that catastrophic forgetting is largely resolved. The problem is that the continual learning community largely has become insular and is mostly focusing on toy problems that don't matter, where they will even avoid good solutions for nonsensical reasons. For example, reactivation / replay / rehearsal works well for mitigating catastrophic forgetting almost entirely, but a lot of the continual learning community mostly dislikes it because it is very effective. A lot of the work is focusing on toy problems and they refuse to scale up. I wrote this paper with some of my colleagues on this issue, although with such a long author list it isn't as focused as I would have liked in terms of telling the continual learning community to get out of its rut such that they are writing papers that advance AI rather than are just written for other continual learning researchers: https://arxiv.org/abs/2311.11908

The majority are focusing on the wrong paradigms and the wrong questions, which blocks progress towards the kinds of continual learning needed to make progress towards creating models that think in latent space and enabling meta-cognition, which would then give architectures the ability to avoid hallucinations by knowing what they don't know.

eximius

Self updating requires learning to learn, which I'm not sure we know how to do.

bashfulpup

He's right but at the same time wrong. Current AI methods are essentially scaled up methods that we learned decades ago.

These long horizon (agi) problems have been there since the very beginning. We have never had a solution to them. RL assumes we know the future which is a poor proxy. These energy based methods fundamentally do very little that an RNN didn't do long ago.

I worked on higher dimensionality methods which is a very different angle. My take is that it's about the way we scale dependencies between connections. The human brain makes and breaks a massive amount of nueron connections daily. Scaling the dimensionality would imply that a single connection could be scalled to encompass significantly more "thoughts" over time.

Additionally the true to solution to these problems are likely to be solved by a kid with a laptop as much as an top researcher. You find the solution to CL on a small AI model (mnist) you solve it at all scales.

haolez

Not exactly related, but I wonder sometimes if the fact that the weights in current models are very expansive to change is a feature and not a "bug".

Somehow, it feels harder to trust a model that could evolve over time. It's performance might even degrade. That's a steep price to pay for having memory built in and a (possibly) self-evolving model.

bashfulpup

We degrade, and I think we are far more valuable than one model.

nradov

For a kid with a laptop to solve it would require the problem to be solvable with current standard hardware. There's no evidence for that. We might need a completely different hardware paradigm.

bashfulpup

Also possible and a fair point. My point is that it's a "tiny" solution that we can scale.

I could revise that by saying a kid with a whiteboard.

It's an einstein×10 moment so who know when that'll happen.

hnfong

I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

https://arxiv.org/abs/2502.09992

https://www.inceptionlabs.ai/news

(these are results from two different teams/orgs)

It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.

hnuser123456

And they seem to be about 10x as fast as similar sized transformers.

317070

No, 10x less sampling steps. Whether or not that means 10x faster remains to be seen, as a diffusion step tends to be more expensive than an autoregressive step.

littlestymaar

If I understood correctly, in practice they show actual speed improvement on high-end cards, because autoregressive LLMs are bandwidth limited and do not compute bound, so switching to a more expensive but less memory bandwidth heavy is going to work well on current hardware.

coderenegade

You could reframe the way LLMs are currently trained as energy minimization, since the Boltzmann distribution that links physics and information theory (and correspondingly, probability theory as well) is general enough to include all standard loss functions as special cases. It's also pretty straightforward to include RL in that category as well.

I think what Lecun is probably getting at is that there's currently no way for a model to say "I don't know". Instead, it'll just do its best. For esoteric topics, this can result in hallucinations; for topics where you push just past the edge of well-known and easy-to-Google, you might get a vacuously correct response (i.e. repetition of correct but otherwise known or useless information). The models are trained to output a response that meets the criteria of quality as judged by a human, but there's no decent measure (that I'm aware of) of the accuracy of the knowledge content, or the model's own limitations. I actually think this is why programming and mathematical tasks have such a large impact on model performance: because they encode information about correctness directly into the task.

So Yann is probably right, though I don't know that energy minimization is a special distinction that needs to be added. Any technique that we use for this task could almost certainly be framed as energy minimization of some energy function.

jiggawatts

My observation from the outside watching this all unfold is that not enough effort seems to be going into the training schedule.

I say schedule because the “static data once through” is the root of the problem in my mind is one of the root problems.

Think about what happens when you read something like a book. You’re not “just” reading it, you’re also comparing it to other books, other books by the same author, while critically considering the book recommendations made by your friend. Any events in the book get compared to your life experience, etc…

LLM training does none of this! It’s a once-through text prediction training regime.

What this means in practice is that an LLM can’t write a review of a book unless it has read many reviews already. They have, of course, but the problem doesn’t go away. Ask an AI to critique book reviews and it’ll run out of steam because it hasn’t seen many of those. Critiques of critiques is where they start falling flat on their face.

This kind of meta-knowledge is precisely what experts accumulate.

As a programmer I don’t just regurgitate code I’ve seen before with slight variations — instead I know that mainstream criticisms of micro services misses their key benefit of extreme team scalability!

This is the crux of it: when humans read their training material they are generating an “n+1” level in their mind that they also learn. The current AI training setup trains the AI only the “n”th level.

This can be solved by running the training in a loop for several iterations after base training. The challenge of course is to develop a meaningful loss function.

IMHO the “thinking” model training is a step in the right direction but nowhere near enough to produce AGI all by itself.

TrainedMonkey

This is a somewhat nihilistic take with an optimistic ending. I believe humans will never fix hallucinations. Amount of totally or partially untrue statements people make is significant. Especially in tech, it's rare for people to admit that they do not know something. And yet, despite all of that the progress keeps marching forward and maybe even accelerating.

ketzo

Yeah, I think a lot of people talk about "fixing hallucinations" as the end goal, rather than "LLMs providing value", which misses the forest for the trees; it's obviously already true that we don't need totally hallucination-free output to get value from these models.

mdp2021

Even as language models can partially solve a few problems, we remain with the problem of achieving Artificial General Intelligence, that the presence of LLMs has exacerbated because they so often reveal to be artificial morons.

Intelligence finds solutions - actual, solid solutions.

More than "fixing" hallucinations, the problem is going beyond them (arriving to "sobriety").

dtnewman

I’m not sure I follow. Sure, people lie, and make stuff up all the time. If an LLM goes and parrots that, then I would argue that it isn’t hallucinating. Hallucinating would be where it makes something up that is not in its training site nor logically deducible from it.

esafak

I think most humans are perfectly capable of admitting to themselves when they do not know something. Computers ought to do better.

danielmarkbruce

You must interact with a very different set of humans than most.

danielmarkbruce

Once one starts thinking of them as "concept models" rather than language models or fact models, "hallucinations" become something not to be so fixated on. We transform tokens into 12k+ length embeddings... right at the start. They stop being language immediately.

They aren't fact machines. They are concept machines.

mdp2021

Not an argument. "Many people are delirious, yet some people create progress". What is that supposed to imply?

probably_wrong

I haven't read Yann Lecun's take. Based on your description alone my first impression would be: there's a paper [1] arguing that "beam search enforces uniform information density in text, a property motivated by cognitive science". UID claims, in short, that a speaker only delivers as much content as they think the listener can take (no more, no less) and the paper claims that beam search enforced this property at generation time.

The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.

Then again, perhaps they have one in mind and I just haven't read it.

[1] https://aclanthology.org/2020.emnlp-main.170/

vessenes

I believe he’s talking about some sort of ‘energy as measured by distance from the models understanding of the world’ as in quite literally a world model. But again I’m ignorant, hence the post!

deepsquirrelnet

In some respects that sounds similar to what we already do with reward models. I think with GRPO, the “bag of rewards” approach doesn’t strike me as terribly different. The challenge is in building out a sufficient “world” of rewards to adequately represent more meaningful feedback-based learning.

While it sounds nice to reframe it like a physics problem, it seems like a fundamentally flawed idea, akin to saying “there is a closed form solution to the question of how should I live.” The problem isn’t hallucinations, the problem is that language and relativism are inextricably linked.

tyronehed

When an architecture is based around world model building, then it is a casual outcome that similar concepts and things end up being stored in similar places. They overlap. As soon as your solution starts to get mathematically complex, you are departing from what the human brain does. Not saying that in some universe it might be possible to make a statistical intelligence, but when you go that direction you are straying away from the only existing intelligences that we know about. The human brain. So the best solutions will closely echo neuroscience.

anonymoushn

This sort of measure is a decent match for BPB though. BPB=-log(document_probability)/document_length_bytes and perplexity=e^(BPB*document_length_bytes/document_length_tokens). We already train models by minimizing perplexity, and model outputs are already those that are high probability. Though like with EBMs, figuring out outputs with even higher probability would require an expensive search step.

nonagono

Many of his arguments make “logical” sense, but one way to evaluate them is: would they have applied equally well 5 years ago? and would that have predicted LLMs will never write (average) poetry, or solve math, or answer common-sense questions about the physical world reasonably well? Probably. But turns out scale is all we needed. So yeah, maybe this is the exact point where scale stops working and we need to drastically change architectures. But maybe we just need to keep scaling.