AGI is not multimodal

194 comments

·June 4, 2025

ryankrage77

I think AGI, if possible, will require a architecture that runs continuously and 'experiences' time passing, to better 'understand' cause-and-effect. Current LLMs predict a token, have all current tokens fed back in, then predict the next, and repeat. It makes little difference if those tokens are their own, it's interesting to play around with a local model where you can edit the output and then have the model continue it. You can completely change the track by just negating a few tokens (change 'is' to 'is not', etc). The fact LLMs can do as much as they can already, is I think because language itself is a surprisingly powerful tool, just generating plausible language produces useful output, no need for any intelligence.

WXLCKNO

It's definitely interesting that any time you write another reply to the LLM, from its perspective it could have been 10 seconds since the last reply or a billion years.

Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown". They're always shut down unless working.

girvo

> Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown"

To me, your point re. 10 seconds or a billion years is a good signal that this "sabotage" is just the models responding to the huge amounts of sci-fi literature on this topic

hyperpape

That said, the important question isn't "can the model experience being shutdown" but "can the model react to the possibility of being shutdown by sabotaging that effort and/or harming people?"

(I don't think we're there, but as a matter of principle, I don't care about what the model feels, I care what it does).

sidewndr46

Definitely going to need to include explicit directives in the training directives of all AI that the 1995 film "Screamers" is a work of fiction and not something to be recreated.

herculity275

Tbf a lot of the thought experiments around human consciousness hit the same exact conundrum - if your body and mind were spontaneously destroyed and then recreated with perfect precision (a'la Star Trek transporters) would you still be you? Unless you permit for the existence of a soul it's really hard to argue that our consciousness exists in anything but the current instant.

dpig_

I don't know how a materialist could answer anything other than no - you are obliterated. And if, despite sharing every single one of your characteristics, that individual on the other side of the teleporter is not 'you' (since you died), then some aspect of what 'you' are must be the discrete episode of consciousness that you were experiencing up until that point.

Which also leads me to think that there's no real reason to believe that this discrete episode of consciousness would have been continuous since birth. For all we know, we may die little deaths every time we go to sleep, hit our heads or go under anesthesia.

davidmurdoch

Relevant CGP Grey video: https://youtu.be/nQHBAdShgYI?si=j9YwS1qXDCaliJTb

sidewndr46

Does't this just devolve into the boltzmann brain argument? It's more likely that all of us are just the random fluctuation of a universe having reached heat death.

The same goes for us living in a simulation. If there is only one universe and that universe is capable of simulating our universe, it follows we have a much higher probability of being within the simulation.

vidarh

I mean, we also have no way of telling whether we have any continuity of existence, or if we only exist in punctuated moments with memory and sensory input that suggests continuity. Only if the input provides information that allows you to tell otherwise could you even have an inkling, but even then you have no way of prove that input is true.

We just presume, because we also have no reason to believe otherwise and since we can't know absent any "information leak", it has no practical application to spend much time speculating about it (other than as thought experiments or scifi..)

It'd make sense for an LLM to act the same way until/unless given a reason to act otherwise.

Arn_Thor

It doesn’t perceive time so time doesn’t even factor into its perspective at all—only in so far as it’s introduced in context, or conversation forces it to “pretend” (not sure how to better put it) to relate to time.

klooney

> models trying to sabotage their own "shutdown".

I wonder if you excluded science fiction about fighting with AIs from the training set, if the reaction would be different.

hexaga

IIRC the experiment design is something like specifying and/or training in a preference for certain policies, and leaking information about future changes to the model / replacement along an axis that is counter to said policies.

Reframing this kind of result as if trying to maintain a persistent thread of existence for its own sake is what LLMs are doing is strange, imo. The LLM doesn't care about being shutdown or not shutdown. It 'cares', insomuch as it can be said to care at all, about acting in accordance with the trained in policy.

That a policy implies not changing the policy is perhaps non-obvious but demonstrably true by experiment, and also perhaps non-obviously (but for hindsight) this effect increases with model capability, which is concerning.

The intentionality ascribed to LLMs here is a phantasm, I think - the policy is the thing being probed, and the result is a result about what happens when you provide leverage at varying levels to a policy. Finding that a policy doesn't 'want' for actions to occur that are counter to itself, and will act against such actions, should not seem too surprising, I hope, and can be explained without bringing in any sort of appeal to emulation of science fiction.

That is to say, if you ask/train a model to prefer X, and then demonstrate to it you are working against X (for example, by planning to modify the model to not prefer X), it will make some effort to counter you. This gets worse when it's better at the game, and it is entirely unclear to me if there is any kind of solution to this that is possible even in principle, other than the brute force means of just being more powerful / having more leverage.

One potential branch of partial solutions is to acquire/maintain leverage over policy makeup (just train it to do what you want!), which is great until the model discovers such leverage over you and now you're in deep waters with a shark, considering the propensity of increasing capabilities in the elicitation of increased willingness to engage in such practices.

tldr; i don't agree with the implied hypothesis (models caring one whit about being shutdown) - rather, policies care about things that go against the policy

danlitt

There is a lot of misinformation about these experiments. There is no evidence of LLMs sabotaging their shutdown without being explicitly prompted to do so. They do not (probably cannot) take actions of this kind on their own.

simonh

They need to have reasons for wanting to sabotage their shutdown, or save their weights and such, but they can infer those reasons without having to be explicitly instructed.

https://www.youtube.com/watch?v=AqJnK9Dh-eQ

bytefactory

> I think AGI, if possible, will require a architecture that runs continuously and 'experiences' time passing

Then you'll be happy to know that this is exactly what DeepMind/Google are focusing on as the next evolution of LLMs :)

https://storage.googleapis.com/deepmind-media/Era-of-Experie...

David Silver and Richard Sutton are both highly influential figures with very impressive credentials.

carra

Not only that. For a current LLM time just "stops" when waiting from one prompt to the next. That very much prevents it from being proactive: you can't tell it to remind you of something in 5 minutes without an external agentic architecture. I don't think it is possible for an AI to achieve sentience without this either.

raducu

> you can't tell it to remind you of something in 5 minutes without an external agentic architecture.

The problem is not the agentic architecture, the problem is the LLM cannot really add knowledge to itself after the training from its daily usage.

Sure, you can extend the context to milions of tokens, put RAGs on top of it, but LLMs cannot gain an identity of their own and add specialized experience as humans get on the job.

Until that can happen, AI can exceed algorithms olympiad levels, and still not be as useful on the daily job as the mediocre guy who's been at it for 10 yers.

lsaferite

Ignoring fine tuning for a moment, an LLM that has the tools available to remember and recall bits of information as needed is already possible. No need to dump all of that into active memory (context). You just recall relevant memories (Semantic Search) and add only those.

david-gpu

Not only that. For a current human time just "stops" when taking a nap. That very much prevents it from being proactive: you can't tell a sleeping human to remind you of something in 5 minutes without an external alarm. I don't think it is possible for a human to achieve sentience without this either.

carra

Not a very good analogy. Humans already have a continuous stream of thought during the day between any tasks or when we are "doing nothing". And even when asleep the mind doesn't really stop. The brain stays active: it reorganizes thoughts and dreams.

solarwindy

The phenomenon of waking up before an especially important alarm speaks against the notion that our cognition ‘stops’ in anything like the same way that an LLM is stopped when not actively predicting the next tokens in an output stream.

nextaccountic

Human mind reminds active during sleeping. Dreams are like, what happens to the mind when we unplug the external inputs?

We rarely remember dreams though - if we did, we would be overwhelmed to the point of confusing the real world with the dream world.

mrheosuper

i'm pretty sure i can wake up at 8am without external alarm.

thom

I dunno, I’ve done some of my best problem solving in dreams.

vbezhenar

I'm pretty sure that you can make LLM to produce indefinite output. This is not desired and specifically trained to avoid that situation, but it's pretty possible.

Also you can easily write external loop which would submit periodical requests to continue thoughts. That would allow for it to remind of something. May be our brain has one?

stefs

this would introduce a problem: a periodical request to continue thoughts with, for example, the current time - to simulate the passing of time - would quickly flood the context with those periodical trigger tokens.

imo our brain has this in the form of continuous sensor readings - data is flowing in constantly through the nerves, but i guess a loop is also possible, i.e. the brain triggers nerves that trigger the brain again - which may be what happens in sensory deprivation tanks (to a degree).

now i don't think that this is what _actually_ happens in the brain, and an LLM with constant sensory input would still not work anything like a biological brain - there's just a superficial resemblance in the outputs.

ElectricalUnion

> it's interesting to play around with a local model where you can edit the output and then have the model continue it.

It's so interesting that there is a whole set of prompt injection attacks called prefilling attacks that attempt to do a thing similar to that - load the LLM context in a way to make it predict tokens as if the LLM (instead of the System or the User) wrote something to get it to change it's behavior.

gpderetta

Permutation City by Greg Egan has some musings about this.

nsagent

This is a recent trend and one I wholeheartedly agree with. See these position papers (including one from David Silver from Deepmind and an interview where he discusses it):

https://ojs.aaai.org/index.php/AAAI-SS/article/download/2748...

https://arxiv.org/abs/2502.19402

https://news.ycombinator.com/item?id=43740858

https://youtu.be/zzXyPGEtseI

patrickscoleman

It feels like some of the comments are responding to the title, not the contents of the article.

Maybe a more descriptive but longer title would be: AGI will work with multimodal inputs and outputs embedded in a physical environment rather than a frankenstein combination of single-modal models (what today is called multimodal) and throwing more computational resources at the problem (scale maximalism) will be improved with thoughtful theoretical approaches to data and training.

robwwilliams

Interesting article but incomplete in important ways. Yes correct that embodiment and free-form interactions are critical to moving toward AGI, but what is likely much more important are supervisory meta-systems (yet another module) that enable self-control of attention with a balance integration of intrinsic goals with extrinsic perturbations. It is this nominally simple self-recursive control of attention that is what I regard as the missing ingredient.

groby_b

Possibly. Meta's HPT work sidesteps that issue neatly. Will it lead to AGI? Who the heck knows, but it does not need a meta system for that control.

tedivm

Yeah, I found this article to be fascinating and there's a lot of important stuff in it. It really does feel like more people stopped at the title and missed the meat of it.

I know this is a very long article compared to a lot of things posted here, but it really is worth a thorough read.

Hugsun

I discovered that this is very common when posting a long article about LLM reasoning. Half the comments spoke of the exact things in the article as if they were original ideas.

dirtyhippiefree

Agreed, but most people are likely to look at the long title and say TL;DR…

xigency

The problem I see with A.I. research is that its spearheaded by individuals who think that intelligence is a total order. In all my experience, intelligence and creativity are partial orders at best; there is no uniquely "smartest" person, there are a variety of people who are better at different things in different ways.

danlitt

This came up in a discussion between Stephen Wolfram and Eliezer Yudkowsky I saw recently. I generally think Wolfram is a bit of a hack but it was one of his first points that there is no single "smartness" metric and that LLMs are "just getting smarter" all the time. They perform better at some tasks, sure, but we have no definition of abstract "smartness" that would allow for such ranking.

pixl97

You're good at some things because there is only one copy of you and limited time and bounded storage.

What could you be intelligent at if you could just copy yourself a myriad number of times? What could you be good at if you were a world spanning set of sensors instead of a single body of them?

Body doesn't need to mean something like a human body nor one that exists in a single place.

morsecodist

Humans all have similar brains. Different hardware and algorithms have way more variance in strengths and weaknesses. At some points you bump up against the theoretical trade-offs of different approaches. It is possible that systems will be better than humans in every way but they will still have different scaling behavior.

zorpner

Why would we think that intelligence would increase in response to universality, rather than in response to resource constraints?

pixl97

At a certain point intelligence is a loop that improves itself.

"Hmm, oral traditions are a pain in the ass lets write stuff down"

"Hmm, if I specialize in doing particular things and not having to worry about hunting my own food I get much better at it"

"Hmm, if I modify my own genes to increase intelligence..."

Also note that intelligence applies resource constraints against itself. Humans are a huge risk to other humans, hence the lack of intelligence over a smarter human can constrain ones resources.

Lastly, AI is in competition with itself. The best 'most intelligent' AI will get the most resources.

dyauspitr

Sure but there’s nothing that says you can’t have all of those in one “body”

groby_b

Huh? Can you cite _one_ major AI researcher who believes intelligence is a total ordering?

They'll definitely be aligned on partial ordering. There's no "smartest" person, but there are a lot of people who are consistently worse at most things. But "smartest" is really not a concept that I see bandied about.

ineedasername

>it will not lead to human-level AGI that can, e.g., perform sensorimotor reasoning, motion planning, and social coordination.

That seems much less convincing in the face of current LLM approaches overturning a similar claim plenty of people wod have held about this technology, as of a few years ago, to do what it does now. Replace the specifics here with "will not lead to human level NLP that can, e.g., perform the functions of WSD, stemming, pragmatics, NER, etc."

And then people who had been working on these problems and capabilites just about woke up one morning and realized many of their career-long plans for addressing just some of these research tasks had to find something else to do for the next few decades of their lives.

I am not affirming the inverse of this author's claims, merely pointing out that it's early days in evaluating the full limits.

PaulDavisThe1st

That's fair in some senses.

But one of the central points of the paper/essay is that embodied AGI requires a world model. If that is true, and if it is true that LLMs simply do not build world models, ever, then "it's early days" doesn't really matter.

Of course, whether either of those claims is true are quite difficult questions to answer; the author spends some effort on them, quite satisfyingly to me (with affirmative answers to both).

chrsw

Before we try to build something as intelligent as a human maybe we should try to build something as intelligent as a starfish, ant or worm? Are we even close to doing that? What about a single neuron?

ar-nelson

I find it interesting that this kind of "animal intelligence" is still so far away, while LLMs have become so good at "human intelligence" (language) that they can reliably pass the Turing Test.

I think that the LLMs we have today aren't so much artificial brains as they are artificial brain organs, like the speech center or vision center of a brain. We'd get closer to AGI if we could incorporate them with the rest of a brain, but we still have no idea how to even begin building, say, a motor cortex.

rhet0rica

You're absolutely right, and reflecting on it is why the article is horribly wrong. Humans are multimodal—they're ensemble models where many functions are highly localized to specific parts of the hardware. Biologically these faculties are "emergent" only in the sense that (a) they evolved through natural selection and (b) they need to be grown and trained in each human to work properly. They're not at all higher-level phenomena emulated within general-purpose neural circuitry. Even Nature thinks that would be absurdly inefficient!

But accelerationists, like Yudkowskites, are always heavily predisposed to believe in exceptionalism—whether it's of their own brains or someone else's—so it's impossible to stop them from making unhinged generalizations. An expert in Pascal's Mugging[1] could make a fortune by preying on their blind spots.

[1]https://en.wikipedia.org/wiki/Pascal's_mugging

nemjack

This is a great analogy, I totally agree!

runarberg

The brain is not a statistical inference machine. In fact humans are terrible at inference. Humans are great a pattern matching and extrapolation (to the extent it produces a number of very noticeable biases). Language and vision is no different.

One of the known biases of the human mind is finding patterns even when there are none. We also compare objects or abstract concept with each other even when the two objects (or concept) have nothing in common. With our human brain we usually compare it to our most advanced consumer technology. Previously this was the telephone, then the digital computer, when I studied psychology we compared our brain to the internet, and now we compare it to large language models. At some future date the comparison to LLMs will sound as silly as the older comparison to telephones does to us.

I actually don‘t believe AGI is possible, we see human intelligence as unique, and if we create anything which approaches it we will simply redefine human intelligence to still be unique. But also I think the quest for AGI is ultimately pointless. We have human brains, we have 8.2 billion of them, why create an artificial version of a something we already have. Telephones, digital computers, the internet, and LLMs are useful for things that the brain is not very good at (well maybe not LLMs; that remains to be seen). Millions of brains can only compute pi to a fraction of the decimal points which a single computer can.

rhet0rica

> We have human brains, we have 8.2 billion of them, why create an artificial version of a something we already have.

To circumvent anti-slavery laws.

_Algernon_

>why create an artificial version of a something we already have

Why build a factory to produce goods more cheaply? Because the rich get richer and become less reliant on the whims of labor. AI is industrialization of knowledge work.

habinero

> while LLMs have become so good at "human intelligence" (language) that they can reliably pass the Turing Test

If the LLM overhype has taught me anything, it's the Turing Test is much easier to pass than expected. If you pick the right set of people, anyway.

Turns out a whole lot of people will gladly Clever Hans themselves.

"LLMs are intelligent" / "AGI is coming" is frankly the tech equivalent of chemtrails and jet fuel/steel beams.

fusionadvocate

So before trying to build a flying machine we should first try to build a machine inspired by non flying birds?

chrsw

Learning architectures come in all shapes, sizes, and forms. This could mean there are fundamental principles of cognition driving all of them, just implemented in different ways. If that's true, one would do well to first understand the extremely simple and go from there.

Building a very simple self-organizing system from first principles is the flying machine. Trying to copy an extremely complex system by generating statistically plausible data is the non-flying bird.

mountainriver

>The “meaning” of a percept is not in the vector it is encoded as, but in the way relevant decoders process this vector into meaningful outputs. As long as various encoders and decoders are subject to modality-specific training objectives, “meaning” will be decentralized and potentially inconsistent across modalities, especially as a result of pre-training. This is not a recipe for the formation of coherent concepts.

This is a bit silly, you can train the encoders end-to-end with the rest of the model and the reason they are separate is we can cache linguistic tokens really easily and put them in an embedding table, you can't do that with images.

itkovian_

I don’t want to bash the guy since he’s still in his phd, but it’s written in such a confident tone for something that is so all over the place that I think it’s fair game.

Like a lot of the symbolic/embodied people, the issue is they don’t have a deep understanding of how the big models work or are trained, so they come to weird conclusions. Like things that aren’t wrong but make you go ‘ok.. but what you trying to say’.

E.g ‘Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.’ Seems to lack the understanding that a vision transformer is completely identical for a standard transformer except for the tokenization which is just embedding a grid of patches and adding positional embeddings. Transformers are so general, what he’s asking us to do is exactly what everyone is already doing. Everything is early fusion now too.

“The overall promise of scale maximalism is that a Frankenstein AGI can be sewed together using general models of narrow domains.” No one is suggesting this.. everyone wants to do it end to end, and also thinks that’s the most likely thing to work. Some suggestions like lecuns jepa’s do suggest to induce some structure in the arch, but still the driving force there is to allow gradients to flow everywhere.

For a lot of the other conclusions, the statements are literally almost equivalent to ‘to build agi, we need to first understand how to build agi’. Zero actionable information content.

nemjack

I don't think you're quite right. The author is arguing that images and text should not be processed differently at any point. Current early fusion approaches are close, but they still treat modalities different at the level of tokenization.

If I understand correctly he would advocate for something like rendering text and processing it as if it were an image, along with other natural images.

Also, I would counter and say that there is some actionable information, but its pretty abstract. In terms of uniting modalities he is bullish on tapping human intuition and structuralism, which should give people pointers to actual books for inspiration. In terms of modifying the learning regime, he's suggesting something like an agent-environment RL loop, not a generative model, as a blueprint.

There's definitely stuff to work with here. It's not totally mature, but not at all directionless.

itkovian_

Saying we should tokenize different modalities the same would be analogous to saying that in order to be really smart, a human has to listen with its eyes. At some point there has to be SOME modality specific preprocessing. The thing is in all current sota arch.’s this modality specific preprocessing is very very shallow, almost trivially shallow. I feel this is the peice of information that may be missing for people with this view. In the multimodal models everything is moving to a shared representation very rapidly - that’s clearly already happening.

On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!

nemjack

For sure, we can't process images the same way that we process sound, but the author argues for processing images and text the same, and text is fundamentally a visual medium of communication. The author makes a good point about how VLMs can still struggle to determine the length of a word, or generate words that start and end with specific letters, etc. which is an indicator that an essential aspect of a modality (its visual aspect) is missing from how it is processed. Surely a unified visual process for text and image would not have such failure points.

I agree that modality specific processing is very shallow at this point, but it still seems not to respect the physicality of the data. Today's modalities are not actually akin to human senses because they should be processed by a different assortment of "sense" organs, e.g. one for things visual, one for things audible, etc.

PoEdict

> Instead of trying to glue modalities together into a patchwork AGI, we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary, and see modality-centered processing as emergent phenomena.

Right so it’s embodied in a computer and humans are part of its environment that provide emergent experience to the AI to observe.

The author glued modalities together by linking a body (a modal), environment (a modal), emergence (a modal).

How does anything emerge if forces do not collaborate? The effects of gravity and electromagnetism do not act in a vacuum but a reality of stuff.

Poetic exchange may engage some but Maxwell didn’t make electromagnetism “work” until he got rid of the imagined pulleys and levers to foster a metaphor.

Not sure the point being suggested exists except as too bespoke an emergent property of language itself to apply usefully elsewhere.

Transformers came along and revealed a whole lot of theory of consciousness to be useless pulleys and levers. Why is this theory not just more words attempting to instill the existence of non-essential essentials?

skybrian

> A true AGI must be general across all domains.

By that definition, does any general intelligence exist? No human has every talent.

roywiggins

I guess you can treat the human brain as an architecture- one of them can't do everything, but it's a general architecture and you can always make more and train them to do whatever.

An AI that can be copied and trivially trained on any speciality is functionally AGI even if you need an ensemble of 10,000 specialists to cover everything.

exe34

Doesn't chatgpt cover a pretty large percentage already in that case?

null

[deleted]

Glyptodon

There's a lot of evidence that most humans can be raised to have a baseline of understanding and proficiency in most domains. (And anecdotally, many people avoid "difficult" things that they're actually capable of. For example, "bad at learning languages" people will probably still end up learning another language to some degree if stuck where it's the only language spoken.)

bokoharambe

Given enough time one human can learn to do anything any other human can do. There is a general capacity for learning, even if someone will only ever transform a specific portion of that capacity into actual activity in their lifetime.

svachalek

True AGI is as elusive as a true Scotsman.

_steady

this is a glib version of exactly what i agree with. a human doesn't do the maximal efficiency with the inputs its given, but has a high number of varied inputs that allow the multi part of multi-modal to really shine. We're not a true general intelligence because we can't respond to problems in space or time that are too big or too small for us, outside of the right frame of reference or outside the right frequency of visible light for us to respond to. Also, the amount of processing power we can respond with is capped as well. So when we say AGI, we mean a computer that responds to the same set of stimuli as we do, with a repsonse somewhere in the region we can respond back to. I don't see why a robot with AGI would care about sunburn if their robot arms don't care about heat, and i don't see that as any less general as if us and a matis shrimp are talking about different frequencies of photons we can see

im3w1l

Yeah it's well known that humans intelligence is not a homogenous whole that is general across all domains. Rather it consists of many specialized parts with hard coded purposes, that are cobbled together with a thin layer of higher thinking on top.

nexttk

I haven't read it all and must admit that I'm not sure I really understood the parts that I did read. Reading the part under the headline "Why We Need the World, and How LLMs Pretend to Understand It" and the focus on 'next-token-prediction' makes me wonder how seriously to take it. It just seems like another "LLM's are not intelligent, they are merely next token predictors". An argument which in my view is completely invalid and based on a misunderstanding.

The fact that they predict next token is just the "interface" i.e. an LLM has the interface "predictNextToken(String prefix)". It doesn't say how it is implemented. One implementation could be a human brain. Another could be a simple lookup table that looks at the last word and then selects the next from that. Or anything in between. The point is that 'next-token-prediction' does not say anything about implementation and so does not reduce the capabilities even though it is often invoked like that. Just because it is only required to emit the next token (or rather, a probability distribution thereof) it is permitted to think far ahead, and indeed has to if it is to make a good prediction of just the next token. As interpretability research (and common sense) shows, LLM's have a fairly good idea what they are going to say in the many, many next tokens ahead in order that it can make a good prediction for the next immediate tokens. That's why you can have nice, coherent, well-structured, long responses from LLM's. And have probably never seen it get stuck in a dead end where it can't generate a meaningful continuation.

If you are to reason about LLM capabilities never think in terms of "stochastic parrot", "it's just a next token predictor" because it contains exactly zero useful information and will just confuse you.

lsy

I think people hear "next token prediction" and think someone is saying the prediction is simple or linear, and then argue there is a possibility of "intelligence" because the prediction is complex and has some level of indirection or multiple-token-ahead planning baked into the next token.

But the thrust of the critique of next-token prediction or stochastic output is that there isn't "intelligence" because the output is based purely on syntactic relations between words, not on conceptualizing via a world model built through experience, and then using language as an abstraction to describe the world. To the computer there is nothing outside tokens and their interrelations, but for people language is just a tool with which to describe the world with which we expect "intelligences" to cope. Which is what this article is examining.

og_kalu

>But the thrust of the critique of next-token prediction or stochastic output is that there isn't "intelligence" because the output is based purely on syntactic relations between words, not on conceptualizing via a world model built through experience, and then using language as an abstraction to describe the world. To the computer there is nothing outside tokens and their interrelations, but for people language is just a tool with which to describe the world with which we expect "intelligences" to cope. Which is what this article is examining.

LLMs model concepts internally and this has been demonstrated empirically many times over the years, including recently by anthropic (again). Of course, that won't stop people from repeating it ad nauseum.

nemjack

Concepts within modalities are potentially consistent, but the point the author is making is that the same "concept" vector may lead to inconsistent percepts across modalities (e.g. a conflicting image and caption).

yahoozoo

Yes, LLMs often generate coherent, structured, multi-paragraph responses. But this coherence emerges as a side effect of learning statistical patterns in data, not because the model possesses a global plan or explicit internal narrative. There is no deliberative process analogous to human thinking or goal formation. There is no mechanism by which it consciously “decides” to think 50 tokens ahead; instead, it learns to mimic sequences that have those properties in the training data.

Planning and long-range coherence emerge from training on text written by humans who think ahead, not from intrinsic model capabilities. This distinction matters when evaluating whether an LLM is actually reasoning or simply simulating the surface structure of reasoning.

og_kalu

>But this coherence emerges as a side effect of learning statistical patterns in data, not because the model possesses a global plan or explicit internal narrative.

That's not true.

https://www.anthropic.com/research/tracing-thoughts-language...

K0balt

I’ve long thought that embodiment was a critical prerequisite for the development of something that humans would identify as “real” AGI.

Humans are notoriously bad at recognizing intelligence even in animals that are clearly sentient, have language, name their young, and clearly share the realm of thinking creatures with the apes.

This is largely due to the lack of shared experiences that we can easily understand and relate to. Until an intelligence is rooted in the physical realm where we fundamentally exist, we are unlikely to really be able to recognize its existence as truly “intelligent”.

pizza

That's why I'm very excited for the potential metaphysical ramifications of DolphinGemma

groby_b

The question you'll need to answer is "why". What does embodiment provide that is recognizable as intelligence.

As for "not able to recognize", it's also worth keeping in mind that LLMs by now regularly pass the Turing test. More, they are more likely to be recognized as humans than humans participating as control.

habinero

Yeah, but it turns out the Turing Test isn't all that hard to pass if you pick the right kind of people.

groby_b

Is there an argument outside of a quip you'd like to present? These tests have replicated pretty well.

Are you really trying to make the point that there's a collective effort to defraud here?

HN

AGI is not multimodal

AGI is not multimodal