LLMs aren't world models

119 comments

·August 10, 2025

bithive123

Language models aren't world models for the same reason languages aren't world models.

Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.

But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.

If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?

jandrewrogers

There is an important implication of learning and indexing being equivalent problems. A number of important data models and data domains exist for which we do not know how to build scalable indexing algorithms and data structures.

It has been noted for several years in US national labs and elsewhere that there is an almost perfect overlap between data models LLMs are poor at learning and data models that we struggle to index at scale. If LLMs were actually good at these things then there would be a straightforward path to addressing these longstanding non-AI computer science problems.

The incompleteness is that the LLM tech literally can't represent elementary things that are important enough that we spend a lot of money trying to represent them on computers for non-AI purposes. A super-intelligent AGI being right around the corner implies that we've solved these problems that we clearly haven't solved.

Perhaps more interesting, it also implies that AGI tech may look significantly different than the current LLM tech stack.

habitue

> Symbols, by definition, only represent a thing.

This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.

Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!

nomel

I don’t think it’s a communication problem as much as there is no possible relation between a word and a (literal) physical experiences. They’re, quite literally, on different planes of existence.

drdeca

When I have a physical experience, sometimes it results in me saying a word.

Now, maybe there are other possible experiences that would result in me behaving identically, such that from my behavior (including what words I say) it is impossible to distinguish between different potential experiences I could have had.

But, “caused me to say” is a relation, is it not?

Unless you want to say that it wasn’t the experience that caused me to do something, but some physical thing that went along with the experience, either causing or co-occurring with the experience, and also causing me to say the word I said. But, that would still be a relation, I think.

semiquaver

Well shit, I better stop reading books then.

copypaper

Reminds me of this [1] article. If us humans, after all these years we've been around, can't relay our thoughts exactly as we perceive them in our heads, what makes us think that we can make a model that does it better than us?

[1]: https://www.experimental-history.com/p/you-cant-reach-the-br...

pron

> Symbols, by definition, only represent a thing. They are not the same as the thing

First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.

But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?

What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).

> If we accept the incompleteness theorem

There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system. Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.

Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.

[1]: https://plato.stanford.edu/entries/church-turing/

astrange

> First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.

It should be capable of something similar (fsvo similar), but the largest difference is that humans have to be power-efficient and LLMs do not.

That is, people don't actually have world models, because modeling something is a waste of time and energy insofar as it's not needed for anything. People are capable of taking out the trash without knowing what's in the garbage bag.

null

[deleted]

Terr_

> Of course, if physics does exist - i.e. the universe is governed by a finite set of laws

Wouldn't physics still "exist" even if there were an infinite set of laws?

pron

Well, the physical universe will still exist, but I don't think that physics - the scientific study of said universe - will become sort of meaningless, I would think?

goatlover

> course, if physics does exist - i.e. the universe is governed by a finite set of laws

That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.

The Humean way of looking at physics is that we notice relationships and model those with various symbols. They symbols form incomplete models because we can't get to the bottom of why the relationships exist.

> that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.

The indeterminism of Quantum Mechanics limits how how precise measure can be and how predictable the future is.

pron

> That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.

What I meant was that since physics is the scientific search for the laws of nature, then if there's an infinite number of them, then the pursuit becomes somewhat meaningless, as an infinite number of laws aren't really laws at all.

> They symbols form incomplete models because we can't get to the bottom of why the relationships exist.

Why would a model be incomplete if we don't know why the laws are what they are? A model pretty much is a set of laws; it doesn't require an explanation (we may want such an explanation, but it doesn't improve the model).

cognitif

> Language models aren't world models for the same reason languages aren't world models. Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".

Symbols, maps, descriptions, and words are useful precisely because they are NOT what they represent. Representation is not identity. What else could a “world model” be other than a representation? Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?

mrbungie

> Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?

I was following the string of questions, but I think there is a logical leap between those two questions.

Another question: is Language the only way to define models? An imagined sound or an imagined picture of an apple in my minds-eye are models to me, but they don't use language.

energy123

Everything is just a low resolution representation of a thing. The so-called reality we supposedly have access to is at best a small number of sound waves and photons hitting our face. So I don't buy this argument that symbols are categorically different. It's a gradient and symbols are more sparse and less rich of a data source, yes. But who are we to say where that hypothetical line exists, beyond which further compression of concepts into smaller numbers of buckets becomes a non-starter for intelligence and world modelling. And then there's multi modal LLMs which have access to data of a similar richness that humans have access to.

bithive123

There are no "things" in the universe. You say this wave and that photon exist and represent this or that, but all of that is conceptual overlay. Objects are parts of speech, reality is undifferentiated quanta. Can you point to a particular place where the ocean becomes a particular wave? Your comment already implies an understanding that our mind is behind all the hypothetical lines; we impose them, they aren't actually there.

auggierose

First: true propositions (that are not provable) can definitely be expressed, if they couldn't, the incompleteness theorem would not be true ;-)

It would be interesting to know what the percentage of people is, who invoke the incompleteness theorem, and have no clue what it actually says.

Most people don't even know what a proof is, so that cannot be a hindrance on the path to AGI ...

Second: ANY world model that can be digitally represented would be subject to the same argument (if stated correctly), not only LLMs.

bithive123

I knew someone would call me out on that. I used the wrong word; what I meant was "expressed in a way that would satisfy" which implies proof within the symbolic order being used. I don't claim to be a mathematician or philosopher.

auggierose

Well, you don't get it. The LLM definitely can state propositions "that satisfy", let's just call them true propositions, and that this is not the same as having a proof for it is what the incompleteness theorem says.

Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?

drdeca

Gödel’s incompleteness theorems aren’t particularly relevant here. Given how often people attempt to apply them to situations where they don’t say anything of note, I think the default should generally be to not publicly appeal to them unless one either has worked out semi-carefully how to derive the thing one wants to show from them, or at least have a sketch that one is confident, from prior experience working with it, that one could make into a rigorous argument. Absent these, the most one should say, I think, is “Perhaps one can use Gödel’s incompleteness theorems to show [thing one wants to show].” .

Now, given a program that is supposed to output text that encodes true statements (in some language), one can probably define some sort of inference system that corresponds to the program such that the inference system is considered to “prove” any sentence that the program outputs (and maybe also some others based on some logical principles, to ensure that the inference system satisfies some good properties), and upon defining this, one could (assuming the language allows making the right kinds of statements about arithmetic) show that this inference system is, by Gödel’s theorems, either inconsistent or incomplete.

This wouldn’t mean that the language was unable to express those statements. It would mean that the program either wouldn’t output those statements, or that the system constructed from the program was inconsistent (and, depending on how the inference system is obtained from the program, the inference system being inconsistent would likely imply that the program sometimes outputs false or contradictory statements).

But, this has basically nothing to do with the “placeholders” thing you said. Gödel’s theorem doesn’t say that some propositions are inexpressible in a given language, but that some propositions can’t be proven in certain axiom+inference systems.

Rather than the incompleteness theorems, the “undefinability of truth” result seems more relevant to the kind of point I think you are trying to make.

Still, I don’t think it will show what you want it to, even if the thing you are trying to show is true. Like, perhaps it is impossible to capture qualia with language, sure, makes sense. But logic cannot show that there are things which language cannot in any way (even collectively) refer to, because to show that there is a thing it has to refer to it.

————

“Can you write a test suite for it?”

Hm, might depend on what you count as a “suite”, but a test protocol, sure. The one I have in mind would probably be a bit expensive to run if it fails the test though (because it involves offering prize money).

ameliaquining

One thing I appreciated about this post, unlike a lot of AI-skeptic posts, is that it actually makes a concrete falsifiable prediction; specifically, "LLMs will never manage to deal with large code bases 'autonomously'". So in the future we can look back and see whether it was right.

For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.

moduspol

"Deal with" and "autonomously" are doing a lot of heavy lifting there. Cursor already does a pretty good job indexing all the files in a code base in a way that lets it ask questions and get answers pretty quickly. It's just a matter of where you set the goalposts.

jononor

"LLM" as well, because coding agents are already more than just an LLM. There is very useful context management around it, and tool calling, and ability to run tests/programs, etc. Though they are LLM-based systems, they are not LLMs.

smnrchrds

Indeed. If the LLM calls a chess engine tool behind the scenes, it would be able to play excellent chess as well.

ameliaquining

True, there'd be a need to operationalize these things a bit more than is done in the post to have a good advance prediction.

shinycode

« autonomously » what happens when subtle updates that are not bugs but change the meaning of some features that might break the workflow on some other external parts of a client’s system ? It happens all the time and, because it’s really hard to have the whole meaning and business rules written and maintained up to date, an LLM might never be able to grasp some meaning. Maybe if instead of developing code and infrastructures, the whole industry shifts toward only writing impossibly precise spec sheets that make meaning and intent crystal clear then, maybe « autonomously » might be possible to pull off

wizzwizz4

Those spec sheets exist: they're called software.

shinycode

Not exactly. It depends how software is written and if there is ADRs in the project. I had to work on projects where there was bugs because someone coded business rules in a very bad and unclear way. You move an if somewhere and something breaks somewhere else. You ask « is this condition the way it’s supposed to work or is it a bug » when software is not clear enough - and often it isn’t because we have to go fast - we ask people to confirm the rule. My point is this, amazingly written software surely works best with LLMs. That’s not the most software written for now because businesses value speed over engineering sometimes (or it’s lack of skills)

exe34

How large? What does "deal" mean here? Autonomously - is that on its own whim, or at the behest of a user?

slt2021

>LLMs will never manage to deal

time to prove hypothesis: infinity years

null

[deleted]

libraryofbabel

This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.

LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.

yosefk

Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.

My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.

Let's see how my predictions hold up; I have made enough to look very wrong if they don't.

Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says

WillPostForFood

Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.

https://imgur.com/a/O9CjiJY

typpilol

This seems like a common theme with these types of articles

marcellus23

I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.

libraryofbabel

I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.

I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.

A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.

I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:

* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.

* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.

* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)

* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.

I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.

Anyway thanks for writing this and responding!

yosefk

I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."

I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)

AyyEye

With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.

BobbyJo

The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.

It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.

In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.

I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/

LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.

xigoi

> It's like asking a blind person to count the number of colors on a car.

I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.

vrighter

That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.

Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.

williamcotton

I don’t solve math problems with my poetry writing skills:

https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...

andyjohnson0

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Is this a real defect, or some historical thing?

I just asked GPT-5:

    How many "B"s in "blueberry"?

and it replied:

    There are 2 — the letter b appears twice in "blueberry".

I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.

libraryofbabel

It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.

ThrowawayR2

It was discussed and reproduced on GPT-5 on HN couple of days ago: https://news.ycombinator.com/item?id=44832908

Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.

bgwalter

It is not historical:

https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...

Perhaps they have a hot fix that special cases HN complaints?

nosioptar

Shouldn't the correct answer be that there is not a "B" in "blueberry"?

yosefk

Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think

astrange

Tokenization makes things harder, but it doesn't make them impossible. Just takes a bit more memorization.

Other writing systems come with "tokenization" built in making it still a live issue. Think of answering:

1. How many n's are in 日本?

2. How many ん's are in 日本?

(Answers are 2 and 1.)

libraryofbabel

> they clearly don't have any world model whatsoever

Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.

simiones

> where it certainly hadn’t seen the questions before?

What are you basing this certainty on?

And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.

It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.

lossolo

https://arxiv.org/abs/2508.01191

armchairhacker

Any suggestions from this literature?

libraryofbabel

The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.

keeda

That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.

I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)

And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.

geraneum

> it wrote by itself as part of a larger task I had given it, so it certainly understands transparency

Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)

> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.

We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.

> is a purely philosophical question

It is indeed. A question we need to ask ourselves.

Uehreka

> We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers.

If LLMs are stochastic parrots, but also we’re just stochastic parrots, then what does it matter? That would mean that LLMs are in fact useful for many things (which is what I care about far more than any abstract discussion of free will).

DennisP

Maybe pure language models aren't world models, but Genie 3 for example seems to be a pretty good world model:

https://deepmind.google/discover/blog/genie-3-a-new-frontier...

We also have multimodal AIs that can do both language and video. Genie 3 made multimodal with language might be pretty impressive.

Focusing only on what pure language models can do is a bit of a straw man at this point.

frankfrank13

Great quote at the end that I think I resonate a lot with:

> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.

ej88

This article is interesting but pretty shallow.

0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?

1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.

2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.

You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!

https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358

yosefk

A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.

The examples are from all the major commercial American LLMs as listed in a sister comment.

You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.

o_nate

What with this and your previous post about why sometimes incompetent management leads to better outcomes, you are quickly becoming one of my favorite tech bloggers. Perhaps I enjoyed the piece so much because your conclusions basically track mine. (I'm a software developer who has dabbled with LLMs, and has some hand-wavey background on how they work, but otherwise can claim no special knowledge.) Also your writing style really pops. No one would accuse your post of having been generated by an LLM.

yosefk

thank you for your kind words!

deadbabe

Don’t: use LLMs to play chess against you

Do: use LLMs to talk shit to you while a real chess AI plays chess against you.

The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.

Seb-C

Are you suggesting that we use an LLM as an interface between the AI and the player?

Why would anyone choose to awkwardly play using natural language rather than a reliable, fast and intuitive UI?

skeledrew

Agree in general with most of the points, except

> but because I know you and I get by with less.

Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.

helloplanets

Don't forget the millions of years of pre-training! ;)

lordnacho

Here's what LLMs remind me of.

When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.

Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.

But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.

In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.

In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.

This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.

The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.

roywiggins

> When you challenge it, it just apologizes and pretends to correct itself.

Even when it was right the first time!

imenani

As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).

yosefk

ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough

imenani

Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.

red75prime

My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.

How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.

The current models lack "remember/use/update" parts.

red75prime

> I don't think there's any fundamental difference in the principle of their operation

Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).

That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.

But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.

lowsong

It doesn't matter. These limitations are fundamental to LLMs, so all of them that will ever be made suffer from these problems.