An LLM is a lossy encyclopedia

276 comments

·August 29, 2025

quincepie

I totally agree with the author. Sadly, I feel like that's not what the majority of LLM users tend to view LLMs. And it's definitely not what AI companies marketing.

> The key thing is to develop an intuition for questions it can usefully answer vs questions that are at a level of detail where the lossiness matters

the problem is that in order to develop an intuition for questions that LLMs can answer, the user will at least need to know something about the topic beforehand. I believe that this lack of initial understanding of the user input is what can lead to taking LLM output as factual. If one side of the exchange knows nothing about the subject, the other side can use jargon and even present random facts or lossy facts which can almost guarantee to impress the other side.

> The way to solve this particular problem is to make a correct example available to it.

My question is how much effort would it take to make a correct example available for the LLM before it can output quality and useful data? If the effort I put in is more than what I would get in return, then I feel like it's best to write and reason it myself.

> the user will at least need to know something about the topic beforehand.

I used ChatGPT 5 over the weekend to double check dosing guidelines for a specific medication. "Provide dosage guidelines for medication [insert here]"

It spit back dosing guidelines that were an order of magnitude wrong (suggested 100mcg instead of 1mg). When I saw 100mcg, I was suspicious and said "I don't think that's right" and it quickly corrected itself and provided the correct dosing guidelines.

These are the kind of innocent errors that can be dangerous if users trust it blindly.

The main challenge is LLMs aren't able to gauge confidence in its answers, so it can't adjust how confidently it communicates information back to you. It's like compressing a photo, and a photographer wrongly saying "here's the best quality image I have!" - do you trust the photographer at their word, or do you challenge him to find a better quality image?

zehaeva

What if you had told it again that you don't think that's right? Would it have stuck to it's guns and went "oh, no, I am right here" or would it have backed down and said "Oh, silly me, you're right, here's the real dosage!" and give you again something wrong?

I do agree that to get the full usage out of an LLM you should have some familiarity with what you're asking about. If you didn't already have a sense of what a dosage is already, why wouldn't 100mcg be the right one?

I replied in the same thread "Are you sure that sounds like a low dose". It stuck to the (correct) recommendation in the 2nd response, but added in a few use cases for higher doses. So seems like it stuck to its guns for the most part.

For things like this, it would definitely be better for it to act more like a search engine and direct me to trustworthy sources for the information rather than try to provide the information directly.

blehn

Perhaps the absolute worst use-case for an LLM

QuantumGood

With search and references, and without search and references are two different tools. They're supposed to be closer to the same thing, but are not. That isn't to say there's a guarantee of correctness with references, but in my experience, accuracy is better, and seeing unexpected references is helpful when confirming.

SV_BubbleTime

LANGUAGE model, not FACT model.

cantor_S_drug

I gave LLM a list of python packages and asked it to give me their respective licenses. Obviously it got some of them wrong. I had to manually check with the package's pypi page.

christkv

I find if I force thinking mode and then force it to search the web it’s much better.

ARandumGuy

But at that point wouldn't it be easier to just search the web yourself? Obviously that has its pitfalls too, but I don't see how adding an LLM middleman adds any benefit.

Agree, I usually force thinking mode too. I actually like the "Thinking mini" option that was just released recently, good middle ground between getting an instant answer and waiting 1-2 minutes.

dncornholio

Using a LLM for medical research is just as dangerous as Googling it. Always ask your doctors!

el_benhameen

I don’t disagree that you should use your doctor as your primary source for medical decision making, but I also think this is kind of an unrealistic take. I should also say that I’m not an AI hype bro. I think we’re a long ways off from true functional AGI and robot doctors.

I have good insurance and have a primary care doctor with whom I have good rapport. But I can’t talk to her every time I have a medical question—it can take weeks to just get a phone call! If I manage to get an appointment, it’s a 15 minute slot, and I have to try to remember all of the relevant info as we speed through possible diagnoses.

Using an llm not for diagnosis but to shape my knowledge means that my questions are better and more pointed, and I have a baseline understanding of the terminology. They’ll steer you wrong on the fine points, but they’ll also steer you _right_ on the general stuff in a way that Dr. Google doesn’t.

One other anecdote. My daughter went to the ER earlier this year with some concerning symptoms. The first panel of doctors dismissed it as normal childhood stuff and sent her home. It took 24 hours, a second visit, and an ambulance ride to a children’s hospital to get to the real cause. Meanwhile, I gave a comprehensive description of her symptoms and history to an llm to try to get a handle on what I should be asking the doctors, and it gave me some possible diagnoses—including a very rare one that turned out to be the cause. (Kid is doing great now). I’m still gonna take my kids to the doctor when they’re sick, of course, but I’m also going to use whatever tools I can to get a better sense of how to manage our health and how to interact with the medical system.

yojo

This is the terrifying part: doctors do this too! I have an MD friend that told me she uses ChatGPT to retrieve dosing info. I asked her to please, please not do that.

jrm4

Almost certainly more I would think, precisely because of magnitude errors.

The ol' "What weighs more, a pound of feathers or two pounds of bricks" trick explains this perfectly to me.

djrj477dhsnv

I disagree. I'd wager that state of the art LLMs can beat out of the average doctor at diagnosis given a detailed list of symptoms, especially for conditions the doctor doesn't see on a regular basis.

yujzgzc

Plot twist, your doctor is looking it up on WebMD themselves

gmac

Not really: it's arguably quite a lot worse. Because you can judge the trustworthiness of the source when you follow a link from Google (e.g. I will place quite a lot of faith in pages at an .nhs.uk URL), but nobody knows exactly how that specific LLM response got generated.

giancarlostoro

> the user will at least need to know something about the topic beforehand.

This is why I've said a few times here on HN and elsewhere, if you're using an LLM you need to think of yourself as an architect guiding a Junior to Mid Level developer. Juniors can do amazing things, they can also goof up hard. What's really funny is you can make them audit their own code in a new context window, and give you a detailed answer as to why that code is awful.

I use it mostly on personal projects especially since I can prototype quickly as needed.

skydhash

> if you're using an LLM you need to think of yourself as an architect guiding a Junior to Mid Level developer.

The thing is coding can (and should) be part of the design process. Many times, I though I have a good idea of what the solution should look like, then while coding, I got exposed more to the libraries and other parts of the code, which led me to a more refined approach. This exposure is what you will miss and it will quickly result in unfamiliar code.

giancarlostoro

I agree. I mostly use it for scaffolding, I don't like letting it do all the work for me.

netcan

>the problem is that in order to develop an intuition for questions that LLMs can answer, the user will at least need to know something about the topic beforehand. I believe that this lack of initial understanding of the user input

I think there's a parallel here for the internet as an i formation source. It delivered on "unlimited knowledge at the tip of everyone's fingertips" but lowering the bar also lowered the bar.

That access "works" only when the user is capable of doing their part too. Evaluating sources, integrating knowledge. Validating. Cross examining.

Now we are just more used to recognizing that accessibility comes with its own problem.

Some of this is down to general education. Some to domain expertize. Personality plays a big part.

The biggest factor is, i think, intelligence. There's a lot of 2nd and 3rd order thinking required to simultaneously entertain a curiosity, consider of how the LLM works, and exercise different levels of skepticism depending on the types of errors LLMs are likely to make.

Using LLMs correctly and incorrectly is.. subtle.

HarHarVeryFunny

> The key thing is to develop an intuition for questions it can usefully answer vs questions that are at a level of detail where the lossiness matters

It's also useful to have an intuition for what things an LLM is liable to get wrong/hallucinate, one of which is questions where the question itself suggests one or more obvious answers (which may or may not be correct), which the LLM may well then hallucinate, and sound reasonable, if it doesn't "know".

felipeerias

LLMs are very sensitive to leading questions. A small hint of that the expected answer looks like will tend to produce exactly that answer.

SAI_Peregrinus

As a consequence LLMs are extremely unlikely to recognize an X-Y problem.

giantrobot

You don't even need a leading direct question. You can easily lead an LLM just by having some statements (even at times single words) in the context window.

bobbylarrybobby

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

theshrike79

> the problem is that in order to develop an intuition for questions that LLMs can answer, the user will at least need to know something about the topic beforehand

This is why simonw (The author) has his "pelican on a bike" -test, it's not 100% accurate but it is a good indicator.

I have a set of my own standard queries and problems (no counting characters or algebra crap) I feed to new LLMs I'm testing

None of the questions exist outside of my own Obsidian note so they can't be gamed by LLM authors. And I've tested multiple different LLMs using them so I have a "feeling" on what the answer should look like. And I personally know the correct answer so I can immediately validate them.

barapa

They are training on your queries. So they may have some exposure to them going forward.

franktankbank

Even if your queries are hidden via a local running model you must have some humility that your queries are not actually unique. For this reason I have a very difficult time believing that a basic LLM will be able to properly reason about complex topics, it can regurgitate to whatever level its been trained. That doesn't make it less useful though. But on the edge case how do we know the query its ingesting gets trained with a suitable answer? Wouldn't this constitute an over-fitting in these cases and be terribly self-reinforcing?

keysdev

Not if one ollama pull to ur machine.

geye1234

Please, everybody, preserve your records. Preserve your books, preserve your downloaded files (that can't be tampered with), keep everything. AI is going to make it harder and harder to find out the truth about anything over the next few years.

You have a moral duty to keep your books, and keep your locally-stored information.

Taylor_OD

I get very annoyed when llms respond with quotes around certain things I ask for, then when I say what is the source of that quote? they say oh I was paraphrasing and that isnt a real quote.

At least wikipedia has sources that probably support what it says and normally the quotes are real quotes. LLMs just seem to add quotation marks as, "proof" that its confident something is correct.

bloudermilk

To that end, it seems as though archive.org will important for an entirely new reason. Not for the loss of information, but the degradation of it.

randomjoe2

[flagged]

latexr

A lossy encyclopaedia should be missing information and be obvious about it, not making it up without your knowledge and changing the answer every time.

When you have a lossy piece of media, such as a compressed sound or image file, you can always see the resemblance to the original and note the degradation as it happens. You never have a clear JPEG of a lamp, compress it, and get a clear image of the Milky Way, then reopen the image and get a clear image of a pile of dirt.

Furthermore, an encyclopaedia is something you can reference and learn from without a goal, it allows you to peruse information you have no concept of. Not so with LLMs, which you have to query to get an answer.

gjm11

Lossy compression does make things up. We call them compression artefacts.

In compressed audio these can be things like clicks and boings and echoes and pre-echoes. In compressed images they can be ripply effects near edges, banding in smoothly varying regions, but there are also things like https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres... where one digit is replaced with a nice clean version of a different digit, which is pretty on-the-nose for the LLM failure mode you're talking about.

Compression artefacts generally affect small parts of the image or audio or video rather than replacing the whole thing -- but in the analogy, "the whole thing" is an encyclopaedia and the artefacts are affecting little bits of that.

Of course the analogy isn't exact. That would be why S.W. opens his post by saying "Since I love collecting questionable analogies for LLMs,".

moregrist

> Lossy compression does make things up. We call them compression artefacts.

I don’t think this is a great analogy.

Lossy compression of images or signals tends to throw out information based on how humans perceive it, focusing on the most important perceptual parts and discarding the less important parts. For example, JPEG essentially removes high frequency components from an image because more information is present with the low frequency parts. Similarly, POTS phone encoding and mp3 both compress audio signals based on how humans perceive audio frequency.

The perceived degradation of most lossy compression is gradual with the amount of compression and not typically what someone means when they say “make things up.”

LLM hallucinations aren’t gradual and the compression doesn’t seem to follow human perception.

Vetch

You are right and the idea of LLMs as lossy compression has lots of problems in general (LLMs are a statistical model, a function approximating the data generating process).

Compression artifacts (which are deterministic distortions in reconstruction) are not the same as hallucinations (plausible samples from a generative model; even when greedy, this is still sampling from the conditional distribution). A better identification is with super-resolution. If we use a generative model, the result will be clearer than a normal blotchy resize but a lot of details about the image will have changed as the model provides its best guesses at what the missing information could have been. LLMs aren't meant to reconstruct a source even though we can attempt to sample their distribution for snippets that are reasonable facsimiles from the original data.

An LLM provides a way to compute the probability of given strings. Once paired with entropy coding, on-line learning on the target data allows us to arrive at the correct MDL based lossless compression view of LLMs.

baq

LLM confabulations might as well be gradual in the latent space. I don’t think lossy is synonymous to perceptual and the high frequency components rather easily translate to less popular data.

latexr

I feel like my comment is pretty clear that a compression artefact is not the same thing as making the whole thing up.

> Of course the analogy isn't exact.

And I don’t expect it to be, which is something I’ve made clear several times before, including on this very thread.

https://news.ycombinator.com/item?id=45101679

jpcompartir

Interesting, in the LLM case these compression artefacts then get fed into the generating process of the next token, hence the errors compound.

ACCount37

Not really. The whole "inference errors will always compound" idea was popular in GPT-3.5 days, and it seems like a lot of people just never updated their knowledge since.

It was quickly discovered that LLMs are capable of re-checking their own solutions if prompted - and, with the right prompts, are capable of spotting and correcting their own errors at a significantly-greater-than-chance rate. They just don't do it unprompted.

Eventually, it was found that reasoning RLVR consistently gets LLMs to check themselves and backtrack. It was also confirmed that this latent "error detection and correction" capability is present even at base model level, but is almost never exposed - not in base models and not in non-reasoning instruct-tuned LLMs.

The hypothesis I subscribe to is that any LLM has a strong "character self-consistency drive". This makes it reluctant to say "wait, no, maybe I was wrong just now", even if latent awareness of "past reasoning look sketchy as fuck" is already present within the LLM. Reasoning RLVR encourages going against that drive and utilizing those latent error-correction capabilities.

gf000

I don't think there is a singular "should" that fits every use case.

E.g. a Bloom filter also doesn't "know" what it knows.

latexr

I don’t understand the point you’re trying to make. The given example confused me further, since nothing in my argument is concerned with the tool “knowing” anything, that has no relation to the idea I’m expressing.

I do understand and agree with a different point you’re making somewhere else in this thread, but it doesn’t seem related to what you’re saying here.

https://news.ycombinator.com/item?id=45101946

Lerc

The argument is that a banana is a squishy hammer.

You're saying hammers shouldn't be squishy.

Simon is saying don't use a banana as a hammer.

latexr

> You're saying hammers shouldn't be squishy.

No, that is not what I’m saying. My point is closer to “the words chosen to describe the made up concept do not translate to the idea being conveyed”. I tried to make that fit into your idea of the banana and squishy hammer, but now we’re several levels of abstraction deep using analogies to discuss analogies so it’s getting complicated to communicate clearly.

> Simon is saying don't use a banana as a hammer.

Which I agree with.

tsunamifury

This is the type of comment that has been killing HN lately. “I agree with you but I want to disagree because I’m generally just that type of person. Also I am unable to tell my disagreeing point adds nothing.”

mock-possum

Yeah an LLM is an unreliable librarian, if anything.

JustFinishedBSG

I actually disagree. Modern encoding formats can, and do, hallucinate blocks.

It’s a lot less visible and I guess dramatic than LLMs but it happens frequently enough that I feel like at every major event there are false conspiracies based on video « proofs » that are just encoding artifacts

simonw

I think you are missing the point of the analogy: a lossy encyclopedia is obviously a bad idea, because encyclopedias are meant to be reliable places to look up facts.

latexr

And my point is that “lossy” does not mean “unreliable”. LLMs aren’t reliable sources of facts, no argument there, but a true lossy encyclopaedia might be. Lossy algorithms don’t just make up and change information, they remove it from places where they might not make a difference to the whole. A lossy encyclopaedia might be one where, for example, you remove the images plus gramatical and phonetic information. Eventually you might compress the information where the entry for “dog” only reads “four legged creature”—which is correct but not terribly helpful—but you wouldn’t get “space mollusk”.

simonw

I don't think a "true lossy encylopedia" is a thing that has ever existed.

baq

A lossy encyclopedia which you can talk to and it can look up facts in the lossless version while having a conversation OTOH is... not a bad idea at all, and hundreds of millions of people agree if traffic numbers are to be believed.

(but it isn't and won't ever be an oracle and apparently that's a challenge for human psychology.)

simonw

Completely agree with you - LLMs with access to search tools that know how to use them (o3, GPT-5, Claude 4 are particularly good at this) mostly paper over the problems caused by a lossy set of knowledge in the model weights themselves.

But... end users need to understand this in order to use it effectively. They need to know if the LLM system they are talking to has access to a credible search engine and is good at distinguishing reliable sources from junk.

That's advanced knowledge at the moment!

butlike

I don't like the confident hallucinations of LLMs either, but don't they rewrite and add entries in the encyclopedia every few years? Implicitly that makes your old copy "lossy"

Again, never really want a confidently-wrong encyclopedia, though

rynn

Aren't all encyclopedias 'lossy'? They are all partial collections of information; none have all of the facts.

checkyoursudo

I am sympathetic to your analogy. I think it works well enough.

But it falls a bit short in that encyclopedias, lossy or not, shouldn't affirmatively contain false information. The way I would picture a lossy encyclopedia is that it can misdirect by omission, but it would not change A to ¬A.

Maybe a truthy-roulette enclyclopedia?

tomrod

I guarantee every encyclopedia has mistakes.

TacticalCoder

> You never have a clear JPEG of a lamp, compress it, and get a clear image of the Milky Way, then reopen the image and get a clear image of a pile of dirt.

Oh but it's much worse than that: because most LLMs aren't deterministic in the way they operate [1], you can get a pristine image of a different pile of dirt every single time you ask.

[1] there are models where if you have the "model + prompt + seed" you're at least guaranteed to get the same output every single time. FWIW I use LLMs but I cannot integrate them in anything I produce when what they output ain't deterministic.

ACCount37

"Deterministic" is overrated.

Computers are deterministic. Most of the time. If you really don't think about all the times they aren't. But if you leave the CPU-land and go out into the real world, you don't have the privilege of working with deterministic systems at all.

Engineering with LLMs is closer to "designing a robust industrial process that's going to be performed by unskilled minimum wage workers" than it is to "writing a software algorithm". It's still an engineering problem - but of the kind that requires an entirely different frame of mind to tackle.

latexr

And one major issue is that LLMs are largely being sold and understood more like reliable algorithms than what they really are.

If everyone understood the distinction and their limitations, they wouldn’t be enjoying this level of hype, or leading to teen suicides and people giving themselves centuries-old psychiatric illnesses. If you “go out into the real world” you learn people do not understand LLMs aren’t deterministic and that they shouldn’t blindly accept their outputs.

https://archive.ph/rdL9W

https://archive.ph/20241023235325/https://www.nytimes.com/20...

https://archive.ph/20250808145022/https://www.404media.co/gu...

latexr

> you can get a pristine image of a different pile of dirt every single time you ask.

That’s what I was trying to convey with the “then reopen the image” bit. But I chose a different image of a different thing rather than a different image of a similar thing.

energy123

An encyclopaedia also can't win a gold medals at the IMO and IOI. So yeah, they're not the same thing, even though the analogy is pretty good.

latexr

Of course they’re not the same thing, the goal of an analogy is not to be perfect but to provide a point of comparison to explain an idea.

My point is that I find the chosen term inadequate. The author made it up from combining two existing words, where one of them is a poor fit for what they’re aiming to convey.

dragonwriter

Thinking of an LLM as any kind of encyclopedia is probably the wrong model. LLMs are information presentation/processing tools that incidentally, as a consequence of the method by which they are built to do that, may occasionally produce factual information that is not directly prompted.

If you want an LLM to be part of a tool that is intended to provide access to (presumably with some added value) encyclopedic information, it is best not to consider the LLM as providing any part of the encyclopedic information function of the system, but instead as providing part of the user interface of the system. The encyclopedic information should be provided by appropriate tooling that, at request by an appropriately prompted LLM or at direction of an orchestration layer with access to user requests (and both kinds of tooling might be used in the same system) provides relevant factual data which is inserted into the LLM’s context.

The correct modifier to insert into the sentence “An LLM is an encyclopedia” is “not”, not “lossy”.

lxgr

Using artificial neural networks directly for information storage and retrieval (i.e. not just leveraging them as tools accessing other types of storage) is currently infeasible, agreed.

On the other hand, biological neural networks are doing it all the time :) And there might well be an advantage to it (or a hybrid method), once we can make it more economical.

After all, the embedding vector space is shaped by the distribution of training data, and if you have out-of-distribution data coming in due to a new or changed environment, RAG using pre-trained models and their vector spaces will only go so far.

null

[deleted]

intended

eh, bio neural networks aren't doing that all the time. Memmories are notorious for being "rebuilt" constantly.

kgeist

I think an LLM can be used as a kind of lossy encyclopedia, but equating it directly to one isn't entirely accurate. The human mind is also, in a sense, a lossy encyclopedia.

I prefer to think of LLMs as lossy predictors. If you think about it, natural "intelligence" itself can be understood as another type of predictor: you build a world model to anticipate what will happen next so you can plan your actions accordingly and survive.

In the real world, with countless fuzzy factors, no predictor can ever be perfectly lossless. The only real difference, for me, is that LLMs are lossier predictors than human minds (for now). That's all there is to it.

Whatever analogy you use, it comes down to the realization that there's always some lossiness involved, whether you frame it as an encyclopedia or not.

jbstack

Are LLMs really lossier than humans? I think it depends on the context. Given any particular example, LLMs might hallucinate more and a human might do a better job at accuracy. But overall LLMs will remember far more things than a human. Ask a human to reproduce what they read in a book last year and there's a good chance you'll get either absolutely nothing or just a vague idea of what the book was about - in this context they can be up to 100% lossy. The difference here is that human memory decays over time while a LLM's memory is hardwired.

ijk

I think what trips people up is that LLMs and humans are both lossy, but in different ways.

The intuitions that we've developed around previous interactions are very misleading when applied to LLMs. When interacting with a human, we're used to being able to ask a question about topic X in context Y and assume that if you can answer it we can rely on you to be able to talk about it in the very similar context Z.

But LLMs are bad at commutative facts; A=B and B=A can have different performance characteristics. Just because it can answer A=B does not mean it is good at answering B=A; you have to test them separately.

I've seen researchers who should really know better screw this up, rendering their methodology useless for the claim they're trying to validate. Our intuition for how humans do things can be very misleading when working with LLMs.

withinboredom

That's not exactly true. Every time you start a new conversation; you get a new LLM for all intents. Asking an LLM about an unrelated topic towards the end of a ~500 page conversation will get you vastly different results than at the beginning. If we could get to multi-thousand page contexts, it would probably be less accurate than a human, tbh.

jbstack

Yes, I should have clarified that I was referring to memory of training data, not of conversations.

sigmoid10

>Given any particular example, LLMs might hallucinate more and a human might do a better job at accuracy

This drastically depends on the example. For average trivia questions, modern LLMs (even smaller, open ones) beat humans easily.

layer8

Lossy is an incomplete characterization. LLMs are also much more fluctuating and fuzzy. You can get wildly varying output depending on prompting, for what should be the same (even if lossy) knowledge. There is not just loss during the training, but also loss and variation during inference. An LLM overall is a much less coherent and consistent thing than most humans, in terms of knowledge, mindset, and elucidations.

A_D_E_P_T

> If you think about it, natural "intelligence" itself can be understood as another type of predictor: you build a world model to anticipate what will happen next so you can plan your actions accordingly and survive.

Yes.

Human intelligence consists of three things.

First, groundedness: The ability to form a representation of the world and one’s place in it.

Second, a temporal-spatial sense: A subjective and bounded idea of self in objective space and time.

Third: A general predictive function which is capable of broad abstraction.

At its most basic level, this third element enables man to acquire, process, store, represent, and continually re-acquire knowledge which is external to that man's subjective existence. This is calculation in the strictest sense.

And it is the third element -- the strength, speed, and breadth of the predictive function -- which is synonymous with the word "intelligence." Higher animals have all three elements, but they're pretty hazy -- especially the third. And, in humans, short time horizons are synonymous with intellectual dullness.

All of this is to say that if you have a "prediction machine" you're 90% of the way to a true "intelligence machine." It also, I think, suggests routes that might lead to more robust AI in the future. (Ground the AI, give it a limited physical presence in time and space, match its clocks to the outside world.)

quonn

"Prediction" is hardly more than another term for inference. It's the very essence of machine learning. There is nothing new or useful in this concept.

A_D_E_P_T

Point is that it's also exactly analogous to human intelligence. There's almost nothing else to it.

NoMoreNicksLeft

Imagine having the world's most comprehensive encyclopedia at your literal fingertips, 24 hours a day, but being so lazy that you offload the hard work of thinking by letting retarded software pathologically lie to you and then blindly accepting the non-answers it spits at you rather than typing in two or three keywords to Wikipedia and skimming the top paragraph.

>I prefer to think of LLMs as lossy predictors.

I've started to call them the Great Filter.

In the latest issue of the comic book Lex Luthor attempts to exterminate humanity by hacking the LLM and having it inform humanity that they can hold their breath underwater for 17 hours.

somewhereoutth

> you build a world model

The foundational conceit (if you will) of LLMs is that they build a semantic (world) model to 'make sense' of their training. However it is much more likely that they are simply building a syntactic model in response to the training. As far as I know there is no evidence of a semantic model emerging.

null

[deleted]

ijk

There's some evidence of valid relationships: you can build a map of Manhattan by asking about directions from each street corner and plotting the relations.

This is still entirely referential, but in a way that a human would see some relation to the actual thing, albeit in a somewhat weird and alien way.

jebarker

Maybe I don’t have a precise enough definition of syntax and semantics, but it seems like it’s more than just syntactic since interchangeable tokens in the same syntax affect the semantics of the sentence. Or do you view completing a prompt such as “The president of the United States is?” as a syntax question?

IanCal

Is this not addressed by othellogpt?

cubefox

Another difference is that you are predicting future sensory experiences in real-time, while LLMs "predict" text which a "helpful, honest, harmless" assistant would produce.

GuB-42

There are a lot of parallels between AI and compression.

In fact the best compression algorithms and LLMs have in common that they work by predicting the next word. Compression algorithms take an extra step called entropy coding to encode the difference between the prediction and the actual data efficiently, and the better the prediction, the better the compression ratio.

What makes a LLM "lossy" is that you don't have the "encode the difference" step.

And yes, it means you can turn a LLM into a (lossless) compression algorithm, and I think a really good one in term of compression ratio on huge data sets. You can also turn a compression algorithm like gzip into a language model! A very terrible one, but the output is better than a random stream of bytes.

jparishy

I suspect this ends up being pretty important for the next advancements in AI, specifically LLM-based AI. To me, the transformer architecture is a sort of compression algorithm that is being exploited for emergent behavior at the margins. But I think this is more like stream of consciousness than premeditated thought. Eventually I think we figure out a way to "think" in latent space and have our existing AI models be just the mouthpiece.

In my experience as a human, the more you know about a subject, or even the more you have simply seen content about it, the easier it is to ramble on about it convincingly. It's like a mirroring skill, and it does not actually mean you understand what you're saying.

LLMs seem to do the same thing, I think. At scale this is widely useful, though, I am not discounting it. Just think it's an order of magnitude below what's possible and all this talk of existing stream-of-consciousness-like LLMs creating AGI seems like a miss

layer8

One difference is that compression gives you one and only one thing when decompressing. Decompression isn't a function taking arbitrary additional input and producing potentially arbitrary, nondeterministic output based on it.

We would have very different conversations if LLMs were things that merely exploded into a singular lossy-expanded version of Wikipedia, but where looking at the article for any topic X would give you the exact same article each time.

withinboredom

LLMs deliberately insert randomness. If you run a model locally (or sometimes via API), you can turn that off and get the same response for the same input every time.

layer8

True, but I'd argue that you can't get the definite knowledge of an LLM by turning off randomness, or fixing the seed. Otherwise that would be a routinely employed feature, to determine what an LLM "truly knows", removing any random noise distorting that knowledge, and instead randomness would only be turned on for tasks requiring creativity, not when merely asking factual questions. But it doesn’t work that way. Different seeds and will uncover different "knowledge", and it's not the case that one is a truer representation of an LLM's knowledge than another.

Furthernore, even in the absence of randomness, asking an LLM the same question in different ways can yield different, potentially contradictory answers, even when the difference in prompting is perfectly benign.

arjvik

With a handy trick called arithmetic coding, you can actually turn an LLM into a lossless compression algorithm!

vbarrielle

Indeed, see https://bellard.org/nncp/ for an example.

amarant

Accurate!

An llm is also a more convenient encyclopedia.

I'm not surprised a large portion of people choose convenience over correctness. I do not necessarily agree with the choice, but looking at historical trends, I do not find it surprising that it's a popular choice.

freefaler

> "...They have a huge array of facts compressed into them but that compression is lossy (see also Ted Chiang)"

indeed, Ted's piece (ChatGPT Is a Blurry JPEG of the Web) is here:

https://archive.is/iHSdS

baq

Worth highlighting - 2023.

thw_9a83c

Yes, LLM is a lossy encyclopedia with a human-language answering interface. This has some benefits, mostly in terms of convenience. You don't have to browse or read through so many pages of a real encyclopedia to get a quick answer. However, there is also a clear downside. Currently, LLM is unable to judge if your question is formulated incorrectly or if your question opens up more questions that should be answered first. It always jumps to answering something. A real human would assess the questioner first and usually ask for more details before answering. I feel this is the predominant reason why LLM answers feel so dumb at times. It never asks for clarification.

simonw

I don't think that's universally true with the new models - I've seen Claude 4 and GPT-5 ask for clarification on questions with obvious gaps.

With GPT-5 I sometimes see it spot a question that needs clarifying in its thinking trace, then pick the most likely answer, then spit out an answer later that says "assuming you meant X ..." - I've even had it provide an answer in two sections for each branch of a clear ambiguity.

ACCount37

A lot of the touted "fundamental limitations of LLMs" are less "fundamental" and more "you're training them wrong".

So there are improvements version to version - from both increases in raw model capabilities and better training methods being used.

ijk

I'm frustrated by the number of times I encounter people assuming that the current model behavior is inevitable. There's been hundreds of billions of dollars spent on training LLMs to do specific things. What exactly they've been trained on matters; they could have been trained to do something else.

Interacting with a base model versus an instruction tuned model will quickly show you the difference between the innate language faculties and the post-trained behavior.

koakuma-chan

GPT-5 is seriously annoying. It asks not just one but multiple clarifying questions, while I just want my answer.

kingstnap

If you don't want to answer clarifying questions, then what use is the answer???

Put another way, if you don't care about details that change the answer, it directly implies you don't actually care about the answer.

Related silliness is how people force LLMs to give one word answers to underspecified comparisons. Something along the lines of "@Grok is China or US better, one word answer only."

At that point, just flip a coin. You obviously can't conclude anything useful with the response.

coffeefirst

This is also why the Kagi Assistant is still be the AI tool I’ve found. The failure state is the same as a search results, it either can’t find anything, finds something irrelevant, or finds material that contradicts the premise of your question.

It seems to me the more you can pin it to another data set, the better.

null

[deleted]

112233

That AI is closely related to compression is a well established idea. E.g. http://prize.hutter1.net/

It seems reasonable to argue that LLMs are a form of lossy compression of text that preserves important text features.

There is a precedent of distributing low quality lossy compressed versions of copyrighted work being considered illegal.

narrator

An LLM is a lossy Borges' Library of Babel

"Though the vast majority of the books in this universe are pure gibberish, the laws of probability dictate that the library also must contain, somewhere, every coherent book ever written, or that might ever be written, and every possible permutation or slightly erroneous version of every one of those books. " -https://en.wikipedia.org/wiki/The_Library_of_Babel

RodgerTheGreat

It's a version of the library of babel filtered only to the books which plausibly consist of prose. The set is still incomprehensibly vast, and all the more treacherous for generally being "readable".