Recent results show that LLMs struggle with compositional tasks

302 comments

·February 2, 2025

moolimon

The main thesis here seems to be that LLMs behave like almost all other machine learning models, in that they are doing pattern matching on their input data, and short circuiting to a statistically likely result. Chain of thought reasoning is still bound by this basic property of reflexive pattern matching, except the LLM is forced to go through a process of iteratively refining the domain it does matching on.

Chain of thought is interesting, because you can combine it with reinforcement learning to get models to solve (seemingly) arbitrarily hard problems. This comes with the caveat that you need some reward model for all RL. This means you need a clear definition of success, and some way of rewarding being closer to success, to actually solve those problems.

Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

nonameiguess

To me:

LLMs are trained, as others have mentioned, first to just learn the language at all costs. Ingest any and all strings of text generated by humans until you can learn how to generate text in a way that is indistinguishable.

As a happy side effect, this language you've now learned happens to embed quite a few statements of fact and examples of high-quality logical reasoning, but crucially, the language itself isn't a representation of reality or of good reasoning. It isn't meant to be. It's a way to store and communicate arbitrary ideas, which may be wrong or bad or both. Thus, the problem for these researchers now becomes how do we tease out and surface the parts of the model that can produce factually accurate and reasonable statements and dampen everything else?

Animal learning isn't like this. We don't require language at all to represent and reason about reality. We have multimodal sensory experience and direct interaction with the physical world, not just recorded images or writing about the world, from the beginning. Whatever it is humans do, I think we at least innately understand that language isn't truth or reason. It's just a way to encode arbitrary information.

Some way or another, we all grok that there is a hierarchy of evidence or even what evidence is and isn't in the first place. Going into the backyard to find where your dog left the ball or reading a physics textbook is fundamentally a different form of learning than reading the Odyssey or the published manifesto of a mass murderer. We're still "learning" in the sense that our brains now contain more information than they did before, but we know some of these things are representations of reality and some are not. We have access to the world beyond the shadows in the cave.

anon84873628

Humans can carve the world up into domains with a fixed set of rules and then do symbolic reasoning within it. LLMs can't see to do this in a formal way at all -- they just occasionally get it right when the domain happens to be encoded in their language learning.

You can't feed an LLM a formal language grammar (e.g. SQL) then have it only generate results with valid syntax.

It's awfully confusing to me that people think current LLMs (or multi-modal models etc) are "close" to AGI (for whatever various definitions of all those words you want to use) when they can't do real symbolic reasoning.

Though I'm not an expert and happy to be corrected...

cornel_io

Adult humans can do symbolic reasoning, but lower mammals cannot. Even ones that share most of our brain structure are much worse at this, if they can do it at all; children need to learn it, along with a lot of the other things that we consider a natural part of human intelligence.

That all points towards symbolic reasoning being a pretty small algorithmic discovery compared to the general ability to pattern match and do fuzzy lookups, transformations, and retrievals against a memory bank. It's not like our architecture is so special that we burned most of our evolutionary history selecting for these abilities, they're very recent innovations, and thus must be relatively simple, given the existence of the core set of abilities that our close ancestors have.

The thing about transformers is that obviously they're not the end of the line, there are some things they really can't do in their current form (though it's a smaller set than people tend to think, which is why the Gary Marcuses of the world always backpedal like crazy and retcon their previous statements as each new release does things that they previously said were impossible). But they are a proof of concept showing that just about the simplest architecture that you could propose that might be able to generate language in a reasonable way (beyond N-gram sampling) can, in fact, do it really, really well even if all you do is scale it up, and even the simplest next-token prediction as a goal leads to much higher level abilities than you would expect. That was the hard core of the problem, building a flexible pattern mimic that can be easily trained, and it turns out to get us way further along the line to AGI than I suspect anyone working on it ever expected it would without major additions and changes to the design. Now it's probably time to start adding bits and bobs and addressing some of the shortcomings (e.g. static nature of the network, lack of online learning, the fact that chains of thought shouldn't be constrained to token sequences, addressing tokenization itself, etc), but IMO the engine at the heart of the current systems is so impressively capable that the remaining work is going to be less of an Einstein moment and more of an elbow grease and engineering grind.

We may not be close in the "2 years of known work" sense, but we're certainly not far in the "we have no idea how to prove the Riemann Hypothesis" sense anymore, where major unknown breakthroughs are still required which might be 50+ years away, or the problem might even be unsolvable.

mnky9800n

Humans often do not have a clear definition of success and instead create a post-hoc narrative to describe whatever happened as success.

idiotsecant

Yes, I've always thought that LLMs need the equivalent of a limbic system. This is how we solved this problem in organic computers. There is no static 'reward function'. Instead, we have a dynamic reward function computer. It decides from day to day and hour to hour what our basic objectives are. It also crucially handles emotional 'tagging' of memory. Memories that we store are proportionally more likely to be retrieved under similar emotional conditions. It helps to filter relevant memories, which is something LLMs definitely could use.

I think the equivalent of an LLM limbic system is more or less the missing piece for AGI. Now, how you'd go about making one of those I have no idea. How does one construct an emotional state space?

_heimdall

Companies are bad about doing this on purpose. If they set out to build AGI and accomplish something novel, just call that AI and go on fund raising from people who don't know better (or more likely don't care and just want to gamble with others' money).

cadamsdotcom

Continuous RL in a sense. There maybe an undiscovered additional scaling law around models doing what you describe; continuous LLM-as-self-judge, if you will.

Provided it can be determined why a user ended the chat, which may turn out to be possible in some subset of conversations.

ben_w

And also sometimes write down the conclusion and work backwards, without considering that the reason most likely for the conclusion isn't necessarily going to have the conclusion as the most likely conclusion — I hope I phrased that broken symmetry correctly.

ahartmetz

I'm not following. Do you have an example?

ben_w

Aesop's fables, "sour grapes".

mnky9800n

The milliken oil drop experiment, “winning “ the space race, mostly anything C levels will tell the board and shareholders at a shareholder meeting, the American wars in Iraq and Afghanistan, most of what Sam Altman or Elon musk has to say, this list continues.

emsign

This!

viccis

>Interesting to think about what structures human intelligence has that these models don't.

Kant's Critique of Pure Reason has been a very influential way of examining this kind of epistemology. He put forth the argument that our ability to reason about objects comes through our apprehension of sensory input over time, schematizing these into an understanding of the objects, and finally, through reason (by way of the categories) into synthetic a priori knowledge (conclusions grounded in reason rather than empiricism).

If we look at this question in that sense, LLMs are good at symbolic manipulation that mimics our sensibility, as well as combining different encounters with concepts into an understanding of what those objects are relative to other sensed objects. What it lacks is the transcendental reasoning that can form novel and well grounded conclusions.

Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.

corimaith

>Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.

This just goes back into the problems of that AI winter again though. First Order Logic isn't expressive enough to model the real world, while Second Order Logic dosen't have a complete proof system to truly verify all it'sstatements, and is too complex and unyieldy for practical uses. The number of people I would also imagine that are working on such problems would be very few, this isn't engineering that it is analytic philosophy and mathematics.

viccis

Kant predates analytical philosophy and some of its failures (the logical positivism you are referring to). The idea here is that first order logic doesn't need to be expressive enough to model the world. Only that some logic system is capable of modeling the understanding of a representation of the world mediated by way of perception (via the current multimodal generative AI models). And finally, it does not need to be complete or correct, just equivalent or better than how our minds do such.

drakenot

With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik.

It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.

null

[deleted]

mountainriver

It is a reward function it’s just a deterministic one. Reward models are often hacked preventing real reasoning from being discovered

huijzer

> Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

What is also a benefit for humans, I think, is that people are typically much more selective. LLMs train to predict anything on the internet, so for example for finance that includes clickbait articles which have a lifetime of about 2 hours. Experts would probably reject any information in these articles and instead try to focus on high quality sources only.

Similarly, a math researcher will probably have read a completely set of sources throughout the life than, say, a lawyer.

I’m not sure it’s a fundamental difference, but current models do seem to not specialize from the start unlike humans. And that might be in the way of learning the best representations. I know from ice hockey for example, that you can see within 3 seconds whether someone played ice hockey from young age or not. Same with language. People can usually hear an accent within seconds. Relatedly, I've used OpenAI's text to speech a while back and the Dutch voice had an American accent. What this means is that even if you ask LLMs about Buffett's strategy, maybe they have a "clickbait accent" too. So with the current approach to training, the models might never reach absolute expert performance.

andai

When I was doing some NLP stuff a few years ago, I downloaded a few blobs of Common Crawl data, i.e. the kind of thing GPT was trained on. I was sort of horrified by the subject matter and quality: spam, advertisements, flame wars, porn... and that seems to be the vast majority of internet content. (If you've talked to a model without RLHF like one of the base Llama models, you may notice the personality is... different!)

I also started wondering about the utility of spending most of the network memorizing infinite trivia (even excluding most of the content above, which is trash), when LLMs don't really excel at that anyway, and they need to Google it anyway to give you a source. (Aside: I've heard soke people have good luck with "hallucinate then verify" with RAG / Googling...)

i.e. what if we put those neurons to better use? Then I found the Phi-1 paper, which did exactly that. Instead of training the model on slop, they trained it on textbooks! And instead of starting with PhD level stuff, they started with kid level stuff and gradually increased the difficulty.

What will we think of next...

dr_dshiv

Yes, but the PHI-1 textbooks were synthetic — written by other models! So…

astrange

You can get rid of the trivia by training one model on the slop, then a second model on the first one - called distillation or teacher-student training. But it's not much of a problem because regularization during training should discourage it from learning random noise.

The reason LLMs work isn't because they learn the whole internet, it's because they try to learn it but then fail to, in a useful way.

If anything current models are overly optimized away from this; I get the feeling they mostly want to tell you things from Wikipedia. You don't get a lot of answers that look like they came from a book.

gf000

I don't know, babies hear a lot of widely generic topics from multiple people before learning to speak.

I would rather put it that humans can additionally specialize much more, but we usually have a pretty okay generic understanding/model of a thing we consider as 'known'. I would even wager that being generic enough (ergo, has been sufficiently abstracted) is possibly the most important "feature" human's have? (In the context of learning)

ben_w

> For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

What we can integrate, we seem to integrate efficiently*; but compared to the quantities used to train AI, we humans may as well be literally vegetables.

* though people do argue about exactly how much input we get from vision etc., personally I doubt vision input is important to general human intelligence, because if it was then people born blind would have intellectual development difficulties that I've never heard suggested exist — David Blunket's success says human intelligence isn't just fine-tuning on top of a massive vision-grounded model.

Retric

Hearing is also well into the terabytes worth of information per year. Add in touch, taste, smell, proprioception, etc and the brain gets a deluge.

The difference is we’re really focused on moving around in 3D space and more abstract work, where an LLM etc is optimized for a very narrow domain.

hammock

> Hearing is also well into the terabytes worth of information per year. Add in touch, taste, smell, proprioception, etc and the brain gets a deluge

Is that supposed to be a lot? Only a small fraction of that is committed to permanent storage.

A random server is today processing anywhere from tens of terabytes to hundreds of petabytes annually

jdietrich

>Hearing is also well into the terabytes worth of information per year.

If we assume that the human auditory system is equivalent to uncompressed digital recording, sure. Actual neural coding is much more efficient, so the amount of data that is meaningfully processed after multiple stages of filtering and compression is plausibly on the order of tens of gigabytes per year; the amount actually retained is plausibly in the tens of megabytes.

Don't get me wrong, the human brain is hugely impressive, but we're heavily reliant on very lossy sensory mechanisms. A few rounds of Kim's Game will powerfully reveal just how much of what we perceive is instantly discarded, even when we're paying close attention.

gf000

Is that a positive thing? If anything I would consider that as the reverse - LLMs have the "intelligence of vegetables" because even with literally the whole of human written knowledge they can at most regurgitate that back to us with no novelty whatsoever, even though a 2 years old with a not even matured brain can learn a human language from orderS of magnitude less and lower quality input from a couple of people only.

But any Nobel-price winner has read significantly less than a basic LLM, and we see no LLM doing any tiny scientific achievement, let alone that high impact ones.

ben_w

> Is that a positive thing?

Neither. Both.

Depends what you want to measure.

It's perfectly legit to call these models "thick" because they *need* to read such a vast quantity of text that a human would literally spend two thousand lifetimes to go through it even if that was all the human did with their days.

It also remains the case that, unlike us, they can go through all of that in a few months.

> with no novelty whatsoever, even though a 2 years old with a not even matured brain can learn a human language from orderS of magnitude less and lower quality input from a couple of people only.

You're either grossly underestimating AI or overestimating 2 year olds, possibly both.

I just about remember being a toddler, somewhere between then and 5 was around the age I had the idea that everyone got an invisible extra brain floating next to them for every year they lived. Took me an embarrassingly long time (teens, IIRC) to realise that the witch-duck-weight-comparison scene in Monty Python and the Holy Grail wasn't a documentary, thanks to the part of the film captioned "Famous Historian". One time my dad fell ill, and he was talking to mum about "the tissue being damaged" while I was present, so I gave him a handkerchief (AKA "a tissue"). And while I don't remember this directly, my mum's anecdotes include me saying "fetrol fump", waving a spoon in a jam pan and calling this act "spelling", and when discovered running around with my pockets inside-out explaining myself as trying to fly because I apparently thought that the lining of a pocket was called a "wing".

When it comes to human novelty, I also quite often find there's a lot of remixing going on that just isn't immediately apparent. As Steve Jobs apparently once said, “Good artists copy; great artists steal.”, except Jobs stole that quote from Picasso.

It's easy to categorise different levels with AI, but which one of these counts as "novelty", and how often do humans ever achieve each of these grades?

0. Memorisation of the training set. Think: bunch of pictures, pick best fit.

1. Linear interpolation between any pair of elements in the training set. Think: simple cross-fade between any two pictures, but no tracking or distorting of features during that fade.

2. Let the training set form a basis vector space, and interpolate freely within the constraints of the examples. Think: if these pictures are faces, it would make any hair colour between the most extreme limits shown, etc.

3. Extrapolate beyond the examples. Think: Even if no black or white hair was visible, so long as several shades of grey were, it could reach the ideas of black or white hair.

4. Invent a new vector. Think: even if it had been trained only on black-and-white images, it could still invent green hair.

> But any Nobel-price winner has read significantly less than a basic LLM, and we see no LLM doing any tiny scientific achievement, let alone that high impact ones.

We do see them doing *tiny* scientific achievements, with extra emphasis on "tiny". Just like with using them in software, even the best "only" act like fresh graduates.

When any AI gets to high-impact… the following (fictional) quote comes to mind: "as soon as we started thinking for you, it really became our civilization."

mnky9800n

I feel like if you take the underlying transformer and apply to other topics, e.g., eqtransformer, nobody questions this assumption. It’s only when language is in the mix do people suggest they are something more and some kind of “artificial intelligence” akin to the beginnings of Data from Star Trek or C3P0 from Star Wars.

lubujackson

Human processing is very interesting and should likely lead to more improvements (and more understanding of human thought!)

Seems to me humans are very good at pattern matching, as a core requirement for intelligence. Not only that, we are wired to enjoy it innately - see sudoku, find Waldo, etc.

We also massively distill input information into short summaries. This is easy to see by what humans are blind to: the guy in a gorilla suit walking through a bunch of people passing a ball around, or basically any human behavior magicians use to deceive or redirect attention. We are mombarded with information constantly. This is the biggest difference between us and LLMs as we have a lot more input data and also are constantly updating that information - with the added feature/limitation of time decay. It would be hard to navigate life without short term memory or a clear way to distinguish things that happened 10 minutes ago from 10 months ago. We don't fully recall each memory of washing the dishes but junk the vast, vast majority of our memories, which is probably the biggest shortcut our brains have over LLMs.

Then we also, crucially, store these summaries in memory as connected vignettes. And our memory is faulty but also quite rich for how "lossy" it must be. Think of a memory involving a ball from before the age of 10 and most people can drum up several relevant memories without much effort, no matter their age.

antirez

LLMs keep showing more and more they are the wonder of AI that we awaited for decades: talking machines that every two months do progresses that two months before were anticipated impossible because of <put here some limit that was actually in the prejudice of the skeptical AI community> (just stochastic parrots, no reasoning possible without symbolic representations, there are no longer tokens, ...)

At the same time, part of the scientific community continues to diminish what was accomplished and the steps that are being made. A few months ago LeCun arrived to tell new researchers to move away from LLMs since they are a dead end: imagine the disservice he made to the surely non-zero folks that followed the advice, putting themselves out of the AI research that matters. (Incidentally, this skepticism of the Meta AI head must have something to do with the fact that Meta, despite the huge efforts allocated, produced the worst LLM among Anthropic, OpenAI, DeepSeek -- I bet Zuckerberg is asking questions lately).

It's very hard to explain this behavior if not by psychological denial.

[EDIT: you can't see the score of this comment, but I can: it's incredible how it goes from 3, to -2, to 1, and so forth. The community is split in two, and it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth]

ramblerman

I get the sentiment, but I actually think some skepticism in the system is healthy.

Billions are flowing towards LLMS, and Sam Altman will overpromise AGI is just around the corner and the days of jobs are gone to fill his coffers to anyone that will listen.

Additionally if we begin to use these things in real production environments where mistakes matters, knowing the exact limitations is key.

None of this takes away from the fact that these are exciting times.

dr_dshiv

I can’t communicate enough how the skepticism (“this is just hype” or “LLMs are stochastic parrots”) is the vastly dominant thought paradigm in European academic circles.

So instead of everyone having some enthusiasm and some skepticism, you get a bifurcation where whole classes of people act as the skeptics and others as the enthusiasts. I view the strong skeptics as more “in the wrong” because they often don’t use LLMs much. If you are an actual enthusiastic user, you simply can’t get good performance without a very strong dose of skepticism towards everything LLMs output.

fragmede

I don't think everyone shares those doubts. The first time you catch an LLM in a lie is sobering, but there are lots of areas, and thus lots of users, for whom it doesn't hallucinate for, because they're asking softball questions and it doesn't end up hallucinating, or hallucinations just really aren't aren't that big a deal. (eg an LLM horoscope generator or using it write sci fi.)

so while we're on HN going back and forth about how outright lies by the system indight the whole thing for everybody, we should be careful to note that it's not for everybody, or rather, it's a known limitation so don't trust it to cite real cases for you as a lawyer, but using it to help you figure out what mens rea means in a practical sense by asking it questions about the concept, totally.

Honestly, hallucinations happen so rarely for me because of the kinds of things I ask it, that it doesn't happen enough for me to not believe it's answers in low-stakes situations, or situations on the level of horoscope generation, and I'm sure I'm not alone in treating ChatGPT that way, despite evidence to the contrary.

eagleislandsong

> I can’t communicate enough how the skepticism (“this is just hype” or “LLMs are stochastic parrots”) is the vastly dominant thought paradigm in European academic circles.

I'm very curious. If you don't mind taking the time to elaborate, will you give a few examples of such skepticism/naysaying? Thank you.

antirez

Yes there is another part of the community that overhypes everything. But I can expect that from a CEO of an AI company (especially if he is Altman), but from researches? Also the fact that LLMs may reach superhuman expertise in certain fields in a short timeframe (a few years), since reinforcement learning is starting to be applied to LLMs may no longer be a totally crazy position. If it is possible to extend considerably the same approach seen in R1-Zero there could be low hanging fruits around the corner.

comeonbro

This article is about things which aren't limitations anymore!

You are applauding it as pushback for pushback's sake, but it's an article about limitations in biplane construction, published after we'd already landed on the moon.

suddenlybananas

Is there any evidence that these fundamental issues with compositionality have been resolved or are you just asserting it? Has the paper been replicated with a CoT model and had a positive result?

lowsong

> …it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth

This is more of a salient point that you perhaps realized. In life there is no single absolute, unknowable truth. Philosophy has spent the entire span of human existence grappling with this topic. The real risk with AI is not that we build some humanity-destroying AGI, but that we build a machine that is 'convincing enough' — and the idea that such a machine would be built by people who believe in objective truth is the most worrying part.

thrance

Depends, if you're a realist [1] (like most) then there can be such a thing as absolute truth, that you may not always be able to access.

[1] https://en.wikipedia.org/wiki/Philosophical_realism?wprov=sf...

askl56

This is teleologically false.

A teleological argument that assumes truth is contingent upon a specific worldview would indeed be flawed, because it would make truth an artifact of a given perspective rather than something independent of it.

mdp2021

> At the same time, part of the scientific community continues to diminish what was accomplished

Revisit the idea: part of the public is bewildered by voices that started calling "intelligence" what was and apparently still is the precise implementation of unintelligence. The fault is in some, many people - as usual.

Very recent state-of-the-art LLM models themselves declare that if the majority of their training data states that entity E is red they will say it's red, and if the majority says it's blue then they will say it's blue: that is the implementation of an artificial moron.

And in fact, very recent state-of-the-art LLM models state cretinous ideas that are child level - because "that's what they have heard" (stuck, moreover analytically, in the simplifications intrinsic in expression).

This architectural fault should be the foremost concern.

pera

Psychological denial of what exactly? And what part of the article/preprints you are commenting on?

Every time an article exposing some limitation of the current wave of LLMs is submitted to HN there are comments like yours and I genuinely cannot understand the point you are trying to make: There is no such thing as a perfect technology, everything has limitations, and we can only improve our current state of the art by studying these and iterate.

rthrfrd

I think if we referred to LLMs as AK (Artifical Knowledge) instead of AI it would be easier to have more cohesive discussions.

I don’t see how there can be a single truth when there is not even a single definition of many of the underlying terms (intelligence, AGI, etc) which this discipline supposedly defines itself by. Combine that with a lot of people with little philosophical perspective suddenly being confronted with philosophical topics and you end up with a discourse that personally I’ve mostly given up on participating in until things calm down again.

It feels like nobody remembers all the timelines for which we were supposed to have self-driving cars.

polotics

you are I think badly misrepresenting what Yann Le Cun said: he didn't say LLM's were a dead end, he said to do research in directions that do not require billions of dollars of investment to show results, in particular for PhD's this is sensible, and in view of recent cheaper results, prescient

fragmede

Sensible with the caveat that deepseek R1 still took millions of dollars off compute time, so you're not training the next one on the box in your basement with a pair of 3090s (though you could certainly fine-tune a shared quantized model). you can't run the full sized model on anything cheap, so. basement researcher still need access to a decent amount of funding, which likely requires outside help.

sitkack

It is becoming more and more important to determine for ourselves what is true and what is not. No person is right on most things, even when they are an expert in that thing. The biggest trap, is to believe someone because they are passionate, that they say it with conviction. Ignore most of the out of band signaling, take what they are saying and then also see if you can corroborate with another source.

There are so many people who are wrong about so may things.

I really appreciate that you are making your dev with ai videos, it shows people different, more humanistic ways of operating with AI.

Most of what I use AI for is to understand and relearn things I only thought I knew. This I think, is the most powerful use of AI, not in the code writing or the image generation, but in understanding and synthesis.

There is that hilarious tautological statement, "it is easy if you know it".

This video https://www.youtube.com/watch?v=TPLPpz6dD3A shows how to use AI to be a personal tutor using the Socratic Method. This is what people should be using AI for, have it test you for things you think you are already good at and you will find huge gaps in your own understanding. Now go apply it to things you have no clue about.

Speaking of parrots, a large volume of the anti AI sentiment, even here is by people repeating half truths they don't understand, confidently, about what AI cannot do. One would need a pretty tight formal case to prove such things.

Everyone should be playing, learning and exploring with these new tools, not shutting each other down.

antirez

Yes, the stochastic parrots story is one of the most strong instances in recent times where experts in a field are made blind by their own expertise (the mental model they have of certain things) to the point of being incapable of seeing trivial evidences.

vunderba

There’s a certain irony in hearing someone describe an LLM as a "stochastic parrot" for the ten-thousandth time when the only reason they’re doing so is that they’ve seen a sufficient number of other people using the exact same term (so now it's in their proverbial training data).

sitkack

Another trope that stands out is that someone will take a model, run a battery of tests against it and then make general statements about what LLMs can and cannot do without understanding their architecture, the training data, and the training itself.

And then they dress it up to sound scientific, when really they are making hasty generalizations to support a preconceived bias.

guelo

But what for? Human learning is becoming of diminishing utility as the machines improve. For example, I am now able to create computer programs and beautiful artwork without taking the time to master these skills. You could say that I can use art and programming as basic tools to accelerate my learning of bigger things, but whatever that bigger thing is AI is coming for it too. I can't imagine the progress the machines will achieve in 10 years. We'll be replaced.

anon22981

The reason you overestimate their capabilities is because you use them for things you don’t know anything about. It’s like when your nephew made a simple HTML website for himself twenty years ago that was <h1>Hi I am Mark</h1> — it seemed impressive, but you just didn’t know that it wasn’t. Using LLMs in real world complex cases (in programming or art) instantly reveal their significant shortcomings. They are a very good nephew for making stuff that seem impressive, but a bad expert or consultant.

kubb

I'm sorry but they don't "do progress that was anticipated impossible", especially not every two months.

They were predicted to end the software engineering profession for almost four years already. And it just doesn't happen, even though they can bang out a perfect to-do list in React in a matter of seconds.

LLMs have incremental improvements on the quality of their responses as measured by benchmarks. The speed and cost of inference has also been improving. Despite that there was no major breakthrough since GPT 3.

People keep trying to make them reason, and keep failing at it.

throw310822

> They were predicted to end the software engineering profession for almost four years already

ChatGPT was launched on November 30 2022. Two years and two months ago. The fact that in such a short timeframe you're talking about missed predictions is absurd, but telling of the accelerated timeframe in which we're living. The fact is that currently AI and LLMs are going through a phase of explosive improvement, to the point we can expect enormous improvements in capabilities every six months or so.

solumunus

I use LLM’s daily so I’m no skeptic. We are not seeing enormous improvements every 6 months, that’s hyperbolic. There has been a significant improvement since GPT 3.5, I’ll give you that, but even in those ~2 years I don’t think I’d describe the improvement as “enormous”. The capabilities are similar with output quality improving by a noticeable degree.

pera

OpenAI API for GPT-3 was launched on June 11, 2020, that's four years and seven months ago:

https://news.ycombinator.com/item?id=23489653

kubb

And what has enormously improved since ChatGPTs launch? Maybe you should ask it what it "thinks" about the hype surrounding it.

Capricorn2481

> to the point we can expect enormous improvements in capabilities every six months or so

Not really, we just can see we've had improvements. That is not evidence of upcoming improvement.

sgt101

SE is a good example - I get a lot of help from LLM tools and I think we're learning how to use them better across realistic SDLC processes as well, but we're not replacing lots of people at the moment. On the other hand I saw a business case from one of the big SI's (not my employer but in a deck that was shown by the SI in an discussion) that described the need to move their Indian software dev workforce from 350k FTE to 50K FTE over the next five years.

I think that the onshore impacts will be much lower or negligible, or possibly even positive, because so much work has been offshored already, and as is well worn in every discussion, Jevons paradox may drive up demand significantly (to be fair I believe this as wherever I have worked we've had 3x+ demand (with business cases) for development projects and had to arbitrarily cull 2x of it at the beginning of each year. So, just like the 30 people in India that are working on my project won't do anything useful unless we feed the work to them, the LLM's won't do anything useful either. And just like we have to send lots of work back to India because it's not right, the same is true of LLM's. The difference is that I won't spend 4 hrs on a friday afternoon on Teams discussing it.

But this is not surprising because we've had big impacts from tools like IDE's, VM's, and compilers which have driven seismic changes in our profession, I think that LLM's are just another one of those.

What I'm watching for is an impact in a non tech domain like healthcare or social care. These are important domains that are overwhelmed with demand and riddled with makework, yet so far LLM's have made very little impact. At least, I am not seeing health insurance rates being cut, hospital waiting lists fall or money and staff being redeployed from back office functions to front line functions.

Why hasn't this started?

vachina

LLMs can hammer out existing solutions to problems, but not never before seen problems.

starchild3001

What a poorly informed article. It's very shallow and out of touch with LLM research. As it stands 6-12 months old models are system 1 thinkers, everybody knows this and knew this even at the time. You need system 2 thinking (test time compute) for more complex logical, algorithmic and reasoning tasks. We knew this when Daniel kahneman wrote thinking fast, thinking slow (over a decade ago) and we still know it today. So LLMs can think but they have to be programmed to think (a la system 2, reasoning, thinking models). There's nothing inherently wrong or limited with LLMs themselves as far as we can tell.

astrange

This is an example of "metaphor-driven development" in AI, which Phil Agre criticized a few decades ago.

System 1/System 2 isn't a real thing. It's just a metaphor Kahneman invented for a book. AI developers continually find metaphors about the brain, decide they are real, implement something which they give the same name, decide it's both real and the same thing because they have given it the same name, and then find it doesn't work.

(Another common example is "world model", something which has never had a clear meaning, and if you did define it you'd find that people don't have one and don't need one.)

tshadley

> To understand the capabilities of LLMs, we evaluate GPT3 (text-davinci-003) [11], ChatGPT (GPT-3.5-turbo) [57] and GPT4 (gpt-4)

Oh dear, this is embarrassing. Anil Anathaswamy, are you aware a year in AI research now is like 10 years in every other field?

geoffhill

Idk, `o3-mini-high` was able to pop this Prolog code out in about 20 seconds:

  solve(WaterDrinker, ZebraOwner) :-
      % H01: Five houses with positions 1..5.
      Houses = [ house(1, _, norwegian, _, _, _),  % H10: Norwegian lives in the first house.
                 house(2, blue, _, _, _, _),       % H15: Since the Norwegian lives next to the blue house,
                 house(3, _, _, milk, _, _),        %       and house1 is Norwegian, house2 must be blue.
                 house(4, _, _, _, _, _),
                 house(5, _, _, _, _, _) ],
  
      % H02: The Englishman lives in the red house.
      member(house(_, red, englishman, _, _, _), Houses),
      % H03: The Spaniard owns the dog.
      member(house(_, _, spaniard, _, dog, _), Houses),
      % H04: Coffee is drunk in the green house.
      member(house(_, green, _, coffee, _, _), Houses),
      % H05: The Ukrainian drinks tea.
      member(house(_, _, ukrainian, tea, _, _), Houses),
      % H06: The green house is immediately to the right of the ivory house.
      right_of(house(_, green, _, _, _, _), house(_, ivory, _, _, _, _), Houses),
      % H07: The Old Gold smoker owns snails.
      member(house(_, _, _, _, snails, old_gold), Houses),
      % H08: Kools are smoked in the yellow house.
      member(house(_, yellow, _, _, _, kools), Houses),
      % H11: The man who smokes Chesterfields lives in the house next to the man with the fox.
      next_to(house(_, _, _, _, _, chesterfields), house(_, _, _, _, fox, _), Houses),
      % H12: Kools are smoked in a house next to the house where the horse is kept.
      next_to(house(_, _, _, _, horse, _), house(_, _, _, _, _, kools), Houses),
      % H13: The Lucky Strike smoker drinks orange juice.
      member(house(_, _, _, orange_juice, _, lucky_strike), Houses),
      % H14: The Japanese smokes Parliaments.
      member(house(_, _, japanese, _, _, parliaments), Houses),
      % (H09 is built in: Milk is drunk in the middle house, i.e. house3.)
      
      % Finally, find out:
      % Q1: Who drinks water?
      member(house(_, _, WaterDrinker, water, _, _), Houses),
      % Q2: Who owns the zebra?
      member(house(_, _, ZebraOwner, _, zebra, _), Houses).
  
  right_of(Right, Left, Houses) :-
      nextto(Left, Right, Houses).
  
  next_to(X, Y, Houses) :-
      nextto(X, Y, Houses);
      nextto(Y, X, Houses).

Seems ok to me.

   ?- solve(WaterDrinker, ZebraOwner).
   WaterDrinker = norwegian,
   ZebraOwner = japanese .

orbital-decay

That's because it uses a long CoT. The actual paper [1] [2] talks about the limitations of decoder-only transformers predicting the reply directly, although it also establishes the benefits of CoT for composition.

This is all known for a long time and makes intuitive sense - you can't squeeze more computation from it than it can provide. The authors just formally proved it (which is no small deal). And Quanta is being dramatic with conclusions and headlines, as always.

[1] https://arxiv.org/abs/2412.02975

[2] https://news.ycombinator.com/item?id=42889786

antirez

LLMs using CoT are also decoder-only, it's not a paradigm shift like people want to claim now to don't say they were wrong: it's still next token prediction, that is forced to explore more possibilities in the space it contains. And with R1-Zero we also know that LLMs can train themselves to do so.

janalsncm

That’s a different paper than the one this article describes. The article describes this paper: https://arxiv.org/abs/2305.18654

mkl

The article describes both papers.

usaar333

A paper that came out 15 months ago?

teruakohatu

gpt-4o, asked to produce swi-prolog code, gets the same result using a very similar code. gpt4-turbo can do it with slightly less nice code. gpt-3.5-turbo struggled to get the syntax correct but I think with some better prompting could manage it.

COT is defiantly optional. Although I am sure all LLM have seen this problem explained and solved in training data.

mycall

This doesn't include Encoder-Decoder Transformer Fusion for machine translation, or Encoder-Only like text classification, named entity recognition or BERT.

leonidasv

Also, notice that the original study is from 2023.

echelon

The LLM doesn't understand it's doing this, though. It pattern matched against your "steering" in a way that generalized. And it didn't hallucinate in this particular case. That's still cherry picking, and you wouldn't trust this to turn a $500k screw.

I feel like we're at 2004 Darpa Grand Challenge level, but we're nowhere near solving all of the issues required to run this on public streets. It's impressive, but leaves an enormous amount to be desired.

I think we'll get there, but I don't think it'll be in just a few short years. The companies hyping that this accelerated timeline is just around the corner are doing so out of existential need to keep the funding flowing.

simonw

Solving it with Prolog is neat, and a very realistic way of how LLMs with tools should be expected to handle this kind of thing.

EdwardDiego

I would've been very surprised if Prolog to solve this wasn't something that the model had already ingested.

Early AI hype cycles, after all, is where Prolog, like Lisp, shone.

plasticeagle

Indeed.

https://stackoverflow.com/questions/9252656/einsteins-riddle...

simonw

I'm certain models like o3-mini are capable of writing Prolog of this quality for puzzles they haven't seen before - it feels like a very straight-forward conversion operation for them.

lsy

If the LLM’s user indicates that the input can and should be translated as a logic problem, and then the user runs that definition in an external Prolog solver, what’s the LLM really doing here? Probabilistically mapping a logic problem to Prolog? That’s not quite the LLM solving the problem.

xyzzy123

Do you feel differently if it runs the prolog in a tool call?

layer8

Not the user you’re replying to, but I would feel differently if the LLM responded with “This is a problem I can’t reliably solve by myself, but there’s a logic programming system called Prolog for which I could write a suitable program that would. Do you have access to a Prolog interpreter, or could you give me access to one? I could also just output the Prolog program if you like.”

Furthermore, the LLM does know how Prolog’s unification algorithm works (in the sense that it can provide an explanation of how Prolog and the algorithm works), yet it isn’t able to follow that algorithm by itself like a human could (with pen and paper), even for simple Prolog programs whose execution would fit into the resource constraints.

This is part of the gap that I see to true human-level intelligence.

baq

But the problem is solved. Depends what you care about.

endofreach

Psst, don't tell my clients that it's not actually me but the languages syntax i use, that's solving their problem.

choeger

So you asked an LLM to translate. It excells in translation. But ask it to solve and it will, inevitably, fail. But that's also expected.

The interesting question is: Given a C compiler and the problem, could an LLM come up with something like Prolog on its own?

charlieyu1

I think it could even solve, these kinds of riddles are heavily trained

n144q

Then what about new, unseen riddles that don't have a similar pattern to existing ones? That's the question people are asking.

intended

Science is not in the proving of it.

It’s in the disproving of it, and in the finding of the terms that help others understand the limits.

I dont know why it took me so long to come to that sentence. Yes, everyone can trot out their core examples that reinforce the point.

The research is motivated by these examples in the first place.

Agraillo

Good point. LLMs can be treated as "theories" and then they definitely meet falsifiability [1] allowing researchers finding "black swans" for years to come. Theories in this case can be different. But if the theory is of logical or symbolic solver then Wolfram's Mathematica may be struggle with understanding the human language as an input, but when evaluating the results, well, I think Stephen (Wolfram) can sleep soundly, at least for now

[1] https://en.wikipedia.org/wiki/Falsifiability

est

I'd say not only LLM stuggle with these kind of problems, 99% of humans do.

tuatoru

    solve (make me a sandwich)

Moravec's Paradox is still a thing.

mmcnl

There's so much talk about the advancements in AI/LLMs, yet for me ChatGPT as of this date is basically just a faster search engine without cookie banners, clickbait and ads. It hallucinates a lot and it can keep very limited context. Why is there so much promise about future progress but so little actual progress?

EA-3167

It's the same cycle we saw with Crypto, there's so much money flying around that the motivation to "believe" is overwhelming. The hype is coming from all directions, and people are social animals that put greater weight on words that come from multiple sources. It's also a platform for people excited about the future to fantasize, and for people terrified of the future to catastrophize.

knowaveragejoe

I have to wonder how you are using ChatGPT to get a lot of hallucinations or run into issues with limited context.

mikeknoop

One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.

orbital-decay

It's Quanta being misleading. They mention several papers but end up with this [1] which talks about decoder-only transformers, not LLMs in general, chatbots, or LLM synthesis engines, whatever that means. The paper also proves that CoT-like planning lets you squeeze more computation from a transformer, which is... obvious? but formally proven this time. Models trained to do CoT don't have some magical on-the-fly compositional ability, they just invest more computation (could be dozens millions of tokens in case of o3 solving the tasks from that benchmark).

[1] https://arxiv.org/abs/2412.02975

kebsup

I've managed to get llms fail on simple questions, that require thinking graphically - 2D or 3D.

An example would be: you have a NxM grid. How many shapes of XYZ shape can you fit on it?

However, thinking of the transformer video games, AI can be trained to have a good representation of 2D/3D worlds. I wonder how it can be combined so that this graphical representation is used to compute text output.

klodolph

When one of these limitations gets spelled out in an article, it feels like six months later, somebody has a demo of a chatbot without that particular limitation.

These limitations don’t seem in any way “fundamental” to me. I’m sure there are a ton of people gluing LLMs to SAT solvers as we speak.

chefandy

Could you give an example of something we recently solved that was considered an unsolvable problem six months beforehand? I don’t have any specific examples, but it seems like most of the huge breakthrough discoveries I’ve seen announced end up being overstated and for practical usage, our choice of LLM-driven tools is only marginally better than they were a couple of years ago. It seems like the preponderance of practical advancement in recent times has come from the tooling/interface improvements rather than generating miracles from the models themselves. But it could be that I just don’t have the right use cases.

munchler

Take a look at the ARC Prize, which is a test for achieving "AGI" created in 2019 by François Chollet. Scroll down halfway on the home page and ponder the steep yellow line on the graph. That's what OpenAI o3 recently achieved.

[0] https://arcprize.org/

[1] https://arcprize.org/blog/oai-o3-pub-breakthrough

mrshadowgoose

Reviewing the actual problems is highly recommended: https://kts.github.io/arc-viewer/

They're not particularly difficult, but clearly require reasoning to solve.

EdwardDiego

So we're only 12% from AGI?

I'm dubious tbh. Given we still can't simulate a nematode.

0x008

> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set…

Sounds fishy to me

liamwire

Not quite what you asked for, but it seems tangentially related and you might find it interesting: https://r0bk.github.io/killedbyllm/

janalsncm

Would be interesting to have a list of startups killed by ChatGPT as well.

gallerdude

Completely disagree… there are a crazy amount of cases that didn’t work, until the models scaled to a point they magically did.

Best example I can think of is the ARC AGI benchmark. It was seen to measure human-like intelligence through special symmetries and abstract patterns.

From GPT-2 to GPT-4 there was basically had no progress, then o1 got about 20%. Now o3 has basically solved the benchmark.

chefandy

I guess what I'm probably not seeing from my vantage point is that translating into a better experience with the tools available. I just cancelled a ChatGPT plus subscription because it just didn't seem useful enough to justify the price. I absolutely understand that there are people for whom it is, but nearly everyone I see that talks a lot about the value of AI either has use cases that I don't care about such as automated "content" generation or high-volume lowish-skill code generation, or they see achieving a progressively more difficult set of benchmarks as a useful end in itself. I like copilot autocomplete when I'm coding, but the quality of that hasn't dramatically changed. I don't give a damn about benchmarks-- I only care what I get from it practically. I have absolutely no interest in using ChatGPT as a therapist or companion because I value human connection and have access to it. So far I simply don't see significant changes in what comes out vs what gets typed in for practical usage. I wouldn't give ChatGPT logic problems to solve except maybe for generating code because I know code well enough to quickly evaluate its output. If the caveat is "hey FYI this thing might hide some frustratingly plausible looking bullshit in the answer so double-check its work," then what good is it really for hard problems if you just have to re-do them anyway?

The same thing is true with image generation. Sure, it's better in ways that are sort-of meaningful for low-value professional or hobby usage, but it's barely budged the barriers to becoming good enough for high-end media production.

I totally believe that this technology is improving and when you're looking at it in isolation, those improvements seem meaningful. But I just don't see that yet translating into things most of the general public can sink their teeth into. With things like the (still) shitty google search "enhancements", and users being forced into AI-driven chat workflows or having big loud not-really-useful UI elements dedicated to AI features, in some ways they've made people's experience using computers meaningfully worse.

Just like with Mastodon, I see a huge disconnect with the tech crowd's excitement with what's happening with the technology, and how that ends up working for users that need to actually solve their problems with that technology.

intelkishan

Performance of OpenAI o3 in the ARC-AGI challenge fits the bill, however the model is not released publicly.

wslh

SMT solvers really.

levocardia

Came here to say the same thing. Bet o3 and Claude 3.5 Opus will crush this task by the end of 2025.

xigency

I've been slacking but yeah it's on my list.

changoplatanero

By the time these academic studies get published they are usually already several months out of date. o3-mini was released yesterday and if one wants to know about the limitations of current technology they are much better to check twitter than some research paper

FuckButtons

I think the breathless hype train of twitter is probably the worst place to get an actually grounded take on what the real world implications of the technology is.

Seeing the 100th example of an llm generating some toy code for which there are a vast number of examples of approximately similar things in the training corpus doesn’t give you a clearer view of what is or isn’t possible.

unification_fan

I think that most of the developers who advocate for AI coding have never worked all by themselves on projects with over 500/1000 files. Because if they had they would not advocate for AI coding.

AtlasBarfed

I posted this earlier, but I wanted a java port of sed for ... Reasons, and despite the existence of man pages and source code it couldn't do anything but the most basic flags.

Imo this should be low hanging fruit. Porting non-trivial but yet 3-4 core code files that are already debugged and interface specified should be what an LLM excels at.

elicksaur

Or neither. Try it yourself.

For me, LLMs still don’t meet basic usefulness and are a net negative when I try to use them. I push code daily for my job.

kybernetyk

I have a good use case for them: Communication with the bureaucracy of my country. I tell my LLM of choice to write a letter to $whoever about $whatever, then I print it out (yes, we still have to do this as email don't get accepted) and send if off. I don't even need to proof read it because if there's a mistake the bureaucracy will tell me in another letter. So the burden of correctness checking is on some bureaucrat which saves me time and mental resources.

I wouldn't ever use a LLM for anything where correctness matters (code) because I'd spend the same amount of time checking the generated code as writing it myself. But a letter to my tax office? Heck, why not. If something goes really wrong I can always say "gee, I made a mistake let's try it again".

cropcirclbureau

So what, you use it to spam and waste other people's time? I know, dealing government bureaucracy and corruption is soul leeching but spam was always one of the golden usecases for generated AI.

Xmd5a

The paper is recent and being discussed here: https://news.ycombinator.com/item?id=42889786

anon291

It fundamentally does not matter. Matrix multiplication does not erase the truth of Godel and Turing.

kadoban

Godel and Turing just proved that there are some true things that can't be proved, and things that cannot be computed. They didn't show where those boundaries are.

They certainly didn't show those boundaries to be below human cognition level.

anon291

Godel proved that there are unprovable statements. Turing showed that certain classes of problems can only be solved by machines with infinite tapes. This no bounded LLM can possibly solve every turing complete problem. Only theoretically infinite chain of thought can possibly get us that power.

Godel then tells us that, if we have such a system, there are things where this system may get stuck.

Indeed this is what we see in chain of thought models. If you give them an impossible problem they either give up or produce a seemingly infinite series of tokens before emitting the </think> tag.

Turing tells us that examining any set of matrices modeling a finite state machine over an infinite token stream is the halting problem.

andrewflnr

Yeah, the grounded take is that Turing and Gödel apply just as much to human intelligence. If not, someone please go ahead and use this to physically prove the existence of an immortal, hypercomputational soul.

drdeca

Who is trying to “erase the truth of Gödel and Turing”? (Well, some cranks are, but I don’t think that’s who you are talking about.)

Gödel and Turing’s results do not appear to give any reason that a computer program can’t do what a person can do.

anon291

That's not the point. Computer program with a finite number of steps (an auto regressive LLM without chain of thought) has a limit in what it can reason in one step. This article does a lot of wordcelling to show this obvious point.

mohsen1

https://chatgpt.com/share/679f0353-12bc-8007-91cf-dd63d52044...

O1 Pro gets the answer with proper reasoning

Can’t tell if this is dude to data contamination or it really figured it out?

How can we form the question in another way to avoid data contamination?

mohsen1

With modified prompt to make less look like the original prompt it thought 6x more:

https://chatgpt.com/share/679f086d-a758-8007-b240-38e6843037...

pollinations

I don't think individual examples make sense to solve these kinds of discussions as for me it can vary easily by 6x thinking with exactly the same input and parameters.

null

[deleted]

Workaccount2

"Recent results" = two year old study looking at GPT-3, 3.5, and first gen 4.

WillAdams

Perhaps applying:

https://www.doc.ic.ac.uk/~re14/Evans-R-2020-PhD-Thesis.pdf

"Kant’s Cognitive Architecture"

will help?

> ...we provide a precise formalization of what it means

> to “make sense” of a sensory sequence. According to our definition, making sense means constructing

> a symbolic causal theory that explains the sensory sequence and satisfies a set of unity conditions

> that were inspired by Kant’s discussion in the first half of the Critique of Pure Reason. According to

> our interpretation, making sense of sensory input is a type of program synthesis, but it is unsupervised

> program synthesis.