Why do LLMs have emergent properties?
120 comments
·May 8, 2025anon373839
andy99
Yes, deep learning models only interpolate, and essentially represent an effective way of storing data labeling effort. Doesn't mean they're not useful, just not what tech adjacent promoters want people to think.
john-h-k
> Yes, deep learning models only interpolate
What do you mean by this? I don’t think the understanding of LLMs is sufficient to make this claim
albertzeyer
Just by the mathematical definition of interpolation (https://en.wikipedia.org/wiki/Interpolation), any function which approximates the given regression points (training data), and is defined for values in between (unseed data), will interpolate it. Maybe you think about linear interpolation specifically, but there are many types of interpolation, and any mathematical function, any neural network is just another form of interpolation.
Interpolation is also related to extrapolation. In higher dimensional spaces, the distinction is not so clear. In terms of machine learning, you would call this generalization.
The question is more, is it a good type of interpolation/extrapolation/generalization. You measure that on a test set.
And mathematically speaking, your brain also is just doing another type of interpolation/extrapolation.
andy99
An LLM is a classifier, there is lots of research into how deep learning classifiers work, that I haven't seen contradicted when applied to LLMs.
kevinsync
My hot take is that what some people are labeling as "emergent" is actually just "incidental encoding" or "implicit signal" -- latent properties that get embedded just by nature of what's being looked at.
For instance, if you have a massive tome of English text, a rather high percentage of it will be grammatically-correct (or close), syntactic and understandable, because humans who speak good English took the time to write it and wrote it how other humans would expect to read or hear it. This, by its very nature, embeds "English language" knowledge due to sequence, word choice, normally-hard-to-quantify expressions (colloquial or otherwise), etc.
When you consider source data from many modes, there's all kinds of implicit stuff that gets incidentally written.. for instance, real photographs of outer space or deep sea would only show humans in protective gear, not swimming next to the Titanic. Conversely, you won't see polar bears eating at Chipotle, or giant humans standing on top of mountains.
There's a statistical probability of "this showed up enough in the training data to loosely confirm its existence" / "can't say I ever saw that, so let's just synthesize it" aspect of the embeddings that one person could interpret as "emergent intelligence", while another could just-as-convincingly say it's probabilistic output that is mostly in line with what we expect to receive. Train the LLM on absolute nonsense instead and you'll receive exactly that back.
atoav
Emergent as I have known and used it before is when more complex behavior emerges from simple rules.
My goto example for this was Game of Life, where from very simple rules, a very organically behaving (turing complete) system emerges. Now Game of Life is a deterministic system, meaning that the same rules and the same start-configurarion will play out in exactly the same way each time — but given the simplicity of the logic and the rules the resulting complexity is what I'd call emergent.
So maybe this is more about the definition of what we'd call emergent and what not.
As someone who has programmed markov chains where the stochastic interpolation really shines through, transformer-based LLMS definitly show some emergent behavior one wouldn't have immediately suspected just from the rules. Emergent does not mean "conscious" or "self-reflective" or anything like that. But the things a LLM can infer from its training data is already quite impressive.
gond
Interesting. Is there a quantitative threshold to emergence anyone could point at with these smaller models? Tracing the thoughts of a large language model is probably the only way to be sure, or is it?
gond
Disregarding the downvotes, I mean this as a serious question.
From the liked article: “We don’t know an “algorithm” for this, and we can’t even begin to guess the required parameter budget or the training data needed.”
Why not, at least the external ones? The computational resources and the size of the training dataset is quantifiable from an input point of view. What gets used is not, but the input size should.
zmmmmm
This seems superficial and doesn't really get to the heart of the question. To me it's not so much about bits and parameters but a more interesting fundamental question of whether pure language itself is enough to encompass and encode higher level thinking.
Empirically we observe that an LLM trained purely to predict a next token can do things like solve complex logic puzzles that it has never seen before. Skeptics claim that actually the network has seen at least analogous puzzles before and all it is doing is translating between them. However the novelty of what can be solved is very surprising.
Intuitively it makes sense that at some level, that intelligence itself becomes a compression algorithm. For example, you can learn separately how to solve every puzzle ever presented to mankind, but that would take a lot of space. At some point it's more efficient to just learn "intelligence" itself and then apply that to the problem of predicting the next token. Once you do that you can stop trying to store an infinite database of parallel heuristics and just focus the parameter space on learning "common heuristics" that apply broadly across the problem space, and then apply that to every problem.
The question is, at what parameter count and volume of training data does the situation flip to favoring "learning intelligence" rather than storing redundant domain specialised heuristics? And is it really happening? I would have thought just looking at the activation patterns could tell you a lot, because if common activations happen for entirely different problem spaces then you can argue that the network has to be learning common abstractions. If not, maybe it's just doing really large scale redundant storage of heurstics.
disambiguation
Good take, but while we're invoking intuition, something is clearly missing in the fundamental design given real brains don't need to consume all the worlds literature before demonstrating intelligence. There's some missing piece w.r.t self learning and sense making. The path to emergent reasoning you lay out is interesting and might happen anyway as we scale up, but the original idea was to model these algorithms in our own image in the first place - I wonder if we won't discover that missing piece first.
adampwells
> However the novelty of what can be solved is very surprising.
I've read that the 'surprise' factor is much reduced when you actually see just how much data these things are trained on - far more than a human mind can possibly hold and (almost) endlessly varied. I.e. there is 'probably' something in the training set close to what 'surprised' you.
cratermoon
Alternate view: Are Emergent Abilities of Large Language Models a Mirage? https://arxiv.org/abs/2304.15004
"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
foobarqux
The author himself explicitly acknowledges the paper but the incomprehensibly ignores it ("Even so, many would like to understand, predict, and even facilitate the emergence of these capabilities."). It's like saying "some say [foo] doesn't exist but even so many would like to understand [foo]". It's incoherent.
autoexec
No point in letting facts get in the way of an entire article I guess.
K0balt
A decent thought-proxy for this : powered flight.
An aircraft can approach powered flight without achieving it. With a given amount of thrust or aerodynamic characteristics, the aircraft will weigh dynamic_weight=(static_weight - x) where x is a combination of the aerodynamic characteristics and the amount of thrust applied.
In no case where dynamic_weight>0 will the aircraft fly, even though it exhibits characteristics of flight, I.e the transfer of aerodynamic forces to counteract gravity.
So while it progressively exhibits characteristics of flight, it is not capable of any kind of flight at all until the critical point of dynamic_weight<0. So the enabling characteristics are not “emergent”, but the behavior is.
I think this boils down to a matter of semantics.
scopemouthwash
“Thought-proxy”?
I think the word you’re looking for is “analogy”.
K0balt
Analogy is a great word proxy for thought proxy.
Yeah, I’m pretty sure analogy would have been fine there, I think maybe it fell off the edge of my vocabulary for a moment? Not really sure, but I really can’t think of any reason why “thought proxy” would have been more descriptive, informative, or accurate ¯_(ツ)_/¯
jebarker
Yes, this paper is under-appreciated. The point is that we as humans decide what constitutes a given task we're going to set as a bar and it turns out that statistical pattern matching can solve many of those tasks to a reasonable level (we also get to define "reasonable") when there's sufficient scale of parameters and data, but that tip-over point is entirely arbitrary.
Al-Khwarizmi
The continuous metrics the paper uses are largely irrelevant in practice, though. The sudden changes appear when you use metrics people actually care about.
To me the paper is overhyped. Knowing how neural networks work, it's clear that there are going to be underlying properties that vary smoothly. This doesn't preclude the existence of emergent abilities.
moffkalast
That has been a problem with most LLM benchmarks. Any test that's rated in percentages tends to be logarithmic, since getting from say 90% to 95% is not a linear 5% improvement but probably more like a 2x or 10x improvement in practical terms, since the metric is already nearly maxed out and only the extreme edge cases remain that are much harder to master.
lordnacho
What seems a bit miraculous to me is, how did the researchers who put us on this path come to suspect that you could just throw more data and more parameters at the problem? If the emergent behavior doesn't appear for moderate sized models, how do you convince management to let you build a huge model?
TheCoreh
This is perhaps why it took us this long to get to LLMs, the underlying math and ideas were (mostly) there, and even if the Transformer as an architecture wasn't ready yet, it wouldn't surprise me if throwing sufficient data/compute at a worse architecture wouldn't also produce comparable emergent behavior
There needed to be someone willing to try going big at an organization with sufficient idle compute/data just sitting there, not a surprise it first happened at Google.
hibikir
But we got here step by step, as other interesting use cases came up by using somewhat less compute. Image recognition, early forms of image generation, AlphaGo, AlphaZero for chess. All earlier forms of deep neural networks that are much more reasonable than training a top of the line LLM today, but seemed expensive at the time. And ultimately a lot of this also comes from the hardware advancements and the math advancements. If you took classes neural networks in the 1990s, you'd notice that they mostly talked about 1 or 2 hidden layers, and not all that much focus on the math to train large networks, precisely because of how daunting the compute costs were for anything that wasn't a toy. But then came video card hardware, and improvements to use it to do gradient descent, making going past silly 3 layer networks somewhat reasonable.
Every bet makes perfect sense after you consider how promising the previous one looked, and how much cheaper the compute was getting. Imagine being tasked to train an LLM in 1995: All the architectural knowledge we have today and a state-level mandate would not have gotten all that far. Just the amount of fast memory that we put to bear wouldn't have been viable until relatively recently.
pixl97
> and how much cheaper the compute was getting.
I remember back in the 90s how scientists/data analysts were saying that we'd need exaflop scale systems to tackle certain problems. I remember thinking how foreign that number was when small systems were running maybe tens of megaFLOPS. Now we have systems starting to zettaflops (FP8 so not exact comparison).
educasean
You might appreciate this article: https://www.quantamagazine.org/when-chatgpt-broke-an-entire-...
creer
They didn't. Not LLM people specifically. Google a long time ago figured out that you get far better results on a very wide range of problems just by going bigger. (Which then must have become frustrating for some people because most of the effort seems to have gone to scaling? See for example as-opposed-to-symbolic.)
Al-Khwarizmi
While GPT-2 didn't show emergent abilities, it did show improved accuracy on various tasks with respect to GPT-1. At that point, it was clear that scaling made sense.
In other words, no one expected GPT-3 to suddenly start solving tasks without training as it did, but it was expected to be useful as an incremental improvement to what GPT-2 did. At the time, GPT-2 was seeing practical use, mainly in text generation from some initial words - at that point the big scare was about massive generation of fake news - and also as a model that one could fine-tune for specific tasks. It made sense to train a larger model that would do all that better. The rest is history.
null
prats226
I don't think model sizes increased suddenly, there might not be emergent properties for certain tasks at smaller scales but there was improvement at slower rate for sure. Competition to improve that metric albeit at lower pace led to slow increase in model sizes and by chance led to emergence the way its defined in paper?
gessha
There’s that Sinclair quote:
It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It
gyomu
The reasoning in the article is interesting, but this struck me as a weird example to choose:
> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”
Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.
But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.
Retric
It’s not about being “a runaway hit” as an objective measurement it’s about the things an LLM would need to achieve before that was possible. At first AI scores on existing tests seemed like a useful metric. However, tests designed for humans make specific assumptions that don’t apply to these systems making such tests useless.
AI is very good at gaming metrics so it’s difficult to list some criteria where achieving it is meaningful. A hypothetical coherent novel without spelling/grammar mistakes could in effect be a copy of some existing work that shows up in its corpus, however a hit requires more than a reskinned story.
chii
> however a hit requires more than a reskinned story.
demonstrably false with a lot of hits in the past that is a reskinned story of existing stories!
Retric
While not a technical term of art, copyright applies to a reskinned story. “the series is not available in English translation, because of the first book having been judged a breach of copyright.” https://en.wikipedia.org/wiki/Tanya_Grotter
There’s plenty of room to take inspiration and go in another direction aka Pride and Prejudice and Zombies.
creer
That it seems hard (impossible) or not clear intuitively how to go about it, to us humans, is what makes the question interesting. In a way. The other questions are interesting but a different class of interesting. At any rate, both good for this question. Either way this becomes "what would we need to estimate this emergence threshold?".
eutropia
I often find that people using the word emergent to describe properties of a system tend to ascribe quasi magical properties to the system. Things tend to get vague and hand wavy when that term comes up.
Just call them properties with unknown provenance.
gond
> Just call them properties with unknown provenance.
They would if it would be the correct designation, however, it is not.
Emergence does not equal non-understanding or some spooky-hooky force coming from the unknown.Reductionism does not lead to an explaining-away of emergence.
juancn
I always wondered if the specific dimensionality of the layers and tensors has a specific effect on the model.
It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).
etrautmann
How is topology specifically related to emergent capabilities in AI?
interstice
The bag of heuristics thing is interesting to me, is it not conceivable that a NN of a certain size trained only on math problems would be able to wire up what amounts to a calculator? And if so, could that form part of a wider network, or is I/O from completely different modalities not really possible in this way?
andy99
I didn't follow entirely on a fast read, but this confused me especially:
The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks
I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.
vonneumannstan
This doesn't seem right and most people recognize that 'neurons' encode for multiple activations. https://transformer-circuits.pub/2022/toy_model/index.html
Der_Einzige
They’re 1000% right on the idea that most models are hilariously undertrained
vonneumannstan
Pretty sure that since the Chinchilla paper this probably isn't the case. https://arxiv.org/pdf/2203.15556
kazinator
> the same loss can be achieved with many different combinations of parameters
Perhaps it can be, but it isn't. The loss in a given model was achieved with that particular combination of parameters that the model has, and there exists no other combination of parameters which which that model can appeal to for more information.
To have the other combinations, we would need to train more models and then somehow combine them so that the combinations are available; but that's just conceptually the same as making one larger model with more parameters, and lower loss.
Michelangelo11
How could they not?
Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).
One good instance are spandrels in evolutionary biology. The wikipedia article is a good explanation of the subject: https://en.m.wikipedia.org/wiki/Spandrel_(biology)
creer
This is also my impression "how could they not" but it goes a bit further: Can we predict it? Can we estimate the size of a system that will achieve X? Can we build systems that are pre-disposed to emergent behaviors? Can we build systems that preclude some class of emergent behavior (relevant to AGI safety perhaps)? And then of course many systems will not achieve anything because even when "large", they are "uselessly large" - as in, you can define more points on a line and it's still a dumb line.
To me the "how could they not" comes from the idea that if LLMs somehow encapsulate/ contain/ exploit all human writings, then they most likely cover a large part of human intelligence. For damn sure much more than the "basic human". The question is more of how we can get this behavior back out of it - than whether it's there.
null
trash_cat
I think the better question is to answer why do emergent properties exist in the first place.
I disagree with the premise that emergence is binary. It's not. What we determine "emergent behaviour" is partly a social concept. We decide when an LLM is good enough for us and when it "solved" something through emergent properties.
waynecochran
Since gradient descent converges on a local minima, would we expect different emergent properties with different initialization of the weights?
jebarker
Not significantly, as I understand it. There's certainly variation in LLM abilities with different initializations but the volume and content of the data is a far bigger determinant of what an LLM will learn.
waynecochran
So there is an "attractor" that different initializations end up converging on?
andy99
Different initialization converge to different places, e.g https://arxiv.org/abs/1912.02757
For LLMs (as with other models), many local optima appear to support roughly the same behavior. This is the idea of the problem being under-specified ie many more equations than unknowns so there are many ways to get the same result.
recursivecaveat
You end up with different weights when using different random initialization, but with modern techniques the behavior of the resulting model is not really distinct. Back in the image-recognition days it was like +/- 0.5% accuracy. If you imagine you're descending in a billion-parameter space, you will always have a negative gradient to follow in some dimension: local minima frequency goes down rapidly with (independent) dimension count.
I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.