Why do LLMs have emergent properties?
77 comments
·May 8, 2025anon373839
andy99
Yes, deep learning models only interpolate, and essentially represent an effective way of storing data labeling effort. Doesn't mean they're not useful, just not what tech adjacent promoters want people to think.
john-h-k
> Yes, deep learning models only interpolate
What do you mean by this? I don’t think the understanding of LLMs is sufficient to make this claim
andy99
An LLM is a classifier, there is lots of research into how deep learning classifiers work, that I haven't seen contradicted when applied to LLMs.
gond
Interesting. Is there a quantitative threshold to emergence anyone could point at with these smaller models? Tracing the thoughts of a large language model is probably the only way to be sure, or is it?
gyomu
The reasoning in the article is interesting, but this struck me as a weird example to choose:
> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”
Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.
But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.
interstice
The bag of heuristics thing is interesting to me, is it not conceivable that a NN of a certain size trained only on math problems would be able to wire up what amounts to a calculator? And if so, could that form part of a wider network, or is I/O from completely different modalities not really possible in this way?
juancn
I always wondered if the specific dimensionality of the layers and tensors has a specific effect on the model.
It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).
etrautmann
How is topology specifically related to emergent capabilities in AI?
lordnacho
What seems a bit miraculous to me is, how did the researchers who put us on this path come to suspect that you could just throw more data and more parameters at the problem? If the emergent behavior doesn't appear for moderate sized models, how do you convince management to let you build a huge model?
educasean
You might appreciate this article: https://www.quantamagazine.org/when-chatgpt-broke-an-entire-...
Al-Khwarizmi
While GPT-2 didn't show emergent abilities, it did show improved accuracy on various tasks with respect to GPT-1. At that point, it was clear that scaling made sense.
In other words, no one expected GPT-3 to suddenly start solving tasks without training as it did, but it was expected to be useful as an incremental improvement to what GPT-2 did. At the time, GPT-2 was seeing practical use, mainly in text generation from some initial words - at that point the big scare was about massive generation of fake news - and also as a model that one could fine-tune for specific tasks. It made sense to train a larger model that would do all that better. The rest is history.
TheCoreh
This is perhaps why it took us this long to get to LLMs, the underlying math and ideas were (mostly) there, and even if the Transformer as an architecture wasn't ready yet, it wouldn't surprise me if throwing sufficient data/compute at a worse architecture wouldn't also produce comparable emergent behavior
There needed to be someone willing to try going big at an organization with sufficient idle compute/data just sitting there, not a surprise it first happened at Google.
hibikir
But we got here step by step, as other interesting use cases came up by using somewhat less compute. Image recognition, early forms of image generation, AlphaGo, AlphaZero for chess. All earlier forms of deep neural networks that are much more reasonable than training a top of the line LLM today, but seemed expensive at the time. And ultimately a lot of this also comes from the hardware advancements and the math advancements. If you took classes neural networks in the 1990s, you'd notice that they mostly talked about 1 or 2 hidden layers, and not all that much focus on the math to train large networks, precisely because of how daunting the compute costs were for anything that wasn't a toy. But then came video card hardware, and improvements to use it to do gradient descent, making going past silly 3 layer networks somewhat reasonable.
Every bet makes perfect sense after you consider how promising the previous one looked, and how much cheaper the compute was getting. Imagine being tasked to train an LLM in 1995: All the architectural knowledge we have today and a state-level mandate would not have gotten all that far. Just the amount of fast memory that we put to bear wouldn't have been viable until relatively recently.
null
prats226
I don't think model sizes increased suddenly, there might not be emergent properties for certain tasks at smaller scales but there was improvement at slower rate for sure. Competition to improve that metric albeit at lower pace led to slow increase in model sizes and by chance led to emergence the way its defined in paper?
andy99
I didn't follow entirely on a fast read, but this confused me especially:
The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks
I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.
vonneumannstan
This doesn't seem right and most people recognize that 'neurons' encode for multiple activations. https://transformer-circuits.pub/2022/toy_model/index.html
waynecochran
Since gradient descent converges on a local minima, would we expect different emergent properties with different initialization of the weights?
jebarker
Not significantly, as I understand it. There's certainly variation in LLM abilities with different initializations but the volume and content of the data is a far bigger determinant of what an LLM will learn.
waynecochran
So there is an "attractor" that different initializations end up converging on?
andy99
Different initialization converge to different places, e.g https://arxiv.org/abs/1912.02757
For LLMs (as with other models), many local optima appear to support roughly the same behavior. This is the idea of the problem being under-specified ie many more equations than unknowns so there are many ways to get the same result.
me3meme
Metaphor: finding a path from a initial point to a destination in a graph. As the number of parameters increases one can expect the LLM to be able to remember how to go from one place to another and in the end it should be able to find a long path. This can be an emergent property since with less parameters the LLM could not be able to find the correct path. Now one has to find what kind of problems this metaphor is a good model of.
null
samirillian
*Do
nthingtohide
What do you think about this analogy?
A simple process produces a Mandelbrot set. A simple process (loss minimization through gradient descent) produces LLMs. So what plays the role of 2D-plane or dense point grid in the case of LLMs? It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training. In case of a 2D plan, the closeness between two points is determined by our numerical representation schema. But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpus
The following is a quote from Yuri Manin, an eminent Mathematician.
https://www.youtube.com/watch?v=BNzZt0QHj9U Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.
I have a related idea which I picked up from somewhere which mirrors the above observation.
When we see beautiful fractals generated by simple equations and iterative processes, we give importance to only the equations, not to the cartesian grid on which that process operates.
null
cratermoon
Alternate view: Are Emergent Abilities of Large Language Models a Mirage? https://arxiv.org/abs/2304.15004
"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
Al-Khwarizmi
The continuous metrics the paper uses are largely irrelevant in practice, though. The sudden changes appear when you use metrics people actually care about.
To me the paper is overhyped. Knowing how neural networks work, it's clear that there are going to be underlying properties that vary smoothly. This doesn't preclude the existence of emergent abilities.
K0balt
A decent thought-proxy for this : powered flight.
An aircraft can approach powered flight without achieving it. With a given amount of thrust or aerodynamic characteristics, the aircraft will weigh dynamic_weight=(static_weight - x) where x is a combination of the aerodynamic characteristics and the amount of thrust applied.
In no case where dynamic_weight>0 will the aircraft fly, even though it exhibits characteristics of flight, I.e the transfer of aerodynamic forces to counteract gravity.
So while it progressively exhibits characteristics of flight, it is not capable of any kind of flight at all until the critical point of dynamic_weight<0. So the enabling characteristics are not “emergent”, but the behavior is.
I think this boils down to a matter of semantics.
scopemouthwash
“Thought-proxy”?
I think the word you’re looking for is “analogy”.
jebarker
Yes, this paper is under-appreciated. The point is that we as humans decide what constitutes a given task we're going to set as a bar and it turns out that statistical pattern matching can solve many of those tasks to a reasonable level (we also get to define "reasonable") when there's sufficient scale of parameters and data, but that tip-over point is entirely arbitrary.
foobarqux
The author himself explicitly acknowledges the paper but the incomprehensibly ignores it ("Even so, many would like to understand, predict, and even facilitate the emergence of these capabilities."). It's like saying "some say [foo] doesn't exist but even so many would like to understand [foo]". It's incoherent.
autoexec
No point in letting facts get in the way of an entire article I guess.
moffkalast
That has been a problem with most LLM benchmarks. Any test that's rated in percentages tends to be logarithmic, since getting from say 90% to 95% is not a linear 5% improvement but probably more like a 2x or 10x improvement in practical terms, since the metric is already nearly maxed out and only the extreme edge cases remain that are much harder to master.
Michelangelo11
How could they not?
Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).
One good instance are spandrels in evolutionary biology. The wikipedia article is a good explanation of the subject: https://en.m.wikipedia.org/wiki/Spandrel_(biology)
null
I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.