ARC-AGI without pretraining
132 comments
·March 4, 2025pona-a
Valectar
This betrays a very naive concept of "knowledge" and "understanding". It presupposes that there's some kind of platonic realm of logic and reason that an AGI just needs to tap in to. But ultimately, there can be no meaning, or reasoning, or logic, without context. Matching a pattern of shapes presupposes the concept of a shape, which presupposes a concept of spatial relationships, which presupposes a concept of 3 or even 2 dimensional space. These things only seem obvious and implicit to you because they permeate the environment that your mind spent hundreds of millions of years evolving to interpret, and then tens of years consuming and processing to understand.
The true test of an AGI is it's ability to assimilate disparate information into a coherent world-view, which is effectively what the pretraining is doing. And even then, it is likely that any intelligence capable of doing that will need to be "preloaded" with assumptions about the world it will occupy, structurally. Similar to the regions of the brain which are adept at understanding spatial relationships, or language, or interpreting our senses, etc.
gitfan86
Yes, AGI was here at AlphaGo. People don't like that because they think it should have generalized outside of GO, but when you say AGI was here at AlphaZero which can play other games they again say not general enough. At this point is seem unlikely that AI will ever be general enough to satisfy the sceptics for the reason you said. There will always be some domain that requires training on new data.
devmor
You're calling an Apple an Orange and complaining that everyone else wont refer to it as such. AGI is a computer program that can understand or learn any task a human can, mimicking the cognitive ability of a human.
It doesn't have to actually "think" as long as it can present an indistinguishable facsimile, but if you have to rebuild its training set for each task, that does not qualify. We don't reset human brains from scratch to pick up new skills.
FrustratedMonky
"AI will ever be general enough to satisfy the sceptics for the reason you said"
Also
People keep thinking "General" means one AI can "do everything that any human can do everywhere all at once".
When really, humans are also pretty specialized. Humans have Years of 'training' to do a 'single job'. And they do not easily switch tasks.
jshmrsn
If the machine can decide how to train itself (adjust weights) when faced with a type of problem it hasn’t seen before, then I don’t think that would go against the spirit of general intelligence. I think that’s basically what humans do when they decide to get better at something, they figure out how to practice that task until they get better at it.
pona-a
In-context learning is a very different problem from regular prediction. It is quite simple to fit a stationary solution to noisy data, that's just a matter of tuning some parameters with fairly even gradients. In-context learning implies you're essentially learning a mesa-optimizer for the class of problems you're facing, which in the form of transformers means essentially means fitting something not that far from a differentiable Turing machine with no inductive biases.
fsndz
Exactly. That's basically the problem with a lot of the current paradigm, they don't allow true generalisation. That's why some people say there won't be any AGI anytime soon: https://www.lycee.ai/blog/why-no-agi-openai
exe34
"true generalisation" isn't really something a lot of humans can do.
godelski
> humans CAN do.
I think people often get confused with claims - Humans CAN generalize
- Humans CAN reason
- Humans CAN be intelligent
- Humans CAN be conscious
Generalization[0] is something MOST humans CAN do, but MOST humans DO NOT do. Do not confuse "can" and "are".One of my pet peeves is how often qualifying words are just ignored. They are critical parts of any communication.[1]
Another pet peeve is over anthropomorphization. Anthropomorphism is a useful tool, but.. well... we CAN over generalize ;)
[0] I don't know what you mean by "true generalization". I'm not going to address that because you can always raise the bar for what is "true" and let's try to be more concrete. Maybe I misunderstand. I definitely misunderstand.
[1] Classic example: someone says "most x are y" then there's a rebuttal of "but x_1 isn't y" or "I'm x and I'm not y" or some variation. Great! Most isn't all? This is not engaging in good faith and there's examples found with any qualifying word. It is quite common to see.
fsndz
the thing is LLMs don't even do the kind of generalisations the dumbest human can do. while simultaneously doing some stuff the smartest human probably can't
leptons
Sadly, a lot of humans choose not to think. We're in an age of willful ignorance.
fritzo
i feel like you're overgeneralizing here
tripplyons
I think that most human learning comes from years of sensory input. Why should we expect a machine to generalize well without any background?
aithrowawaycomm
Newborns (and certainly toddlers) seem to understand the underlying concepts for these things when it comes to visual/hepatic object identification and "folk physics":
A short list of abilities that cannot be performed by CompressARC includes:
Assigning two colors to each other (see puzzle 0d3d703e)
Repeating an operation in series many times (see puzzle 0a938d79)
Counting/numbers (see puzzle ce9e57f2)
Translation, rotation, reflections, rescaling, image duplication (see puzzles 0e206a2e, 5ad4f10b, and 2bcee788)
Detecting topological properties such as connectivity (see puzzle 7b6016b9)
Note: I am not saying newborns can solve the corresponding ARC problems! The point is there is a lot of evidence that many of the concepts ARC-AGI is (allegedly) measuring are innate in humans, and maybe most animals; e.g. cockroaches can quickly identify connected/disconnected components when it comes to pathfinding. Again, not saying cockroaches can solve ARC :) OTOH even if orcas were smarter than humans they would struggle with ARC - it would be way too baffling and obtuse if your culture doesn't have the concept of written standardized tests. (I was solving state-mandated ARCish problems since elementary school.) This also applies to hunter-gatherers, and note the converse: if you plopped me down among the Khoisan in the Kalahari, they would think I was an ignorant moron. But it makes as much sense scientifically to say "human-level intelligence" entails "human-level hunter-gathering" instead of "human-level IQ problems."Ukv
> there is a lot of evidence that many of the concepts ARC-AGI is (allegedly) measuring are innate in humans
I'd argue that "innate" here still includes a brain structure/nervous system that evolved on 3.5 billion years worth of data. Extensive pre-training of one kind or another currently seems the best way to achieve generality.
mystified5016
Newborn brains aren't blank, they are complex beyond our current ability to understand. All mammals are born with a shocking amount of instinctual knowledge built right into their genome.
All organisms are born pre-trained because if you can't hide or survive the moment you're born, you get eaten.
naasking
> Newborns (and certainly toddlers) seem to understand the underlying concepts for these things when it comes to visual/hepatic object identification and "folk physics"
Yes, they enjoy millions of years of pretraining thanks to evolution, ie. their pretrained base model has some natural propensity for visual, auditory, and tactile sensory modalities, and some natural propensity for spatial and causal reasoning.
null
Krasnol
I'd guess it's because we don't want to have another human. We want something better. Therefore, the expectations on the learning process are way beyond what humans do. I guess some are expecting some magic word (formula) which would be like a seed with unlimited potential.
So like humans after all but faster.
I guess it's just hard to write a book about the way you write that book.
andoando
It does but it also generalizes extremely well
tripplyons
I haven't seen a convincing argument that it is more sample efficient than a neural network that has seen an equivalent amount of lifetime information. Yann LeCun gave an interesting explanation of how even in the first few years of a child's life, they have seen much more information than the largest pretrained models have.
null
Davidzheng
Arc is equivalent to a distribution over four tuples of images--with no prior the last image is uniformly distributed given the first three...
ta8645
The issue is that general intelligence is useless without vast knowledge. The pretraining is the knowledge, not the intelligence.
dchichkov
For long context sizes AGI is not useless without vast knowledge. You could always put a bootstrap sequence into the context (think Arecibo Message), followed by your prompt. A general enough reasoner with enough compute should be able to establish the context and reason about your prompt.
ta8645
Yes, but that just effectively recreates the pretraining. You're going to have to explain everything down to what an atom is, and essentially all human knowledge if you want to have any ability to consider abstract solutions that call on lessons from foreign domains.
There's a reason people with comparable intelligence operate at varying degrees of effectiveness, and it has to do with how knowledgeable they are.
conradev
Isn't knowledge of language necessary to decode prompts?
tripplyons
I'm not at all experienced in neuroscience, but I think that humans and other animals primarily gain intelligence by learning from their sensory input.
FergusArgyll
You don't think a lot is encoded in genes from before we're born?
pona-a
I don't think so. A lot of useful specialized problems are just patterns. Imagine your IDE could take 5 examples of matching strings and produce a regex you can count on working? It doesn't need to know the capital of Togo, metabolic pathways of the eukaryotic cell, or human psychology.
For that matter, if it had no pre-training, it means it can generalize to any new programming languages, libraries, and entire tasks. You can use it to analyze the grammar of a dying African language, write stories in the style of Hemingway, and diagnose cancer on patient data. In all of these, there are only so many samples to fit on.
ta8645
Of course, none of us have exhaustive knowledge. I don't know the capital of Togo.
But I do have enough knowledge to know what an IDE is, and where that sits in a technological stack, i know what a string is, and all that it relies on etc. There's a huge body of knowledge that is required to even begin approaching the problem. If you posted that challenge to an intelligent person from 2000 years ago, they would just stare at you blankly. It doesn't matter how intelligent they are, they have no context to understand anything about the task.
bloomingkales
A lot of useful specialized problems are just patterns.
It doesn't need to know the capital of Togo, metabolic pathways of the eukaryotic cell, or human psychology.
What if knowing those things distills down to a pattern that matches a pattern of your code and vice versa? There's a pattern in everything, so know everything, and be ready to pattern match.
If you just look at object oriented programming, you can easily see how knowing a lot translates to abstract concepts. There's no reason those concepts can't be translated bidirectionally.
raducu
> The pretraining is the knowledge, not the intelligence.
I thought the knowledge is the training set and the intelligence is the emergent/side effect of reproducing that knowledge by making sure the reproduction is not rote memorisation?
ta8645
I'd say that it takes intelligence to encode knowledge, and the more knowledge you have, the more intelligently you can encode further knowledge, in a virtuous cycle. But once you have a data set of knowledge, there's nothing to emerge, there are no side effects. It just sits there doing nothing. The intelligence is in the algorithms that access that encoded knowledge to produce something else.
godelski
> I feel like extensive pretraining goes against the spirit of generality.
What do you mean by generality?Pretraining is fine. It is even fine for pursuit to AGI. Humans and every animal has "baked in" memory. You're born knowing how to breath and have latent fears (chickens and hawks).
Generalization is the ability to learn on a subset of something and then adapt to the entire (or a much larger portion) of the superset. It's always been that way. Humans do this, right? You learn some addition, subtraction, multiplication, and division and then you can do novel problems you've never seen before that are extremely different. We are extremely general here because we've learned the causal rule set. It isn't just memorization. This is also true for things like physics, and is literally the point of science. Causality is baked into scientific learning. Of course, it is problematic when someone learns a little bit of something and thinks they know way more about it, but unfortunately ego is quite common.
But also, I'm a bit with you. At least with what I think you're getting at. These LLMs are difficult to evaluate because we have no idea what they're trained on and you can't really know what is new, what is a slight variation from the training, and this is even more difficult considering the number of dimensions involved (meaning things may be nearly identical in latent space though they don't appear so to us humans).
I think there's still a lot of ML/AI research that can and SHOULD be done at smaller scales. We should be studying more about this adaptive learning and not just in the RL setting. One major gripe I have with the current research environment is that we are not very scientific when designing experiments. They are highly benchmark/data-set-evaluation focused. Evaluation needs to go far beyond test cases. I'll keep posting this video of Dyson recounting his work being rejected by Fermi[0][1]. You have to have a good "model". What I do not see happening in ML papers is proper variable isolation and evaluation based on this: i.e. hypothesis testing. Most papers I see are not providing substantial evidence for their claims. It may look like this, but the devil is always in the details. When doing extensive hyper-parameter tuning it becomes very difficult to determine if the effect is something you've done via architectural changes, change in the data, change in training techniques, or change in hyperparameters. To do a proper evaluation would require huge ablations with hold-one-out style scores reported. This is obviously too expensive, but the reason it gets messy is because there's a concentration on getting good scores on whatever evaluation dataset is popular. But you can show a method's utility without beating others! This is a huge thing many don't understand. Worse, by changing hyper-parameters to optimize for the test-set result, you are doing information leakage. Anything you change based on the result of the evaluation set, is, by definition, information leakage. We can get into the nitty gritty to prove why this is, but this is just common practice these days. It is the de facto method and yes, I'm bitter about this. (a former physicist who came over to ML because I loved the math and Asimov books)
[0] https://www.youtube.com/watch?v=hV41QEKiMlM
[1] I'd also like to point out that Dyson notes that the work was still __published__. Why? Because it still provides insights and the results are useful to people even if the conclusions are bad. Modern publishing seems to be more focused on novelty and is highly misaligned from scientific progress. Even repetition helps. It is information gain. But repetition results in lower information gain with each iteration. You can't determine correctness by reading a paper, you can only determine correctness by repetition. That's the point of science, right? As I said above? Even negative results are information gain! (sorry, I can rant about this a lot)
AIorNot
I was thinking about this lex friedman podcast with Marcus Hutter. Also, Joshua Bach defined intelligence as the ability to accurately model reality.. is lossless compression itself intelligence or a best fit model- is there a difference? https://www.youtube.com/watch?v=E1AxVXt2Gv4
meowface
Incidentally, François Chollet, creator of ARC-AGI, argues in this 2020 Lex Fridman podcast that intelligence is not compression: https://youtu.be/-V-vOXLyKGw
visarga
Compression works on finite data, and AI models need to keep open for new data. So they should be less greedy.
akomtu
It's the ability to find a simple model that predicts a complex reality with a high accuracy and low latency. So we need to consider the four axes, and AI will be a region in this space.
Edit. In fact, there is a simple test for intelligence: can you read a function in C and tell how a change in input changes the output. For complex algorithms, you have to build an internal model because how else are you going to run qsort on million items? That's also how you'd tell if a student is faking it or really understands. A harder test would be to do the opposite: from a few input/output examples come up with an algorithm.
fastball
For a quicker tie in than watching a whole podcast, the result of Hutter's stance takes the form of the Hutter prize[1], which in some ways has many of the same goals as ARC-AGI, but sees compression itself as the goalpost towards intelligence.
cocomutator
I'm trying to distill the essence of their approach, which imho is concealed behind inessential and particular details such as the choice of this or that compression scheme or prior distributions.
It seems like the central innovation is the construction of a "model" which can be optimized with gradient descent, and whose optimum is the "simplest" model that memorizes the input-output relationships. In their setup, "simplest" has the concrete meaning of "which can be efficiently compressed" but more generally it probably means something like "whose model complexity is lowest possible".
This is in stark contrast to what happens in standard ML: typically, we start by prescribing a complexity budget (e.g. by choosing the model architecture and all complexity parameters), and only then train on data to find a good solution that memorizes input-output relationship.
The new method is ML on its head: we optimize the model so that we reduce its complexity as much as possible while still memorizing the input-output pairs. That this is able to generalize from 2 training examples is truly remarkable and imho hints that this is absolutely the right way of "going about" generalization.
Information theory happened to be the angle from which the authors arrived at this construction, but I'm not sure that is the essential bit. Rather, the essential bit seems to be the realization that rather than finding the best model for a fixed pre-determined complexity budget, we can find models with minimal possible complexity.
getnormality
The idea of minimizing complexity is less novel than it may seem. Regularization terms are commonly added to loss objectives in optimization, and these regularizers can often be interpreted as penalizing complexity. Duality allows us to interpret these objectives in multiple ways:
1. Minimize a weighted sum of data error and complexity.
2. Minimize the complexity, so long as the data error is kept below a limit.
3. Minimize the error on the data, so long as the complexity is kept below a limit.
It does seem like classical regularization of this kind has been out of fashion lately. I don't think it plays much of a role in most Transformer architectures. It would be interesting if it makes some sort of comeback.
Other than that, I think there are so many novel elements in this approach that it is hard to tell what is doing the work. Their neural architecture, for example, seems carefully hacked to maximize performance on ARC-AGI type tasks. It's hard to see how it generalizes beyond.
cocomutator
Right, but see how the complexity budget is prescribed ahead of time: we first set the regularization strength or whatever, and then optimize the model. The result is the best model with complexity no greater than the budget. In this standard approach, we're not minimizing complexity, we're constraining it.
getnormality
Again, because of duality, these are not really different things.
To your point in the other thread, once you start optimizing both data fidelity and complexity, it's no longer that different from other approaches. Regularization has been common in neural nets, but usually in a simple "sum of sizes of parameters" type way, and seemingly not an essential ingredient in recent successful models.
ealexhudson
I think you're right about the essential ingredient in this finding, but I feel like this is a pretty ARC-AGI specific result.
Each puzzle is kind of a similar format, and the data that changes in the puzzle is almost precisely that needed to deduce the rule. By reducing the amount of information needed to describe the rule, you almost have to reduce your codec to what the rule itself is doing - to minimise the information loss.
I feel like if there was more noise or arbitrary data in each puzzle, this technique would not work. Clearly there's a point at which that gets difficult - the puzzle should not be "working out where the puzzle is" - but this only works because each example is just pure information with respect to the puzzle itself.
cocomutator
I agree with your observation about the exact noise-free nature of the problem. It allows them to formulate the problem as "minimize complexity such that you memorize the X-y relationship exactly". This would need to be generalized to the noisy case: instead of demanding exact memorization, you'd need to prescribe an error budget. But then this error budget seems like an effective complexity metaparameter, doesn't it, and we're back to square zero of cross-validation.
ealexhudson
If we think of the 'budget' as being similar to a bandwidth limit on video playback, there's a kind of line below which the picture starts being pretty unintelligible, but for the most part that's a slider: the less the budget, the slightly less accurate playback you get.
But because this is clean data, I wonder if there's basically a big gap here: the codec that encodes the "correct rule" can achieve a step-change lower bandwidth requirement than similar-looking solutions. The most elegant ruleset - at least in this set of puzzles - always compresses markedly better. And so you can kind of brute-force the correct rule by trying lots of encoding strategies, and just identify which one gets you that step-change compression benefit.
EigenLord
Interesting. I've been slowly coming to the conclusion that the way forward with machine learning is actually less "machine learning" as we've grown accustomed with it, less pretraining, less data, less search, more direct representation, symbolic processing, more constraint-satisfaction, meta-learning etc. All those things we need less of (pretraining, data, etc) are messy, brute force, and contingent. Working with them, you'll always be dependent on the quality of your data, which is fine if you want to data-mine, but not fine if you want to model the underlying causes of the data.
From my (admittedly sketchy, rushed) understanding of what they're doing, they're essentially trying to uncover the minimal representation of the solution/problem space. Through their tracking of the actual structure of the problem through equivariences, they're actually deriving something like the actual underlying representation of the puzzle and how to solve them, rather than hoping to pick up on this from many solved examples.
null
pyinstallwoes
Impressive documentation and explanation. Thank you.
It pleases me to find this as it supports my own introspection (heck, it’s in my profile!)
> Intelligence is compressing information into irreducible representation.
mmazing
Love the description of intelligence.
https://en.wikipedia.org/wiki/Kolmogorov_complexity
https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_induc...
https://en.wikipedia.org/wiki/Minimum_description_length
Seems like these could be related, going to dive into this more! :)
programjames
> Intelligence is compressing information into irreducible representation
I thought that was physics ;)
null
d--b
> ARC-AGI, introduced in 2019, is an artificial intelligence benchmark designed to test a system’s ability to infer and generalize abstract rules from minimal examples. The dataset consists of IQ-test-like puzzles, where each puzzle provides several example images that demonstrate an underlying rule, along with a test image that requires completing or applying that rule. While some have suggested that solving ARC-AGI might signal the advent of artificial general intelligence (AGI), its true purpose is to spotlight the current challenges hindering progress toward AGI
Well they kind of define intelligence as the ability to compress information into a set of rules, so yes, compression does that…
bubblyworld
This is not at all as circular or obvious as you are claiming here. Have you actually tried the ARC-AGI problems yourself? They are quite subtle, and test a broad range of abstract concepts. For reference o1-preview scores 21% on the public eval, compared to 34% in the OP.
null
null
fragebogen
(Somewhat) related Schmidhuber https://arxiv.org/abs/0812.4360
naasking
> Despite these constraints, CompressARC achieves 34.75% on the training set and 20% on the evaluation set—processing each puzzle in roughly 20 minutes on an RTX 4070.
This phrasing suggests that each puzzle took 20 mins, so for the 100 puzzle challenge that's 33.3 hours, which exceeds the target of 12 hours for the challenge. Pretty cool approach though.
unixpickle
This seems to be pretty much exactly a standard Bayesian deep learning approach, albeit with a heavily engineered architecture.
cgadski
I'm very excited that we're figuring out how to use deep learning on small numbers of data points!
I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.
About their overall method, they write:
> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.
I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):
> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.
So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.
Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.
Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:
> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.
As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.
The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.
bubblyworld
Thanks for this insightful comment. One of my first questions about this was how it avoids some kind of latent space collapse with such a tiny dataset.
Do you think it's accurate to describe equivariance as both a strength and a weakness here? As in it allows the model to learn a useful compression, but you have to pick your set of equivariant layers up front, and there's little the model can do to "fix" bad choices.
cgadski
Yeah, I think it's really important to understand how to coax non-equivariant models into being equivariant when needed. I don't think purely equivariant architectures are the way forward.
One example that comes to mind (I don't know much/haven't thought about it much) is how AlphaFold apparently dropped rotational equivariance of the model in favor of what amounts to data augmentation---opting to "hammer in" the symmetry rather than using these fancy equivariant-by-design architectures. Apparently it's a common finding that hard-coded equivariance can hurt performance in practice when you have enough data.
programjames
Here's what they did:
1. Choose random samples z ~ N(μ, Σ) as the "encoding" of a puzzle, and a distribution of neural network weights p(θ) ~ N(θ, <very small variance>).
2. For a given z and θ, you can decode to get a distribution of pixel colors. We want these pixel colors to match the ones in our samples, but they're not guaranteed to, so we'll have to add some correction ε.
3. Specifying ε takes KL(decoded colors || actual colors) bits. If we had sources of randomness q(z), q(θ), specifying z and θ would take KL(p(z) || q(z)) and KL(p(θ) || q(θ)) bits.
4. The authors choose q(z) ~ N(0, 1) so KL(p(z) || q(z)) = 0.5(μ^2 + Σ^2 - 1 - 2ln Σ). Similarly, they choose q(θ) ~ N(0, 1/2λ), and since Var(θ) is very small, this gives KL(p(θ) || q(θ)) = λθ^2.
5. The fewer bits they use, the lower the Kolmogorov complexity, and the more likely it is to be correct. So, they want to minimize the number of bits
a * 0.5(μ^2 + Σ^2 - 1 - 2ln Σ) + λ * θ^2 + c * KL(decoded colors || actual colors).
6. Larger a gives a smaller latent, larger λ gives a smaller neural network, and larger c gives a more accurate solution. I think all they mention is they choose c = 10a, and that λ was pretty large.
They can then train μ, Σ, θ until it solves the examples for a given puzzle. Decoding will then give all the answers, including the unknown answer! The main drawback to this method is, like Gaussian splatting, they have to train an entire neural network for every puzzle. But, the neural networks are pretty small, so you could train a "hypernetwork" that predicts μ, Σ, θ for a given puzzle, and even predicts how to train these parameters.
throwaway314155
Is there _any_ hope of being able to understand how this works without a background in statistics and information theory? So far, old-school pretraining gets heavy bonus points just for being plausible and easy to understand.
essexedwards
Yes, there is hope for a high-level heuristic understanding. Here's my attempt to explain in more familiar terms.
They train a new neural network from scratch for each problem. The network is trained only on the data about that problem. The loss function tries to make sure it can map the inputs to the outputs. It also tries to keep the network weights small so that the neural network is as simple as possible. Hopefully a simple function that maps the sample inputs to the sample outputs will also do the right thing on the test input. It works 20~30% of the time.
throwaway314155
Great explanation, thanks. I have some followups if you have the time!
a.) Why does this work as well as it does? Why does compression/fewer-parameters encourage better answers in this instance?
b.) Will it naturally transfer to other benchmarks that evaluate different domains? If so does that imply an approach similarly robust to pre-training that can be used for different domains/modalities?
c.) It works 20-30% of the time - do the researchers find any reason to believe that this could "scale" up in some fashion so that, say, a single larger network could handle any of the problems, rather than needing a new network for each problem? If so, would it improve accuracy as well as robustness?
programjames
Sure! I'll try to give a more intuitive explanation. Basically, it's been known for awhile that intelligence is very related to compression. Why? Suppose you're trying to describe a phenomena you see (such as an IQ test). To precisely describe what's going on, you'd have to write it in some kind of formal language, like a programming language. Maybe your description looks like:
This wavey piece can be described by algorithm #1, while this spikey piece can be described by algorithm #2, while this...
More precisely, you try to write your phenomena as a weighted sum of these algorithms:
phenomena = sum weight * algorithm
There are exponentialy more algorithms that have more bits, so if you want this sum to ever converge, you need to have exponentially smaller weights for longer algorithms. Thus, most of the weight is concentrated in shorter algorithms, so a simple explanation is going to be a really good one!What the authors are trying to do is find a simple (small number of bits) algorithm that reconstructs the puzzles and the example solutions they're given. As a byproduct, the algorithm will construct a solution to the final problem that's part of the puzzle. If the algorithm is simple enough, it won't be able to just memorize the given examples—it has to actually learn the trick to solving the puzzle.
Now, they could have just started enumerating out programs, beginning with `0`, `1`, `00`, `01`, ..., and seeing what their computer did with the bits. Eventually, they might hit on a simple bit sequence that the computer interprets as an actual program, and in fact one that solves the puzzle. But, that's very slow, and in fact the halting problem says you can't rule out some of your programs running forever (and your search getting stuck). So the authors turned to a specialized kind of computer, one that they know will stop in a finite number of steps...
...and that "computer" is a fixed-size neural network! The bit sequence they feed in goes to determine (1) the inputs to the neural network, and (2) the weights in the neural network. Now, they cheat a little, and actually just specify the inputs/weights, and then figure out what bits would have given them those inputs/weights. That's because it's easier to search in the input/weight space—people do this all the time with neural networks.
They initialize the space of inputs/weights as random normal distributions, but they want to change these distributions to be concentrated in areas that correctly solve the puzzle. This means they need additional bits to specify how to change the distributions. How many extra bits does it take to specify a distribution q, if you started with a distribution p? Well, it's
- sum q(x) log p(x) + sum p(x) log p(x)
(expected # bits for random q) (expected # bits for random p)
This is known as the KL-divergence, which we write as KL(q||p). They want to minimize the length of their program, which means they want to minimize the expected number of additional bits they have to use, i.e. KL(q(inputs)||p(inputs)) + KL(q(weights)||p(weights)).There's a final piece of the puzzle: they want their computer to exactly give the correct answer for the example solutions they know. So, if the neural network outputs an incorrect value, they need extra bits to say it was incorrect, and actually here's the correct value. Again, the expected number of bits is just going to be a KL-divergence, this time between the neural network's output, and the correct answers.
Putting this altogether, they have a simple computer (neural network + corrector), and a way to measure the bitlength for various "programs" they can feed into the computer (inputs/weights). Every program will give the correct answers for the known information, but the very simplest programs are much more likely to give the correct answer for the unknown puzzles too! So, they just have to train their distributions q(inputs), q(weights) to concentrate on programs that have short bitlengths, by minimizing the loss function
KL(q(inputs)||p(inputs)) + KL(q(weights)||p(weights)) + KL(outputs||correct answers)
They specify p(inputs) as the usual normal distribtuion, p(weights) as a normal distribution with variance around 1/(dimension of inputs) (so the values in the neural network don't explode), and finally have trainable parameters for the mean and variance of q(inputs) and q(weights).null
uh_uh
Given that all parameters are trained jointly at inference time and a single sample of z is supposed to encode ALL inputs and outputs for a given puzzle (I think), I don't quite understand the role of the latent z here. Feels like μ and Σ could be absorbed into θ (and an additional variance parameter).
Although if they partition z such that each section corresponds to one input and run f_θ through those sections iteratively, then I guess it makes sense.
programjames
I agree, z should be absorbed into μ and Σ, e.g. you always input `[1 0 0 ... 0]`, and the first layer of the neural network would essentially output z. They would have to stop approximating the KL(q(θ)||p(θ)) as O(θ^2) though, so maybe this is more computationally efficient?
uh_uh
Could be. Also, as you imply, they'd have to loosen the regularization penalty on θ, and maybe it's difficult to loosen it such that it won't become too prone to overfitting.
Maybe their current setup of keeping θ "dumb" encourages the neural network to take on the role of the "algorithm" as opposed to the higher-variance input encoded by z (the puzzle), though this separation seems fuzzy to me.
naasking
> The main drawback to this method is, like Gaussian splatting, they have to train an entire neural network for every puzzle.
They discuss how to avoid this in the section "Joint Compression via Weight Sharing Between Puzzles".
mmazing
> But, the neural networks are pretty small, so you could train a "hypernetwork" that predicts μ, Σ, θ for a given puzzle, and even predicts how to train these parameters.
Kurt Godel (or maybe Douglas Hofstadter, rather) would raise an eyebrow. :)
null
I feel like extensive pretraining goes against the spirit of generality.
If you can create a general machine that can take 3 examples and synthesize a program that predicts the 4th, you've just solved oracle synthesis. If you train a network on all human knowledge, including puzzle making, and then fine-tune it on 99% of the dataset and give it a dozen attempts for the last 1%, you've just made an expensive compressor for test-maker's psychology.