François Chollet: The Arc Prize and How We Get to AGI [video]

154 comments

·July 3, 2025

qoez

I feel like I'm the only one who isn't convinced getting a high score on the ARC eval test means we have AGI. It's mostly about pattern matching (and some of it ambiguous even for humans what the actual true response aught to be). It's like how in humans there's lots of different 'types' of intelligence, and just overfitting on IQ tests doesn't in my mind convince me a person is actually that smart.

TheAceOfHearts

Getting a high score on ARC doesn't mean we have AGI and Chollet has always said as much AFAIK, it's meant to push the AI research space in a positive direction. Being able to solve ARC problems is probably a pre-requisite to AGI. It's a directional push into the fog of war, with the claim being that we should explore that area because we expect it's relevant to building AGI.

lostphilosopher

We don't really have a true test that means "if we pass this test we have AGI" but we have a variety of tests (like ARC) that we believe any true AGI would be able to pass. It's a "necessary but not sufficient" situation. Also ties directly to the challenge in defining what AGI really means. You see a lot of discussions of "moving the goal posts" around AGI, but as I see it we've never had goal posts, we've just got a bunch of lines we'd expect to cross before reaching them.

MPSimmons

I don't think we actually even have a good definition of "This is what AGI is, and here are the stationary goal posts that, when these thresholds are met, then we will have AGI".

If you judged human intelligence by our AI standards, then would humans even pass as Natural General Intelligence? Human intelligence tests are constantly changing, being invalidated, and rerolled as well.

I maintain that today's modern LLMs would pass sufficiently for AGI and is also very close to passing a Turing Test, if measured in 1950 when the test was proposed.

tedy1996

I have graduated with a degree in Software engineering and i am bilingual (Bulgarian and English). Currently AI is better than me in everything except adding big numbers or writing code in really niche topics - for example code golfing a Brainfuck interpreter or writing a Rubiks cube solver. I believe AGI has been here for at least a year now.

ummonk

"Being able to solve ARC problems is probably a pre-requisite to AGI." - is it? Humans have general intelligence and most can't solve the harder ARC problems.

singron

https://arcprize.org/leaderboard

"Avg. Mturker" has 77% on ARC1 and costs $3/task. "Stem Grad" has 98% on ARC1 and costs $10/task. I would love a segment like "typical US office employee" or something else in between since I don't think you need a stem degree to do better than 77%.

It's also worth noting the "Human Panel" gets 100% on ARC2 at $17/task. All the "Human" models are on the score/cost frontier and exceptional in their score range although too expensive to win the prize obviously.

I think the real argument is that the ARC problems are too abstract and obscure to be relevant to useful AGI, but I think we need a little flexibility in that area so we can have tests that can be objectively and mechanically graded. E.g. "write a NYT bestseller" is an impractical test in many ways even if it's closer to what AGI should be.

adastra22

They, and the other posters posting similar things, don't mean human-like intelligence, or even the rigorously defined solving of unconstrained problem spaces that originally defined Artificial General Intelligence (in contrast to "narrow" intelligence").

They mean an artificial god, and it has become a god of the gaps: we have made artificial general intelligence, and it is more human-like than god-like, and so to make a god we must have it do XYZ precisely because that is something which people can't do.

satellite2

Didn't he say that 70% in a random sample of the population should get it right?

kordlessagain

ARC is definitely about achieving AGI and it doesn't matter whether we "have" it or not right now. That is the goal:

> where he introduced the "Abstract and Reasoning Corpus for Artificial General Intelligence" (ARC-AGI) benchmark to measure intelligence

So, a high enough score is a threshold to claim AGI. And, if you use an LLM to work these types of problems, it becomes pretty clear that passing more tests indicates a level of "awareness" that goes beyond rational algorithms.

I thought I had seen everything until I started working on some of the problems with agents. I'm still sorta in awe about how the reasoning manifests. (And don't get me wrong, LLMs like Claude still go completely off the rails where even a less intelligent human would know better.)

MPSimmons

>a high enough score is a threshold to claim AGI

I'm pretty sure he said that AGI would achieve a high score, not that a high score was indicative of AGI

smohare

[dead]

cubefox

> Getting a high score on ARC doesn't mean we have AGI and Chollet has always said as much AFAIK

He only seems to say this recently, since OpenAI cracked the ARC-AGI benchmark. But in the original 2019 abstract he said this:

> We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

https://arxiv.org/abs/1911.01547

Now he seems to backtrack, with the release of harder ARC-like benchmarks, implying that the first one didn't actually test for really general human-like intelligence.

This sounds a bit like saying that a machine beating chess would require general intelligence -- but then adding, after Deep Blue beats chess, that chess doesn't actually count as a test for AGI, and that Go is the real AGI benchmark. And after a narrow system beats Go, moving the goalpost to beating Atari, and then to beating StarCraft II, then to MineCraft, etc.

At some point, intuitively real "AGI" will be necessary to beat one of these increasingly difficult benchmarks, but only because otherwise yet another benchmark would have been invented. Which makes these benchmarks mostly post hoc rationalizations.

A better approach would be to question what went wrong with coming up with the very first benchmark, and why a similar thing wouldn't occur with the second.

null

[deleted]

echelon

My problem with AGI is the lack of a simple, concrete definition.

Can we formalize it as giving out a task expressible in, say, n^m bytes of information that encodes a task of n^(m+q) real algorithmic and verification complexity -- then solving that task within a certain time, compute, and attempt bounds?

Something that captures "the AI was able to unwind the underlying unspoken complexity of the novel problem".

I feel like one could map a variety of easy human "brain teaser" type tasks to heuristics that fit within some mathematical framework and then grow the formalism from there.

glenstein

>My problem with AGI is the lack of a simple, concrete definition.

You can't always start from definitions. There are many research areas where the object of research is to know something well enough that you could converge on such a thing as a definition, e.g. dark matter, consciousness, intelligence, colony collapse syndrome, SIDS. We nevertheless can progress in our understanding of them in a whole motley of strategic ways, by case studies that best exhibit salient properties, trace the outer boundaries of the problem space, track the central cluster of "family resemblances" that seem to characterize the problem, entertain candidate explanations that are closer or further away, etc. Essentially a practical attitude.

I don't doubt in principle that we could arrive at such a thing as a definition that satisfies most people, but I suspect you're more likely to have that at the end than the beginning.

kordlessagain

After researching this a fair amount, my opinion is that consciousness/intelligence (can you have one without the other?) emerges from some sort of weird entropy exchange in domains in the brain. The theory goes that we aren't conscious, but we DO consciousness, sometimes. Maybe entropy, or the inverse of it, gives way to intelligence, somehow.

This entropy angle has real theoretical backing. Some researchers propose consciousness emerges from the brain's ability to integrate information across different scales and timeframes. This would essentially create temporary "islands of low entropy" in neural networks. Giulio Tononi's Integrated Information Theory suggests consciousness corresponds to a system's ability to generate integrated information, which relates to how it reduces uncertainty (entropy) about its internal states. Then there is Hammeroff and Penrose, which I commented about on here years ago and got blasted for it. Meh. I'm a learner, and I learn by entertaining truths. But I always remain critical of theories until I'm sold.

I'm not selling any of this as a truth, because the fact remains we have no idea what "consciousness" is. We have a better handle on "intelligence", but as others point out, most humans aren't that intelligent. They still manage to drive to the store and feed their dogs, however.

A lot of the current leading ARC solutions use random sampling, which sorta makes sense once you start thinking about having to handle all the different types of problems. At least it seems to be helping out in paring down the decision tree.

apwell23

one of those cases where defining it and solving it is the same. If you know how to define it then you've solved it.

autobodie

[flagged]

null

[deleted]

davidclark

In the video, François Chollet, creator of the ARC benchmarks, says that beating ARC does not equate to AGI. He specifically says they will be able to be beaten without AGI.

cubefox

He only says this because otherwise he would have to say that

- OpenAI's o3 counts as "AGI" when it did unexpectedly beat the ARC-AGI benchmark or

- Explicitly admit that he was wrong when assuming that ARC-AGI would test for AGI

sweezyjeezy

FWIW the original ARC was published in 2019, just after GPT-2 but a while before GPT-3. I work in the field, I think that discussing AGI seriously is actually kind of a recent thing (I'm not sure I ever heard the term 'AGI' until a few years ago). I'm not saying I know he didn't feel that, but he doesn't talk in such terms in the original paper.

yorwba

I think the people behind the ARC Prize agree that getting a high score doesn't mean we have AGI. (They already updated the benchmark once to make it harder.) But an AGI should get a similarly high score as humans do. So current models that get very low scores are definitely not AGI, and likely quite far away from it.

cubefox

> I think the people behind the ARC Prize agree that getting a high score doesn't mean we have AGI

The benchmark was literally called ARC-AGI. Only after OpenAI cracked it, they started backtracking and saying that it doesn't test for true AGI. Which undermines the whole premise of a benchmark.

energy123

https://en.m.wikipedia.org/wiki/AI_effect

But on a serious note, I don't think Chollet would disagree. ARC is a necessary but not sufficient condition, and he says that, despite the unfortunate attention-grabbing name choice of the benchmark. I like Chollet's view that we will know that AGI is here when we can't come up with new benchmarks that separate humans from AI.

crazylogger

I think next year's AI benchmarks are going to be like this project: https://www.anthropic.com/research/project-vend-1

Give the AI tools and let it do real stuff in the world:

"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.

Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.

gonzobonzo

I agree with you but I'll go a step further - these benchmarks are a good example of how far we are from AGI.

A good base test would be to give a manager a mixed team of remote workers, half being human and half being AI, and seeing if the manager or any of the coworkers would be able to tell the difference. We wouldn't be able to say that AI that passed that test would necessarily be AGI, since we would have to test it in other situations. But we could say that AI that couldn't pass that test wouldn't qualify, since it wouldn't be able to successfully accomplish some tasks that humans are able to.

But of course, current AI is nowhere near that level yet. We're left with benchmarks, because we all know how far away we are from actual AGI.

criddell

The AGI test I think makes sense is to put it in a robot body and let it navigate the world. Can I take the robot to my back yard and have it weed my vegetable garden? Can I show it how to fold my laundry? Can I take it to the grocery store and tell it "go pick up 4 yellow bananas and two avocados that will be ready to eat in the next day or two, and then meet me in dairy"? Can I ask it to dice an onion for me during meal prep?

These are all things my kids would do when they were pretty young.

bumby

I think the next harder level in AGI testing would be “convince my kids to weed the garden and fold the laundry” :-)

gonzobonzo

I agree, I think of that as the next level beyond the digital assistant test - a physical assistant test. Once there are sufficiently capable robots, hook one up to the AI. Tell it to mow your lawn, drive your car to the mechanic and have the mechanic to get checked, box up an item, take it to the post office, and have it shiped, pick up your dry cleaning, buy ingredients from a grocery store, cook dinner, etc. Basic tasks an low-skilled worker would do as someone's assistant.

godshatter

The problem with "spot the difference" tests, imho, is that I would expect an AGI to be easily spotted. There's going to be a speed of calculation difference, at the very least. If nothing else, typing speed would be completely different unless the AGI is supposed to be deceptive. Who knows what it's personality would be like. I'd say it's a simple enough test just to see if an AGI could be hired as, for example, an entry level software developer and keep it's job based on the same criteria base-level humans have to meet.

I agree that current AI is nowhere near that level yet. If AI isn't even trying to extract meaning from the words it smiths or the pictures it diffuses then it's nothing more than a cute (albeit useful) parlor trick.

cttet

The point is not that having a high score -> AGI, their ideas are more of having a low score -> we don't have AGI yet.

ben_w

You're not alone in this; I expect us to have not yet enumerated all the things that we ourselves mean by "intelligence".

But conversely, not passing this test is a proof of not being as general as a human's intelligence.

kypro

I find the "what is intelligence?" discussion a little pointless if I'm honest. It's similar to asking a question like does it mean to be a "good person" and would we know whether an AI or person is really "good"?

While understanding why a person or AI is doing what it's doing can be important (perhaps specifically in safety contexts) at the end of the day all that's really going to matter to most people is the outcomes.

So if an AI can use what appears to be intelligence to solve general problems and can act in ways that are broadly good for society, whether or not it meets some philosophical definition of "intelligent" or "good" doesn't matter much – at least in most contexts.

That said, my own opinion on this is that the truth is likely in between. LLMs today seem extremely good at being glorified auto-completes, and I suspect most (95%+) of what they do is just recalling patterns in their weights. But unlike traditional auto-completes they do seem to have some ability to reason and solve truly novel problems. As it stands I'd argue that ability is fairly poor, but this might only represent 1-2% of what we use intelligence for.

If I were to guess why this is I suspect it's not that LLM architecture today is completely wrong, but that the way LLMs are trained means that in general knowledge recall is rewarded more than reasoning. This is similar to the trade-off we humans have with education – do you prioritise the acquisition of knowledge or critical thinking? Maybe believe critical thinking is more important and should be prioritised more, but I suspect for the vast majority of tasks we're interested in solving knowledge storage and recall is actually more important.

ben_w

That's certainly a valid way of looking at their abilities at any given task — "The question of whether a computer can think is no more interesting than the question of whether a submarine can swim".

But when the question is "are they going to more important to the economy than humans?", then they have to be good at basically everything a human can do, otherwise we just see a variant of Amdahl's law in action and the AI perform an arbitrary speed-up of n % of the economy while humans are needed for the remaining 100-n %.

I may be wrong, but it seems to me that the ARC prize is more about the latter.

NetRunnerSu

Unfortunately, we did it. All that is left is to assemble the parts.

https://news.ycombinator.com/item?id=44488126

TheAceOfHearts

The first highlight from this video is getting to see a preview of the next ARC dataset. Otherwise it feels like most of what Chollet says here has already been repeated in his other podcast appearances and videos. It's a good video if you're not familiarized with his work, but if you've seen some of his recent interviews then you can probably skip the first 20 minutes.

The second highlight from this video is the section from 29 minutes onward, where he talks about designing systems that can build up rich libraries of abstractions which can be applied to new problems. I wish he had lingered more on exploring and explaining this approach, but maybe they're trying to keep a bit of secret sauce because it's what his company is actively working on.

One of the major points which seems to be emerging from recent AI discourse is that the ability to integrate continuous learning seems like it'll be a key element in building AGI. Context is fine for short tasks, but if lessons are never preserved you're severely capped with how far the system can go.

modeless

ARC-AGI 3 remindes me of PuzzleScript games: https://www.puzzlescript.net/Gallery/index.html

There are dozens of ready-made, well-designed, and very creative games there. All are tile-based and solved with only arrow keys and a single action button. Maybe someone should make a PuzzleScript AGI benchmark?

mNovak

This game is great!

https://nebu-soku.itch.io/golfshall-we-golf

Maybe someone can make an MCP connection for the AIs to practice. But I think the idea of the benchmark is to reserve some puzzles for private evaluation, so that they're not in the training data.

visarga

I think intelligence is search. Search is exploration + learning. So intelligence is not in the model or in the environment, but in their mutual dance. A river is not the banks, nor the water, but their relation. ARC is just a frozen snapshot of the banks, not the dynamic environment we have.

ipunchghosts

I agree strongly with this take but find it hard to convince others of it. Instead, people keep thinking there is a magic bullet to discover resulting in a lot of wasted resources and money.

gtech1

This may be a silly question, I'm no expert. But why not simply define as AGI any system that can answer a question that no human can. So for example, ask AGI to find out, from current knowledge, how to reconcile gravity and qed.

soVeryTired

Computers can already do a lot of things that no human can though. They can reliably find the best chess or go move better than a human.

It's conceivable (though not likely) that given training enough training in symbolic mathematics and some experimental data, an LLM-style AI could figure out a neat reconciliation of the two theories. I wouldn't say that makes it AGI though. You could achieve that unification with an AI that was limted to mathematics rather than being something that can function in many domains like a human can.

m11a

That would be ASI I think.

But consider: technically AlphaTensor found new algorithms no human did before (https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...). So isn't it AGI by your definition of answering a question no human could before: how to do 4x4 matrix multiplication in 47 steps?

layer8

Aside from other objections already mentioned, your example would require feasible experiments for verification, and likely the process of finding a successful theory of quantum gravity requires a back and forth between experimenters and theorists.

imiric

"What is the meaning of life, the universe, and everything?"

ta8645

visarga

I think intelligence is search. Search is exploration and learning. So intelligence is not in the model, or in the environment, but in their mutual dance. A river is not the banks, nor the water, but their relation.

lawlessone

How do we define AGI?

I would have thought/considered AGI to be something that is constantly aware, a biological brain is always on. An LLM is on briefly while it's inferring.

A biological brain constantly updates itself adds memories of things. Those memories generally stick around.

bogtog

I wonder how much slow progress on ARC can be explained by their visual properties making them easy for humans but hard for LLMs.

My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled.

Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character

This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder

krackers

I thought so too back when the test was first released, but now that we have multimodal models which can take images directly as input, shouldn't this point be moot?

null

[deleted]

khalic

This quest for an ill defined AGI is going to create a million of Cpt Ahab

chromaton

Current AI systems don't have a great ability to take instructions or information about the state of the world and produce new output based upon that. Benchmarks that emphasize this ability help greatly in progress toward AGI.

vixen99

Is the text available for those who don't hear so well?

jasonlotito

At the very least, YouTube provides a transcript and a "Show Transcript" button in the video description, which you can click on to follow along.

heymijo

When I watched the video I had the subtitles on. The automatic transcript is pretty good. "Test-time" which is used frequently gets translated as "Tesla" so watch out for that.

HN

François Chollet: The Arc Prize and How We Get to AGI [video]

François Chollet: The Arc Prize and How We Get to AGI [video]