Skip to content(if available)orjump to list(if available)

A Man Out to Prove How Dumb AI Still Is

noosphr

>Last week, the ARC Prize team released an updated test, called ARC-AGI-2, and it appears to have sent the AIs back to the drawing board. The full o3 model has not yet been tested, but a version of o1 dropped from 32 percent on the original puzzles to just 3 percent on the new version, and a “mini” version of o3 currently available to the public dropped from roughly 30 percent to below 2 percent. (An OpenAI spokesperson declined to say whether the company plans to run the benchmark with o3.) Other flagship models from OpenAI, Anthropic, and Google have achieved roughly 1 percent, if not lower. Human testers average about 60 percent.

Arc AGI is the main reason why I don't trust static bench marks.

If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.

Forget all these fancy benchmarks. If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar. I've had _every_ model fail this on regular grammars with strings of more than 4 characters long.

LLMs are the solution to natural language, which is a huge deal. They aren't the solution to reasoning which is still best solved with what used to be called symbolic AI before it started working, e.g. sat solvers.

globnomulous

I tried my own test recently:

"Write a history of the Greek language but reverse it, so that one would need to read it from right to left and bottom to top."

ChatGPT wrote the history and showed absolutely no awareness, let alone, "understanding" of the second half of the prompt.

tkcranny

As much I think AI is overhyped too, that is a prime use case that would be better solved by passing the text to a tool, rather than jam a complex transformations like that into its latent space.

ahupp

With o3-mini-high (just the last paragraph):

civilization Mycenaean the of practices religious and economic, administrative the into insights invaluable provides and B Linear as known script the in recorded was language Greek the of form attested earliest The

globnomulous

Oh, interesting, what do you get when you specify that the letters need to be reversed, too? (That was what I meant and the original prompt explicitly stated that requirement. I forgot to include it in the summary of my 'test' here.)

wordofx

Just copied your prompt and it handled it just fine.

globnomulous

?siht ekil kool rewsna eht diD

Edit: realized just now that my summary of the 'test' failed to specify the request fully: the letters need to be reversed, too. Maybe I'm just bad with AI tools, because I didn't even get a response that 'this like looked' (i.e. reversed the order of the words).

whiplash451

Show me the results of your symbolic AI on ARC 2.

fritzo

ARC 2 is brand new, but neurosymbolic approaches have performed well on the original ARC, e.g. https://arxiv.org/abs/2411.02272

janalsncm

> best solved with what used to be called symbolic AI before it started working

Right, the current paradigm of requiring an LLM to do arbitrary digit multiplication will not work and we shouldn’t need to. If your task is “do X” and it can be reliably accomplished with “write a python program to do X” that’s good enough as far as I’m concerned. It’s preferable, in fact.

Btw Chollet has said basically as much. He calls them “stored programs” I think.

I think he is onto something. The right atomic to approach these problems is probably not the token, at least at first. Higher level abstraction should be refined to specific components, similar to the concept of diffusion.

YurgenJurgensen

As soon as the companies behind these systems stop marketing them as do-anything machines, I will stop judging them on their ability to do everything.

The ChatGPT input field still says ‘Ask anything’, and that is what I shall do.

brookst

You can ask me anything. I don’t see that as a promise that I am infallible.

mdp2021

> that’s good enough as far as I’m concerned

But in that case, why an LLM. If we want Question-Answer machines to be reliable, they must have the skills which include "counting" just as a basic example.

janalsncm

The purpose of the LLM would be to translate natural language into computer language, not to do the calculation itself.

Ologn

Most human ten year olds in school can add two large numbers together. If a connectionist network is supposed to model the human brain, it should be able to do that. Maybe LLMs can do a lot of things, but if they can't do that, then they're an incomplete model of the human brain.

michaelmarkell

If I were to guess, most (adult) humans could not add two 3 digit numbers together with 100% accuracy. Maybe 99%? Computers can already do 100%, so we should probably be trying to figure out how to use language to extract the numbers from stuff and send them off to computers to do the calculations. Especially because in the real world most numbers that matter are not just two digits addition

janalsncm

Artificial neural nets are pretty far from brains. We don’t use them because they are like brains, we use them because they can approximate arbitrary functions given sufficient data. In other words, they work.

For what it’s worth, people are also pretty bad at math compared to calculators. We are slow and error prone. That’s ok.

What I was (poorly) trying to say is that I don’t care if the neural net solves the problem if it can outsource it to a calculator. People do the same thing. What is important is reliably accomplishing the goal.

jameshart

Most human ten year olds can add two large numbers together with the aid of a scratchpad and a pen. You need tools other than a single dimensional vector of text to do some of these things.

SpicyLemonZest

No LLM or other modern AI architecture I'm aware of is supposed to model the human brain. Even if they were, LLMs can add large numbers with the level of skill I'd expect from a 10 year old:

----

What's 494547645908151+7640745309351279642?

ChatGPT said: The sum of 494,547,645,908,151 and 7,640,745,309,351,279,642 is:

7,641,239,857,997,187,793

----

(7,641,239,856,997,187,793 is the correct answer)

sionisrecur

> If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.

Sounds like a use-case for property testing: https://en.wikipedia.org/wiki/Software_testing#Property_test...

mdp2021

> I've had _every_ model fail this

That seems to be because LLMs don't seem to be able to follow procedures (e.g. reliably counting).

gambiting

>> If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar.

I'm not sure I understand what that means - could you explain please?

janalsncm

It means applying specific rules about how text can be generated. For example, generating valid json reliably. Currently we use constrained decoding to accomplish this (e.g. the next token must be one of three valid options).

Now you can imagine giving an LLM arbitrary validity rules for generating text. I think that’s what they mean by “grammar”.

elpocko

I'm not GP, but here goes:

LLMs are token-based, which are words or word fragments; they have limited ability to work on a letter-by-letter basis. They can't reliably count letters in a sentence, for example. "give it a string and a grammar and ask it to generate the string from the grammar" can't be done by inference alone because of this: they would generate tokens that don't match the grammar.

But you can use a grammar-based sampler and it'll generate valid strings just fine. llama.cpp can easily do this if you provide an EBNF grammar specification.

i_am_proteus

>Chollet, a French computer scientist and one of the industry’s sharpest skeptics

I feel like this description really buries the lede on Chollet's expertise. (For those who don't know, he's the creator of and lead contributor[0] to Keras)

[0]https://github.com/keras-team/keras/graphs/contributors

mikestew

Not to dismiss Chollet’s work, but I’m starting to think he need prove nothing to even the muggles. For example, nearly any endurance athlete stands a good chance of being a Strava user. If you run in those circles, have you heard a single person with anything good to say about Strava’s “Athletic Intelligence”? Garmin is rolling out a beta right now that includes “AI Insights” or summat. Same deal: useless summaries like “you ran 5 miles today, which contributes to your aerobic base”. I could do better with a database and some if/else statements. And Garmin wants a subscription for this. (It’s included in Strava’s subscription, but I suppose you’re still paying for it.) And so now the memes tend toward “dumb AI insight of the day” on many online forums.

Seems to me that a lot of folks are enjoying having an LLM rewrite their email or whatever, but I wonder how many are actually buying the rest of it? The companies themselves sure aren’t helping.

anthomtb

I use Strava for mountain biking and the Athletic Intelligence is just comical.

"This ride was longer and harder than usual" - no sh*t, the map, elevation profile and my legs have already informed me.

"You set 3 new PRs" - I can see that with one flick of the thumb thank you.

"Consider a rest day" - consider? None of my job, spouse, equipment or body is too keen on doing that again for a while.

soupfordummies

I'm taking a university course right now and one of the big textbook companies (McGraw-Hill, Macmillan, one of those) has an "AI Tutor" on their homework assignment.

It's somehow LESS helpful than just having a pointer to which part of the text to revisit.

It basically just restates the question/problem. Even worse than that is that it's an essentially STATIC note for each question yet it appears to be REAL-TIME GENERATED each time. I guess that could just be for appearances but it's just dumb all the way around really.

fabbari

I think some of the AI demos are kind of comedy gems.

I have seen the Apple Intelligence presentation a while ago and in the span of five minutes they had someone asking the assistant to expand a one liner into an e-mail and then someone receiving a long e-mail and asking the AI to summarize it.

We spun GPUs to expand, then spun them again to summarize. Gold.

null

[deleted]

some_random

Calling Francois Chollet just "A Man" in the title (or "The Man" in the actual article as of writing) is crazy work, he's been deeply involved in ML for ages including creating Keras.

whiplash451

Francois Chollet and his work deserve a better title than this stupid headline.

Francois is out to push the boundaries of science and help create models that are truly more intelligent.

sroussey

Maybe if AI knew what it was doing I would not end up banging my head like I did here:

https://chatgpt.com/share/67ef43f4-3b88-800d-a5a3-e3ffea178f...

(Me trying to describe a desk top with a fold down hinged top, and it just drawing whatever)

mdp2021

> In 2019, Chollet created the Abstraction and Reasoning Corpus for Artificial General Intelligence, or ARC-AGI—an exam designed to show the gulf between AI models’ memorized answers and the “fluid intelligence” that people have

There are a number of skill signals we demand from an intelligence.

Mind you: some of them are achieved - like the ability to interpret pronouns (Hinton's "the trophy will not enter the case: it's too big" vs "the trophy will not enter the case: it's too small").

Others, we meet occasionally when we are not researching said requirements systematically: one example is that detective game described at https://news.ycombinator.com/item?id=43284420 - a simple game of logic that intelligences are required to be able to solve (...and yet, again some rebutted that humans would fail etc.).

It remains important though that those working modules are not clustered (solving specific tasks and remaining unused otherwise): they must be intellectual keys adapted into use in the most general cases they can be be helpful in. That's important in intelligence. So, even the ability to solve "revealing" tasks is not enough - the way in which the ability works is crucial.

mdp2021

> A person who scores 30 percent on ARC-AGI-2 is in no sense inferior to someone who scores 90 percent

"News just in: journalist for the Atlantic stops reasoning and drifts in a world of feelings after neural hijacking, as he perceives abilities as some kind of threat".

> Human cognitive diversity [...] when that diversity is already so abundant, do you really want to?

We definitely need intelligence.

echelon

> When I spoke with him earlier this year, Chollet told me that AI companies have long been “intellectually lazy“

s/intellectually lazy/hype maxing for fundraising/

refulgentis

I think it's fascinating that his impossible benchmark got defeated, but because the Keras guy doesn't like LLMs, it is possible to mishear algorithmic distaste as saying people shipping this are "lazy" and "hype maxing."

whiplash451

Francois never said he dislike LLMs. In fact, he said he expected them to be part of the solution to ARC.

I don’t know where this persistent myth comes from, but it has to go.

refulgentis

> part of the solution to ARC. I don’t know where this persistent myth comes from,

Part of, explicitly, not the, quite 100% explicitly. The TL;DR is "LLMs can't do it alone, program synthesis leveraging LLMs is my bet". Not "Maybe not LLMs but they'll certainly help us get there!", quite the opposite! Hence: well, TFA. And the intellectually lazy quote we are explicitly discussing. And anything Chollet has said on the subject. [^1]

[^1]"LLMs won’t lead to AGI - $1,000,000 Prize to find true solution" - https://www.dwarkesh.com/p/francois-chollet - 1.5 hours with the gent

artificialprint

Arc agi 1 that "got defeated" was published even before first mainstream llms and still stood the test of time

refulgentis

> "got defeated"

1.5 hours with Chollet on "LLMs won’t lead to AGI - $1,000,000 Prize to find true solution" -

Published June 2024, and by December, well...we can all agree there's an ARC AGI 2 now.

https://www.dwarkesh.com/p/francois-chollet

HenryBemis

  1a) it's not AI, it's LLM. The companies who create/train/operate them may (wink-wink) pitch them as "AI" with half-truths, but we (here) know it's LLMs "all the way down"
  1b) just like I disliked the "autopilot" in Teslas because it was never autopilot.
  2) I know that I wanted to write some software tools, and I have been successful at this for the past many months, and I got top-shelve tools, that work, do their tasks, send alerts, etc. etc. And I am not the only one. So if the purpose is to "show it's a stupid AI".. well.. it's not AI.. so yeah. If the purpose is "it is not perfect", yes, because it draws a hand with 10 fingers. What else is new?
LLMs are a tool, still under development, still early in the curve, they can do A-B-C well but not X-Y-Z well (or at all). Congratulations :)

dgfitz

> LLMs are a tool, still under development, still early in the curve…

I completely agree, they are a tool, and a decently useful tool. They are not early in the curve, they’re about flat at this point.

Yeul

Unfortunately I played Deus Ex religiously when I was a kid so I'll never be impressed with AI. The sequel had a self flying helicopter- good luck with that Tesla and BYD.