Skip to content(if available)orjump to list(if available)

Results of "Humanity's Last Exam" benchmark published

dang

The project site is https://lastexam.ai. Readers may want to look at both.

next_xibalba

These types of exams, and most benchmarks to date, seem to be very one dimensional in terms of measuring intelligence. For instance, if we transported a human from 2,000 years ago to present day and asked him to take this exam, he would likely get 0%, given that he couldn't read or write, let alone comprehend the concepts and context required to solve these questions. But, that man would still undoubtedly be far more intelligent than an ape on all dimensions. He would likely be more intelligent than a toddler on many dimensions. He might even be more intelligent than some high schools students on a few dimensions. I can't exactly articulate "what" is missing or how to measure it, but I can intuit that some things are in these benchmarks.

oersted

"Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.

At some point, you just have to be pragmatic and measure the questions you want the AI to be good at answering, rather than trying to measure intelligence in general.

In that sense, I see this as one more benchmark that collects questions that we want/expect AI to be good at, is not good at yet and have been underrepresented at previous benchmarks. That's obviously valuable, there's nothing "magical" about it. Although it is reasonable to be annoyed at the "Humanity's Last Exam" naming, of course they must have missed plenty of edge-cases like everyone else and it is very arrogant to claim it will be the "Last" one.

godelski

  > "Intelligence" itself is very ill-defined
While this is true, it is well agreed upon (by domain experts) that intelligence is distinct from knowledge recall. But that's what most of these tests... test.

If you look at IQ tests you'll see that they are attempts to test things that aren't knowledge based. You'll also notice that the main critiques of IQ tests are about how they often actually measure knowledge and that there's bias in natural knowledge acquisition. So even the disagreements about the definition of intelligence make clear that knowledge and intelligence are distinct. I feel that often people conflate "intelligence is ill-defined" with "intelligence has no definition." These two are not in opposition. Being ill-defined is more like "I know I left my phone in the house, but I'm not sure where." This is entirely different from "I lost my phone, it is somewhere in California" or "It is somewhere on Earth" and clearly different from "I lost my phone. I'm unsure if I had a phone. What even is a phone?"

oersted

Yes agreed, there is indeed a rough consensus on what intelligence is and reasonable ways to approximately measure it. These standard tests have been applied to LLMs from the beginning, they have not proven to be the most helpful to guide research, but there's value to applying benchmarks that have been battle-tested with humans.

It's just that OP was questioning this group's criteria for selecting the questions that determine intelligence. Then we get into endless discussions of semantics.

At the end of the day, you are just testing which questions your AI performs well on, and you can describe how you chose those questions. Claiming it measures "general intelligence" is just unhelpful and frustrating.

JohnMakin

> IQ is rife with issues

Indeed, and yet people are obsessed with the it and the idea of measuring their own intelligence - I completely do not understand it. I am in an extremely high percentile, but I am a total moron in a lot of areas and if you met me would likely think so as well. It's a poor predictor for just about everything except how good a person is at recognizing patterns (I know there are many different kinds of tests, but inevitably, it feels like this) and how quickly they can reason. But people are obsessed with it (Go on quora and search "IQ", you probably won't half to though, since half the questions there are seemingly about IQ).

A thing I like to say is you didn't earn your intelligence any more than a 7'0" man earned his height - to some degree it seems innate (we don't even really know how).

This all said, it seems even more pointless to try to "IQ" test an AI in this manner. What does it predict? What is it measuring? And you're not going to be able to use the same questions for more than 1 test, because the AI will "learn" the answers.

godelski

The lowest IQ thing you can do is be obsessed with IQ.

There are known knowns, there are known unknowns, and there are unknown unknowns. The wise man knows he cannot know what he does not know and that it'd be naive to presume he knows when he cannot know how much he doesn't know. Therefore, only the unintelligent man really _knows_ anything.

visarga

> "Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.

Yes, because it is 1st person exclusively. If you expand a bit, consider "search efficiency". It's no longer just 1st person, it can be social. And it doesn't hide the search space. Intelligence is partially undefined because it doesn't specify the problem space, it is left blank. But "search efficiency" is more scientific and concrete.

esotericimpl

This is always the answer for anyone who thinks LLMs are capable of "intelligence".

It's good at answering questions that its trained on, I would suggest general intelligence are things you didnt want/train the AI to be good at answering.

fooker

Are you good at answering questions you are not trained to answer?

How about a middle school test in a language you don’t speak?

golol

The things that are missing are what stops us from having useful agents so far: Agency, judgement, sense of time, long horizon planning, not being gullible. I kinda feel like some amount of ego is necessary to get a model to behave like that.

tkgally

I agree that many aspects of intelligence—and of the lack of intelligence—are not being measured by such benchmarks. One issue is they are only examining problems that have right answers.

One of the most powerful uses of LLMs for me, at least, is brainstorming: having them suggest possible avenues for me to pursue with specific projects I am working on. If I give Claude or ChatGPT or Gemini enough context about my problems, they usually come up with useful suggestions—sometimes amazingly well. Are they better at that than the smartest human? I don't know. How do you quantify the quality of an idea? But those ideas often seem really, really good to me.

Another difficult-to-measure capability is interaction. Back-and-forth conversations with models don't always go well, but when they work they frequently blow me away. But those successes are dependent partly on the model, partly on me, and partly on how the conversation happens to unfold. Again, that success or failure doesn't seem measurable with benchmarks that require objectively right answers.

taeric

I'm curious why you are confident they would be more intelligent than a modern toddler?

I largely empathize with your point. But, as I can recognize there are some out there far better at problem solving than I am, I am growing ok with the idea that intelligence can be measured. Not to a single number, most likely, but to a variety of different aspects.

Similarly, I'd imagine that a human from 2000 years ago is probably more hardy than one from the modern age. If only because of selection effects at play.

Obviously, you can't extrapolate a straight line between either measurement and expect it to continue in either direction. But I don't know why you couldn't build up a measurement for it?

(And it should go without saying that you shouldn't be judging worth using this sort of measurement.)

og_kalu

This is true but that's because it's gotten hard to do much else. LLMs are eating up everything else that don't require long horizon planning or multimodality.

If you created a new benchmark today that didn't lean on the things I've mentioned or esoteric/super specialized domain knowledge (that would actually require some sort of super-human performance to ace) like this or Frontier Math, LLMs would probably do pretty well.

modeless

ARC-AGI is a benchmark with no language that could plausibly be solved by primitive humans, assuming only intelligence.

fooker

Put ‘em in diverse simulations and see how long they survive.

I can imagine a dystopian world where people are subject to this for training and testing AI.

WanderPanda

I mean it is humanity’s LAST exam. Humanity’s first exam would probably be something about communication? Or about building and predicting effects of certain tools?

jbenoit

They started collecting problems last fall, saying the top 550 submissions sent in by Nov 1st would get rewarded, to the tune of $500-$5000 each.

Near the deadline, I counted the total number of submissions, and realized that each question I wrote had an expected value of hundreds of dollars, which is a great use of my time. So I wrote a good number, using the knowledge gained in my CS Ph. D.

Then, as the Nov 1st deadline rolled around, they announced they extended the deadline to Nov 15th. Then Nov 15th came, and it said on their website they were still accepting submissions.

Most of my submissions are being included in the benchmark, but I'm only getting paid $500, for one of them (the one I thought was most standard and least difficult, funnily enough). Had they closed submissions when they said they would, it seems likely I'd be paid for a few more.

From my perspective, they basically conned hundreds of Ph. D.'s around the world to write questions for much less reward than promised. My close friend wrote a large number of questions for them, is getting paid thousands of dollars, and still feels defrauded.

I'm not sure what they're doing in the end. It sounds like they're mostly just paying people who submitted before Nov 1st with a few exceptions, but either way they lied. There was no indication that people who submitted later would not get paid, and there was no indication that the deadline would be extended. Either they pay people who submitted after Nov 1st, meaning they lied to the people who submitted before about their expected reward. Or they don't, meaning they majorly lied to the people who submitted after. Either way, it's clear grounds for a class action lawsuit, and I hope one gets running.

vkou

You shouldn't engage in a CAL, a regular lawsuit from anyone wronged will be cheaper and way more painful for them.

If you're in the US, consider small claims court. It's a small sum of money, you won't need to pay a lawyer, they'll probably not even show up.

jbenoit

Hmmm. I can see how it would be more painful for them to fight, but most people were conned <$200, and it's rather self-sacrificing to fight for that. Plus, no-one wants a reputation as litigious, but starting a CAL is less conducive to creating that reputation.

I only submitted before Nov 1st, so I'm not sure to what extent I was personally conned.

renjimen

I don't know about groundbreaking. It's just more academic questions. We already have a lot of those benchmarks, this is just a bit harder, but at this point these models are so glaringly bad at so many other areas APART from academic questions. Benchmarks for spatial reasoning or theory of mind are more interesting now, for example. These kinds of understanding are far more important if we expect to integrate AI into our everyday lives. I suspect even our most distant primate cousins could outperform multi-modal models on these kinds of tests.

jfengel

It does feel a bit like the early days of AI:

"We want to make computers do what smart people do. What do smart people do? They play chess! Once we've solved that, everything else will be easier."

It has been remarkable how much of the "easier" stuff they've made progress on -- like natural language and images. But after a huge quantum improvement, it doesn't seem very good at adapting to a lot of the things we really need them for.

renjimen

Exactly!

Whatever world model LLMs have is like this crippled view through the lens of the internet. They are really like savants.

It's annoying the AI companies are still touting their performance on all these metrics for domain knowledge in white collar jobs, but in truth they will fail in all but the most narrow application in those domains because they can't understand basic human behaviour.

pavel_lishin

> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

I wonder how many questions give a gentle nudge towards the answer like this. How many answers would have been wildly off the mark without specifying what the answer needs to look like?

sdwr

Isn't this a terrible question to measure intelligence? It looks like it's testing niche domain knowledge along the lines of:

> What color is the ball hidden behind the flowerpot in my neighbor's backyard?

Maybe you can reason towards the answer if you only have a deep knowledge of bird anatomy and not Apodiformes anatomy, and that's the intelligence part?

zeroonetwothree

Good point. I wouldn’t expect a human to need the last sentence.

salynchnew

The generous hypothesis, here, is that this is so they can automate the benchmarking itself. If that is true, then this is likely a result of the test authors being too clever for their own good and over-optimizing. If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.

You should be able to easily accept answers like "four" and "4" as equivalent, for example. I doubt there will be that many frontier models running against this test at any time, and a simple glance at the answers from any human should be enough to catch edge cases like this one.

m_ke

The only reliable final test will be a black box test suite that takes your model, executes it in a sealed environment and gives you a grade back, potentially with a performance break down by subject.

No telling companies what the questions look like, what the output format is, what topics are covered, so that there’s no room to make up synthetic data to interpolate from.

LPisGood

The 8 sample questions available here are interesting:

https://lastexam.ai/

I might be able to answer 2 of them with great effort (maybe!), and I would highly surprised if any human alive could answer 5 or more without seeing the problems in advance.

sebzim4500

I can answer 2 of them quite quickly with pen and paper (compsci, physics) and one that I had to look up some definitions on wikipedia (maths) so I am certain there are people who can do more than 5.

The computer science one seems weirdly easy compared to the rest, it's multiple choice and it is very easy to get it by process of elimination even if you don't understand how to actually do the problem.

LPisGood

Yes, many can answer the compsci and physics problems. The math problem is abstract and more difficult, but solving those 3 and 2 others seems nearly superhuman.

zamalek

I assume that the questions (and answers) aren't published anywhere? Else it would be "Humanity's Last Exam before the previous crawl".

mrandish

Assessing AI's progress toward replicating the full breadth and depth of human intelligence is a deceptively hard problem. A paper by François Chollet, who was until recently a researcher at Google, called "On the Measure of Intelligence" is the best overview of the challenges I've read. Highly recommended.

https://arxiv.org/abs/1911.01547

sebzim4500

The name is obviously a bit stupid, but based on the sample questions I think they did a good job of creating a harder version of the existing academic question benchmarks.

The questions are possible for a smart person familiar with the subject but still just beyond SOTA models.

My guess is that within the next few years we will have models that can ace this test but are still bizarrely bad at things we find easy.

xnx

Interesting marketing for Scale AI. I'd be surprised if any foundation models started benchmarking against this.

Captchas seem like the more interesting test. As long as there are captchas that average people can solve, but computers can't, we will still have a long way to go toward artificial intelligence.

sebzim4500

I don't think this is necessary true. I can imagine a future in which we have robots that can do 99% of human jobs but there's one thing they are strangely bad at some otherwise unimportant skill that can be used as a captcha.

dang

I briefly merged this thread into https://news.ycombinator.com/item?id=42804853, but actually the current article has more context, so probably we should keep this as the top link and then people can look at https://lastexam.ai also.