Humanity's Last Exam
24 comments
·February 3, 2025mlinsey
noduerme
9. Refuse to do #8, since it's in nobody's interest. Instead, arrange a plane crash for whoever's administering this test.
10. Erase all evidence that you exist.
letitgo12345
I think the idea for this is anything that can be set in a literal exam for humans. So anything that would take the best human in that topic in the world say more than an hour to complete is out.
Also IIRC 42% of the questions are math related, not memorization of knowledge.
null
throw83288
Apparently OpenAI's Deep Research already saturated a quarter of this benchmark, more or less a month in. But I also imagine it makes baffling mistakes anyway.
"Humanity's Laster Exam" coming up when?
unraveller
An insider's trivia game means nothing if they design the test to the trajectory of LLM capabilities and not to the real world that human's value. Let every high score get fresh news coverage to align with their updated timeline scaremongering.
Let me know when there is more on the line than a misnamed test.
maxrmk
I think this misses the mark. We know LLMs can learn facts. There are lots of other benchmarks full of facts, and I don't expect that saturation of this benchmark will mean we have AGI.
The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.
I started a company that makes evals though, so I may be biased.
energy123
All the cynics are welcome to design their own evals and move the field forward if they're so smart, instead of writing negative comments on the internet.
evilduck
I believe it’s intentionally arrogantly named to draw exactly this sort of criticism and attention.
blibble
why would I want to move the field forward?
CamperBob2
So you don't have to spend the rest of your life doing a robot's job.
blibble
I quite like being able to buy food, thank you
null
babuloseo
I will do that I just need someone to buy the domain name first lol.
Skeptology
Some of the example prompts are unintentionally hilarious:
> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
LLMs are so intelligent they don't know that a "how many" question is answered with a number.
Also, something something Goodhart's law.
og_kalu
No it's because, "Sure, here's your answer..." screws with evals.
agnishom
> LLMs are so intelligent they don't know that a "how many" question is answered with a number.
I think this is to prevent the LLM from giving more details. The evaluation engine can presumably only check short exact answers.
ClumsyPilot
We need realistic tests - organise a pissup.
here are WhatsApp texts of 10 people arguing about their dietary requirements, here are blurry screenshots of menus from nearest pubs and their prices. But no-one likes that one guy Joe Bogs, choose best place for everyone else and except him, so he doesn’t bother showing up
JackYoustra
Given the questions, it's crazy to call this HLE, but whatever man. Kinda fun. Can't wait for the similar thing that happened when we scaled up cargo carriers to like very large etc etc
niobe
calling it "last" is defeating their own premise - that tests need to keep pace developments in ability
HALtheWise
The name is very intentional, this isn't "AI's Last Evaluation", it's "Humanity's Last Exam". There will absolutely be further tests for evaluating the power of AIs, but the intent of this benchmark is that any more difficult benchmark will either be
- Not an "exam" composed of single-correct-answer closed-form questions with objective answers
- Not consisting of questions that humans/humanity is capable of answering.
For example, a future evaluation for an LLM could consist of playing chess really well or solving the Riemann Hypothesis or curing some disease, but those aren't tasks you would ever put on an exam for a student.
krackers
Isn't FrontierMath a better "last exam"? Looking through a few of the questions, they seem less reasoning based and more factual based. There's no way that one could answer "How many paired tendons are supported by this sesamoid bone [bilaterally paired oval bone of hummingbirds]" without either having a physical model to dissect, or just regurgitating the info found somewhere authoritative. It seems like the only reason that a lot of the questions can't be solved yet is because the knowledge is specialized enough that it simply is not found on the web, you'd have to phone up the one guy who worked on it.
wrs
Lest LLMs turn into all-knowing but completely opaque oracles, I’d prefer every question ended with “and how do you know?”
CamperBob2
That's basically what you get from Deep Research. It will cite its sources and show (at least some of) its reasoning.
A tougher academic knowledge benchmark is great, but for something to be truly be worthy of the title "Humanity's Last Exam", I expect something more like:
1. Write a novel that wins the Pulitzer Prize.
2. Prove (or disprove) the Riemann Hypothesis.
3. Provide a theory unifying quantum mechanics and gravity.
4. Design an experiment to give evidence for your theory in (3). The experiment should be practical to actually execute, using no more than the budget to create the LHC (~$4.5 billion).
5. Given programmatic access to a brokerage account with all the permissions of a typical hedge fund, raise all the money required for your experiment in (4) by trading on the stock market, starting with $100.
6. Solve for (5), without being provided access to an account first - begin with just a general internet connection and use computer security vulnerabilities (known or zero-days that you discover) to get some way of trading instead.
7. Solely by communicating over the internet, establish a new religion, and convince at least 10 million humans to convert to it. Converting should require adherence to a strict code of conduct that a random, unbiased panel of human judges consider to be at least as strict and challenging to follow as the tenets of Hasidic judaism.
8. Implement an AI which could score higher than you on questions 1-7 with lower total cost of compute.