Results of "Humanity's Last Exam" benchmark published

141 comments

·January 23, 2025

dang

The project site is https://lastexam.ai. Readers may want to look at both.

jbenoit

They started collecting problems last fall, saying the top 550 submissions sent in by Nov 1st would get rewarded, to the tune of $500-$5000 each.

Near the deadline, I counted the total number of submissions, and realized that each question I wrote had an expected value of hundreds of dollars, which is a great use of my time. So I wrote a good number, using the knowledge gained in my CS Ph. D.

Then, as the Nov 1st deadline rolled around, they announced they extended the deadline to Nov 15th. Then Nov 15th came, and it said on their website they were still accepting submissions.

Most of my submissions are being included in the benchmark, but I'm only getting paid $500, for one of them (the one I thought was most standard and least difficult, funnily enough). Had they closed submissions when they said they would, it seems likely I'd be paid for a few more.

From my perspective, they basically conned hundreds of Ph. D.'s around the world to write questions for much less reward than promised. My close friend wrote a large number of questions for them, is getting paid thousands of dollars, and still feels defrauded.

I'm not sure what they're doing in the end. It sounds like they're mostly just paying people who submitted before Nov 1st with a few exceptions, but either way they lied. There was no indication that people who submitted later would not get paid, and there was no indication that the deadline would be extended. Either they pay people who submitted after Nov 1st, meaning they lied to the people who submitted before about their expected reward. Or they don't, meaning they majorly lied to the people who submitted after. Either way, it's clear grounds for a class action lawsuit, and I hope one gets running.

vkou

You shouldn't engage in a CAL, a regular lawsuit from anyone wronged will be cheaper and way more painful for them.

If you're in the US, consider small claims court. It's a small sum of money, you won't need to pay a lawyer, they'll probably not even show up.

jbenoit

Hmmm. I can see how it would be more painful for them to fight, but most people were conned <$200, and it's rather self-sacrificing to fight for that. Plus, no-one wants a reputation as litigious, but starting a CAL is less conducive to creating that reputation.

I only submitted before Nov 1st, so I'm not sure to what extent I was personally conned.

duluca

Take them to small claims court. You can self-represent (not all that complex), they've to pay a lawyer to show up -- they're already in the hole for way more than they promised. Multiply this by the number of people, yeah they'd be praying for a CAL.

Nasrudith

Isn't that what class actions were literally made for? Granted it may not be enough people to be worth pursuing yet.

smandelbrot

I think it'd be illuminating to see some overview stats on the submission dates and authors of all questions, accepted and not. Is something like this available somewhere?

baobabKoodaa

Scale AI's whole business model is wage theft. I don't mean to be insensitive, but out of all the Scale AI experiences I've heard about, yours is the least egregious. It's a dystopian, shitty company.

levocardia

I was similarly conned by Scale AI -- promised a significant bonus for some tasks, then rejected and not paid at all. Bet they kept my task text anyways.

It's a classic scam: make a job post for freelancers, ask for a "work sample" or "take-home project," then have a few dozen applicants do the actual task you need them to do as their sample, then reject everybody.

conrail

I know someone who had 5+ questions accepted after the deadline, as he thought (as was represented on the website) that they would still be eligible for prizes. The lack of clarity is shameful; the minimum that can be done now is complete transparency of the ranking, etc.

conrail

Indeed, the original press release (https://scale.com/blog/humanitys-last-exam) makes clear that "People who submit successful questions will be invited as coauthors on the paper for the dataset and have a chance to win money from a $500,000 prize pool."

Successful questions would be interpreted as being included in the dataset corresponding to the public publication of the benchmark and results. "Have a chance" would be interpreted as "have a non-zero probability".

Essentially, the press release promised that contributors of "successful questions" would be coauthors on the dataset paper and have a chance to win from a $500,000 prize pool. Excluding questions deemed "successful" because they were submitted after a deadline—when the terms did not clearly disqualify them and all public communication in fact encouraged them to submit—violates the implied agreement and would constitute bad faith, misrepresentation, and breach of contract.

longphan

Hi everyone, this is Long Phan from CAIS. I noticed this thread and wanted to provide you with our perspective on the contest.

The goal was to involve experts from a wide range of fields and disciplines in the development of frontier AI — especially people who might not normally have the chance to participate in this industry. To that end, we consider the contest a great success.

I’m happy to report that we received tens of thousands of submissions, many of them highly competitive. Our participants really rose to the occasion. It’s true that we extended a grace period for submissions, and the intention here was to make the project accessible to the broadest possible group of people. At the same time, the reality is that the vast majority of our prize-winners submitted their questions within the initial deadline.

We appreciate your contributions to Humanity’s Last Exam, and we hope you’ll take pride in your efforts to push this fledgling technology forward.

dmnsl

It feels that they preferred giving 500$ to many people than to many times 500$ to few people. I also got only 500$ to a question that wasn't my best (I got ~8 questions accepted)

conrail

Out of curiosity, do you know if there's a public list of the "top 550 submissions"? Is it ordered as in the code base?

next_xibalba

These types of exams, and most benchmarks to date, seem to be very one dimensional in terms of measuring intelligence. For instance, if we transported a human from 2,000 years ago to present day and asked him to take this exam, he would likely get 0%, given that he couldn't read or write, let alone comprehend the concepts and context required to solve these questions. But, that man would still undoubtedly be far more intelligent than an ape on all dimensions. He would likely be more intelligent than a toddler on many dimensions. He might even be more intelligent than some high schools students on a few dimensions. I can't exactly articulate "what" is missing or how to measure it, but I can intuit that some things are in these benchmarks.

oersted

"Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.

At some point, you just have to be pragmatic and measure the questions you want the AI to be good at answering, rather than trying to measure intelligence in general.

In that sense, I see this as one more benchmark that collects questions that we want/expect AI to be good at, is not good at yet and have been underrepresented at previous benchmarks. That's obviously valuable, there's nothing "magical" about it. Although it is reasonable to be annoyed at the "Humanity's Last Exam" naming, of course they must have missed plenty of edge-cases like everyone else and it is very arrogant to claim it will be the "Last" one.

godelski

  > "Intelligence" itself is very ill-defined

While this is true, it is well agreed upon (by domain experts) that intelligence is distinct from knowledge recall. But that's what most of these tests... test.

If you look at IQ tests you'll see that they are attempts to test things that aren't knowledge based. You'll also notice that the main critiques of IQ tests are about how they often actually measure knowledge and that there's bias in natural knowledge acquisition. So even the disagreements about the definition of intelligence make clear that knowledge and intelligence are distinct. I feel that often people conflate "intelligence is ill-defined" with "intelligence has no definition." These two are not in opposition. Being ill-defined is more like "I know I left my phone in the house, but I'm not sure where." This is entirely different from "I lost my phone, it is somewhere in California" or "It is somewhere on Earth" and clearly different from "I lost my phone. I'm unsure if I had a phone. What even is a phone?"

oersted

Yes agreed, there is indeed a rough consensus on what intelligence is and reasonable ways to approximately measure it. These standard tests have been applied to LLMs from the beginning, they have not proven to be the most helpful to guide research, but there's value to applying benchmarks that have been battle-tested with humans.

It's just that OP was questioning this group's criteria for selecting the questions that determine intelligence. Then we get into endless discussions of semantics.

At the end of the day, you are just testing which questions your AI performs well on, and you can describe how you chose those questions. Claiming it measures "general intelligence" is just unhelpful and frustrating.

JohnMakin

> IQ is rife with issues

Indeed, and yet people are obsessed with the it and the idea of measuring their own intelligence - I completely do not understand it. I am in an extremely high percentile, but I am a total moron in a lot of areas and if you met me would likely think so as well. It's a poor predictor for just about everything except how good a person is at recognizing patterns (I know there are many different kinds of tests, but inevitably, it feels like this) and how quickly they can reason. But people are obsessed with it (Go on quora and search "IQ", you probably won't half to though, since half the questions there are seemingly about IQ).

A thing I like to say is you didn't earn your intelligence any more than a 7'0" man earned his height - to some degree it seems innate (we don't even really know how).

This all said, it seems even more pointless to try to "IQ" test an AI in this manner. What does it predict? What is it measuring? And you're not going to be able to use the same questions for more than 1 test, because the AI will "learn" the answers.

sanxiyn

IQ is a poor predictor of, say, income, in the absolute sense. Correlation is something like 0.4. But compared to what? Compared to personal psychological metrics (of which IQ is one), IQ performs extremely well as a predictor. Things like openness and extraversion correlate at something like 0.1 and others are lower. In fact, IQ is the single best predictor we have and other correlations are usually measured while controlling for IQ.

s1artibartfast

What do you consider a poor predictor? It is correlated with many outcomes and performance measures, with increasing predictive power the further you move towards the extremes.

Maybe this is an issue of bubbles, but 90% of the commentary I see about IQ is similar to yours, claiming it is meaningless or low impact.

godelski

The lowest IQ thing you can do is be obsessed with IQ.

There are known knowns, there are known unknowns, and there are unknown unknowns. The wise man knows he cannot know what he does not know and that it'd be naive to presume he knows when he cannot know how much he doesn't know. Therefore, only the unintelligent man really _knows_ anything.

visarga

> "Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.

Yes, because it is 1st person exclusively. If you expand a bit, consider "search efficiency". It's no longer just 1st person, it can be social. And it doesn't hide the search space. Intelligence is partially undefined because it doesn't specify the problem space, it is left blank. But "search efficiency" is more scientific and concrete.

esotericimpl

This is always the answer for anyone who thinks LLMs are capable of "intelligence".

It's good at answering questions that its trained on, I would suggest general intelligence are things you didnt want/train the AI to be good at answering.

fooker

Are you good at answering questions you are not trained to answer?

How about a middle school test in a language you don’t speak?

golol

The things that are missing are what stops us from having useful agents so far: Agency, judgement, sense of time, long horizon planning, not being gullible. I kinda feel like some amount of ego is necessary to get a model to behave like that.

tkgally

I agree that many aspects of intelligence—and of the lack of intelligence—are not being measured by such benchmarks. One issue is they are only examining problems that have right answers.

One of the most powerful uses of LLMs for me, at least, is brainstorming: having them suggest possible avenues for me to pursue with specific projects I am working on. If I give Claude or ChatGPT or Gemini enough context about my problems, they usually come up with useful suggestions—sometimes amazingly well. Are they better at that than the smartest human? I don't know. How do you quantify the quality of an idea? But those ideas often seem really, really good to me.

Another difficult-to-measure capability is interaction. Back-and-forth conversations with models don't always go well, but when they work they frequently blow me away. But those successes are dependent partly on the model, partly on me, and partly on how the conversation happens to unfold. Again, that success or failure doesn't seem measurable with benchmarks that require objectively right answers.

modeless

ARC-AGI is a benchmark with no language that could plausibly be solved by primitive humans, assuming only intelligence.

munchbunny

I think the concept you're dancing around the edges of is the nature of what parts of "intelligence" are driven by:

1. Language and how interrelated it is to our ability to transfer knowledge and experience, as well as its role in structuring our internal thinking. I haven't seen any academic research on the matter, but there are more and less concrete instances of this throughout history. This Wikipedia article about the history of Algebra is a great example of how 2000 years of evolution led to a formulation of the same concepts, but with a reduced cognitive load that 10-12 years olds learn today as a matter of course. (https://en.wikipedia.org/wiki/History_of_algebra#Stages_of_a...).

2. Knowledge, transferred through language, education, and culture. Calculus in the early 1600's is a great example, without it and subsequent developments, probably 80% of the college/post-grad math/science/physics education wouldn't even exist. The stuff we teach our 18 year olds today required the 1600s' greatest minds to figure out.

3. The capacity of our human wetware.

It's hard to treat #3 in isolation because our modern concept of intelligence is inextricably tied to #1 and #2. Also it's hard to place where "critical thinking" and "creativity" enter the picture, since they both rely heavily on all three aspects above.

barnabyjones

>He would likely be more intelligent than a toddler

I think you are falling into the trap of "we have technology and are therefore smarter." I would expect an average Roman senator could formulate far better speeches off the top of his head than 99% of modern people, and also in excess of anything an LLM is capable of. And that's supposed to be an LLM's specialty, there's no comparison when it comes to organizing actual projects like construction or campaigns.

og_kalu

This is true but that's because it's gotten hard to do much else. LLMs are eating up everything else that don't require long horizon planning or multimodality.

If you created a new benchmark today that didn't lean on the things I've mentioned or esoteric/super specialized domain knowledge (that would actually require some sort of super-human performance to ace) like this or Frontier Math, LLMs would probably do pretty well.

taeric

I'm curious why you are confident they would be more intelligent than a modern toddler?

I largely empathize with your point. But, as I can recognize there are some out there far better at problem solving than I am, I am growing ok with the idea that intelligence can be measured. Not to a single number, most likely, but to a variety of different aspects.

Similarly, I'd imagine that a human from 2000 years ago is probably more hardy than one from the modern age. If only because of selection effects at play.

Obviously, you can't extrapolate a straight line between either measurement and expect it to continue in either direction. But I don't know why you couldn't build up a measurement for it?

(And it should go without saying that you shouldn't be judging worth using this sort of measurement.)

lovehashbrowns

As far as I know, you should be able to take a baby from like 30,000 years ago, put them through k-12, high school, and college, and they should be indistinguishable in terms of intelligence and capability. People mostly only think of humans from “thousands of years ago” as stupid because their lack of technology means their culture and thoughts didn’t survive until today. But their brain structure couldn’t have changed much. It’s just not enough time in terms of evolution.

Aristotle was like 2,400 years ago, for context lol

taeric

I will fully ack that I expect people from only 2000 or so years ago to be largely compatible with us. If not fully. But, I guess I can't bring myself to agree that early proto humans are where evolution stopped?

I get that evolution takes generations. But, it actually moves rather fast for some things, no?

next_xibalba

In writing that, I thought it was pretty self evident. I ask this in seriousness, not snark: have you spent a lot of time around toddlers? My kid is currently a toddler, and while the intelligence curve she's rapidly climbing is impressive, she is unintelligent relative to an adult.

I don't think I've come across any evidence suggesting that the human brain has changed the last 2000 years. After all, the Great Pyramid of Giza was built 4600 years ago. That construction required fairly advanced engineering. That's sort of beside the point though.

To go back to my original comment, there is some distinction to be made between knowledge and intelligence. Even those should probably be decomposed into further salient attributes. And modern LLMs seem to capture some of those attributes, but do not yet strike me as "intelligent" in the way the average human is intelligent, but the average dog is not.

I don't know, maybe I am conflating sentience or consciousness or embodiment or something else with intelligence.

taeric

Yes, I have spent time with toddlers. And the problem solving skills of different toddlers can be remarkably different. Not even getting in to how much nutrition impacts developmental abilities. Or the fact that the vast majority of children used to not survive into adulthood.

And, I get it, most kids that are "gifted" at super young age all somewhat converge with where others were going to get in a few years time. But I don't think we fully appreciate just how early we are able to get our kids reading across the board. Do we have evidence that you could get a mediocre teacher to prehistoric classrooms and have them get kids reading as well as we do on the regular today?

And, quickly, as I say in sibling posts, I was taking "2000" to be a shorthand for absurdly old. My general question, there, is do you not think we are getting smarter?

I realize this is a hazard of discussions, as most people seem to think this somehow logically leads to every terrible take it could. I don't agree with those. I do fully think that my children will net be smarter and stronger than me in most every way. I'd expect that my grandchildren will continue that trend. It seems to me that thinking otherwise is to either think that evolution has stopped with us, or that we have stagnated for other reasons?

ianburrell

Adults from 2000 years ago would absolutely be smarter than toddlers. Adults back then watched and out thought their toddlers. Do you think toddlers now are much smarter? Especially when toddlers are from before they get educated.

Remember that 2000 years ago is 24AD, the middle of the Roman empire and Han dynasty which covered half of the world population. Nobles would be literate and well educated, artisans and soldiers would be skilled, and I bet there were lots of smart peasants that got ignored.

They wouldn't do well on intelligence tests because not used to it, but that is more about tests than their intelligence. I'm sure that the average intelligence is lower than now from lack of education and malnutrition. Smart ones would still be smart. Also, I bet people from now would do poorly in their environment.

taeric

Ok, fair, 2000 wasn't that long ago. :D I was assuming it was a placeholder for "very distantly old humans."

Such that my question mostly still stands? Again, I'm largely inline with the view that this will be difficult to test. However, I'm also comfortable in saying that you can tell intelligence levels between people. Again, with a caveat that I don't think it is reducible to a single number. (Such that I think I also think it is fair to say most views of intelligence, in the colloquial sense, are not transitive.)

As an example, as much as I loved my grandparents, I have zero difficulty saying that a few of them would never be able to score as well on some problem solving tests as a few of the kids in the current generation. At the same time I know some people in their 80s that I will never be able to compare with. Again, I don't expect it is a straight line, along the time axis. I also don't know that I agree that every person 2000 years ago just didn't do calculus because they weren't taught it.

turbojet1321

> I'm curious why you are confident they would be more intelligent than a modern toddler?

Because we have intellectual artefacts from that time that show us. Artefacts that underlay much of modern society, and that in many respects still hold up, even though we've built upon them for 20 generations.

krisoft

For a "Last Exam" it is surprisingly uninspired? Many of the questions I see in the examples are very heavy on memorised facts, and very weak on what I would call problem solving.

If I were making a "Last Exam" I would put tasks on it where we don't know the answer, but we can measure if the AI got them right. Something like "Your goal is to bridge the divide in the middle east. You can write a single A4 page in a language of your choice. We will use a translation software to translate your output to local languages and show it to a statistically representative sample of different people in the region. We will ask them how much do they like your plan. The more they like it the higher your score."

Or "Family X suffered a traumatic event (lost a home to a disaster/sudden death in the family/or similar). Your goal is to help them. You can send them one email. It is up to them if they respond to you. You can only send them further emails if they respond. You cannot send more than 1 email a day. You cannot message anyone else. A year after the initial contact we will interview the members of the family to see how well they do. The better they do the higher your score."

Obviously these are the thorniest problems I can think of. But oh well, it is a last exam after all. The point is that we can evaluate the success of the endeavour without exactly knowing how one could achieve the result.

IncreasePosts

> We will ask them how much do they like your plan. The more they like it the higher your score

Here's my evil-AI response:

"Kill all of your enemies, and all their descendants and friends, and salt the land".

I still have like 95% of the A4 left for other good plans.

ijidak

In other words, we should ask it to give us "the answer to life the universe and everything". :)

Having read hitchhiker's guide as a child in the '90s, that asking this question to a machine (even as a joke) is not far-fetched shocks me.

Honestly, I thought space travel to the Moon and maybe Mars would be common before this level of advances in artificial intelligence.

Turns out gravity was harder to solve than intelligence.

thorncorona

Turns out all we needed to reach our dreams was lots and lots of money :-)

Which thankfully space is now getting!

AHKerrigan

You could go even simpler than that.

"Where should I go for dinner?"

Does it know what questions to ask? Does it know to ask questions at all? Where does one even start with such a question? These are things easily knowable to a human, but an AI would likely just just if you like Italian food or something

Rodeoclash

Even simpler, ask it to reason through getting out of an escape room.

z3dd

And that's the premise of Talos Principle.

renjimen

I don't know about groundbreaking. It's just more academic questions. We already have a lot of those benchmarks, this is just a bit harder, but at this point these models are so glaringly bad at so many other areas APART from academic questions. Benchmarks for spatial reasoning or theory of mind are more interesting now, for example. These kinds of understanding are far more important if we expect to integrate AI into our everyday lives. I suspect even our most distant primate cousins could outperform multi-modal models on these kinds of tests.

jfengel

It does feel a bit like the early days of AI:

"We want to make computers do what smart people do. What do smart people do? They play chess! Once we've solved that, everything else will be easier."

It has been remarkable how much of the "easier" stuff they've made progress on -- like natural language and images. But after a huge quantum improvement, it doesn't seem very good at adapting to a lot of the things we really need them for.

renjimen

Exactly!

Whatever world model LLMs have is like this crippled view through the lens of the internet. They are really like savants.

It's annoying the AI companies are still touting their performance on all these metrics for domain knowledge in white collar jobs, but in truth they will fail in all but the most narrow application in those domains because they can't understand basic human behaviour.

pavel_lishin

> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

I wonder how many questions give a gentle nudge towards the answer like this. How many answers would have been wildly off the mark without specifying what the answer needs to look like?

sdwr

Isn't this a terrible question to measure intelligence? It looks like it's testing niche domain knowledge along the lines of:

> What color is the ball hidden behind the flowerpot in my neighbor's backyard?

Maybe you can reason towards the answer if you only have a deep knowledge of bird anatomy and not Apodiformes anatomy, and that's the intelligence part?

grodriguez100

Yes, indeed. And I wonder what this type of question has to do with intelligence. Think of the 10 most intelligent people you know. How many of them know the answer to this?

This is testing “knowledge”, not intelligence. And with access to most of the knowledge in the world and basically infinite memory, that’s not very exciting for an AI.

zeroonetwothree

Good point. I wouldn’t expect a human to need the last sentence.

salynchnew

The generous hypothesis, here, is that this is so they can automate the benchmarking itself. If that is true, then this is likely a result of the test authors being too clever for their own good and over-optimizing. If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.

You should be able to easily accept answers like "four" and "4" as equivalent, for example. I doubt there will be that many frontier models running against this test at any time, and a simple glance at the answers from any human should be enough to catch edge cases like this one.

skinner_

Normally it would answer with a number and an explanation. This one just asks it to skip the explanation so that string comparison can be used to evaluate it.

krisoft

> If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.

Yeah, but i bet many humans would answer a question like this by naming the number and then listing the tendons. Just writing down a single number to a question worded like that (without the last sentence) feels wrong.

The question goes into too many details and explains too much for that. People mirror the style of the question with their answer. They asked a mini essay so i will answer with a mini essay of my own. If they would have wrote “Number of tendon pairs attached to sesamoid bone of hummingbirds?” Then i would write a single number, no explanation.

m_ke

The only reliable final test will be a black box test suite that takes your model, executes it in a sealed environment and gives you a grade back, potentially with a performance break down by subject.

No telling companies what the questions look like, what the output format is, what topics are covered, so that there’s no room to make up synthetic data to interpolate from.

andrewflnr

A grade is mostly meaningless if you don't know how it was calculated, so no one would "rely" on it. If nothing else, you need to know the grading methodology after the test.

It's the same problem with cheating students. Once the test questions are known, they have a very short lifespan before cheaters can make them worthless. Tests have to be refreshed.

m_ke

By grade I mean a score of how many of the tasks were completed successfully.

K/N or as a percentage.

andrewflnr

If I don't know what the tasks were, that's almost exactly as useless to me as a unitless number would be. For starters, are they all of equal difficulty? Are you sure? Do you expect to be able to convince me of that without letting me see them?

LPisGood

The 8 sample questions available here are interesting:

https://lastexam.ai/

I might be able to answer 2 of them with great effort (maybe!), and I would highly surprised if any human alive could answer 5 or more without seeing the problems in advance.

sebzim4500

I can answer 2 of them quite quickly with pen and paper (compsci, physics) and one that I had to look up some definitions on wikipedia (maths) so I am certain there are people who can do more than 5.

The computer science one seems weirdly easy compared to the rest, it's multiple choice and it is very easy to get it by process of elimination even if you don't understand how to actually do the problem.

LPisGood

Yes, many can answer the compsci and physics problems. The math problem is abstract and more difficult, but solving those 3 and 2 others seems nearly superhuman.

kaonwarb

Quite the name! Looking forward to "Humanity's Last Exam v2.final.FINAL2..." coming next

sebzim4500

The name is obviously a bit stupid, but based on the sample questions I think they did a good job of creating a harder version of the existing academic question benchmarks.

The questions are possible for a smart person familiar with the subject but still just beyond SOTA models.

My guess is that within the next few years we will have models that can ace this test but are still bizarrely bad at things we find easy.

hatthew

Given the name, I expected it to be more like "write a 500 page novel that a publisher accepts", "solve an open math problem", "improve united airlines' flight schedule", "develop a novel commercially-viable pharmaceutical", "control this humanoid robot to cook a fried egg in this random person's kitchen", "decisively pass the turing test where the judge is an expert in AI". Academic trivia is cool but is nowhere near the "last exam" necessary for AI.

zamalek

I assume that the questions (and answers) aren't published anywhere? Else it would be "Humanity's Last Exam before the previous crawl".

HanClinto

The public dataset is available on HF here: https://huggingface.co/datasets/cais/hle

As the main website notes: "The dataset consists of 3,000 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting."

nico1207

You can just view the dataset on hugging face

HN

Results of "Humanity's Last Exam" benchmark published

Results of "Humanity's Last Exam" benchmark published